-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transformer - token_embed outputs nan values #44
Comments
yea, this is just normal transformer instability there's a bag of tricks for tackling this |
Shoot, I'm using a dataset of 120 mesh models (1200 after augmentation), it worked bit better with a bigger dataset so it might be due to the 'small' dataset. lr 1e-4:
|
Could you give some examples on how to tackle this? I'm also having NaN after a few epochs (~5 epochs) when training on full ShapeNet (~15k different mesh models) with an 1e-4 lr. I'm still investigating so I'm not sure if it's exactly the same problem as @MarcusLoppe but it could be nice to have some ideas on how to solve this problem :) |
there are no solutions. stabilizing transformers is still an active area of research, esp as you increase parameter count. there are various bandaids however. most practitioners have a couple they apply, but none of them are panaceas yet |
you can check out my x-transformers repo for more info |
Any particular feature? I'm finding gate_residual ,sandwich_norm, ResiDual and scale_residual.
I think experimenting with the optimizer would be a good start as well, most easiest parameters is probably; max_grad_norm and weight_decay. In the paper they didn't mention of any other details then using Adam and batch size of 64, I believe that increasing the batch size might help as well. Due to VRAM constrains I'm only using 1 or 2 batch size. |
@MarcusLoppe you could try qk norm. some researchers at google brain are attached to this, but i suspect it has a slight generalization cost yea, you are right with optimizer. values to play with are beta1, beta2, and eps. your batch size def needs to be bigger once you scale up, but you can use gradient accumulation for this (which is built-in) |
other things that would help is warmup, gradient clipping of 0.5 and 0.25 if you want to be really aggressive |
@MarcusLoppe scratch everything i said, as Kurokabe noted that a potential source of instability was actually due to the gateloop layers |
I still get nan loss at 0.07 using 1e-4 as learning rate. But above that it doesn't give any issues anymore. |
Resolved by using larger dataset, possible explanation: #68 (comment) |
This issue occurs if you have too high learning rate (1-e2) at a low loss (0.3), through this also occurred when I had 1-e3 as lr and at 0.01 loss.
edit: Using flash attention it goes from 5.0 loss to nan in the 5th epoch using 1e-4 lr.
After the codes are masked the and token_embed is called, it will output nan values.
Not sure if this issue is a pytorch, meshgpt-pytorch or user error :)
The text was updated successfully, but these errors were encountered: