New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The model doesn't converge #43
Comments
@SuX97 hmm, I'm not familiar with the dataset, but how large are the images? |
Hi @lucidrains , this dataset has ~240k 10s videos, which means like 72 million images. For training, I use 8 gpus with mini-batch of 96, which is a batch size of 768. The images are resized to short-edge 224 and then randomcrop into 224 x 224. |
@SuX97 hmm, I don't see any obvious problems, could you show me the full set of hyperparameters you used? |
Here is my config for training, the original learning rate is 6e-4@4096, for me, I used 32 videos each with 3 frames as a minibatch, and 16 gpus, the calculation is shown in comments; besides, I used a very large weight decay as the original paper: # optimizer
optimizer = dict(
type='SGD',
lr=0.000223,
momentum=0.9,
# 4096 / (32b * 8g * 3c) = 5.3; 6e-4 / 5.3 = 1.13e-4
weight_decay=0.1) # this lr is used for 16 gpus, so 1.13 * (16 / 8)=2.23
# very large weight_decay
optimizer_config = dict(grad_clip=dict(max_norm=40, norm_type=2))
# learning policy
lr_config = dict(
policy='CosineAnnealing',
min_lr=0,
warmup='linear',
warmup_ratio=0.001,
warmup_by_epoch=True,
warmup_iters=20)
total_epochs = 200 Thanks! |
@lucidrains , FYI, I have also tested smaller learning rates, which all turn out to fail. The loss curves rise after a number of iterations and soon plateau. |
What does your model hyperparameters look like? Could you give Adam optimizer a try? |
Hi @lucidrains , I am using the default hyperparameters by calling:
which is defined as:
Yeah, I will give Adam a try. In addition, I have tried to freeze the vit parameters using a linear probe fine-tuning on Kinetics400, which results in: |
@SuX97 you'll have to scale up from here! smaller patch sizes, bigger dimensions, more heads, and then finally depth |
Do you mean by using models such as Besides, the base 16*16 model I used still cannot converge(the graph I show that converges is by freezing the param of ViT, when open it for optimization, the loss exploded again, and I use warm-up so I think the lr is not too big), I doubt it would be harder for bigger models to converge. |
@SuX97 do you want to give the findings of this paper a try? https://github.com/lucidrains/vit-pytorch#distillation |
@SuX97 Hello! I trained an action recognition dataset with combination of ViT and TSN, and my model can't converge. Have you solved the problem? Do you have any idea? |
Same here. My model is not able to converge. It worked better for a single image classification but when I go for sequence classification with longer sequences, its unable to learn. Looks like it needs lot of data! |
@TitaniumOne @MercyPrasanna Hi, maybe this paper can be a reference: https://arxiv.org/abs/2102.05095. model code: https://github.com/lucidrains/TimeSformer-pytorch. I think the convergence is all about tuning so I give up :( |
Is it possible that weight decay was set too large to cause convergence? |
Hi,
Thank you for your work.
I have tested to train your implementation for action classification on Kinetics400, but find the training not convergent.
Note that the learning rate is calculated based on the paper: 4096 ~ 6e-4 by Linear Scaling Rule. I also applied warmup but the loss plateau while warming up.
I have also tested the pretrained model in timm. Also not convergent.
Do you have any suggestion for better training such stand-alone transformers? Thanks.
The text was updated successfully, but these errors were encountered: