Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The model doesn't converge #43

Closed
SuX97 opened this issue Dec 16, 2020 · 14 comments
Closed

The model doesn't converge #43

SuX97 opened this issue Dec 16, 2020 · 14 comments

Comments

@SuX97
Copy link

SuX97 commented Dec 16, 2020

Hi,

Thank you for your work.

I have tested to train your implementation for action classification on Kinetics400, but find the training not convergent.
image
Note that the learning rate is calculated based on the paper: 4096 ~ 6e-4 by Linear Scaling Rule. I also applied warmup but the loss plateau while warming up.

I have also tested the pretrained model in timm. Also not convergent.
image

Do you have any suggestion for better training such stand-alone transformers? Thanks.

@lucidrains
Copy link
Owner

@SuX97 hmm, I'm not familiar with the dataset, but how large are the images?

@SuX97
Copy link
Author

SuX97 commented Dec 18, 2020

@SuX97 hmm, I'm not familiar with the dataset, but how large are the images?

Hi @lucidrains , this dataset has ~240k 10s videos, which means like 72 million images. For training, I use 8 gpus with mini-batch of 96, which is a batch size of 768. The images are resized to short-edge 224 and then randomcrop into 224 x 224.

@lucidrains
Copy link
Owner

@SuX97 hmm, I don't see any obvious problems, could you show me the full set of hyperparameters you used?

@SuX97
Copy link
Author

SuX97 commented Dec 21, 2020

Here is my config for training, the original learning rate is 6e-4@4096, for me, I used 32 videos each with 3 frames as a minibatch, and 16 gpus, the calculation is shown in comments; besides, I used a very large weight decay as the original paper:

# optimizer
optimizer = dict(
    type='SGD',
    lr=0.000223,
    momentum=0.9,
    # 4096 / (32b * 8g * 3c) = 5.3; 6e-4 / 5.3  = 1.13e-4
    weight_decay=0.1)  # this lr is used for 16 gpus, so 1.13 * (16 / 8)=2.23
# very large weight_decay
optimizer_config = dict(grad_clip=dict(max_norm=40, norm_type=2))
# learning policy
lr_config = dict(
    policy='CosineAnnealing',
    min_lr=0,
    warmup='linear',
    warmup_ratio=0.001,
    warmup_by_epoch=True,
    warmup_iters=20)

total_epochs = 200

Thanks!

@SuX97
Copy link
Author

SuX97 commented Dec 21, 2020

@lucidrains , FYI, I have also tested smaller learning rates, which all turn out to fail. The loss curves rise after a number of iterations and soon plateau.

@lucidrains
Copy link
Owner

What does your model hyperparameters look like? Could you give Adam optimizer a try?

@SuX97
Copy link
Author

SuX97 commented Dec 22, 2020

Hi @lucidrains , I am using the default hyperparameters by calling:

        self.m = timm.create_model(
            'vit_base_patch16_224', pretrained=use_pretrained)

which is defined as:

@register_model
def vit_base_patch16_224(pretrained=False, **kwargs):
    model = VisionTransformer(
        patch_size=16, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4, qkv_bias=True,
        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
    model.default_cfg = default_cfgs['vit_base_patch16_224']
    if pretrained:
        load_pretrained(
            model, num_classes=model.num_classes, in_chans=kwargs.get('in_chans', 3), filter_fn=_conv_filter)
    return model

Yeah, I will give Adam a try.

In addition, I have tried to freeze the vit parameters using a linear probe fine-tuning on Kinetics400, which results in:
image
It does converge, but a normal CNN on Kinetics400 can have a loss lower than 1. So I believe the main problem is still optimization.

@lucidrains
Copy link
Owner

@SuX97 you'll have to scale up from here! smaller patch sizes, bigger dimensions, more heads, and then finally depth

@SuX97
Copy link
Author

SuX97 commented Dec 24, 2020

@SuX97 you'll have to scale up from here! smaller patch sizes, bigger dimensions, more heads, and then finally depth

Do you mean by using models such as large or huge variations? The patch size is now 16, the smallest one.

Besides, the base 16*16 model I used still cannot converge(the graph I show that converges is by freezing the param of ViT, when open it for optimization, the loss exploded again, and I use warm-up so I think the lr is not too big), I doubt it would be harder for bigger models to converge.

@lucidrains
Copy link
Owner

@SuX97 do you want to give the findings of this paper a try? https://github.com/lucidrains/vit-pytorch#distillation

@TitaniumOne
Copy link

@SuX97 Hello! I trained an action recognition dataset with combination of ViT and TSN, and my model can't converge. Have you solved the problem? Do you have any idea?

@MercyPrasanna
Copy link

Same here. My model is not able to converge. It worked better for a single image classification but when I go for sequence classification with longer sequences, its unable to learn. Looks like it needs lot of data!

@SuX97
Copy link
Author

SuX97 commented Mar 3, 2021

@TitaniumOne @MercyPrasanna Hi, maybe this paper can be a reference: https://arxiv.org/abs/2102.05095. model code: https://github.com/lucidrains/TimeSformer-pytorch. I think the convergence is all about tuning so I give up :(

@SuX97 SuX97 closed this as completed Mar 3, 2021
@Blosslzy
Copy link

Blosslzy commented Jan 2, 2022

Is it possible that weight decay was set too large to cause convergence?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants