The model doesn't converge #43

SuX97 · 2020-12-16T07:06:00Z

Hi,

Thank you for your work.

I have tested to train your implementation for action classification on Kinetics400, but find the training not convergent.

Note that the learning rate is calculated based on the paper: 4096 ~ 6e-4 by Linear Scaling Rule. I also applied warmup but the loss plateau while warming up.

I have also tested the pretrained model in timm. Also not convergent.

Do you have any suggestion for better training such stand-alone transformers? Thanks.

lucidrains · 2020-12-16T21:52:07Z

@SuX97 hmm, I'm not familiar with the dataset, but how large are the images?

SuX97 · 2020-12-18T04:01:18Z

@SuX97 hmm, I'm not familiar with the dataset, but how large are the images?

Hi @lucidrains , this dataset has ~240k 10s videos, which means like 72 million images. For training, I use 8 gpus with mini-batch of 96, which is a batch size of 768. The images are resized to short-edge 224 and then randomcrop into 224 x 224.

lucidrains · 2020-12-18T18:03:31Z

@SuX97 hmm, I don't see any obvious problems, could you show me the full set of hyperparameters you used?

SuX97 · 2020-12-21T06:27:53Z

Here is my config for training, the original learning rate is 6e-4@4096, for me, I used 32 videos each with 3 frames as a minibatch, and 16 gpus, the calculation is shown in comments; besides, I used a very large weight decay as the original paper:

# optimizer
optimizer = dict(
    type='SGD',
    lr=0.000223,
    momentum=0.9,
    # 4096 / (32b * 8g * 3c) = 5.3; 6e-4 / 5.3  = 1.13e-4
    weight_decay=0.1)  # this lr is used for 16 gpus, so 1.13 * (16 / 8)=2.23
# very large weight_decay
optimizer_config = dict(grad_clip=dict(max_norm=40, norm_type=2))
# learning policy
lr_config = dict(
    policy='CosineAnnealing',
    min_lr=0,
    warmup='linear',
    warmup_ratio=0.001,
    warmup_by_epoch=True,
    warmup_iters=20)

total_epochs = 200

Thanks!

SuX97 · 2020-12-21T09:14:33Z

@lucidrains , FYI, I have also tested smaller learning rates, which all turn out to fail. The loss curves rise after a number of iterations and soon plateau.

lucidrains · 2020-12-21T16:40:29Z

What does your model hyperparameters look like? Could you give Adam optimizer a try?

SuX97 · 2020-12-22T06:07:37Z

Hi @lucidrains , I am using the default hyperparameters by calling:

        self.m = timm.create_model(
            'vit_base_patch16_224', pretrained=use_pretrained)

which is defined as:

@register_model
def vit_base_patch16_224(pretrained=False, **kwargs):
    model = VisionTransformer(
        patch_size=16, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4, qkv_bias=True,
        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
    model.default_cfg = default_cfgs['vit_base_patch16_224']
    if pretrained:
        load_pretrained(
            model, num_classes=model.num_classes, in_chans=kwargs.get('in_chans', 3), filter_fn=_conv_filter)
    return model

Yeah, I will give Adam a try.

In addition, I have tried to freeze the vit parameters using a linear probe fine-tuning on Kinetics400, which results in:

It does converge, but a normal CNN on Kinetics400 can have a loss lower than 1. So I believe the main problem is still optimization.

lucidrains · 2020-12-22T19:00:54Z

@SuX97 you'll have to scale up from here! smaller patch sizes, bigger dimensions, more heads, and then finally depth

SuX97 · 2020-12-24T06:27:37Z

@SuX97 you'll have to scale up from here! smaller patch sizes, bigger dimensions, more heads, and then finally depth

Do you mean by using models such as large or huge variations? The patch size is now 16, the smallest one.

Besides, the base 16*16 model I used still cannot converge(the graph I show that converges is by freezing the param of ViT, when open it for optimization, the loss exploded again, and I use warm-up so I think the lr is not too big), I doubt it would be harder for bigger models to converge.

lucidrains · 2020-12-24T23:22:38Z

@SuX97 do you want to give the findings of this paper a try? https://github.com/lucidrains/vit-pytorch#distillation

TitaniumOne · 2021-02-23T02:02:43Z

@SuX97 Hello! I trained an action recognition dataset with combination of ViT and TSN, and my model can't converge. Have you solved the problem? Do you have any idea?

MercyPrasanna · 2021-02-27T06:40:02Z

Same here. My model is not able to converge. It worked better for a single image classification but when I go for sequence classification with longer sequences, its unable to learn. Looks like it needs lot of data!

SuX97 · 2021-03-03T05:30:48Z

@TitaniumOne @MercyPrasanna Hi, maybe this paper can be a reference: https://arxiv.org/abs/2102.05095. model code: https://github.com/lucidrains/TimeSformer-pytorch. I think the convergence is all about tuning so I give up :(

Blosslzy · 2022-01-02T09:19:43Z

Is it possible that weight decay was set too large to cause convergence?

SuX97 mentioned this issue Dec 24, 2020

Model doesn't converge #45

Open

SuX97 closed this as completed Mar 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The model doesn't converge #43

The model doesn't converge #43

SuX97 commented Dec 16, 2020

lucidrains commented Dec 16, 2020

SuX97 commented Dec 18, 2020 •

edited

lucidrains commented Dec 18, 2020

SuX97 commented Dec 21, 2020 •

edited

SuX97 commented Dec 21, 2020

lucidrains commented Dec 21, 2020

SuX97 commented Dec 22, 2020

lucidrains commented Dec 22, 2020

SuX97 commented Dec 24, 2020

lucidrains commented Dec 24, 2020

TitaniumOne commented Feb 23, 2021

MercyPrasanna commented Feb 27, 2021

SuX97 commented Mar 3, 2021

Blosslzy commented Jan 2, 2022

The model doesn't converge #43

The model doesn't converge #43

Comments

SuX97 commented Dec 16, 2020

lucidrains commented Dec 16, 2020

SuX97 commented Dec 18, 2020 • edited

lucidrains commented Dec 18, 2020

SuX97 commented Dec 21, 2020 • edited

SuX97 commented Dec 21, 2020

lucidrains commented Dec 21, 2020

SuX97 commented Dec 22, 2020

lucidrains commented Dec 22, 2020

SuX97 commented Dec 24, 2020

lucidrains commented Dec 24, 2020

TitaniumOne commented Feb 23, 2021

MercyPrasanna commented Feb 27, 2021

SuX97 commented Mar 3, 2021

Blosslzy commented Jan 2, 2022

SuX97 commented Dec 18, 2020 •

edited

SuX97 commented Dec 21, 2020 •

edited