Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training issues and learning rates #128

Open
scarbain opened this issue Aug 29, 2023 · 6 comments
Open

Training issues and learning rates #128

scarbain opened this issue Aug 29, 2023 · 6 comments

Comments

@scarbain
Copy link

Hi! Thanks for releasing the models + the training code. That's a massive contribution !

I've tried to train the model either by finetuning the already released model or training from scratch but the result is always the same : the model starts collapsing and the frames produced during training are only noise.

Here are what I tested to prevent that :

  • Using 30 videos or 3500 videos
  • Using different batch size (I started at BS1 because I don't have enough VRAM to go higher with 24GB) :
    --- Using gradient accumulation steps of 4 with BS1 : No really change
    --- Using BS4 + gradient accumulation steps of 1 with Gradient checkpointing : Strangely the model didn't seem to learn ANYTHING when using gradient checkpointing
  • The only thing that got any result was to really reduce the learning rate :
    -- LR 1e-4 : Model collapse after only 40 steps
    -- LR 1e-5 : Model collapse after around 100 steps
    -- LR 1e-7 : Model collapse after 10K steps but it didn't learn anything

I haven't tried using the original dataset of videos, that would be my next test. Can it be because of the videos I used ? Something with FPS or anything ?

Has anyone else managed to train from scratch or finetune ? If yes, what LR did you use ? And what other params have you changed from the training.yaml file ?

Thanks

@shliu0
Copy link

shliu0 commented Sep 1, 2023

same issue, have you handled this problem?

@scarbain
Copy link
Author

scarbain commented Sep 1, 2023

No, I haven't tried again

@shliu0
Copy link

shliu0 commented Sep 6, 2023

i have updated xformers from 0.0.16 to 0.0.17, then it works, maybe you can try this

@yishuaidu
Copy link

Hi! Thanks for releasing the models + the training code. That's a massive contribution !

I've tried to train the model either by finetuning the already released model or training from scratch but the result is always the same : the model starts collapsing and the frames produced during training are only noise.

Here are what I tested to prevent that :

  • Using 30 videos or 3500 videos
  • Using different batch size (I started at BS1 because I don't have enough VRAM to go higher with 24GB) :
    --- Using gradient accumulation steps of 4 with BS1 : No really change
    --- Using BS4 + gradient accumulation steps of 1 with Gradient checkpointing : Strangely the model didn't seem to learn ANYTHING when using gradient checkpointing
  • The only thing that got any result was to really reduce the learning rate :
    -- LR 1e-4 : Model collapse after only 40 steps
    -- LR 1e-5 : Model collapse after around 100 steps
    -- LR 1e-7 : Model collapse after 10K steps but it didn't learn anything

I haven't tried using the original dataset of videos, that would be my next test. Can it be because of the videos I used ? Something with FPS or anything ?

Has anyone else managed to train from scratch or finetune ? If yes, what LR did you use ? And what other params have you changed from the training.yaml file ?

Thanks

hi , whats your videos look like? same motion?

@yifanliuu
Copy link

Any update?
I got the same issue.
I'm not sure if the collapse is related to the training datasets. I used the tiktok videos to train the motion module from scratch without modifying any hyperparams in the training config file, but got noisy video after about 30 training steps. My xformer's version is 0.0.20.

@liutaocode
Copy link

Is it related to this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants