Drop in accuracy after rewinding convnext training #495
Replies: 5 comments
-
are you using a few hundreds gpus and a very large batch size ? |
Beta Was this translation helpful? Give feedback.
-
Yes, we are using a global batch size of ~70k. We are not sure exactly how you are doing the rewinding? Are you starting the training from the last checkpoint, with a larger LR and batch size? Can you please provide more details? We checked the details here https://huggingface.co/laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-rewind and it seems the rewind is basically retraining for 256 epochs with larger batchsize and lr? I am confused why it says For the rewind of last 10% . Thanks! |
Beta Was this translation helpful? Give feedback.
-
@AlaaKhaddaj it is not retraining for 256 checkpoint epochs ('virtual' epochs as they are not full dataset passes), only a 10% rewind which in this case means I resumed from checkpoint 230, so it reran the last 26. Basically setting the initial LR to 2e-3 instead of 1e-3 (so the resume LR will be based on 230/256 schedule wise in the cosine and slightly altered by the # steps from the global batch size change). I'd say the bump in global batch is a bit of a factor, from 82->95k. Augmentation was increased. EDIT: also, should point out that the original hparams weren't Ideal, I started a bit too low in LR and only figured that out after I'd got some results from base/large models that finished sooner (but were started later)... |
Beta Was this translation helpful? Give feedback.
-
@AlaaKhaddaj moved to discussion as this is a fiddly hparams thing that is likely good as a reference and not a bug |
Beta Was this translation helpful? Give feedback.
-
Thank you for the clarification! |
Beta Was this translation helpful? Give feedback.
-
Hello,
I am trying to replicate the improvement you got in convnext using rewinding. However, I am noticing a big drop in accuracy. Can you please provide more details about that?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions