-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save interval not working #3490
Comments
How many epochs does your run go for? Also, what is the save_filename? Is it unique per timestamp (to verify it isn't being overwritten each time) |
Only 1 checkpoint is getting saved..in the format of 'ep3-ba1458-rank0.pt'. If I specify save interval 1ep only 1st epoch gets saved even if I have 30 epochs. And if save interval is 3ep, the 3rd epoch gets saved, rest are ignored |
Hm... do you get any errors or traces? Mind sharing a minimal repro please? |
There are no errors and everything working fine including plotting the loss etc. |
Mind sharing what version of composer you are using? |
0.17.2 |
That's a very old (>6 months old) version of composer. is there a reason you need to use that old of a version? |
I have been using this particular docker image https://github.com/Skylion007/mosaicml-examples/tree/skylion007/add-fa2-to-bert/examples to use flashattention, triton etc. These set up require sepcific versions, if I update to latest composer, triton fails. |
Going to close as it seems to work. I'm not super sure what the bug was, but it's certainly possible there was an issue in older versions. Definitely recommend upgrading to latest :) |
I am running Mosaicbert pretraining code from /examples/benchmarks/bert using the docker container. Everything works well, other than saving number of checkpoints.
My conf is as below :
Even tried with giving batch interval and without specifying save interval as well(which should by default save every epoch). Whatever interval I specify, only 1 checkpoint is getting saved. I was wondering if any one else facing the same issue.
The text was updated successfully, but these errors were encountered: