Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save interval not working #3490

Closed
naimavahab opened this issue Jul 24, 2024 · 9 comments
Closed

Save interval not working #3490

naimavahab opened this issue Jul 24, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@naimavahab
Copy link

naimavahab commented Jul 24, 2024

I am running Mosaicbert pretraining code from /examples/benchmarks/bert using the docker container. Everything works well, other than saving number of checkpoints.
My conf is as below :

save_interval: 1ep
save_num_checkpoints_to_keep: -1 
save_folder: ./{run_name}/ckpt    
save_overwrite: True

Even tried with giving batch interval and without specifying save interval as well(which should by default save every epoch). Whatever interval I specify, only 1 checkpoint is getting saved. I was wondering if any one else facing the same issue.

@naimavahab naimavahab added the bug Something isn't working label Jul 24, 2024
@mvpatel2000
Copy link
Contributor

How many epochs does your run go for? Also, what is the save_filename? Is it unique per timestamp (to verify it isn't being overwritten each time)

@naimavahab
Copy link
Author

Only 1 checkpoint is getting saved..in the format of 'ep3-ba1458-rank0.pt'. If I specify save interval 1ep only 1st epoch gets saved even if I have 30 epochs. And if save interval is 3ep, the 3rd epoch gets saved, rest are ignored

@mvpatel2000
Copy link
Contributor

Hm... do you get any errors or traces? Mind sharing a minimal repro please?

@naimavahab
Copy link
Author

There are no errors and everything working fine including plotting the loss etc.
I am using this git branch for mosaicbert pretraining. https://github.com/Skylion007/mosaicml-examples/tree/skylion007/add-fa2-to-bert/examples

@eracah
Copy link
Contributor

eracah commented Jul 25, 2024

Mind sharing what version of composer you are using?

@naimavahab
Copy link
Author

0.17.2

@eracah
Copy link
Contributor

eracah commented Jul 25, 2024

That's a very old (>6 months old) version of composer. is there a reason you need to use that old of a version?

@naimavahab
Copy link
Author

I have been using this particular docker image https://github.com/Skylion007/mosaicml-examples/tree/skylion007/add-fa2-to-bert/examples to use flashattention, triton etc. These set up require sepcific versions, if I update to latest composer, triton fails.
But now I managed by updating to a slightly higher version like 0.19. And it works fine. But I wonder why composer failed only at checkpoint part for the previous 0.17.2 version

@mvpatel2000
Copy link
Contributor

Going to close as it seems to work.

I'm not super sure what the bug was, but it's certainly possible there was an issue in older versions. Definitely recommend upgrading to latest :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants