Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot save checkpoint while train.py on single GPU #113

Closed
tiendung opened this issue Apr 8, 2023 · 3 comments · Fixed by #124
Closed

Cannot save checkpoint while train.py on single GPU #113

tiendung opened this issue Apr 8, 2023 · 3 comments · Fixed by #124

Comments

@tiendung
Copy link

tiendung commented Apr 8, 2023

I got following error at utils.py while saving checkpoint while pre-training on single GPU. Any hint how should I fix it? Thanks.

state_dict = model._forward_module.state_dict()                            

NotImplementedError: offload_to_cpu=True and NO_SHARD is not supported yet
@lantiga
Copy link
Collaborator

lantiga commented Apr 12, 2023

@awaelchli is this something related to the current set of fixes or should we configure FSDP differently?

@lantiga
Copy link
Collaborator

lantiga commented Apr 12, 2023

@tiendung fix landed thanks to @awaelchli, can you give it a shot now?

@awaelchli
Copy link
Member

FYI the fix just gets rid of the error, but to run training on a single GPU, you will need a lot of memory, and checkpionting can result in OOM. If you have trouble, you can create a smaller model by setting config.n_layers and config.n_embd for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants