Cannot save checkpoint while train.py on single GPU #113

tiendung · 2023-04-08T08:50:27Z

I got following error at utils.py while saving checkpoint while pre-training on single GPU. Any hint how should I fix it? Thanks.

state_dict = model._forward_module.state_dict()                            

NotImplementedError: offload_to_cpu=True and NO_SHARD is not supported yet

The text was updated successfully, but these errors were encountered:

lantiga · 2023-04-12T07:02:39Z

@awaelchli is this something related to the current set of fixes or should we configure FSDP differently?

lantiga · 2023-04-12T11:36:25Z

@tiendung fix landed thanks to @awaelchli, can you give it a shot now?

awaelchli · 2023-04-12T11:54:39Z

FYI the fix just gets rid of the error, but to run training on a single GPU, you will need a lot of memory, and checkpionting can result in OOM. If you have trouble, you can create a smaller model by setting config.n_layers and config.n_embd for example.

awaelchli mentioned this issue Apr 12, 2023

Enable checkpoint with FSDP on single device #124

Merged

lantiga closed this as completed in #124 Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot save checkpoint while train.py on single GPU #113

Cannot save checkpoint while train.py on single GPU #113

tiendung commented Apr 8, 2023

lantiga commented Apr 12, 2023

lantiga commented Apr 12, 2023

awaelchli commented Apr 12, 2023

Cannot save checkpoint while train.py on single GPU #113

Cannot save checkpoint while train.py on single GPU #113

Comments

tiendung commented Apr 8, 2023

lantiga commented Apr 12, 2023

lantiga commented Apr 12, 2023

awaelchli commented Apr 12, 2023