[REQUEST] Moving a trainable model with an optimiser between GPU and CPU #5620

kfertakis · 2024-06-05T18:47:37Z

Is your feature request related to a problem? Please describe.
When a deespeed model is initialised with an optimiser, the torch.nn.module.to() functionality for moving the model between devices breaks as the optimiser holds references to the model parameters and thus GPU memory is not cleared when trying to move it to CPU for example.

Describe the solution you'd like
Functionality that is similar to torch.nn.module.to() for moving both model and optimiser between devices which de-allocates the previously occupied memory.

Describe alternatives you've considered
The alternative is to destroy the model instance and recreate it from a checkpoint but this has a much higher time cost.

The text was updated successfully, but these errors were encountered:

tohtana · 2024-09-09T21:51:06Z

Hi @kfertakis,
Is #6011 useful for your purpose? It is not merged yet but we validated that it reduces the memory footprint.

kfertakis · 2024-09-11T08:42:12Z

Hey @tohtana,

Thank you very much for bringing it to my attention. I will be testing it and report back but it does seem promising for my use case. Thanks a lot.

kfertakis · 2024-10-01T13:57:43Z

#6011 addresses this request.

kfertakis added the enhancement New feature or request label Jun 5, 2024

tohtana self-assigned this Sep 9, 2024

kfertakis mentioned this issue Sep 12, 2024

Add APIs to offload states of model, optimizer, and engine #6011

Merged

kfertakis closed this as completed Oct 1, 2024

This was referenced Oct 1, 2024

[REQUEST] Dynamic model offload support ZeRO-3 inference models #6595

Open

[REQUEST] Extend offload_states to support models with cpu-based optimizer #6596

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQUEST] Moving a trainable model with an optimiser between GPU and CPU #5620

[REQUEST] Moving a trainable model with an optimiser between GPU and CPU #5620

kfertakis commented Jun 5, 2024

tohtana commented Sep 9, 2024

kfertakis commented Sep 11, 2024

kfertakis commented Oct 1, 2024

[REQUEST] Moving a trainable model with an optimiser between GPU and CPU #5620

[REQUEST] Moving a trainable model with an optimiser between GPU and CPU #5620

Comments

kfertakis commented Jun 5, 2024

tohtana commented Sep 9, 2024

kfertakis commented Sep 11, 2024

kfertakis commented Oct 1, 2024