Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[zero_to_fp32] 3x less cpu memory requirements #4025

Merged
merged 4 commits into from
Jul 25, 2023

Conversation

stas00
Copy link
Collaborator

@stas00 stas00 commented Jul 24, 2023

As we have just discovered converting an 80B param model Z3 checkpoint requires more than 1TB of cpu memory as the original script stores the complete state_dict of optim files into the memory.

As we don't need the actual 2 optim states (2x4 bytes) but only the fp32 weights (4 bytes) this PR discards the former immediately on loading each shard. This reduces peak memory requirements 3x times.

Unfortunately it still takes a really long time to load each of these files. So it'll still be slow, but require 3x less cpu memory. Possible solutions:

  1. If deepspeed switches to https://github.com/huggingface/safetensors/ it should be possible to load only the wanted parts from each shard. (fwiw, we are gradually moving all transformers models on the hub to safetensors)
  2. Alternatively, perhaps the optim states and fp32 master weights could live in 2 separate files. In which case it'd very fast to load just the fp32 weight files. feature request: [REQUEST] split zero3 checkpoint files into optim states and master weights #4029 as it's probably a low-hanging fruit as compared to solution (1)

@tjruwase

@tjruwase tjruwase added this pull request to the merge queue Jul 25, 2023
Merged via the queue into microsoft:master with commit 1cc9caa Jul 25, 2023
16 checks passed
@stas00 stas00 deleted the patch-4 branch July 25, 2023 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants