[zero_to_fp32] 3x less cpu memory requirements #4025

stas00 · 2023-07-24T17:48:26Z

As we have just discovered converting an 80B param model Z3 checkpoint requires more than 1TB of cpu memory as the original script stores the complete state_dict of optim files into the memory.

As we don't need the actual 2 optim states (2x4 bytes) but only the fp32 weights (4 bytes) this PR discards the former immediately on loading each shard. This reduces peak memory requirements 3x times.

Unfortunately it still takes a really long time to load each of these files. So it'll still be slow, but require 3x less cpu memory. Possible solutions:

If deepspeed switches to https://github.com/huggingface/safetensors/ it should be possible to load only the wanted parts from each shard. (fwiw, we are gradually moving all transformers models on the hub to safetensors)
Alternatively, perhaps the optim states and fp32 master weights could live in 2 separate files. In which case it'd very fast to load just the fp32 weight files. feature request: [REQUEST] split zero3 checkpoint files into optim states and master weights #4029 as it's probably a low-hanging fruit as compared to solution (1)

@tjruwase

[zero_to_fp32] 3x less cpu memory requirements

44976b6

stas00 requested review from jeffra, tjruwase and awan-10 as code owners July 24, 2023 17:48

stas00 and others added 2 commits July 24, 2023 10:52

safer version

b8e2a38

style

0b515ca

stas00 mentioned this pull request Jul 24, 2023

[REQUEST] split zero3 checkpoint files into optim states and master weights #4029

Open

Merge branch 'master' into patch-4

0df6094

tjruwase approved these changes Jul 25, 2023

View reviewed changes

tjruwase added this pull request to the merge queue Jul 25, 2023

Merged via the queue into microsoft:master with commit 1cc9caa Jul 25, 2023
16 checks passed

stas00 deleted the patch-4 branch July 25, 2023 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[zero_to_fp32] 3x less cpu memory requirements #4025

[zero_to_fp32] 3x less cpu memory requirements #4025

stas00 commented Jul 24, 2023 •

edited

Loading

[zero_to_fp32] 3x less cpu memory requirements #4025

[zero_to_fp32] 3x less cpu memory requirements #4025

Conversation

stas00 commented Jul 24, 2023 • edited Loading

stas00 commented Jul 24, 2023 •

edited

Loading