[BUG] load_checkpoint should load directly to gpu #1971

stas00 · 2022-05-21T05:16:32Z

Describe the bug

Currently HF Transformers integration users can finetune a model and save the checkpoint with given resources. However resuming from that same checkpoint requires much more CPU peak memory - which can be huge for large models, which prevents users from resuming their finetuning. (The current workaround is to add a huge swap file)

To Reproduce

I reproduced it as part of this bug report: huggingface/transformers#17258

The full reproduction steps are here: huggingface/transformers#17258 (comment)

I also verified that torch.load doesn't load everything in CPU memory when map_location="cpu" huggingface/transformers#17258 (comment)

and I tracked the issue down to deepspeed loading those potentially huge zero checkpoints (70GB for gpt-j-6) into cpu memory first:

DeepSpeed/deepspeed/runtime/engine.py

Line 2748 in 5208eb7

_state = torch.load(ckpt_name, map_location='cpu')

Expected behavior

save_checkpoint and load_checkpoint should require approximately the same amount of memory and should be lean and not need any CPU memory other than the size of the largest param or optim state since torch.load copies params via cpu.

With upcoming models like 176B the current implementation just won't work as it would require several TBs of CPU memory to load a zero checkpoint.

@tjruwase, @jeffra

The text was updated successfully, but these errors were encountered:

stas00 · 2022-06-09T22:54:19Z

As this problem is recurrent for HF Transformers' users - meanwhile I shared a hack to stagger checkpoint loading for those who need here:
huggingface/transformers#17534 (comment)

If you're not using HF Trainer you can patch deepspeed's load_checkpoint directly, using similar code - you just need the rank number the deepspeed way there or get it from int(os.environ.get("LOCAL_RANK", "0"))

much later edit - this idea actually doesn't work because of the barrier calls, so staggering is not possible, since the first process won't free up memory on cpu until all other processes loaded it.

desperadoola · 2023-07-04T05:20:19Z

Any update?

Follow the suggestion in here to make a large swapfile, but the loading takes forever ...

desperadoola · 2023-07-04T08:00:48Z

Any update?

Follow the suggestion in here to make a large swapfile, but the loading takes forever ...

Change 'pin_memory' to False, and follow this #3629 solve the problem. Now we can resume training from a FALCON-40B checkpoint, with 1T CPU memory.

stas00 added the bug Something isn't working label May 21, 2022

stas00 mentioned this issue May 21, 2022

run_clm.py exits with error -9 on checkpoint restart huggingface/transformers#17258

Closed

4 tasks

stas00 mentioned this issue Jun 9, 2022

How to use finetuner.py to train t5-large model huggingface/transformers#17534

Closed

4 tasks

stas00 mentioned this issue Mar 23, 2023

[BUG] save/load checkpoint in zero3 fails to preserve frozen weights #3090

Closed

tjruwase added the training label Mar 23, 2023

tjruwase self-assigned this Mar 23, 2023

tjruwase assigned ShijieZZZZ and unassigned tjruwase May 15, 2023

stas00 mentioned this issue Jul 20, 2023

improve from_pretrained for zero3 multi gpus mode huggingface/transformers#24964

Merged

ambroser53 mentioned this issue May 20, 2024

Resuming from checkpoint runs into OOM huggingface/transformers#30822

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] load_checkpoint should load directly to gpu #1971

[BUG] load_checkpoint should load directly to gpu #1971

stas00 commented May 21, 2022 •

edited

stas00 commented Jun 9, 2022 •

edited

desperadoola commented Jul 4, 2023

desperadoola commented Jul 4, 2023

[BUG] load_checkpoint should load directly to gpu #1971

[BUG] load_checkpoint should load directly to gpu #1971

Comments

stas00 commented May 21, 2022 • edited

stas00 commented Jun 9, 2022 • edited

desperadoola commented Jul 4, 2023

desperadoola commented Jul 4, 2023

stas00 commented May 21, 2022 •

edited

stas00 commented Jun 9, 2022 •

edited