add util for ram efficient loading of model when using fsdp #25107

pacman100 · 2023-07-26T10:50:49Z

What does this PR do?

Fixes an issue explained in [FSDP] FSDP doesn't work (random accuracy performance) when using param_init_fn and sync_module_states=True pytorch/pytorch#105840 when using FSDP for training very large models. Should be merged after support for ram efficient loading of model with FSDP accelerate#1777

Currently, when using FSDP, the model is loaded for each of the N processes completely on CPU leading to huge CPU RAM usage. When training models like Flacon-40B with FSDP on a dgx node with 8 GPUs, it would lead to CPU RAM getting out of memory because each process is loading 160GB (40B x 4Bytes (FP32)) in CPU RAM for a total of 160*8=1280GB requirement which results in script getting killed due to out of CPU RAM.

To combat this, we load the model only on rank 0 and have it on meta device when rank!=0. Then use no-op param_init_fn along with sync_module_states=True for FSDP to properly init the weights on other ranks and broadcast the params from rank 0 to other ranks.

Usage:

No user-facing changes:

Post this PR:

accelerator.process_index=0 GPU Memory before entering the loading : 0
accelerator.process_index=0 GPU Memory consumed at the end of the loading (end-begin): 0
accelerator.process_index=0 GPU Peak Memory consumed during the loading (max-begin): 0
accelerator.process_index=0 GPU Total Peak Memory consumed during the loading (max): 0
accelerator.process_index=0 CPU Memory before entering the loading : 926
accelerator.process_index=0 CPU Memory consumed at the end of the loading (end-begin): 26415
accelerator.process_index=0 CPU Peak Memory consumed during the loading (max-begin): 31818
accelerator.process_index=0 CPU Total Peak Memory consumed during the loading (max): 32744
accelerator.process_index=0 model.lm_head.weight=Parameter containing:
tensor([[-0.0179,  0.0201, -0.0273,  ..., -0.0275, -0.0396, -0.0131],
        [-0.0510, -0.0079, -0.0383,  ..., -0.0481,  0.0581,  0.0282],
        [-0.0217, -0.0216, -0.0064,  ..., -0.0508,  0.0554, -0.0013],
        ...,
        [ 0.0425,  0.0452, -0.0131,  ...,  0.0019,  0.0476,  0.0342],
        [-0.0170, -0.0085,  0.0449,  ..., -0.0074,  0.0178,  0.0043],
        [-0.0439, -0.0859, -0.0820,  ...,  0.0130,  0.0669,  0.0884]],
       requires_grad=True)
accelerator.process_index=1 GPU Memory before entering the loading : 0
accelerator.process_index=1 GPU Memory consumed at the end of the loading (end-begin): 0
accelerator.process_index=1 GPU Peak Memory consumed during the loading (max-begin): 0
accelerator.process_index=1 GPU Total Peak Memory consumed during the loading (max): 0
accelerator.process_index=1 CPU Memory before entering the loading : 933
accelerator.process_index=1 CPU Memory consumed at the end of the loading (end-begin): 10
accelerator.process_index=1 CPU Peak Memory consumed during the loading (max-begin): 573
accelerator.process_index=1 CPU Total Peak Memory consumed during the loading (max): 1506
accelerator.process_index=1 model.lm_head.weight=Parameter containing:
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], requires_grad=True)
accelerator.process_index=0 GPU Memory before entering the prepare : 0
accelerator.process_index=0 GPU Memory consumed at the end of the prepare (end-begin): 13202
accelerator.process_index=0 GPU Peak Memory consumed during the prepare (max-begin): 15458
accelerator.process_index=0 GPU Total Peak Memory consumed during the prepare (max): 15458
accelerator.process_index=0 CPU Memory before entering the prepare : 27345
accelerator.process_index=0 CPU Memory consumed at the end of the prepare (end-begin): -26394
accelerator.process_index=0 CPU Peak Memory consumed during the prepare (max-begin): 0
accelerator.process_index=0 CPU Total Peak Memory consumed during the prepare (max): 27345
FullyShardedDataParallel(
  (_fsdp_wrapped_module): RWForCausalLM(
    (transformer): RWModel(
      (word_embeddings): Embedding(65024, 4544)
      (h): ModuleList(
        (0-31): 32 x FullyShardedDataParallel(
          (_fsdp_wrapped_module): DecoderLayer(
            (input_layernorm): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
            (self_attention): Attention(
              (maybe_rotary): RotaryEmbedding()
              (query_key_value): Linear(in_features=4544, out_features=4672, bias=False)
              (dense): Linear(in_features=4544, out_features=4544, bias=False)
              (attention_dropout): Dropout(p=0.0, inplace=False)
            )
            (mlp): MLP(
              (dense_h_to_4h): Linear(in_features=4544, out_features=18176, bias=False)
              (act): GELU(approximate='none')
              (dense_4h_to_h): Linear(in_features=18176, out_features=4544, bias=False)
            )
          )
        )
      )
      (ln_f): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
    )
    (lm_head): Linear(in_features=4544, out_features=65024, bias=False)
  )
)
accelerator.process_index=1 GPU Memory before entering the prepare : 0
accelerator.process_index=1 GPU Memory consumed at the end of the prepare (end-begin): 13202
accelerator.process_index=1 GPU Peak Memory consumed during the prepare (max-begin): 15458
accelerator.process_index=1 GPU Total Peak Memory consumed during the prepare (max): 15458
accelerator.process_index=1 CPU Memory before entering the prepare : 945
accelerator.process_index=1 CPU Memory consumed at the end of the prepare (end-begin): 4
accelerator.process_index=1 CPU Peak Memory consumed during the prepare (max-begin): 4
accelerator.process_index=1 CPU Total Peak Memory consumed during the prepare (max): 949
accelerator.process_index=1 model.lm_head.weight=Parameter containing:
tensor([[-0.0179,  0.0201, -0.0273,  ..., -0.0275, -0.0396, -0.0131],
        [-0.0510, -0.0079, -0.0383,  ..., -0.0481,  0.0581,  0.0282],
        [-0.0217, -0.0216, -0.0064,  ..., -0.0508,  0.0554, -0.0013],
        ...,
        [ 0.0425,  0.0452, -0.0131,  ...,  0.0019,  0.0476,  0.0342],
        [-0.0170, -0.0085,  0.0449,  ..., -0.0074,  0.0178,  0.0043],
        [-0.0439, -0.0859, -0.0820,  ...,  0.0130,  0.0669,  0.0884]],
       device='cuda:1', requires_grad=True)
accelerator.process_index=0 model.lm_head.weight=Parameter containing:
tensor([[-0.0179,  0.0201, -0.0273,  ..., -0.0275, -0.0396, -0.0131],
        [-0.0510, -0.0079, -0.0383,  ..., -0.0481,  0.0581,  0.0282],
        [-0.0217, -0.0216, -0.0064,  ..., -0.0508,  0.0554, -0.0013],
        ...,
        [ 0.0425,  0.0452, -0.0131,  ...,  0.0019,  0.0476,  0.0342],
        [-0.0170, -0.0085,  0.0449,  ..., -0.0074,  0.0178,  0.0043],
        [-0.0439, -0.0859, -0.0820,  ...,  0.0130,  0.0669,  0.0884]],
       device='cuda:0', requires_grad=True)

So you can see that during loading Rank 1 doesn't take any more CPU RAM. And the performance between both setups matches.

To Do:

Add docs in the FSDP section

HuggingFaceDocBuilderDev · 2023-07-26T11:48:18Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

As said internally, I would prefer for this to be done automatically in from_pretrained when FSDP is detected with the options that make sense. For instance we have several tests for DeepSpeed ZeRO3 in from_pretrained, one loading the state dict only on the main process.

sgugger

Looks better, thansk!

src/transformers/trainer.py

src/transformers/training_args.py

src/transformers/trainer_pt_utils.py

sgugger

LGTM, just have the unresolved comment on why the prefetch this is removed in the trainer file.

lwmlyy · 2023-08-15T09:55:35Z

@pacman100 Hi, I've tried to run Llama2 with the two PR but it seems something went wrong. Plz check, thx!

While copying the parameter named "model.layers.29.self_attn.v_proj.weight", whose dimensions in the model are torch.Size([4096, 4096]) and whose dimensions in the checkpoint are torch.Size([4096, 4096]), an exception occurred : ('Cannot copy out of meta tensor; no data!\nException raised from copy_impl at ../aten/src/ATen/native/Copy.cpp:188 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f649c6c04d7 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)\nframe #1: + 0x11c32e4 (0x7f64ea8552e4 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #2: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x62 (0x7f64eb3deb32 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #3: at::ops::copy::redispatch(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool) + 0x7b (0x7f64ebff07db in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #4: + 0x5443145 (0x7f64eead5145 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #5: at::ops::copy::redispatch(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool) + 0x7b (0x7f64ebff07db in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #6: + 0x54454f4 (0x7f64eead74f4 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #7: at::ops::copy::call(at::Tensor&, at::Tensor const&, bool) + 0x15f (0x7f64ec04dadf in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #8: + 0x4cdbc9 (0x7f65030ddbc9 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)\nframe #9: + 0x1453a3 (0x55e385cfb3a3 in /opt/conda/bin/python)\nframe #10: _PyEval_EvalFrameDefault + 0x6f3 (0x55e385ce9b13 in /opt/conda/bin/python)\nframe #11: + 0x1515df (0x55e385d075df in /opt/conda/bin/python)\nframe #12: _PyEval_EvalFrameDefault + 0x2b8f (0x55e385cebfaf in /opt/conda/bin/python)\nframe #13: _PyFunction_Vectorcall + 0x6f (0x55e385cfa03f in /opt/conda/bin/python)\nframe #14: _PyEval_EvalFrameDefault + 0x304 (0x55e385ce9724 in /opt/conda/bin/python)\nframe #15: _PyFunction_Vectorcall + 0x6f (0x55e385cfa03f in /opt/conda/bin/python)\nframe #16: _PyEval_EvalFrameDefault + 0x304 (0x55e385ce9724 in /opt/conda/bin/python)\nframe #17: _PyFunction_Vectorcall + 0x6f (0x55e385cfa03f in /opt/conda/bin/python)\nframe #18: _PyEval_EvalFrameDefault + 0x304 (0x55e385ce9724 in /opt/conda/bin/python)\nframe #19: _PyFunction_Vectorcall + 0x6f (0x55e385cfa03f in /opt/conda/bin/python)\nframe #20: _PyEval_EvalFrameDefault + 0x304 (0x55e385ce9724 in /opt/conda/bin/python)\nframe #21: _PyFunction_Vectorcall + 0x6f (0x55e385cfa03f in /opt/conda/bin/python)\nframe #22: _PyEval_EvalFrameDefault + 0x304 (0x55e385ce9724 in /opt/conda/bin/python)\nframe #23: _PyFunction_Vectorcall + 0x6f (0x55e385cfa03f in /opt/conda/bin/python)\nframe #24: _PyEval_EvalFrameDefault + 0x12ff (0x55e385cea71f in /opt/conda/bin/python)\nframe #25: _PyFunction_Vectorcall + 0x6f (0x55e385cfa03f in /opt/conda/bin/python)\nframe #26: _PyEval_EvalFrameDefault + 0x304 (0x55e385ce9724 in /opt/conda/bin/python)\nframe #27: + 0x150d7c (0x55e385d06d7c in /opt/conda/bin/python)\nframe #28: _PyEval_EvalFrameDefault + 0x12ff (0x55e385cea71f in /opt/conda/bin/python)\nframe #29: + 0x150d7c (0x55e385d06d7c in /opt/conda/bin/python)\nframe #30: _PyEval_EvalFrameDefault + 0x12ff (0x55e385cea71f in /opt/conda/bin/python)\nframe #31: _PyFunction_Vectorcall + 0x6f (0x55e385cfa03f in /opt/conda/bin/python)\nframe #32: PyObject_Call + 0xb8 (0x55e385d07a08 in /opt/conda/bin/python)\nframe #33: _PyEval_EvalFrameDefault + 0x2b8f (0x55e385cebfaf in /opt/conda/bin/python)\nframe #34: _PyFunction_Vectorcall + 0x6f (0x55e385cfa03f in /opt/conda/bin/python)\nframe #35: _PyEval_EvalFrameDefault + 0x12ff (0x55e385cea71f in /opt/conda/bin/python)\nframe #36: _PyFunction_Vectorcall + 0x6f (0x55e385cfa03f in /opt/conda/bin/python)\nframe #37: _PyEval_EvalFrameDefault + 0x304 (0x55e385ce9724 in /opt/conda/bin/python)\nframe #38: _PyFunction_Vectorcall + 0x6f (0x55e385cfa03f in /opt/conda/bin/python)\nframe #39: _PyEval_EvalFrameDefault + 0x4a35 (0x55e385cede55 in /opt/conda/bin/python)\nframe #40: + 0x1e64d2 (0x55e385d9c4d2 in /opt/conda/bin/python)\nframe #41: PyEval_EvalCode + 0x87 (0x55e385d9c417 in /opt/conda/bin/python)\nframe #42: + 0x219ed9 (0x55e385dcfed9 in /opt/conda/bin/python)\nframe #43: + 0x2147e4 (0x55e385dca7e4 in /opt/conda/bin/python)\nframe #44: + 0x98214 (0x55e385c4e214 in /opt/conda/bin/python)\nframe #45: _PyRun_SimpleFileObject + 0x1af (0x55e385dc4b1f in /opt/conda/bin/python)\nframe #46: _PyRun_AnyFileObject + 0x43 (0x55e385dc46e3 in /opt/conda/bin/python)\nframe #47: Py_RunMain + 0x39f (0x55e385dc189f in /opt/conda/bin/python)\nframe #48: Py_BytesMain + 0x39 (0x55e385d8f709 in /opt/conda/bin/python)\nframe #49: __libc_start_main + 0xf3 (0x7f6534704083 in /usr/lib/x86_64-linux-gnu/libc.so.6)\nframe #50: + 0x1d9611 (0x55e385d8f611 in /opt/conda/bin/python)\n',).

pacman100 · 2023-08-15T10:14:49Z

Hello @lwmlyy, I'm able to run 70B Llama on 32 A100 80GB GPUs with it without any issues. Can you share the config, minimal example and launch command?

lwmlyy · 2023-08-15T10:27:57Z

@pacman100 I run the code in meta-llama/llama-recipes#77 with the following command:
torchrun --nnodes 1 --nproc_per_node 4 llama_finetuning.py --enable_fsdp --pure_bf16 --model_name ../Llama-2-7b-hf --batch_size_training 1 --micro_batch_size 1 --dist_checkpoint_root_folder ../Llama-2-7b-hf --dist_checkpoint_folder fine-tuned

Could you also share the script for running Llama-70b as you mentioned?

pacman100 · 2023-08-15T10:32:46Z

Hello @lwmlyy, follow this: meta-llama/llama-recipes#77 (comment)

lwmlyy · 2023-08-15T10:44:35Z

@pacman100 As you mentioned, if the model is loaded with accelerate, no code change is needed. I wonder why the error shows up. Could you give some advice?

pacman100 · 2023-08-15T10:51:16Z

@pacman100 I run the code in meta-llama/llama-recipes#77 with the following command:
torchrun --nnodes 1 --nproc_per_node 4 llama_finetuning.py --enable_fsdp --pure_bf16 --model_name ../Llama-2-7b-hf --batch_size_training 1 --micro_batch_size 1 --dist_checkpoint_root_folder ../Llama-2-7b-hf --dist_checkpoint_folder fine-tuned

Could you also share the script for running Llama-70b as you mentioned?

Hello, you aren't launching via accelerate launch (you are using torchrun) and as such the env variable ACCELERATE_USE_FSDP isn't enabled.

lwmlyy · 2023-08-15T11:13:18Z

@pacman100 I run the code in facebookresearch/llama-recipes#77 with the following command:
torchrun --nnodes 1 --nproc_per_node 4 llama_finetuning.py --enable_fsdp --pure_bf16 --model_name ../Llama-2-7b-hf --batch_size_training 1 --micro_batch_size 1 --dist_checkpoint_root_folder ../Llama-2-7b-hf --dist_checkpoint_folder fine-tuned
Could you also share the script for running Llama-70b as you mentioned?

Hello, you aren't launching via accelerate launch (you are using torchrun) and as such the env variable ACCELERATE_USE_FSDP isn't enabled.

@pacman100 Hi，I meet the same error when using the following command:
accelerate launch llama_finetuning.py --enable_fsdp --model_name ../Llama-2-7b-hf --batch_size_training 1 --micro_batch_size 1 --dist_checkpoint_root_folder ../Llama-2-7b-hf --dist_checkpoint_folder fine-tuned

It works fine with the command:
ACCELERATE_USE_FSDP=True accelerate launch llama_finetuning.py --enable_fsdp --model_name ../Llama-2-7b-hf --batch_size_training 1 --micro_batch_size 1 --dist_checkpoint_root_folder ../Llama-2-7b-hf --dist_checkpoint_folder fine-tuned

But the loss is nan.

pacman100 · 2023-08-15T12:42:43Z

Hello, you aren't using the accelerate integration of FSDP and you are mixing llama recipe implementation which doesn't use Accelerate. Please refer to the Accelerate docs on the proper way to use FSDP with Accelerate. Also, please raise a separate issue.

…g_util

sgugger

Thanks for iterating!

…g_util

nabarunbaruaAIML · 2023-10-09T19:29:51Z

Hi @pacman100 ,

I am trying to Train Llama 70B Model in FSDP, I was going through your repo https://github.com/pacman100/ram_efficient_fsdp/blob/main/train.py, code is failing when trying to import this function load_pretrained_model_only_on_rank0 getting error "ImportError: cannot import name 'load_pretrained_model_only_on_rank0' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/init.py)". Tried to check this function in the Transformer Repo but couldn't find one in the main branch.

Can you please help me, how I can execute your code.

Regards
Nabarun Barua

…ace#25107) * add util for ram efficient loading of model when using fsdp * make fix-copies * fixes 😅 * docs * making it further easier to use * rename the function * refactor to handle fsdp ram efficiency in `from_pretrained` * fixes * fixes * fixes * update * fixes * revert `load_pretrained_model_only_on_rank0` * resolve `load_from_checkpoint`

pkaercher · 2024-02-06T20:05:38Z

I'm currently using transformer v.4.37.2 and accelerate v.0.26.1 and am training on one machine with 2 GPU processors. I'm seeing the Mistral 7B model being loaded onto CPU RAM x2 (once for each processor). I don't understand why since this fix was released with earlier versions transformer v.4.32.0 and accelerate v.0.22.0 and should load the model onto CPU only once, independent of the number of processors. Any insight anyone has is super appreciated!

These are the settings in my fsdp config file:
compute_environment: LOCAL_MACHINE debug: false distributed_type: FSDP downcast_bf16: 'no' fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch: BACKWARD_PRE fsdp_cpu_ram_efficient_loading: true fsdp_forward_prefetch: false fsdp_offload_params: false fsdp_sharding_strategy: 1 fsdp_state_dict_type: SHARDED_STATE_DICT fsdp_sync_module_states: true fsdp_use_orig_params: true machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

pacman100 · 2024-02-08T12:17:06Z

Hello @pkaercher,

To use this feature, you would need to use Accelerate config for FSDP along with Accelerate launcher. For more details on how to use this, please refer https://huggingface.co/docs/transformers/trainer#accelerate-and-trainer

pkaercher · 2024-02-08T19:55:52Z

Hi @pacman100,
Thanks for your reply. I am using Accelerate config along with FSDP (see my config file in my post above that I created with accelerate config --config_file "fsdp_config.yaml". I am running my script with the command accelerate launch --config_file fsdp_config.yaml domain_adapt.py. Attached is my domain_adapt.py script. When I run, I see the CPU RAM go up to 65 GB for the 7B Mistral model, which is twice as much space as it should take up given 7Billion x 4Bytes = 28GB. Twice that (loading the model once for each of the 2 GPU processors I'm using) gives 56 GB, which, plus the space taken up by my environment packages and my dataset would be roughly 65 GB makes me think that accelerate is loading the Mistral model into my CPU RAM x2, which it shouldn't according to this fix.

from datasets import load_dataset
from peft import LoraConfig, get_peft_model 
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling

# Define training parameters

MODEL_CHECKPOINT = 'mistralai/Mistral-7B-Instruct-v0.2'
DATASET_PATH = '/appdata/data'
# SAVE_TO_PATH = '/appdata/embedding_models'
MASK_WHOLE_WORDS = False


ARGS = {
        'lora_alpha': 16,
        'lora_dropout': 0.1,
        'lora_r': 64,
        'output_dir': '/appdata/results',
        'per_device_train_batch_size': 1,
        'per_device_eval_batch_size': 1,
        'gradient_accumulation_steps': 16, 
        'optim': "paged_adamw_32bit",
        'evaluation_strategy': 'steps', # "epoch", default is 'no'
        'save_steps': 50, # defaults to 500
        'logging_steps': 50, # defaults to 500
        'num_train_epochs': 4,  # default is 3
        'learning_rate': 1e-4,
        'max_grad_norm': 0.3, # default is 1
        'max_steps': 500,  # training will only run to this number of steps; overrides num_train_epochs
        'warmup_ratio': 0.03,
        'lr_scheduler_type': "cosine",  # default is "linear"
        }


# Define functions

def run_domain_adaptation(model_checkpoint, dataset_path, args):  # training_dataset_name, mask_whole_words
    # Import model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
    model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

    # Import and tokenize the data
    data = load_dataset(dataset_path , split='train')
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"
    tokenizer.mask_token = '<MASK>'


    def tokenize_function(examples, tokenizer=tokenizer):
        """examples is a dataset object"""
        result = tokenizer(examples["text"])
        return result


    tokenized_datasets = data.map(tokenize_function, batched=True, remove_columns=["text"])
    collator = DataCollatorForLanguageModeling(mlm=True, mlm_probability=0.15, tokenizer=tokenizer)

    peft_config = LoraConfig(
                            lora_alpha=args['lora_alpha'],
                            lora_dropout=args['lora_dropout'],
                            r=args['lora_r'],
                            bias="none",
                            task_type="CAUSAL_LM",
                            target_modules=[
                                "Wqkv",
                                "out_proj",
                                "up_proj",
                                "down_proj",
                            ])

    training_arguments = TrainingArguments(
                                        # output_dir=args['model_output_dir'],
                                        output_dir=args['output_dir'],
                                        per_device_train_batch_size=args['per_device_train_batch_size'],
                                        per_device_eval_batch_size=args['per_device_eval_batch_size'],
                                        gradient_accumulation_steps=args['gradient_accumulation_steps'],
                                        logging_steps=args['logging_steps'],
                                        save_strategy= 'epoch',
                                        # evaluation_strategy=args['evaluation_strategy'],
                                        num_train_epochs=args['num_train_epochs'],
                                        learning_rate=args['learning_rate'],
                                        bf16=True,
                                        # fsdp='full_shard',
                                        max_grad_norm=args['max_grad_norm'],
                                        warmup_ratio=args['warmup_ratio'],
                                        group_by_length=True,
                                        report_to='none',
                                        log_level='debug',
                                        )

    # Train
    model = get_peft_model(model, peft_config)
    trainer = Trainer(
                      model=model,
                      tokenizer=tokenizer,
                      data_collator=collator,
                      train_dataset=tokenized_datasets,
                    #   eval_dataset=tokenized_datasets['validation'],
                      args=training_arguments,
                     )

    trainer.train()

    # Save the model and tokenizer
    model_name = f"{model_checkpoint.split('/')[1]}"  # _{training_dataset_name}"
    trainer.save_model(fr"../embedding_models/{model_name}")
    tokenizer.save_pretrained(fr"../embedding_models/{model_name}")
    # trainer.save_model(SAVE_TO_PATH)

if __name__ == '__main__':
    run_domain_adaptation(model_checkpoint=MODEL_CHECKPOINT,
                          dataset_path=DATASET_PATH,
                          # training_dataset_name=TRAINING_DATASET_NAME,
                          args=ARGS)```

pacman100 · 2024-02-09T09:26:56Z

Hello @pkaercher,

Thank you for the above code, this helps. You are calling from_pretrained method before initializing the distributed process group. As such, from_pretrained has no info whether a distributed training run is in place and as such doesn't know which process is rank 0 or remaining ranks. For this to work when using Trainer, please create an instance of TrainingArguments before calling from_pretrained because TrainingArguments instance initializes the distributed process group.

Updating the docs here huggingface/accelerate#2430 with this information.

pkaercher · 2024-02-13T19:20:27Z

Thank you @pacman100 ! I did as you suggested and saw the max CPU RAM usage during loading of the model drop from 65.2 GB to 47.2 GB, so it looks like it's working now.

fabianlim · 2024-05-28T10:05:41Z

@pacman100 for quantized models the meta device creation seems to be skipped even for non-zero ranks, is my understanding correct?

transformers/src/transformers/modeling_utils.py

Lines 4217 to 4232 in dd4654e

    
           new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( 
        
               model_to_load, 
        
               state_dict, 
        
               loaded_keys, 
        
               start_prefix, 
        
               expected_keys, 
        
               device_map=device_map, 
        
               offload_folder=offload_folder, 
        
               offload_index=offload_index, 
        
               state_dict_folder=state_dict_folder, 
        
               state_dict_index=state_dict_index, 
        
               dtype=dtype, 
        
               hf_quantizer=hf_quantizer, 
        
               is_safetensors=is_safetensors, 
        
               keep_in_fp32_modules=keep_in_fp32_modules, 
        
               unexpected_keys=unexpected_keys,

This means that while quantized models are typically smaller than full precision, there is no real low_cpu_mem feature for quantized models, and quantized models are always loaded in duplicity across ranks?
But even for the non-zero ranks, they seem to be moved to "cpu", so wouldnt they occupy the same amount of cpu memory, as if they had been loaded from the checkpoints directly into "cpu"?

if is_fsdp_enabled() and not is_local_dist_rank_0() and not is_quantized:
    for key, param in model_to_load.state_dict().items():
        if param.device == torch.device("meta"):
            set_module_tensor_to_device(
                model_to_load, key, "cpu", torch.empty(*param.size(), dtype=dtype)
            )

Neo9061 · 2024-06-25T15:39:40Z

@pacman100 for quantized models the meta device creation seems to be skipped even for non-zero ranks, is my understanding correct?

transformers/src/transformers/modeling_utils.py

Lines 4217 to 4232 in dd4654e

new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(

model_to_load,

state_dict,

loaded_keys,

start_prefix,

expected_keys,

device_map=device_map,

offload_folder=offload_folder,

offload_index=offload_index,

state_dict_folder=state_dict_folder,

state_dict_index=state_dict_index,

dtype=dtype,

hf_quantizer=hf_quantizer,

is_safetensors=is_safetensors,

keep_in_fp32_modules=keep_in_fp32_modules,

unexpected_keys=unexpected_keys,

This means that while quantized models are typically smaller than full precision, there is no real low_cpu_mem feature for quantized models, and quantized models are always loaded in duplicity across ranks?

But even for the non-zero ranks, they seem to be moved to "cpu", so wouldnt they occupy the same amount of cpu memory, as if they had been loaded from the checkpoints directly into "cpu"?
if is_fsdp_enabled() and not is_local_dist_rank_0() and not is_quantized:
    for key, param in model_to_load.state_dict().items():
        if param.device == torch.device("meta"):
            set_module_tensor_to_device(
                model_to_load, key, "cpu", torch.empty(*param.size(), dtype=dtype)
            )

Hi @ArthurZucker @pacman100 @sgugger please help answer the question above?

It seems to be a problem when loading a large size model like 300B with 4 bit quantization. I have a similar issue documented in #31577

In particular, this line at

transformers/src/transformers/modeling_utils.py

Line 890 in dd4654e

    
           # For quantized modules with FSDP/DeepSpeed Stage 3, we need to quantize the parameter on the GPU

, will quantization algorithm quantizes model weights across all the GPUs or single GPU with offloading to CPU? it seems to quantizes to single GPU and thus it has OOM for super large size of model.

On the other end, by using @philschmid 's blogpost and code, I am able to load and train two models:

llama-3 70B on 4 instances of g5.24xlarge, and
mixtral 8x22 on 4 instances of g5.8xlarge (failed at model merging stage).

Neither g5.12xlarge (4 GPUs with 96 GB in total, and 192 GB CPU) nor g5.16xlarge (1 GPU with 24 GB and 256 GB CPU) has enough GPU memory on single GPU to load the 4-bit quantized model, thus I suspect you are doing offloading to CPU memory rather than using single GPU - rank 0 - to load the entire quantized models. But then it does not explain why Grok-1 model - HF format is failed at loading stage with 4 bit quantization.

amyeroberts · 2024-06-25T16:34:53Z

cc @SunMarc

Neo9061 · 2024-06-27T13:18:24Z

Hi @SunMarc @amyeroberts any chance you can take a look my questions and share insights? many thanks!

add util for ram efficient loading of model when using fsdp

e8873d9

pacman100 requested review from sgugger and removed request for sgugger July 26, 2023 10:50

pacman100 added 3 commits July 26, 2023 16:22

make fix-copies

6f34229

fixes 😅

67d13df

docs

fb99633

pacman100 mentioned this pull request Jul 26, 2023

support for ram efficient loading of model with FSDP huggingface/accelerate#1777

Merged

pacman100 marked this pull request as ready for review July 26, 2023 11:29

pacman100 requested a review from sgugger July 26, 2023 11:29

making it further easier to use

a48cc1a

rename the function

6388596

sgugger reviewed Jul 26, 2023

View reviewed changes

refactor to handle fsdp ram efficiency in from_pretrained

b18a103

sgugger reviewed Jul 26, 2023

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

src/transformers/training_args.py Outdated Show resolved Hide resolved

src/transformers/trainer_pt_utils.py Outdated Show resolved Hide resolved

pacman100 added 7 commits July 26, 2023 21:49

fixes

353fa8a

fixes

3442100

fixes

e75646d

update

79351ae

fixes

2a5acd1

revert load_pretrained_model_only_on_rank0

449afd4

resolve load_from_checkpoint

3adfddd

sgugger reviewed Jul 27, 2023

View reviewed changes

pacman100 mentioned this pull request Aug 10, 2023

save cpu mem by leveraging FSDP rank0 broadcasting meta-llama/llama-recipes#77

Merged

5 tasks

Merge branch 'main' into smangrul/fsdp_cpu_ram_efficient_model_loadin…

dd8fe12

…g_util

sgugger approved these changes Aug 16, 2023

View reviewed changes

Merge branch 'main' into smangrul/fsdp_cpu_ram_efficient_model_loadin…

ea48377

…g_util

pacman100 merged commit c4c0cef into main Aug 17, 2023
21 checks passed

pacman100 deleted the smangrul/fsdp_cpu_ram_efficient_model_loading_util branch August 17, 2023 16:23

jphme mentioned this pull request Sep 13, 2023

FSDP Model loading with accelerate results in crash (OOM) #26157

Closed

4 tasks

vivekgoe mentioned this pull request Jan 4, 2024

Add changes to support FSDP huggingface/optimum-habana#598

Merged

3 tasks

vivekgoe mentioned this pull request Mar 7, 2024

Fix Llama-70B-FSDP model loading issue huggingface/optimum-habana#752

Merged

ArthurZucker mentioned this pull request Apr 17, 2024

Fix Marian model conversion #30173

Merged

achew010 mentioned this pull request May 29, 2024

Workaround Low-Mem-Mode Patch for GPTQ-LoRA foundation-model-stack/fms-acceleration#26

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add util for ram efficient loading of model when using fsdp #25107

add util for ram efficient loading of model when using fsdp #25107

pacman100 commented Jul 26, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 26, 2023 •

edited

Loading

sgugger left a comment

sgugger left a comment •

edited

Loading

sgugger left a comment

lwmlyy commented Aug 15, 2023

pacman100 commented Aug 15, 2023

lwmlyy commented Aug 15, 2023 •

edited

Loading

pacman100 commented Aug 15, 2023

lwmlyy commented Aug 15, 2023

pacman100 commented Aug 15, 2023 •

edited

Loading

lwmlyy commented Aug 15, 2023 •

edited

Loading

pacman100 commented Aug 15, 2023

sgugger left a comment

nabarunbaruaAIML commented Oct 9, 2023

pkaercher commented Feb 6, 2024 •

edited

Loading

pacman100 commented Feb 8, 2024

pkaercher commented Feb 8, 2024 •

edited

Loading

pacman100 commented Feb 9, 2024

pkaercher commented Feb 13, 2024

fabianlim commented May 28, 2024 •

edited

Loading

Neo9061 commented Jun 25, 2024 •

edited

Loading

amyeroberts commented Jun 25, 2024

Neo9061 commented Jun 27, 2024

add util for ram efficient loading of model when using fsdp #25107

add util for ram efficient loading of model when using fsdp #25107

Conversation

pacman100 commented Jul 26, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Jul 26, 2023 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

sgugger left a comment • edited Loading

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

lwmlyy commented Aug 15, 2023

pacman100 commented Aug 15, 2023

lwmlyy commented Aug 15, 2023 • edited Loading

pacman100 commented Aug 15, 2023

lwmlyy commented Aug 15, 2023

pacman100 commented Aug 15, 2023 • edited Loading

lwmlyy commented Aug 15, 2023 • edited Loading

pacman100 commented Aug 15, 2023

sgugger left a comment

Choose a reason for hiding this comment

nabarunbaruaAIML commented Oct 9, 2023

pkaercher commented Feb 6, 2024 • edited Loading

pacman100 commented Feb 8, 2024

pkaercher commented Feb 8, 2024 • edited Loading

pacman100 commented Feb 9, 2024

pkaercher commented Feb 13, 2024

fabianlim commented May 28, 2024 • edited Loading

Neo9061 commented Jun 25, 2024 • edited Loading

amyeroberts commented Jun 25, 2024

Neo9061 commented Jun 27, 2024

pacman100 commented Jul 26, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 26, 2023 •

edited

Loading

sgugger left a comment •

edited

Loading

lwmlyy commented Aug 15, 2023 •

edited

Loading

pacman100 commented Aug 15, 2023 •

edited

Loading

lwmlyy commented Aug 15, 2023 •

edited

Loading

pkaercher commented Feb 6, 2024 •

edited

Loading

pkaercher commented Feb 8, 2024 •

edited

Loading

fabianlim commented May 28, 2024 •

edited

Loading

Neo9061 commented Jun 25, 2024 •

edited

Loading