Skip to content

Runtime errors when trying to call Trainer() on a model that exceeds GPU vRAM #41013

@ag-TJNII

Description

@ag-TJNII

System Info

  • transformers version: 4.56.1
  • Platform: Linux-6.12.43+deb13-amd64-x86_64-with-glibc2.41
  • Python version: 3.13.5
  • Huggingface_hub version: 0.35.0
  • Safetensors version: 0.6.2
  • Accelerate version: 1.10.1
  • Accelerate config:
    - compute_environment: LOCAL_MACHINE
    - distributed_type: NO
    - mixed_precision: no
    - use_cpu: False
    - debug: False
    - num_processes: 1
    - machine_rank: 0
    - num_machines: 1
    - gpu_ids: all
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - enable_cpu_affinity: False
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.8.0+cu128 (CUDA)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?: Yes
  • GPU type: NVIDIA GeForce RTX 5060 Ti

Who can help?

@zach-huggingface @SunMarc

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

This is based on the quickstart instructions, but is not an officially published script.

Models referenced are git clones of the following models:

#!/usr/bin/env python3 
# Simplification of the example at https://huggingface.co/docs/transformers/quicktour 

import os
import argparse
from pathlib import Path

from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

def main():
    parser = argparse.ArgumentParser(description="Fine-tune a LLM on HTML files using LoRA.")
    parser.add_argument("--model", type=Path, required=True, help="Directory containing pretrained HuggingFace model")
    args = parser.parse_args()

    model = AutoModelForCausalLM.from_pretrained(args.model, dtype="auto", device_map="auto")
    training_args = TrainingArguments(
        output_dir="/tmp/spool",
        per_device_train_batch_size=1,
        num_train_epochs=1,
    )

    Trainer(
        model=model,
        args=training_args,
    )

    print("Success")

if __name__ == "__main__":
    main()
$ ./trainer_example.py --model /host/models/DeepSeek-R1-Distill-Qwen-1.5B
Success
$ ./trainer_example.py --model /host/models/DeepSeek-R1-Distill-Qwen-14B 
Loading checkpoint shards: <Snip TUI status bar>
Some parameters are on the meta device because they were offloaded to the cpu.
You shouldn't move a model that is dispatched using accelerate hooks.
Traceback (most recent call last):
  File "/host/trainer/build_files/app/./trainer_example.py", line 30, in <module>
    main()
    ~~~~^^
  File "/host/trainer/build_files/app/./trainer_example.py", line 22, in main
    Trainer(
    ~~~~~~~^
        model=model,
        ^^^^^^^^^^^^
        args=training_args,
        ^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/usr/local/lib/python3.13/dist-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.13/dist-packages/transformers/trainer.py", line 620, in __init__
    self._move_model_to_device(model, args.device)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/dist-packages/transformers/trainer.py", line 913, in _move_model_to_device
    model = model.to(device)
  File "/usr/local/lib/python3.13/dist-packages/accelerate/big_modeling.py", line 462, in wrapper
    raise RuntimeError("You can't move a model that has some modules offloaded to cpu or disk.")
RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.

Calling via accelerate launch did not impact the error.

Expected behavior

The exception is about an internal action being taken by the library. As a user of the library I'm not sure what action I need to take to resolve this or what config I need to change. The exception should ideally be caught by the code trying to do the move and mitigate it, or bubble it up in a way the user can identify which inputs are causing the error.

From reading the docs offloading should allow this to work. There's still a high probability of user error here, but I can't suss it out from the exceptions I'm getting back, and that's a issue. If this is user error then I think the bug is how the error is presented to the top level code.

Thank you for your work here. I was able to train a small model without needing a deep understanding of what was going on. That's really cool!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions