Runtime errors when trying to call Trainer() on a model that exceeds GPU vRAM

### System Info

- `transformers` version: 4.56.1
- Platform: Linux-6.12.43+deb13-amd64-x86_64-with-glibc2.41
- Python version: 3.13.5
- Huggingface_hub version: 0.35.0
- Safetensors version: 0.6.2
- Accelerate version: 1.10.1
- Accelerate config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: NO
        - mixed_precision: no
        - use_cpu: False
        - debug: False
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.8.0+cu128 (CUDA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: Yes
- GPU type: NVIDIA GeForce RTX 5060 Ti





### Who can help?

@zach-huggingface @SunMarc

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

This is based on the quickstart instructions, but is not an officially published script.

Models referenced are `git clone`s of the following models:
- https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
- https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B

```
#!/usr/bin/env python3 
# Simplification of the example at https://huggingface.co/docs/transformers/quicktour 

import os
import argparse
from pathlib import Path

from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

def main():
    parser = argparse.ArgumentParser(description="Fine-tune a LLM on HTML files using LoRA.")
    parser.add_argument("--model", type=Path, required=True, help="Directory containing pretrained HuggingFace model")
    args = parser.parse_args()

    model = AutoModelForCausalLM.from_pretrained(args.model, dtype="auto", device_map="auto")
    training_args = TrainingArguments(
        output_dir="/tmp/spool",
        per_device_train_batch_size=1,
        num_train_epochs=1,
    )

    Trainer(
        model=model,
        args=training_args,
    )

    print("Success")

if __name__ == "__main__":
    main()
```

```
$ ./trainer_example.py --model /host/models/DeepSeek-R1-Distill-Qwen-1.5B
Success
$ ./trainer_example.py --model /host/models/DeepSeek-R1-Distill-Qwen-14B 
Loading checkpoint shards: <Snip TUI status bar>
Some parameters are on the meta device because they were offloaded to the cpu.
You shouldn't move a model that is dispatched using accelerate hooks.
Traceback (most recent call last):
  File "/host/trainer/build_files/app/./trainer_example.py", line 30, in <module>
    main()
    ~~~~^^
  File "/host/trainer/build_files/app/./trainer_example.py", line 22, in main
    Trainer(
    ~~~~~~~^
        model=model,
        ^^^^^^^^^^^^
        args=training_args,
        ^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/usr/local/lib/python3.13/dist-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.13/dist-packages/transformers/trainer.py", line 620, in __init__
    self._move_model_to_device(model, args.device)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/dist-packages/transformers/trainer.py", line 913, in _move_model_to_device
    model = model.to(device)
  File "/usr/local/lib/python3.13/dist-packages/accelerate/big_modeling.py", line 462, in wrapper
    raise RuntimeError("You can't move a model that has some modules offloaded to cpu or disk.")
RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.
```

Calling via `accelerate launch` did not impact the error.

### Expected behavior

The exception is about an internal action being taken by the library.  As a user of the library I'm not sure what action I need to take to resolve this or what config I need to change.  The exception should ideally be caught by the code trying to do the move and mitigate it, or bubble it up in a way the user can identify which inputs are causing the error.

From reading the docs offloading should allow this to work.  There's still a high probability of user error here, but I can't suss it out from the exceptions I'm getting back, and that's a issue.  If this is user error then I think the bug is how the error is presented to the top level code.

Thank you for your work here.  I was able to train a small model without needing a deep understanding of what was going on.  That's really cool!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Runtime errors when trying to call Trainer() on a model that exceeds GPU vRAM #41013

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Runtime errors when trying to call Trainer() on a model that exceeds GPU vRAM #41013

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions