-
Notifications
You must be signed in to change notification settings - Fork 30.7k
Description
System Info
transformers
version: 4.56.1- Platform: Linux-6.12.43+deb13-amd64-x86_64-with-glibc2.41
- Python version: 3.13.5
- Huggingface_hub version: 0.35.0
- Safetensors version: 0.6.2
- Accelerate version: 1.10.1
- Accelerate config:
- compute_environment: LOCAL_MACHINE
- distributed_type: NO
- mixed_precision: no
- use_cpu: False
- debug: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: [] - DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.8.0+cu128 (CUDA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?:
- Using GPU in script?: Yes
- GPU type: NVIDIA GeForce RTX 5060 Ti
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
This is based on the quickstart instructions, but is not an officially published script.
Models referenced are git clone
s of the following models:
- https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
- https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
#!/usr/bin/env python3
# Simplification of the example at https://huggingface.co/docs/transformers/quicktour
import os
import argparse
from pathlib import Path
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
def main():
parser = argparse.ArgumentParser(description="Fine-tune a LLM on HTML files using LoRA.")
parser.add_argument("--model", type=Path, required=True, help="Directory containing pretrained HuggingFace model")
args = parser.parse_args()
model = AutoModelForCausalLM.from_pretrained(args.model, dtype="auto", device_map="auto")
training_args = TrainingArguments(
output_dir="/tmp/spool",
per_device_train_batch_size=1,
num_train_epochs=1,
)
Trainer(
model=model,
args=training_args,
)
print("Success")
if __name__ == "__main__":
main()
$ ./trainer_example.py --model /host/models/DeepSeek-R1-Distill-Qwen-1.5B
Success
$ ./trainer_example.py --model /host/models/DeepSeek-R1-Distill-Qwen-14B
Loading checkpoint shards: <Snip TUI status bar>
Some parameters are on the meta device because they were offloaded to the cpu.
You shouldn't move a model that is dispatched using accelerate hooks.
Traceback (most recent call last):
File "/host/trainer/build_files/app/./trainer_example.py", line 30, in <module>
main()
~~~~^^
File "/host/trainer/build_files/app/./trainer_example.py", line 22, in main
Trainer(
~~~~~~~^
model=model,
^^^^^^^^^^^^
args=training_args,
^^^^^^^^^^^^^^^^^^^
)
^
File "/usr/local/lib/python3.13/dist-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.13/dist-packages/transformers/trainer.py", line 620, in __init__
self._move_model_to_device(model, args.device)
~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.13/dist-packages/transformers/trainer.py", line 913, in _move_model_to_device
model = model.to(device)
File "/usr/local/lib/python3.13/dist-packages/accelerate/big_modeling.py", line 462, in wrapper
raise RuntimeError("You can't move a model that has some modules offloaded to cpu or disk.")
RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.
Calling via accelerate launch
did not impact the error.
Expected behavior
The exception is about an internal action being taken by the library. As a user of the library I'm not sure what action I need to take to resolve this or what config I need to change. The exception should ideally be caught by the code trying to do the move and mitigate it, or bubble it up in a way the user can identify which inputs are causing the error.
From reading the docs offloading should allow this to work. There's still a high probability of user error here, but I can't suss it out from the exceptions I'm getting back, and that's a issue. If this is user error then I think the bug is how the error is presented to the top level code.
Thank you for your work here. I was able to train a small model without needing a deep understanding of what was going on. That's really cool!