Multi-GPU Training with DPO Full Parameter Stucks #147

Taishi-N324 · 2024-03-31T13:40:44Z

Environment:

transformers: 4.39.0.dev0
trl: 0.7.10
torch: 2.2.2
8 x H100 (80GB)

I am encountering an issue where the training process with DPO on a multi-GPU setup gets stuck. This problem arises when I attempt to launch the training using the accelerate CLI with DeepSpeed's ZeRO-3 configuration.

Steps to Reproduce:

Clone the Alignment Handbook repository:

git clone https://github.com/huggingface/alignment-handbook.git
cd alignment-handbook

Install dependencies:

pip install wheel
python -m pip install .

Launch the training script with the specified configuration:

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_full.yaml

Expected vs. Actual Behavior:
Expected: Smooth utilization of multi-GPU for training without interruptions.
Actual: The process halts immediately after displaying the user warning:

UserWarning: You passed a model_id to the DPOTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you.

Post this warning, there's no progression.

Taishi-N324 changed the title ~~Multi-GPU Training with DPO Full Parameter Stucks #1497~~ Multi-GPU Training with DPO Full Parameter Stucks Mar 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU Training with DPO Full Parameter Stucks #147

Multi-GPU Training with DPO Full Parameter Stucks #147

Taishi-N324 commented Mar 31, 2024 •

edited

Multi-GPU Training with DPO Full Parameter Stucks #147

Multi-GPU Training with DPO Full Parameter Stucks #147

Comments

Taishi-N324 commented Mar 31, 2024 • edited

Taishi-N324 commented Mar 31, 2024 •

edited