Push sharded checkpoint to hub when `push_to_hub=True` in `TrainingArguments` #31808

SunMarc · 2024-07-05T15:59:08Z

What does this PR do ?

This PR make sure that sharded checkpoints are pushed to the hub when we set push_to_hub=True in TrainingArguments.
Fixes #30724.
Thanks @alvarobartt for the detailed issue !

I didn't add a test as it requires a big model to test it and we can't change the shard size when saving from Trainer.

To reproduce:

I was able to successfully push the checkpoints here : https://huggingface.co/marcsun13/sft_openassistant-guanaco/tree/main

python trl/examples/scripts/sft.py     --model_name_or_path="mistralai/Mistral-7B-v0.1"     --dataset_text_field="text"     --report_to="wandb"     --learning_rate=1.41e-5     --per_device_train_batch_size=2     --gradient_accumulation_steps=2     --output_dir="sft_openassistant-guanaco"   --torch_dtype="bfloat16"  --optim=adamw_bnb_8bit   --logging_steps=1     --num_train_epochs=3     --hub_strategy="every_save" --save_strategy="steps"  --save_steps=10    --push_to_hub

HuggingFaceDocBuilderDev · 2024-07-05T16:18:09Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

muellerzr

Thanks for the fix! 🥳

amyeroberts

Thanks for enabling this!

Regarding testing, I'm happy to have merged as you've been able to run. However, not being able to test is a symptom of something needing to change on our code side to enable this.

Save sharded checkpoint in Trainer

47b52e5

SunMarc requested a review from LysandreJik July 5, 2024 15:59

muellerzr approved these changes Jul 8, 2024

View reviewed changes

SunMarc requested a review from amyeroberts July 10, 2024 11:43

amyeroberts approved these changes Jul 10, 2024

View reviewed changes

SunMarc merged commit 8df28bb into main Jul 10, 2024
22 checks passed

SunMarc deleted the save_sharded_checkpoints_in_trainer branch July 10, 2024 13:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Push sharded checkpoint to hub when `push_to_hub=True` in `TrainingArguments` #31808

Push sharded checkpoint to hub when `push_to_hub=True` in `TrainingArguments` #31808

SunMarc commented Jul 5, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 5, 2024

muellerzr left a comment

amyeroberts left a comment

Push sharded checkpoint to hub when push_to_hub=True in TrainingArguments #31808

Push sharded checkpoint to hub when push_to_hub=True in TrainingArguments #31808

Conversation

SunMarc commented Jul 5, 2024 • edited Loading

What does this PR do ?

To reproduce:

HuggingFaceDocBuilderDev commented Jul 5, 2024

muellerzr left a comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

Push sharded checkpoint to hub when `push_to_hub=True` in `TrainingArguments` #31808

Push sharded checkpoint to hub when `push_to_hub=True` in `TrainingArguments` #31808

SunMarc commented Jul 5, 2024 •

edited

Loading