Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push sharded checkpoint to hub when push_to_hub=True in TrainingArguments #31808

Merged
merged 1 commit into from
Jul 10, 2024

Conversation

SunMarc
Copy link
Member

@SunMarc SunMarc commented Jul 5, 2024

What does this PR do ?

This PR make sure that sharded checkpoints are pushed to the hub when we set push_to_hub=True in TrainingArguments.
Fixes #30724.
Thanks @alvarobartt for the detailed issue !

I didn't add a test as it requires a big model to test it and we can't change the shard size when saving from Trainer.

To reproduce:

I was able to successfully push the checkpoints here : https://huggingface.co/marcsun13/sft_openassistant-guanaco/tree/main

python trl/examples/scripts/sft.py     --model_name_or_path="mistralai/Mistral-7B-v0.1"     --dataset_text_field="text"     --report_to="wandb"     --learning_rate=1.41e-5     --per_device_train_batch_size=2     --gradient_accumulation_steps=2     --output_dir="sft_openassistant-guanaco"   --torch_dtype="bfloat16"  --optim=adamw_bnb_8bit   --logging_steps=1     --num_train_epochs=3     --hub_strategy="every_save" --save_strategy="steps"  --save_steps=10    --push_to_hub

@SunMarc SunMarc requested a review from LysandreJik July 5, 2024 15:59
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! 🥳

@SunMarc SunMarc requested a review from amyeroberts July 10, 2024 11:43
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for enabling this!

Regarding testing, I'm happy to have merged as you've been able to run. However, not being able to test is a symptom of something needing to change on our code side to enable this.

@SunMarc SunMarc merged commit 8df28bb into main Jul 10, 2024
22 checks passed
@SunMarc SunMarc deleted the save_sharded_checkpoints_in_trainer branch July 10, 2024 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

hub_strategy="every_save" won't push the model to the Hub if large
4 participants