Remove unconditional train_batch_size assignment#43770
Remove unconditional train_batch_size assignment#43770SunMarc merged 3 commits intohuggingface:mainfrom
Conversation
The train_batch_size should only be saved to TrainerState when auto_find_batch_size is enabled (which is already handled in the auto_find_batch_size block at line 2251). The unconditional assignment caused issues when resuming from checkpoint with different batch configurations. Fixes huggingface#43708
SunMarc
left a comment
There was a problem hiding this comment.
Nice, can you add a simple test to test that it is indeed not saved when auto_find_batch_size is False ?
|
Sure, I'll do just that. |
|
Done @SunMarc ! Added a test that verifies |
|
Hi @SunMarc, looks like simply removing that line broke some tests: The issue is that other resume code expects I think the correct fix is to:
This way:
Should I update the PR with this approach? |
|
Okay let's do that for now ! Thanks for checking |
…ze is enabled Fixes huggingface#43708 When resuming from a checkpoint, the trainer was unconditionally restoring the saved train_batch_size, overwriting the user's current batch size configuration. This caused incorrect max_steps calculation when users wanted to resume training with a different batch size. Now the checkpoint's train_batch_size is only restored when auto_find_batch_size=True, as that feature specifically needs to resume with the automatically-found batch size. Otherwise, the user's current args batch size is used. Added test to verify users can change batch size when resuming.
10d456e to
0d8d4b6
Compare
|
Updated @SunMarc ! I've implemented the approach we discussed. Also added |
|
Thanks a lot ! merging |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
* Remove unconditional train_batch_size assignment The train_batch_size should only be saved to TrainerState when auto_find_batch_size is enabled (which is already handled in the auto_find_batch_size block at line 2251). The unconditional assignment caused issues when resuming from checkpoint with different batch configurations. Fixes huggingface#43708 * Add test for train_batch_size not saved without auto_find_batch_size * Only restore train_batch_size from checkpoint when auto_find_batch_size is enabled Fixes huggingface#43708 When resuming from a checkpoint, the trainer was unconditionally restoring the saved train_batch_size, overwriting the user's current batch size configuration. This caused incorrect max_steps calculation when users wanted to resume training with a different batch size. Now the checkpoint's train_batch_size is only restored when auto_find_batch_size=True, as that feature specifically needs to resume with the automatically-found batch size. Otherwise, the user's current args batch size is used. Added test to verify users can change batch size when resuming.
What does this PR do?
Removes the unconditional
self.state.train_batch_size = self._train_batch_sizeassignment that was causing issues when resuming from checkpoint with different batch configurations.The
train_batch_sizeshould only be saved toTrainerStatewhenauto_find_batch_sizeis enabled, which is already handled in theauto_find_batch_sizeblock. The unconditional assignment was redundant and caused the bug wheremax_stepswas incorrectly calculated when resuming training with a different batch size configuration (even when keeping the same global batch size).Fixes #43708
Before submitting
Pull Request section?
to it if that's the case. Yes, discussed in Trainer
resume_from_checkpointincorrectly calculatesmax_stepswhen changingper_device_train_batch_sizewith same global batch size #43708documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@SunMarc