Skip to content

Remove unconditional train_batch_size assignment#43770

Merged
SunMarc merged 3 commits intohuggingface:mainfrom
lordaarush:remove-unconditional-train-batch-size
Feb 6, 2026
Merged

Remove unconditional train_batch_size assignment#43770
SunMarc merged 3 commits intohuggingface:mainfrom
lordaarush:remove-unconditional-train-batch-size

Conversation

@lordaarush
Copy link
Contributor

What does this PR do?

Removes the unconditional self.state.train_batch_size = self._train_batch_size assignment that was causing issues when resuming from checkpoint with different batch configurations.

The train_batch_size should only be saved to TrainerState when auto_find_batch_size is enabled, which is already handled in the auto_find_batch_size block. The unconditional assignment was redundant and caused the bug where max_steps was incorrectly calculated when resuming training with a different batch size configuration (even when keeping the same global batch size).

Fixes #43708

Before submitting

Who can review?

@SunMarc

The train_batch_size should only be saved to TrainerState when
auto_find_batch_size is enabled (which is already handled in the
auto_find_batch_size block at line 2251). The unconditional assignment
caused issues when resuming from checkpoint with different batch
configurations.

Fixes huggingface#43708
Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, can you add a simple test to test that it is indeed not saved when auto_find_batch_size is False ?

@lordaarush
Copy link
Contributor Author

Sure, I'll do just that.

@lordaarush
Copy link
Contributor Author

lordaarush commented Feb 5, 2026

Done @SunMarc ! Added a test that verifies train_batch_size is None in the saved checkpoint when auto_find_batch_size=False.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

@lordaarush
Copy link
Contributor Author

Hi @SunMarc, looks like simply removing that line broke some tests:
FAILED test_resume_training_with_frozen_params - TypeError: unsupported operand type(s) for //: 'NoneType' and 'int'
FAILED test_resume_training_with_checkpoint - TypeError: unsupported operand type(s) for //: 'NoneType' and 'int'
FAILED test_auto_batch_size_with_resume_from_checkpoint - TypeError: '>=' not supported between instances of 'NoneType' and 'int'
... (9 total trainer resume tests failed)

The issue is that other resume code expects state.train_batch_size to have a value for calculations.

I think the correct fix is to:

  1. Keep saving train_batch_size to the checkpoint (as before)
  2. Only restore it to _train_batch_size when auto_find_batch_size=True

This way:

  • train_batch_size is still available in state for other resume calculations
  • But it only overwrites the user's batch config when auto_find_batch_size is enabled (which needs it)

Should I update the PR with this approach?

@SunMarc
Copy link
Member

SunMarc commented Feb 5, 2026

Okay let's do that for now ! Thanks for checking

…ze is enabled

Fixes huggingface#43708

When resuming from a checkpoint, the trainer was unconditionally restoring
the saved train_batch_size, overwriting the user's current batch size
configuration. This caused incorrect max_steps calculation when users
wanted to resume training with a different batch size.

Now the checkpoint's train_batch_size is only restored when
auto_find_batch_size=True, as that feature specifically needs to resume
with the automatically-found batch size. Otherwise, the user's current
args batch size is used.

Added test to verify users can change batch size when resuming.
@lordaarush lordaarush force-pushed the remove-unconditional-train-batch-size branch from 10d456e to 0d8d4b6 Compare February 5, 2026 19:18
@lordaarush
Copy link
Contributor Author

Updated @SunMarc ! I've implemented the approach we discussed. Also added test_resume_training_with_different_batch_sizeto verify users can change batch size when resuming. All tests passing now.

@SunMarc SunMarc enabled auto-merge (squash) February 6, 2026 14:38
@SunMarc
Copy link
Member

SunMarc commented Feb 6, 2026

Thanks a lot ! merging

@SunMarc SunMarc merged commit 5a1016a into huggingface:main Feb 6, 2026
25 checks passed
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

jiosephlee pushed a commit to jiosephlee/transformers_latest that referenced this pull request Feb 11, 2026
* Remove unconditional train_batch_size assignment

The train_batch_size should only be saved to TrainerState when
auto_find_batch_size is enabled (which is already handled in the
auto_find_batch_size block at line 2251). The unconditional assignment
caused issues when resuming from checkpoint with different batch
configurations.

Fixes huggingface#43708

* Add test for train_batch_size not saved without auto_find_batch_size

* Only restore train_batch_size from checkpoint when auto_find_batch_size is enabled

Fixes huggingface#43708

When resuming from a checkpoint, the trainer was unconditionally restoring
the saved train_batch_size, overwriting the user's current batch size
configuration. This caused incorrect max_steps calculation when users
wanted to resume training with a different batch size.

Now the checkpoint's train_batch_size is only restored when
auto_find_batch_size=True, as that feature specifically needs to resume
with the automatically-found batch size. Otherwise, the user's current
args batch size is used.

Added test to verify users can change batch size when resuming.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Trainer resume_from_checkpoint incorrectly calculates max_steps when changing per_device_train_batch_size with same global batch size

3 participants