Remove unconditional train_batch_size assignment by lordaarush · Pull Request #43770 · huggingface/transformers

lordaarush · 2026-02-05T14:25:36Z

What does this PR do?

Removes the unconditional self.state.train_batch_size = self._train_batch_size assignment that was causing issues when resuming from checkpoint with different batch configurations.

The train_batch_size should only be saved to TrainerState when auto_find_batch_size is enabled, which is already handled in the auto_find_batch_size block. The unconditional assignment was redundant and caused the bug where max_steps was incorrectly calculated when resuming training with a different batch size configuration (even when keeping the same global batch size).

Fixes #43708

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case. Yes, discussed in Trainer resume_from_checkpoint incorrectly calculates max_steps when changing per_device_train_batch_size with same global batch size #43708
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@SunMarc

The train_batch_size should only be saved to TrainerState when auto_find_batch_size is enabled (which is already handled in the auto_find_batch_size block at line 2251). The unconditional assignment caused issues when resuming from checkpoint with different batch configurations. Fixes huggingface#43708

SunMarc

Nice, can you add a simple test to test that it is indeed not saved when auto_find_batch_size is False ?

lordaarush · 2026-02-05T14:36:53Z

Sure, I'll do just that.

lordaarush · 2026-02-05T14:44:01Z

Done @SunMarc ! Added a test that verifies train_batch_size is None in the saved checkpoint when auto_find_batch_size=False.

SunMarc

Thanks !

lordaarush · 2026-02-05T18:10:48Z

Hi @SunMarc, looks like simply removing that line broke some tests:
FAILED test_resume_training_with_frozen_params - TypeError: unsupported operand type(s) for //: 'NoneType' and 'int'
FAILED test_resume_training_with_checkpoint - TypeError: unsupported operand type(s) for //: 'NoneType' and 'int'
FAILED test_auto_batch_size_with_resume_from_checkpoint - TypeError: '>=' not supported between instances of 'NoneType' and 'int'
... (9 total trainer resume tests failed)

The issue is that other resume code expects state.train_batch_size to have a value for calculations.

I think the correct fix is to:

Keep saving train_batch_size to the checkpoint (as before)
Only restore it to _train_batch_size when auto_find_batch_size=True

This way:

train_batch_size is still available in state for other resume calculations
But it only overwrites the user's batch config when auto_find_batch_size is enabled (which needs it)

Should I update the PR with this approach?

SunMarc · 2026-02-05T18:52:28Z

Okay let's do that for now ! Thanks for checking

…ze is enabled Fixes huggingface#43708 When resuming from a checkpoint, the trainer was unconditionally restoring the saved train_batch_size, overwriting the user's current batch size configuration. This caused incorrect max_steps calculation when users wanted to resume training with a different batch size. Now the checkpoint's train_batch_size is only restored when auto_find_batch_size=True, as that feature specifically needs to resume with the automatically-found batch size. Otherwise, the user's current args batch size is used. Added test to verify users can change batch size when resuming.

lordaarush · 2026-02-05T19:39:11Z

Updated @SunMarc ! I've implemented the approach we discussed. Also added test_resume_training_with_different_batch_sizeto verify users can change batch size when resuming. All tests passing now.

SunMarc · 2026-02-06T14:38:17Z

Thanks a lot ! merging

HuggingFaceDocBuilderDev · 2026-02-06T14:47:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

* Remove unconditional train_batch_size assignment The train_batch_size should only be saved to TrainerState when auto_find_batch_size is enabled (which is already handled in the auto_find_batch_size block at line 2251). The unconditional assignment caused issues when resuming from checkpoint with different batch configurations. Fixes huggingface#43708 * Add test for train_batch_size not saved without auto_find_batch_size * Only restore train_batch_size from checkpoint when auto_find_batch_size is enabled Fixes huggingface#43708 When resuming from a checkpoint, the trainer was unconditionally restoring the saved train_batch_size, overwriting the user's current batch size configuration. This caused incorrect max_steps calculation when users wanted to resume training with a different batch size. Now the checkpoint's train_batch_size is only restored when auto_find_batch_size=True, as that feature specifically needs to resume with the automatically-found batch size. Otherwise, the user's current args batch size is used. Added test to verify users can change batch size when resuming.

lordaarush mentioned this pull request Feb 5, 2026

Trainer resume_from_checkpoint incorrectly calculates max_steps when changing per_device_train_batch_size with same global batch size #43708

Closed

4 tasks

SunMarc reviewed Feb 5, 2026

View reviewed changes

Add test for train_batch_size not saved without auto_find_batch_size

114755e

SunMarc approved these changes Feb 5, 2026

View reviewed changes

lordaarush force-pushed the remove-unconditional-train-batch-size branch from 10d456e to 0d8d4b6 Compare February 5, 2026 19:18

SunMarc enabled auto-merge (squash) February 6, 2026 14:38

SunMarc merged commit 5a1016a into huggingface:main Feb 6, 2026
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove unconditional train_batch_size assignment#43770

Remove unconditional train_batch_size assignment#43770
SunMarc merged 3 commits intohuggingface:mainfrom
lordaarush:remove-unconditional-train-batch-size

lordaarush commented Feb 5, 2026

Uh oh!

SunMarc left a comment

Uh oh!

lordaarush commented Feb 5, 2026

Uh oh!

lordaarush commented Feb 5, 2026 •

edited

Loading

Uh oh!

SunMarc left a comment

Uh oh!

lordaarush commented Feb 5, 2026

Uh oh!

SunMarc commented Feb 5, 2026 •

edited

Loading

Uh oh!

lordaarush commented Feb 5, 2026

Uh oh!

SunMarc commented Feb 6, 2026

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lordaarush commented Feb 5, 2026

What does this PR do?

Before submitting

Who can review?

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

lordaarush commented Feb 5, 2026

Uh oh!

lordaarush commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

lordaarush commented Feb 5, 2026

Uh oh!

SunMarc commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lordaarush commented Feb 5, 2026

Uh oh!

SunMarc commented Feb 6, 2026

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lordaarush commented Feb 5, 2026 •

edited

Loading

SunMarc commented Feb 5, 2026 •

edited

Loading