Fix `num_items_in_batch` over-counting for causal LM losses by qgallouedec · Pull Request #46204 · huggingface/transformers

qgallouedec · 2026-05-26T00:32:13Z

The bug

Trainer._get_num_items_in_batch counts labels at every position. But for causal LM the loss shifts labels — position 0 is never a prediction target. The denominator is too large by num_rows per micro-batch, systematically under-scaling every causal LM loss and gradient.

labels (trainer counts these):           shift_labels = labels[..., 1:]
                                         (what ForCausalLMLoss uses)
┌────┬────┬────┬────┬────┐                    ┌────┬────┬────┬────┐
│ t0 │ t1 │ t2 │ t3 │ t4 │  count = 5         │ t1 │ t2 │ t3 │ t4 │  count = 4
└────┴────┴────┴────┴────┘                    └────┴────┴────┴────┘
  ↑
  position 0 is dropped by the shift — over-count = 1 per row

shift_labels = labels[..., 1:]           # ← numerator: 4 CE terms
loss = sum_ce / num_items_in_batch       # ← denominator: counted 5

loss_type is set on every PreTrainedModel (see modeling_utils.py). Only causal LM loss types are touched; ForMaskedLM, classification, etc. are unaffected.

Reported causal LM loss becomes slightly larger (correctly so, the denominator was too big). Shift is num_rows / total_tokens per step. Gradient magnitudes scale by the same factor.

Tests

All TrainerGradientAccumulationTest tests pass (4/4). A padded vs padding-free reproducer now matches with Δgrad_norm = 0.

How it surfaced

TRL's invariance suite compared SFT with padding_free=False vs padding_free=True. Loss curves almost matched, but grad_norm drifted with a systematic +0.17 bias over 50 steps. The padding-free collator masks labels[position_ids == 0] = -100, which incidentally matches the post-shift count, so it exposed the padded path's over-count.

HuggingFaceDocBuilderDev · 2026-05-26T00:44:23Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vasqu · 2026-05-26T12:20:30Z

+        # Causal LM losses shift labels internally (predictions at position i target label[i+1]), so position 0 of
+        # each row is never a prediction target. The valid-prediction count used by `num_items_in_batch` must therefore
+        # be taken over `labels[..., 1:]`, not the full label tensor
+        self._loss_shifts_labels = getattr(model_to_inspect, "loss_type", None) in (


I think this is too simple. I think if we were to go that way we should inspect the actual loss function, e.g. CsmForConditionalGeneration would fail here, no?

But, my bigger issue is that this covers the most simple usecase where we only prepare the tokenized input and pass that but what if we were to use a data collator that properly prepares the shifted labels? We would now count 1 too much

But definitely needed to fix in general!

vasqu · 2026-05-26T12:21:37Z

Also definitely need a test that covers this edge case

SunMarc

Thanks, left some minor comments but happy to merge it in general. Can you add a small test to check that we are calculating correctly the num_item_per_batch in case we have ForCausalLM ?

SunMarc · 2026-05-27T11:30:07Z

+        # Causal LM losses shift labels internally (predictions at position i target label[i+1]), so position 0 of
+        # each row is never a prediction target. The valid-prediction count used by `num_items_in_batch` must therefore
+        # be taken over `labels[..., 1:]`, not the full label tensor
+        self._loss_shifts_labels = getattr(model_to_inspect, "loss_type", None) in (
+            "ForCausalLM",
+            "ForConditionalGeneration",
+        )
+


as @vasqu pointed, maybe we can have the following so that it is a bit more robust.

from transformers.loss.loss_utils import LOSS_MAPPING, ForCausalLMLoss self._loss_shifts_labels = ( LOSS_MAPPING.get(getattr(model_to_inspect, "loss_type", None)) is ForCausalLMLoss )

SunMarc · 2026-05-27T11:32:43Z

+                # Causal LM losses shift labels; count over `labels[..., 1:]` to avoid over-counting position 0.
+                labels_for_count = (
+                    [batch["labels"][..., 1:] for batch in batch_samples]
+                    if self._loss_shifts_labels
+                    else [batch["labels"] for batch in batch_samples]
+                )


Maybe to fix @vasqu point, we can also take into account the case where shift_labels is prepared by the user ?

labels_for_count = [ batch["shift_labels"] if "shift_labels" in batch else batch["labels"][..., 1:] if self._loss_shifts_labels else batch["labels"] for batch in batch_samples ]

- Inspect the actual loss function via LOSS_MAPPING instead of matching loss_type strings (catches CsmForConditionalGeneration etc.). - If the data collator already provides `shift_labels`, count over that tensor directly instead of slicing labels again. - Add unit tests for `_get_num_items_in_batch` covering the causal LM path (with and without pre-shifted labels) and the non-causal-LM path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

SunMarc

Thanks !

vasqu · 2026-05-27T14:15:43Z

Seems like we have 1 new failure https://app.circleci.com/pipelines/github/huggingface/transformers/175911/workflows/b1032ab4-41a0-44f1-b1cd-072bad362f4d/jobs/2326963

Thanks tho, overall LGTM as well

qgallouedec · 2026-05-27T14:30:08Z

Seems like we have 1 new failure

0a3d375 should fix it

github-actions · 2026-05-27T14:44:31Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46204&sha=0a3d37

…ace#46204) * Fix `num_items_in_batch` over-counting for causal LM losses * Address review: use LOSS_MAPPING, honor pre-shifted labels, add tests - Inspect the actual loss function via LOSS_MAPPING instead of matching loss_type strings (catches CsmForConditionalGeneration etc.). - If the data collator already provides `shift_labels`, count over that tensor directly instead of slicing labels again. - Add unit tests for `_get_num_items_in_batch` covering the causal LM path (with and without pre-shifted labels) and the non-causal-LM path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix test_train_and_predict_loss_parity --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fix num_items_in_batch over-counting for causal LM losses

22d3aeb

qgallouedec requested review from SunMarc and vasqu May 26, 2026 00:32

qgallouedec mentioned this pull request May 26, 2026

Refresh sft.json / dpo.json snapshots after transformers num_items_in_batch fix huggingface/trl#5845

Open

qgallouedec added a commit to huggingface/trl that referenced this pull request May 26, 2026

Fixed values (huggingface/transformers#46204)

7375f45

vasqu reviewed May 26, 2026

View reviewed changes

SunMarc reviewed May 27, 2026

View reviewed changes

qgallouedec and others added 2 commits May 27, 2026 13:54

Merge branch 'main' into fix-num-items-in-batch-causal-lm-v2

4d09230

SunMarc approved these changes May 27, 2026

View reviewed changes

vasqu enabled auto-merge May 27, 2026 14:11

fix test_train_and_predict_loss_parity

0a3d375

vasqu added this pull request to the merge queue May 27, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 27, 2026

vasqu added this pull request to the merge queue May 27, 2026

Merged via the queue into main with commit 67265ef May 27, 2026
44 of 45 checks passed

vasqu deleted the fix-num-items-in-batch-causal-lm-v2 branch May 27, 2026 16:03

stevhliu mentioned this pull request Jun 1, 2026

[docs] update num_items_in_batch for causal LMs #46335

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `num_items_in_batch` over-counting for causal LM losses#46204

Fix `num_items_in_batch` over-counting for causal LM losses#46204
vasqu merged 4 commits into
mainfrom
fix-num-items-in-batch-causal-lm-v2

qgallouedec commented May 26, 2026

Uh oh!

HuggingFaceDocBuilderDev commented May 26, 2026

Uh oh!

vasqu May 26, 2026

Uh oh!

vasqu commented May 26, 2026

Uh oh!

SunMarc left a comment

Uh oh!

SunMarc May 27, 2026

Uh oh!

SunMarc May 27, 2026

Uh oh!

SunMarc left a comment

Uh oh!

vasqu commented May 27, 2026

Uh oh!

qgallouedec commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

qgallouedec commented May 26, 2026

The bug

Tests

How it surfaced

Uh oh!

HuggingFaceDocBuilderDev commented May 26, 2026

Uh oh!

vasqu May 26, 2026

Choose a reason for hiding this comment

Uh oh!

vasqu commented May 26, 2026

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc May 27, 2026

Choose a reason for hiding this comment

Uh oh!

SunMarc May 27, 2026

Choose a reason for hiding this comment

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

vasqu commented May 27, 2026

Uh oh!

qgallouedec commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants