tensor mismatch when finetuning a smollm3 model

### System Info

When performing a fine-tuning job with batch size of 4 and max steps of 1000, it errors out with some tokenizer error

```
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': None}.


  0%|          | 0/500 [00:00<?, ?it/s]Traceback (most recent call last):

  File "/tmp/script.py", line 191, in <module>

    trainer.train()

  File "/root/.cache/uv/environments-v2/script-912247c0edd68a55/lib/python3.12/site-packages/transformers/trainer.py", line 2328, in train

    return inner_training_loop(

           ^^^^^^^^^^^^^^^^^^^^

  File "/root/.cache/uv/environments-v2/script-912247c0edd68a55/lib/python3.12/site-packages/transformers/trainer.py", line 2672, in _inner_training_loop

    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)

                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/root/.cache/uv/environments-v2/script-912247c0edd68a55/lib/python3.12/site-packages/trl/trainer/sft_trainer.py", line 1189, in training_step

    return super().training_step(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/root/.cache/uv/environments-v2/script-912247c0edd68a55/lib/python3.12/site-packages/transformers/trainer.py", line 4009, in training_step

    loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/root/.cache/uv/environments-v2/script-912247c0edd68a55/lib/python3.12/site-packages/trl/trainer/sft_trainer.py", line 1123, in compute_loss

    entropy = torch.sum(per_token_entropy * attention_mask) / attention_mask.sum()

                        ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~

RuntimeError: The size of tensor a (4) must match the size of tensor b (8) at non-singleton dimension 0


  0%|          | 0/500 [00:01<?, ?it/s]
```
Please what might be wrong?

### Who can help?

_No response_

### Information

- [x] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

sft finetuning with a10-largex2 GPU

training parameters:

# Configure training
config = SFTConfig(
    output_dir="./smollm3-jobs-sft",
    per_device_train_batch_size=4,
    learning_rate=3e-5,
    max_steps=1000,
    logging_steps=50,
    save_steps=200,
    push_to_hub=True,
    hub_model_id="hubsnippetai/smollm3-jobs-sft"
)

# Train
trainer = SFTTrainer(
    model=model,
    train_dataset=processed_train,
    args=config,
)
trainer.train()

### Expected behavior

After loading the datasets, the job should continue running while training the model without erroring out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tensor mismatch when finetuning a smollm3 model #41129

System Info

Who can help?

Information

Tasks

Reproduction

Configure training

Train

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

tensor mismatch when finetuning a smollm3 model #41129

Description

System Info

Who can help?

Information

Tasks

Reproduction

Configure training

Train

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions