-
Notifications
You must be signed in to change notification settings - Fork 30.8k
Open
Labels
Description
System Info
When performing a fine-tuning job with batch size of 4 and max steps of 1000, it errors out with some tokenizer error
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': None}.
0%| | 0/500 [00:00<?, ?it/s]Traceback (most recent call last):
File "/tmp/script.py", line 191, in <module>
trainer.train()
File "/root/.cache/uv/environments-v2/script-912247c0edd68a55/lib/python3.12/site-packages/transformers/trainer.py", line 2328, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/uv/environments-v2/script-912247c0edd68a55/lib/python3.12/site-packages/transformers/trainer.py", line 2672, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/uv/environments-v2/script-912247c0edd68a55/lib/python3.12/site-packages/trl/trainer/sft_trainer.py", line 1189, in training_step
return super().training_step(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/uv/environments-v2/script-912247c0edd68a55/lib/python3.12/site-packages/transformers/trainer.py", line 4009, in training_step
loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/uv/environments-v2/script-912247c0edd68a55/lib/python3.12/site-packages/trl/trainer/sft_trainer.py", line 1123, in compute_loss
entropy = torch.sum(per_token_entropy * attention_mask) / attention_mask.sum()
~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~
RuntimeError: The size of tensor a (4) must match the size of tensor b (8) at non-singleton dimension 0
0%| | 0/500 [00:01<?, ?it/s]
Please what might be wrong?
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
sft finetuning with a10-largex2 GPU
training parameters:
Configure training
config = SFTConfig(
output_dir="./smollm3-jobs-sft",
per_device_train_batch_size=4,
learning_rate=3e-5,
max_steps=1000,
logging_steps=50,
save_steps=200,
push_to_hub=True,
hub_model_id="hubsnippetai/smollm3-jobs-sft"
)
Train
trainer = SFTTrainer(
model=model,
train_dataset=processed_train,
args=config,
)
trainer.train()
Expected behavior
After loading the datasets, the job should continue running while training the model without erroring out.