loss = 0 after first log with trainer API #31391

not-lain · 2024-06-12T16:00:09Z

System Info

transformers version: 4.41.2
Platform: Linux-6.1.85+-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.23.2
Safetensors version: 0.4.3
Accelerate version: 0.31.0
Accelerate config: not found
PyTorch version (GPU?): 2.3.0+cu121 (True)
Tensorflow version (GPU?): 2.15.0 (True)
Flax version (CPU?/GPU?/TPU?): 0.8.4 (gpu)
Jax version: 0.4.26
JaxLib version: 0.4.26
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

cc @muellerzr @SunMarc

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

https://colab.research.google.com/drive/1Rlcbd3SCibgJQmzAGZW7pyiK01pwumB4?usp=sharing

Expected behavior

normal loss

The text was updated successfully, but these errors were encountered:

SunMarc · 2024-06-12T16:38:12Z

Hi @not-lain, that's probably not a bug. The dataset have only 200 rows. Could you try with a bigger dataset ?

not-lain · 2024-06-12T17:42:02Z

Hi @SunMarc I have tried with both not-lain/docci and not-lain/docci-small , and the same behavior persisted after the first log

not-lain · 2024-06-12T19:08:41Z

@SunMarc after some fiddling It seems that this is not related to the trainer API rather to my training script

before
after applying

model.text_model.train()
model.config.use_cache = False
model.text_model.transformer.gradient_checkpointing_enable()
torch.autograd.set_detect_anomaly(True)

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[28], [line 10](vscode-notebook-cell:?execution_count=28&line=10)
      [1](vscode-notebook-cell:?execution_count=28&line=1) from transformers import Trainer
      [3](vscode-notebook-cell:?execution_count=28&line=3) trainer = Trainer(
      [4](vscode-notebook-cell:?execution_count=28&line=4)         model=model,
      [5](vscode-notebook-cell:?execution_count=28&line=5)         train_dataset=data['train'],
   (...)
      [8](vscode-notebook-cell:?execution_count=28&line=8)         args=args
      [9](vscode-notebook-cell:?execution_count=28&line=9)         )
---> [10](vscode-notebook-cell:?execution_count=28&line=10) trainer.train()

File ~/.local/lib/python3.11/site-packages/transformers/trainer.py:1885, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   [1883](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1883)         hf_hub_utils.enable_progress_bars()
   [1884](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1884) else:
-> [1885](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1885)     return inner_training_loop(
   [1886](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1886)         args=args,
   [1887](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1887)         resume_from_checkpoint=resume_from_checkpoint,
   [1888](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1888)         trial=trial,
   [1889](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1889)         ignore_keys_for_eval=ignore_keys_for_eval,
   [1890](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1890)     )

File ~/.local/lib/python3.11/site-packages/transformers/trainer.py:2216, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   [2213](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2213)     self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
   [2215](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2215) with self.accelerator.accumulate(model):
-> [2216](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2216)     tr_loss_step = self.training_step(model, inputs)
   [2218](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2218) if (
   [2219](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2219)     args.logging_nan_inf_filter
   [2220](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2220)     and not is_torch_xla_available()
   [2221](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2221)     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   [2222](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2222) ):
   [2223](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2223)     # if loss is nan or inf simply add the average of previous logged losses
   [2224](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2224)     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File ~/.local/lib/python3.11/site-packages/transformers/trainer.py:3250, in Trainer.training_step(***failed resolving arguments***)
   [3248](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:3248)         scaled_loss.backward()
   [3249](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:3249) else:
-> [3250](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:3250)     self.accelerator.backward(loss)
   [3252](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:3252) return loss.detach() / self.args.gradient_accumulation_steps

File ~/.local/lib/python3.11/site-packages/accelerate/accelerator.py:2134, in Accelerator.backward(self, loss, **kwargs)
   [2132](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/accelerate/accelerator.py:2132)     self.lomo_backward(loss, learning_rate)
   [2133](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/accelerate/accelerator.py:2133) else:
-> [2134](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/accelerate/accelerator.py:2134)     loss.backward(**kwargs)

File ~/.local/lib/python3.11/site-packages/torch/_tensor.py:525, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
    [515](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:515) if has_torch_function_unary(self):
    [516](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:516)     return handle_torch_function(
    [517](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:517)         Tensor.backward,
    [518](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:518)         (self,),
   (...)
    [523](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:523)         inputs=inputs,
    [524](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:524)     )
--> [525](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:525) torch.autograd.backward(
    [526](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:526)     self, gradient, retain_graph, create_graph, inputs=inputs
    [527](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:527) )

File ~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:267, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    [262](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:262)     retain_graph = create_graph
    [264](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:264) # The reason we repeat the same comment below is that
    [265](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:265) # some Python versions print out the first line of a multi-line function
    [266](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:266) # calls in the traceback and some print out the last line
--> [267](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:267) _engine_run_backward(
    [268](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:268)     tensors,
    [269](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:269)     grad_tensors_,
    [270](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:270)     retain_graph,
    [271](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:271)     create_graph,
    [272](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:272)     inputs,
    [273](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:273)     allow_unreachable=True,
    [274](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:274)     accumulate_grad=True,
    [275](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:275) )

File ~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:744, in _engine_run_backward(t_outputs, *args, **kwargs)
    [742](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:742)     unregister_hooks = _register_logging_hooks_on_whole_graph(t_outputs)
    [743](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:743) try:
--> [744](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:744)     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    [745](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:745)         t_outputs, *args, **kwargs
    [746](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:746)     )  # Calls into the C++ engine to run the backward pass
    [747](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:747) finally:
    [748](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:748)     if attach_logging_hooks:

RuntimeError: Function 'LogSoftmaxBackward0' returned nan values in its 0th output.

since this is not related to the trainer API i'm closing this one, thanks for the support

not-lain closed this as completed Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loss = 0 after first log with trainer API #31391

loss = 0 after first log with trainer API #31391

not-lain commented Jun 12, 2024 •

edited

Loading

SunMarc commented Jun 12, 2024 •

edited

Loading

not-lain commented Jun 12, 2024

not-lain commented Jun 12, 2024 •

edited

Loading

loss = 0 after first log with trainer API #31391

loss = 0 after first log with trainer API #31391

Comments

not-lain commented Jun 12, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

SunMarc commented Jun 12, 2024 • edited Loading

not-lain commented Jun 12, 2024

not-lain commented Jun 12, 2024 • edited Loading

not-lain commented Jun 12, 2024 •

edited

Loading

SunMarc commented Jun 12, 2024 •

edited

Loading

not-lain commented Jun 12, 2024 •

edited

Loading