Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loss = 0 after first log with trainer API #31391

Closed
2 of 4 tasks
not-lain opened this issue Jun 12, 2024 · 3 comments
Closed
2 of 4 tasks

loss = 0 after first log with trainer API #31391

not-lain opened this issue Jun 12, 2024 · 3 comments

Comments

@not-lain
Copy link
Contributor

not-lain commented Jun 12, 2024

System Info

  • transformers version: 4.41.2
  • Platform: Linux-6.1.85+-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.23.2
  • Safetensors version: 0.4.3
  • Accelerate version: 0.31.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.3.0+cu121 (True)
  • Tensorflow version (GPU?): 2.15.0 (True)
  • Flax version (CPU?/GPU?/TPU?): 0.8.4 (gpu)
  • Jax version: 0.4.26
  • JaxLib version: 0.4.26
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help?

cc @muellerzr @SunMarc

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

https://colab.research.google.com/drive/1Rlcbd3SCibgJQmzAGZW7pyiK01pwumB4?usp=sharing
image

Expected behavior

normal loss

@SunMarc
Copy link
Member

SunMarc commented Jun 12, 2024

Hi @not-lain, that's probably not a bug. The dataset have only 200 rows. Could you try with a bigger dataset ?

@not-lain
Copy link
Contributor Author

Hi @SunMarc I have tried with both not-lain/docci and not-lain/docci-small , and the same behavior persisted after the first log

@not-lain
Copy link
Contributor Author

not-lain commented Jun 12, 2024

@SunMarc after some fiddling It seems that this is not related to the trainer API rather to my training script

  • before
    image
  • after applying
model.text_model.train()
model.config.use_cache = False
model.text_model.transformer.gradient_checkpointing_enable()
torch.autograd.set_detect_anomaly(True)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[28], [line 10](vscode-notebook-cell:?execution_count=28&line=10)
      [1](vscode-notebook-cell:?execution_count=28&line=1) from transformers import Trainer
      [3](vscode-notebook-cell:?execution_count=28&line=3) trainer = Trainer(
      [4](vscode-notebook-cell:?execution_count=28&line=4)         model=model,
      [5](vscode-notebook-cell:?execution_count=28&line=5)         train_dataset=data['train'],
   (...)
      [8](vscode-notebook-cell:?execution_count=28&line=8)         args=args
      [9](vscode-notebook-cell:?execution_count=28&line=9)         )
---> [10](vscode-notebook-cell:?execution_count=28&line=10) trainer.train()

File ~/.local/lib/python3.11/site-packages/transformers/trainer.py:1885, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   [1883](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1883)         hf_hub_utils.enable_progress_bars()
   [1884](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1884) else:
-> [1885](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1885)     return inner_training_loop(
   [1886](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1886)         args=args,
   [1887](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1887)         resume_from_checkpoint=resume_from_checkpoint,
   [1888](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1888)         trial=trial,
   [1889](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1889)         ignore_keys_for_eval=ignore_keys_for_eval,
   [1890](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1890)     )

File ~/.local/lib/python3.11/site-packages/transformers/trainer.py:2216, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   [2213](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2213)     self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
   [2215](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2215) with self.accelerator.accumulate(model):
-> [2216](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2216)     tr_loss_step = self.training_step(model, inputs)
   [2218](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2218) if (
   [2219](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2219)     args.logging_nan_inf_filter
   [2220](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2220)     and not is_torch_xla_available()
   [2221](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2221)     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   [2222](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2222) ):
   [2223](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2223)     # if loss is nan or inf simply add the average of previous logged losses
   [2224](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2224)     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File ~/.local/lib/python3.11/site-packages/transformers/trainer.py:3250, in Trainer.training_step(***failed resolving arguments***)
   [3248](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:3248)         scaled_loss.backward()
   [3249](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:3249) else:
-> [3250](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:3250)     self.accelerator.backward(loss)
   [3252](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:3252) return loss.detach() / self.args.gradient_accumulation_steps

File ~/.local/lib/python3.11/site-packages/accelerate/accelerator.py:2134, in Accelerator.backward(self, loss, **kwargs)
   [2132](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/accelerate/accelerator.py:2132)     self.lomo_backward(loss, learning_rate)
   [2133](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/accelerate/accelerator.py:2133) else:
-> [2134](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/accelerate/accelerator.py:2134)     loss.backward(**kwargs)

File ~/.local/lib/python3.11/site-packages/torch/_tensor.py:525, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
    [515](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:515) if has_torch_function_unary(self):
    [516](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:516)     return handle_torch_function(
    [517](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:517)         Tensor.backward,
    [518](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:518)         (self,),
   (...)
    [523](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:523)         inputs=inputs,
    [524](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:524)     )
--> [525](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:525) torch.autograd.backward(
    [526](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:526)     self, gradient, retain_graph, create_graph, inputs=inputs
    [527](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:527) )

File ~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:267, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    [262](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:262)     retain_graph = create_graph
    [264](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:264) # The reason we repeat the same comment below is that
    [265](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:265) # some Python versions print out the first line of a multi-line function
    [266](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:266) # calls in the traceback and some print out the last line
--> [267](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:267) _engine_run_backward(
    [268](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:268)     tensors,
    [269](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:269)     grad_tensors_,
    [270](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:270)     retain_graph,
    [271](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:271)     create_graph,
    [272](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:272)     inputs,
    [273](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:273)     allow_unreachable=True,
    [274](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:274)     accumulate_grad=True,
    [275](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:275) )

File ~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:744, in _engine_run_backward(t_outputs, *args, **kwargs)
    [742](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:742)     unregister_hooks = _register_logging_hooks_on_whole_graph(t_outputs)
    [743](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:743) try:
--> [744](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:744)     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    [745](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:745)         t_outputs, *args, **kwargs
    [746](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:746)     )  # Calls into the C++ engine to run the backward pass
    [747](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:747) finally:
    [748](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:748)     if attach_logging_hooks:

RuntimeError: Function 'LogSoftmaxBackward0' returned nan values in its 0th output.

since this is not related to the trainer API i'm closing this one, thanks for the support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants