Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any solusion to overcome underflow issues? #5426

Closed
zhoubay opened this issue Apr 17, 2024 · 2 comments
Closed

Is there any solusion to overcome underflow issues? #5426

zhoubay opened this issue Apr 17, 2024 · 2 comments

Comments

@zhoubay
Copy link

zhoubay commented Apr 17, 2024

Hi there,

I'm recently runing code in deepspeed, and meet the underflow issues.
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.

I'm using the pytorch_lightning and the deepspeed, the

Traceback (most recent call last):
  File "/myproject/scripts/train.py", line 141, in <module>
    main(get_args())
  File "/myproject/scripts/train.py", line 137, in main
    run(config)
  File "/myproject/scripts/train.py", line 88, in run
    trainer.fit(model=model, datamodule=data_module)
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 531, in fit
    call._call_and_handle_interrupt(
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 41, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 91, in launch
    return function(*args, **kwargs)
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 570, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 975, in _run
    results = self._run_stage()
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1018, in _run_stage
    self.fit_loop.run()
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 201, in run
    self.advance()
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 354, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 133, in run
    self.advance(data_fetcher)
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 218, in advance
    batch_output = self.automatic_optimization.run(trainer.optimizers[0], kwargs)
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 185, in run
    self._optimizer_step(kwargs.get("batch_idx", 0), closure)
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 260, in _optimizer_step
    call._call_lightning_module_hook(
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 140, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/myproject/model/abstract_model.py", line 118, in optimizer_step
    super().optimizer_step(
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1256, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 155, in step
    step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 256, in optimizer_step
    optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 225, in optimizer_step
    return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/deepspeed.py", line 102, in optimizer_step
    return deepspeed_engine.step(**kwargs)
  File "/myenv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2157, in step
    self._take_model_step(lr_kwargs)
  File "/myenv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2063, in _take_model_step
    self.optimizer.step()
  File "/myenv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1799, in step
    self._update_scale(self.overflow)
  File "/myenv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2050, in _update_scale
    self.loss_scaler.update_scale(has_overflow)
  File "/myenv/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
    raise Exception(
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.

I've find that the exception is caused by the following code. To avoid the exception, I want to set the property raise_error_at_min_scale to False.

if (self.cur_scale == self.min_scale) and self.raise_error_at_min_scale:
raise Exception(
"Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.")

However, after searching on your official docs, there's no such setting methods or even pages related to this property.

The search url is here.

https://deepspeed.readthedocs.io/en/stable/search.html?q=raise_error_at_min_scale&check_keywords=yes&area=default

Could you please tell me how to set this property, or which scenario could cause this issue, and how to avoid this?

Thank you for your attention!

@xuanhua
Copy link

xuanhua commented Apr 29, 2024

@zhoubay yes, It looks like there is nowhere to set raise_error_at_min_scale at False if you grep deepspeed's source code.
And this is not an 'underflow' issue but an 'overflow' one. Maybe you should just use fp32 to do your training of bf16 instead of fp16. And you can also double check your code if there is some bug that cause this overflow issue.

@zhoubay
Copy link
Author

zhoubay commented Apr 29, 2024

@zhoubay yes, It looks like there is nowhere to set raise_error_at_min_scale at False if you grep deepspeed's source code. And this is not an 'underflow' issue but an 'overflow' one. Maybe you should just use fp32 to do your training of bf16 instead of fp16. And you can also double check your code if there is some bug that cause this overflow issue.

Thank you for your reply! After changing training of fp16 to bf16, the error disappears!

I'm closing this issue!

@zhoubay zhoubay closed this as completed Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants