You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm recently runing code in deepspeed, and meet the underflow issues. Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
I'm using the pytorch_lightning and the deepspeed, the
Traceback (most recent call last):
File "/myproject/scripts/train.py", line 141, in <module>
main(get_args())
File "/myproject/scripts/train.py", line 137, in main
run(config)
File "/myproject/scripts/train.py", line 88, in run
trainer.fit(model=model, datamodule=data_module)
File "/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 531, in fit
call._call_and_handle_interrupt(
File "/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 41, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/myenv/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 91, in launch
return function(*args, **kwargs)
File "/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 570, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 975, in _run
results = self._run_stage()
File "/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1018, in _run_stage
self.fit_loop.run()
File "/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 201, in run
self.advance()
File "/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 354, in advance
self.epoch_loop.run(self._data_fetcher)
File "/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 133, in run
self.advance(data_fetcher)
File "/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 218, in advance
batch_output = self.automatic_optimization.run(trainer.optimizers[0], kwargs)
File "/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 185, in run
self._optimizer_step(kwargs.get("batch_idx", 0), closure)
File "/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 260, in _optimizer_step
call._call_lightning_module_hook(
File "/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 140, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/myproject/model/abstract_model.py", line 118, in optimizer_step
super().optimizer_step(
File "/myenv/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1256, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/myenv/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 155, in step
step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
File "/myenv/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 256, in optimizer_step
optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
File "/myenv/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 225, in optimizer_step
return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
File "/myenv/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/deepspeed.py", line 102, in optimizer_step
return deepspeed_engine.step(**kwargs)
File "/myenv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2157, in step
self._take_model_step(lr_kwargs)
File "/myenv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2063, in _take_model_step
self.optimizer.step()
File "/myenv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1799, in step
self._update_scale(self.overflow)
File "/myenv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2050, in _update_scale
self.loss_scaler.update_scale(has_overflow)
File "/myenv/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
raise Exception(
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
I've find that the exception is caused by the following code. To avoid the exception, I want to set the property raise_error_at_min_scale to False.
@zhoubay yes, It looks like there is nowhere to set raise_error_at_min_scale at False if you grep deepspeed's source code.
And this is not an 'underflow' issue but an 'overflow' one. Maybe you should just use fp32 to do your training of bf16 instead of fp16. And you can also double check your code if there is some bug that cause this overflow issue.
@zhoubay yes, It looks like there is nowhere to set raise_error_at_min_scale at False if you grep deepspeed's source code. And this is not an 'underflow' issue but an 'overflow' one. Maybe you should just use fp32 to do your training of bf16 instead of fp16. And you can also double check your code if there is some bug that cause this overflow issue.
Thank you for your reply! After changing training of fp16 to bf16, the error disappears!
Hi there,
I'm recently runing code in deepspeed, and meet the underflow issues.
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
I'm using the
pytorch_lightning
and thedeepspeed
, theI've find that the exception is caused by the following code. To avoid the exception, I want to set the property
raise_error_at_min_scale
toFalse
.DeepSpeed/deepspeed/runtime/fp16/loss_scaler.py
Lines 174 to 176 in a9cbd68
However, after searching on your official docs, there's no such setting methods or even pages related to this property.
The search url is here.
https://deepspeed.readthedocs.io/en/stable/search.html?q=raise_error_at_min_scale&check_keywords=yes&area=default
Could you please tell me how to set this property, or which scenario could cause this issue, and how to avoid this?
Thank you for your attention!
The text was updated successfully, but these errors were encountered: