Is there any solusion to overcome underflow issues? #5426

zhoubay · 2024-04-17T03:30:09Z

Hi there,

I'm recently runing code in deepspeed, and meet the underflow issues.
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.

I'm using the pytorch_lightning and the deepspeed, the

Traceback (most recent call last):
  File "/myproject/scripts/train.py", line 141, in <module>
    main(get_args())
  File "/myproject/scripts/train.py", line 137, in main
    run(config)
  File "/myproject/scripts/train.py", line 88, in run
    trainer.fit(model=model, datamodule=data_module)
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 531, in fit
    call._call_and_handle_interrupt(
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 41, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 91, in launch
    return function(*args, **kwargs)
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 570, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 975, in _run
    results = self._run_stage()
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1018, in _run_stage
    self.fit_loop.run()
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 201, in run
    self.advance()
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 354, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 133, in run
    self.advance(data_fetcher)
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 218, in advance
    batch_output = self.automatic_optimization.run(trainer.optimizers[0], kwargs)
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 185, in run
    self._optimizer_step(kwargs.get("batch_idx", 0), closure)
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 260, in _optimizer_step
    call._call_lightning_module_hook(
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 140, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/myproject/model/abstract_model.py", line 118, in optimizer_step
    super().optimizer_step(
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1256, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 155, in step
    step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 256, in optimizer_step
    optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 225, in optimizer_step
    return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
  File "/myenv/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/deepspeed.py", line 102, in optimizer_step
    return deepspeed_engine.step(**kwargs)
  File "/myenv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2157, in step
    self._take_model_step(lr_kwargs)
  File "/myenv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2063, in _take_model_step
    self.optimizer.step()
  File "/myenv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1799, in step
    self._update_scale(self.overflow)
  File "/myenv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2050, in _update_scale
    self.loss_scaler.update_scale(has_overflow)
  File "/myenv/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
    raise Exception(
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.

I've find that the exception is caused by the following code. To avoid the exception, I want to set the property raise_error_at_min_scale to False.

DeepSpeed/deepspeed/runtime/fp16/loss_scaler.py

Lines 174 to 176 in a9cbd68

    
           if (self.cur_scale == self.min_scale) and self.raise_error_at_min_scale: 
        
               raise Exception( 
        
                   "Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.")

However, after searching on your official docs, there's no such setting methods or even pages related to this property.

The search url is here.

https://deepspeed.readthedocs.io/en/stable/search.html?q=raise_error_at_min_scale&check_keywords=yes&area=default

Could you please tell me how to set this property, or which scenario could cause this issue, and how to avoid this?

Thank you for your attention!

The text was updated successfully, but these errors were encountered:

xuanhua · 2024-04-29T09:59:48Z

@zhoubay yes, It looks like there is nowhere to set raise_error_at_min_scale at False if you grep deepspeed's source code.
And this is not an 'underflow' issue but an 'overflow' one. Maybe you should just use fp32 to do your training of bf16 instead of fp16. And you can also double check your code if there is some bug that cause this overflow issue.

zhoubay · 2024-04-29T11:56:13Z

@zhoubay yes, It looks like there is nowhere to set raise_error_at_min_scale at False if you grep deepspeed's source code. And this is not an 'underflow' issue but an 'overflow' one. Maybe you should just use fp32 to do your training of bf16 instead of fp16. And you can also double check your code if there is some bug that cause this overflow issue.

Thank you for your reply! After changing training of fp16 to bf16, the error disappears!

I'm closing this issue!

zhoubay closed this as completed Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any solusion to overcome underflow issues? #5426

Is there any solusion to overcome underflow issues? #5426

zhoubay commented Apr 17, 2024

xuanhua commented Apr 29, 2024

zhoubay commented Apr 29, 2024

Is there any solusion to overcome underflow issues? #5426

Is there any solusion to overcome underflow issues? #5426

Comments

zhoubay commented Apr 17, 2024

xuanhua commented Apr 29, 2024

zhoubay commented Apr 29, 2024