Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MisconfigurationException: Do not set gradient_accumulation_steps in the DeepSpeed config #19891

Open
mxkrn opened this issue May 22, 2024 · 0 comments
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers

Comments

@mxkrn
Copy link

mxkrn commented May 22, 2024

Bug description

I want to use gradient accumulation in my training process which is using a manually configured DeepSpeedStrategy (via a config file) to accomplish distributed training. My first go at this is to define the variable gradient_accumulation_steps in deepspeed_config.json whilst simultaneously passing in the same value in the Trainer as an argument. In this case, lightning raises the following exception:

lightning.fabric.utilities.exceptions.MisconfigurationException: Do not set `gradient_accumulation_steps` in the DeepSpeed config as this will be set with the `accumulate_grad_batches` argument passed via the Lightning Trainer.

That's fine, but when I follow this advice and unset gradient_accumulation_steps in the DeepSpeed config, the deepspeed library throws an exception:

AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 256 != 16 * 1 * 8

As a consequence I'm unable to use gradient accumulation with DeepSpeedStrategy. Am I doing something wrong here or is this actually a conflict of interest between the deepspeed and lightning. In my eyes, I would imagine it would be sufficient if lightning threw a warning here, or followed the deepspeed configuration file entirely.

I've tested this using deepspeed==0.12.6 and deepspeed==0.14.2 (latest).

What version are you seeing the problem on?

v2.2

How to reproduce the bug

// this is the deepspeed config i'm using
{
  "train_batch_size": 256,
  // "gradient_accumulation_steps": 2,
  "train_micro_batch_size_per_gpu": 16,
  "gradient_clipping": 1.0,
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 1e-4,
      "betas": [
        0.9,
        0.95
      ],
      "eps": 1e-8,
      "weight_decay": 1e-2
    }
  },
  "scheduler": {
    "type": "WarmupCosineLR",
    "params": {
      "total_num_steps": 250000,
      "warmup_min_ratio": 1e-3,
      "warmup_num_steps": 10000,
      "cos_min_ratio": 1e-3,
      "warmup_type": "linear"
    }
  },
  "zero_optimization": {
    "stage": 3,
    "allgather_partitions": true,
    "allgather_bucket_size": 2e6,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e6,
    "overlap_comm": true,
    "contiguous_gradients": true
  }
}

Error messages and logs

Here's the full traceback of the first exception thrown by lightning.

Error executing job with overrides: ['ops.num_workers=48', 'wandb=offline']                                                                                                                                             
Traceback (most recent call last):                                                                                                                                                                                      
  File "/home/ubuntu/repos/klaymm/bin/train_lightning.py", line 138, in main                                                                                                                                            
    trainer.fit(lit_klaymm, datamodule=datamodule, ckpt_path=cfg.ops.model_ckpt_path)                                                                                                                                   
  File "/home/ubuntu/miniconda3/envs/klaymm/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit                                                                                        
    call._call_and_handle_interrupt(                                                                                                                                                                                    
  File "/home/ubuntu/miniconda3/envs/klaymm/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt                                                                     
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)                                                                                                                               
  File "/home/ubuntu/miniconda3/envs/klaymm/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch                                                              
    return function(*args, **kwargs)                                                                                                                                                                                    
  File "/home/ubuntu/miniconda3/envs/klaymm/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl                                                                                  
    self._run(model, ckpt_path=ckpt_path)                                                                                                                                                                               
  File "/home/ubuntu/miniconda3/envs/klaymm/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 963, in _run                                                                                       
    self.strategy.setup(self)                                                                                                                                                                                           
  File "/home/ubuntu/miniconda3/envs/klaymm/lib/python3.10/site-packages/lightning/pytorch/strategies/deepspeed.py", line 335, in setup                                                                                 
    self._init_config_if_needed()                                                                                                                                                                                       
  File "/home/ubuntu/miniconda3/envs/klaymm/lib/python3.10/site-packages/lightning/pytorch/strategies/deepspeed.py", line 802, in _init_config_if_needed                                                                
    self._format_config()                                                                                                                                                                                               
  File "/home/ubuntu/miniconda3/envs/klaymm/lib/python3.10/site-packages/lightning/pytorch/strategies/deepspeed.py", line 811, in _format_config                                                                        
    self._format_batch_size_and_grad_accum_config()                                                                                                                                                                     
  File "/home/ubuntu/miniconda3/envs/klaymm/lib/python3.10/site-packages/lightning/pytorch/strategies/deepspeed.py", line 898, in _format_batch_size_and_grad_accum_config                                              
    raise MisconfigurationException(                                                                                                                                                                                    
lightning.fabric.utilities.exceptions.MisconfigurationException: Do not set `gradient_accumulation_steps` in the DeepSpeed config as this will be set with the `accumulate_grad_batches` argument passed via the Lightni
ng Trainer.  

And here's the full traceback of the second exception thrown by deepspeed.

Traceback (most recent call last):                                                                                                                                                                                      
  File "/home/ubuntu/repos/klaymm/bin/train_lightning.py", line 138, in main                                                                                                                                            
    # fit                                                                                                                                                                                                               
  File "/home/ubuntu/miniconda3/envs/klaymm/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit                                                                                        
    call._call_and_handle_interrupt(                                                                                                                                                                                    
  File "/home/ubuntu/miniconda3/envs/klaymm/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt                                                                     
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)                                                                                                                               
  File "/home/ubuntu/miniconda3/envs/klaymm/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch                                                              
    return function(*args, **kwargs)                                                                                                                                                                                    
  File "/home/ubuntu/miniconda3/envs/klaymm/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl                                                                                  
    self._run(model, ckpt_path=ckpt_path)                                                                                                                                                                               
  File "/home/ubuntu/miniconda3/envs/klaymm/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 963, in _run                                                                                       
    self.strategy.setup(self)                                                                                                                                                                                           
  File "/home/ubuntu/miniconda3/envs/klaymm/lib/python3.10/site-packages/lightning/pytorch/strategies/deepspeed.py", line 353, in setup                                                                                 
    self.init_deepspeed()                                                                                                                                                                                               
  File "/home/ubuntu/miniconda3/envs/klaymm/lib/python3.10/site-packages/lightning/pytorch/strategies/deepspeed.py", line 454, in init_deepspeed                                                                        
    self._initialize_deepspeed_train(self.model)
  File "/home/ubuntu/miniconda3/envs/klaymm/lib/python3.10/site-packages/lightning/pytorch/strategies/deepspeed.py", line 490, in _initialize_deepspeed_train
    model, deepspeed_optimizer = self._setup_model_and_optimizer(model, optimizer, scheduler)
  File "/home/ubuntu/miniconda3/envs/klaymm/lib/python3.10/site-packages/lightning/pytorch/strategies/deepspeed.py", line 426, in _setup_model_and_optimizer
    deepspeed_engine, deepspeed_optimizer, _, _ = deepspeed.initialize(
  File "/home/ubuntu/miniconda3/envs/klaymm/lib/python3.10/site-packages/deepspeed/__init__.py", line 157, in initialize
    config_class = DeepSpeedConfig(config, mpu)
  File "/home/ubuntu/miniconda3/envs/klaymm/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 786, in __init__
    self._configure_train_batch_size()
  File "/home/ubuntu/miniconda3/envs/klaymm/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 966, in _configure_train_batch_size
    self._batch_assertion()
  File "/home/ubuntu/miniconda3/envs/klaymm/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 914, in _batch_assertion
    assert train_batch == micro_batch * grad_acc * self.world_size, (
AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 256 != 16 * 1 * 8

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

@mxkrn mxkrn added bug Something isn't working needs triage Waiting to be triaged by maintainers labels May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers
Projects
None yet
Development

No branches or pull requests

1 participant