Crash when saving MPT checkpoints trained via FSDP #221

jquesnelle · 2023-05-25T23:43:26Z

When finetuning mpt-7B-storywriter with FSDP, llm-foundry crashes when saving the actual checkpoint. This was using latest master 4c94c20. This running on a single 4x A100 40GB node.

Traceback (most recent call last):                                                                                                                                                                                   
  File "/workspace/llm-foundry/scripts/train/train.py", line 254, in <module>                             
    main(cfg)                                                                                             
  File "/workspace/llm-foundry/scripts/train/train.py", line 243, in main                                 
    trainer.fit()                                                                                         
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/composer/trainer/trainer.py", line 1766, in fit                                                                                                      
    self._train_loop()                                                                                    
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/composer/trainer/trainer.py", line 1996, in _train_loop                                                                                              
    self.engine.run_event(Event.BATCH_CHECKPOINT)                                                         
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/composer/core/engine.py", line 293, in run_event                                                                                                     
    self._run_nonlogger_callbacks(event)                                                                  
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/composer/core/engine.py", line 475, in _run_nonlogger_callbacks                                                                                      
    self._run_callbacks(event, callbacks)                                                                 
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/composer/core/engine.py", line 467, in _run_callbacks                                                                                                
    cb.run_event(event, self.state, self.logger)                                                          
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/composer/core/callback.py", line 96, in run_event                                                                                                    
    return event_cb(state, logger)                                                                        
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/composer/callbacks/checkpoint_saver.py", line 346, in batch_checkpoint                                                                               
    self._save_checkpoint(                                                                                
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/composer/callbacks/checkpoint_saver.py", line 384, in _save_checkpoint                                                                               
    saved_path = checkpoint.save_checkpoint(                                                              
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/composer/utils/checkpoint.py", line 518, in save_checkpoint                                                                                          
    'state': state.state_dict(),  
File "/workspace/llm-foundry/env/lib/python3.10/site-packages/composer/core/state.py", line 802, in state_dict                                                                                                     
    fsdp_get_optim_state_dict(self.model, optimizer, state_dict_type=self.fsdp_state_dict_type)           
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/composer/core/state.py", line 127, in fsdp_get_optim_state_dict                                                                                      
    optim_state_dict = FSDP.optim_state_dict(model, optim)  # type: ignore                                
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1753, in optim_state_dict                                                               
    return FullyShardedDataParallel._optim_state_dict_impl(                                               
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1154, in _optim_state_dict_impl                                                         
    return _optim_state_dict(                                                                             
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1455, in _optim_state_dict                                                                             
    _gather_orig_param_state(                                                                             
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1690, in _gather_orig_param_state                                                                      
    gathered_state = _all_gather_optim_state(fsdp_state, optim_state)                                     
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1637, in _all_gather_optim_state                                                                       
    for name, non_tensor_value in object_state.non_tensors.items():                                       
AttributeError: 'int' object has no attribute 'items' ```

The text was updated successfully, but these errors were encountered:

abhi-mosaic · 2023-05-26T00:00:10Z

@jquesnelle Thank you for the report. Could you confirm if you are using our Docker image, and if not, what version of torch are you using? This potentially looks like an issue with torch 2.0 and FSDP, or Composer's usage of it.

dakinggg · 2023-05-26T00:19:41Z

This is a pytorch 2 bug. We are implementing a workaround in composer (mosaicml/composer#2237). Your options in the meantime are:

Go back to torch 1.13.1
Use a different optimizer, like LION (https://github.com/mosaicml/llm-foundry/blob/main/llmfoundry/optim/lion.py) (you can see the contents of the linked PR for the root cause, but basically pytorch 2 + FSDP errors when there are non tensor elements on the optimizer state)
Try to use the branch from that composer PR.
Wait for it to be fixed in pytorch (this could be a while :))

sashaDoubov · 2023-05-26T01:26:27Z

To add to @dakinggg's answer, we reproduced this issue, and the workaround is now merged into composer. So saving with FSDP + torch 2.0 should be working when using composer dev, ie.pip install git+https://github.com/mosaicml/composer.git@dev and the latest llm-foundry 7754c71

jquesnelle · 2023-05-26T05:40:35Z

@sashaDoubov I installed composer directly from dev (as described above) and got the same result; although perhaps llm-foundry overwrote the composer requirement? I'll try with torch==1.13.1 now (these are slightly expensive experiments 😂)

Edit: Yeah, I think pip install -e . overwrote the installed composer@dev since llm-foundry has specific version requirements for composer in install_requries

sashaDoubov · 2023-05-26T17:46:23Z

I see... As you mentioned, pip install -e ".[gpu]"will overwrite the composer pip install, I was able to get it working by pip installing composer dev after installing the llm-foundry requirements. Hope that torch==1.13.1 is working for you though, please let us know if you have issues.

abhi-mosaic · 2023-05-31T06:12:29Z

Closing this as it seems like we have a workaround, but @jquesnelle please feel free to reopen if you run into further issues.

jquesnelle · 2023-05-31T12:57:34Z

Closing this as it seems like we have a workaround, but @jquesnelle please feel free to reopen if you run into further issues.

Meant to circle back and confirm that this does work!

germanjke · 2023-06-05T16:14:05Z

this was fixed in last composer version?

germanjke · 2023-06-05T16:15:39Z

I'm using this docker mosaicml/llm-foundry:2.0.1_cu118-latest and have this bug

sashaDoubov · 2023-06-05T18:47:15Z

Hi @germanjke, the llm-foundry docker images currently have dependencies from setup.py which points to our latest composer 14.1 release. This release does not contain the fix for this bug, but our dev branch does, so we recommend running pip uninstall composer and pip install git+https://github.com/mosaicml/composer.git@dev before running composer train/train.py. This will be fixed in a future release, but this is the workaround for now (torch2 support is still a work in progress).
Please let us know if that workaround works for you!

* enabling logit scaling * abhi review cmt * adding warning * add tests to exercise code path * enable str logit_scale options (samhavens review)

germanjke · 2023-06-23T08:22:00Z

hi @sashaDoubov @abhi-mosaic @vchiley have you fixed this issue and now composer required in llm-foundry/setup.py is ok or I still should use composer dev version?

jquesnelle changed the title ~~Crash when saving checkpoints trained via FSDP~~ Crash when saving MPT checkpoints trained via FSDP May 25, 2023

abhi-mosaic mentioned this issue May 26, 2023

Torch 1.13.1 doesn't support sm_90 #212

Closed

abhi-mosaic closed this as completed May 31, 2023

bmosaicml pushed a commit that referenced this issue Jun 6, 2023

enabling logit scaling (#221)

fc6c6ae

* enabling logit scaling * abhi review cmt * adding warning * add tests to exercise code path * enable str logit_scale options (samhavens review)

mvpatel2000 mentioned this issue Jun 9, 2023

Composer logging crashes with Falcon-7B model mosaicml/composer#2289

Closed

abhi-mosaic mentioned this issue Jun 12, 2023

Training resorting to CPU despite having gpus #313

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash when saving MPT checkpoints trained via FSDP #221

Crash when saving MPT checkpoints trained via FSDP #221

jquesnelle commented May 25, 2023 •

edited

Loading

abhi-mosaic commented May 26, 2023 •

edited

Loading

dakinggg commented May 26, 2023

sashaDoubov commented May 26, 2023

jquesnelle commented May 26, 2023 •

edited

Loading

sashaDoubov commented May 26, 2023 •

edited

Loading

abhi-mosaic commented May 31, 2023

jquesnelle commented May 31, 2023

germanjke commented Jun 5, 2023

germanjke commented Jun 5, 2023

sashaDoubov commented Jun 5, 2023

germanjke commented Jun 23, 2023

Crash when saving MPT checkpoints trained via FSDP #221

Crash when saving MPT checkpoints trained via FSDP #221

Comments

jquesnelle commented May 25, 2023 • edited Loading

abhi-mosaic commented May 26, 2023 • edited Loading

dakinggg commented May 26, 2023

sashaDoubov commented May 26, 2023

jquesnelle commented May 26, 2023 • edited Loading

sashaDoubov commented May 26, 2023 • edited Loading

abhi-mosaic commented May 31, 2023

jquesnelle commented May 31, 2023

germanjke commented Jun 5, 2023

germanjke commented Jun 5, 2023

sashaDoubov commented Jun 5, 2023

germanjke commented Jun 23, 2023

jquesnelle commented May 25, 2023 •

edited

Loading

abhi-mosaic commented May 26, 2023 •

edited

Loading

jquesnelle commented May 26, 2023 •

edited

Loading

sashaDoubov commented May 26, 2023 •

edited

Loading