Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when saving MPT checkpoints trained via FSDP #221

Closed
jquesnelle opened this issue May 25, 2023 · 11 comments
Closed

Crash when saving MPT checkpoints trained via FSDP #221

jquesnelle opened this issue May 25, 2023 · 11 comments

Comments

@jquesnelle
Copy link

jquesnelle commented May 25, 2023

When finetuning mpt-7B-storywriter with FSDP, llm-foundry crashes when saving the actual checkpoint. This was using latest master 4c94c20. This running on a single 4x A100 40GB node.

Traceback (most recent call last):                                                                                                                                                                                   
  File "/workspace/llm-foundry/scripts/train/train.py", line 254, in <module>                             
    main(cfg)                                                                                             
  File "/workspace/llm-foundry/scripts/train/train.py", line 243, in main                                 
    trainer.fit()                                                                                         
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/composer/trainer/trainer.py", line 1766, in fit                                                                                                      
    self._train_loop()                                                                                    
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/composer/trainer/trainer.py", line 1996, in _train_loop                                                                                              
    self.engine.run_event(Event.BATCH_CHECKPOINT)                                                         
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/composer/core/engine.py", line 293, in run_event                                                                                                     
    self._run_nonlogger_callbacks(event)                                                                  
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/composer/core/engine.py", line 475, in _run_nonlogger_callbacks                                                                                      
    self._run_callbacks(event, callbacks)                                                                 
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/composer/core/engine.py", line 467, in _run_callbacks                                                                                                
    cb.run_event(event, self.state, self.logger)                                                          
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/composer/core/callback.py", line 96, in run_event                                                                                                    
    return event_cb(state, logger)                                                                        
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/composer/callbacks/checkpoint_saver.py", line 346, in batch_checkpoint                                                                               
    self._save_checkpoint(                                                                                
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/composer/callbacks/checkpoint_saver.py", line 384, in _save_checkpoint                                                                               
    saved_path = checkpoint.save_checkpoint(                                                              
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/composer/utils/checkpoint.py", line 518, in save_checkpoint                                                                                          
    'state': state.state_dict(),  
File "/workspace/llm-foundry/env/lib/python3.10/site-packages/composer/core/state.py", line 802, in state_dict                                                                                                     
    fsdp_get_optim_state_dict(self.model, optimizer, state_dict_type=self.fsdp_state_dict_type)           
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/composer/core/state.py", line 127, in fsdp_get_optim_state_dict                                                                                      
    optim_state_dict = FSDP.optim_state_dict(model, optim)  # type: ignore                                
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1753, in optim_state_dict                                                               
    return FullyShardedDataParallel._optim_state_dict_impl(                                               
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1154, in _optim_state_dict_impl                                                         
    return _optim_state_dict(                                                                             
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1455, in _optim_state_dict                                                                             
    _gather_orig_param_state(                                                                             
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1690, in _gather_orig_param_state                                                                      
    gathered_state = _all_gather_optim_state(fsdp_state, optim_state)                                     
  File "/workspace/llm-foundry/env/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1637, in _all_gather_optim_state                                                                       
    for name, non_tensor_value in object_state.non_tensors.items():                                       
AttributeError: 'int' object has no attribute 'items' ```
@jquesnelle jquesnelle changed the title Crash when saving checkpoints trained via FSDP Crash when saving MPT checkpoints trained via FSDP May 25, 2023
@abhi-mosaic
Copy link
Member

abhi-mosaic commented May 26, 2023

@jquesnelle Thank you for the report. Could you confirm if you are using our Docker image, and if not, what version of torch are you using? This potentially looks like an issue with torch 2.0 and FSDP, or Composer's usage of it.

@dakinggg
Copy link
Collaborator

This is a pytorch 2 bug. We are implementing a workaround in composer (mosaicml/composer#2237). Your options in the meantime are:

  1. Go back to torch 1.13.1
  2. Use a different optimizer, like LION (https://github.com/mosaicml/llm-foundry/blob/main/llmfoundry/optim/lion.py) (you can see the contents of the linked PR for the root cause, but basically pytorch 2 + FSDP errors when there are non tensor elements on the optimizer state)
  3. Try to use the branch from that composer PR.
  4. Wait for it to be fixed in pytorch (this could be a while :))

@sashaDoubov
Copy link
Contributor

To add to @dakinggg's answer, we reproduced this issue, and the workaround is now merged into composer. So saving with FSDP + torch 2.0 should be working when using composer dev, ie.pip install git+https://github.com/mosaicml/composer.git@dev and the latest llm-foundry 7754c71

@jquesnelle
Copy link
Author

jquesnelle commented May 26, 2023

@sashaDoubov I installed composer directly from dev (as described above) and got the same result; although perhaps llm-foundry overwrote the composer requirement? I'll try with torch==1.13.1 now (these are slightly expensive experiments 😂)

Edit: Yeah, I think pip install -e . overwrote the installed composer@dev since llm-foundry has specific version requirements for composer in install_requries

@sashaDoubov
Copy link
Contributor

sashaDoubov commented May 26, 2023

I see... As you mentioned, pip install -e ".[gpu]"will overwrite the composer pip install, I was able to get it working by pip installing composer dev after installing the llm-foundry requirements. Hope that torch==1.13.1 is working for you though, please let us know if you have issues.

@abhi-mosaic
Copy link
Member

Closing this as it seems like we have a workaround, but @jquesnelle please feel free to reopen if you run into further issues.

@jquesnelle
Copy link
Author

Closing this as it seems like we have a workaround, but @jquesnelle please feel free to reopen if you run into further issues.

Meant to circle back and confirm that this does work!

@germanjke
Copy link

this was fixed in last composer version?

@germanjke
Copy link

I'm using this docker mosaicml/llm-foundry:2.0.1_cu118-latest and have this bug

@sashaDoubov
Copy link
Contributor

Hi @germanjke, the llm-foundry docker images currently have dependencies from setup.py which points to our latest composer 14.1 release. This release does not contain the fix for this bug, but our dev branch does, so we recommend running pip uninstall composer and pip install git+https://github.com/mosaicml/composer.git@dev before running composer train/train.py. This will be fixed in a future release, but this is the workaround for now (torch2 support is still a work in progress).
Please let us know if that workaround works for you!

bmosaicml pushed a commit that referenced this issue Jun 6, 2023
* enabling logit scaling

* abhi review cmt

* adding warning

* add tests to exercise code path

* enable str logit_scale options (samhavens review)
@germanjke
Copy link

hi @sashaDoubov @abhi-mosaic @vchiley have you fixed this issue and now composer required in llm-foundry/setup.py is ok or I still should use composer dev version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants