Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to implement Fairseq-MoE training checkpoint like Swin-MoE? #219

Open
withinmiaov opened this issue Nov 10, 2023 · 1 comment
Open

Comments

@withinmiaov
Copy link

First, I want to thank the tutel team for open-sourcing this work, it's a very good and practical framework.
I want to use tutel's moe in fairseq nlp tasks, but I encountered a problem, the original checkpoint setting of fairseq can't save and load Experts parameters distributed on different GPUs. How should I modify the fairseq model to support checkpoints like Swin-moe?

@ghostplant
Copy link
Contributor

ghostplant commented Nov 12, 2023

Hi, you may need to rename the save_dir to make per-device process save to a unique destination:

https://github.com/facebookresearch/fairseq/blob/da8fb630880d529ab47e53381c30ddc8ad235216/fairseq/dataclass/configs.py#L645

You can change the default save_dir path to: f"checkpoints-dev{os.environ.get('LOCAL_RANK', 0)}" or
f"checkpoints-dev{os.environ.get('RANK', 0)}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants