How to implement Fairseq-MoE training checkpoint like Swin-MoE? #219

withinmiaov · 2023-11-10T07:54:50Z

First, I want to thank the tutel team for open-sourcing this work, it's a very good and practical framework.
I want to use tutel's moe in fairseq nlp tasks, but I encountered a problem, the original checkpoint setting of fairseq can't save and load Experts parameters distributed on different GPUs. How should I modify the fairseq model to support checkpoints like Swin-moe?

ghostplant · 2023-11-12T05:56:00Z

Hi, you may need to rename the save_dir to make per-device process save to a unique destination:

https://github.com/facebookresearch/fairseq/blob/da8fb630880d529ab47e53381c30ddc8ad235216/fairseq/dataclass/configs.py#L645

You can change the default save_dir path to: f"checkpoints-dev{os.environ.get('LOCAL_RANK', 0)}" or
f"checkpoints-dev{os.environ.get('RANK', 0)}"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to implement Fairseq-MoE training checkpoint like Swin-MoE? #219

How to implement Fairseq-MoE training checkpoint like Swin-MoE? #219

withinmiaov commented Nov 10, 2023

ghostplant commented Nov 12, 2023 •

edited

How to implement Fairseq-MoE training checkpoint like Swin-MoE? #219

How to implement Fairseq-MoE training checkpoint like Swin-MoE? #219

Comments

withinmiaov commented Nov 10, 2023

ghostplant commented Nov 12, 2023 • edited

ghostplant commented Nov 12, 2023 •

edited