fastmoe/doc/release-note.md at master · laekov/fastmoe · GitHub

v1.1.0

Performance

Smart schedule of FasterMoE is updated with correct stream management, and becomes faster.

Testing

All unit tests are checked and they run correctly now.

Adaption

Megatron-LM 3.2 supported.

Documentation

README is updated with some bugs fixed.
A detailed document for process groups.

v1.0.1

Compatibility

PyTorch 2.0 supported.
Megatron-LM 2.5 supported.

Documentation

A detailed installation-guide thanks to @santurini

Performance related

Generalize FasterMoE's schedule to n_expert > 1 and more bug fixes.
Synchronization reduction thanks to @Fragile-azalea

v1.0.0

FasterMoE

The new performance boosting features in the PPoPP'22 paper FasterMoE, detailed in the document.
- Expert Shadowing.
- Smart Scheduling.
- Topology-aware gate.

Bug fixes

Transformer-XL examples.
Compatibility to PyTorch versions.
Megatron-LM documents.
GShardGate.

v0.3.0

FMoE core

Previous mp_group is renamed to slice_group, indicating that all workers in the group receive the same input batch, and process a slice of the input. mp_group will be deprecated in our next release.
ROCm supported.
FMoELinear is moved to a stand-alone file.

Groupped data parallel

Support any group name by their relative tag name.

Load balancing

A brand new balancing strategy - SWIPE. Contributed by authors of a (currently unpublished) paper.
A property has_loss is added to each gate, in order to identify whether balance loss should be collected.

Megatron-LM support

Experts are partitioned by tensor model parallelism in mp_group, instead of expert parallelism.
Support arbitrary customized gate in MegatronMLP.
Move the patches to a stand-alone file.

Tests

Move util functions into test_ddp.py.

v0.2.1

Load balancing

Fix gradient for balance loss.

Misc

Typos.
Update benchmark interface.
Remove some redundant code for performance improvement.
Enable USE_NCCL by default.
Compatibility for PyTorch <1.8.0 and >=1.8.0.

Megatron adaption

Patch for numerical correctness of gradient clipping.
Support to pipeline parallelism.

v0.2.0

Load balancing

A brand new gate module with capacity-related utilities.
GShard's and Switch Transformer's balance strategies are implemented as integrated gates.
Balance loss is enabled.
Balance monitor is provided.

Checkpointing

MoE models can be loaded and saved by fmoe's checkpointing module.

Performance

FP16 training performance is improved.

Misc

CUDA code directory is reconstructed.
More tests are added.

v0.1.2

Compilation

Remove dependency on the CUDA examples repository.

Distributed

Fix a bug related to PyTorch v1.8.0. FastMoE can now operate on multiple GPUs on multiple nodes with PyTorch v1.8.0.

Misc

Fix tons of typos.
Format the code.

v0.1.1

Distributed

Broadcast data-parallel parameters before training.

Megatron adaption

Initialize FMoELinear parameters using different seed in model parallel even using the same random seed in megatron.
Use proper comm for mp and dp.

Transformer-XL example

Improve scripts.

Misc

Logo and slack workspace link.
Document in Chinese.
Figures to explain how FastMoE works.

v0.1.0

Functions

A model-injection-style easy-to-use user interface for Megatron-LM.
Support both data parallel and model parallel, and a hybrid of the two,
Provide a new customized DDP module to synchronize in different comm groups.
Support to customized nn.Module as an expert.

Document and infrastructure

Use PyTest.
Setup PyLint.
Installation and usage guide.
Explanation of functions and code structure in code.

Performance

A benchmark to compare FastMoE and old PyTorch impl.