Decent-DP is a cutting-edge PyTorch extension designed to simplify and accelerate decentralized data parallel training. As the official implementation of the paper [ICLR'25] From Promise to Practice: Realizing High-performance Decentralized Training, Decent-DP empowers you to scale multi-worker training efficiently—eliminating centralized bottlenecks and streamlining your deep learning pipelines.
-
Decentralized Architecture
Efficiently distributes training across multiple workers without relying on a central coordinator. -
Seamless PyTorch Integration
Easily plug into your existing PyTorch codebase with minimal modifications. -
High-Performance
Optimized for speed and scalability based on state-of-the-art research. -
Flexible and Extensible
Supports various algorithmic schemas to suit different training scenarios and model architectures.
- Python 3.11+
- PyTorch
Install directly from PyPI:
pip install decent-dpClone the repository and install in editable mode:
git clone https://github.com/WangZesen/Decent-DP.git
cd Decent-DP
pip install -e .Here is a pseudocode exmaple of how to use Decent-DP to train a model
import torch.distributed as dist
from decent_dp.ddp import DecentralizedDataParallel as DecentDP
# Initialize process group
dist.init_process_group(backend='nccl' if torch.cuda.is_available() else 'gloo', init_method='env://')
# Initialize model (move to device before wrapping with DecentDP)
model = ...
model = model.to(device)
# Wrap model with DecentDP
model = DecentDP(model,
# optimizer constructor function which takes List[Tuple[str, Tensor]] as input and returns an optimizer
# examples could be found in `decent_dp.optim` module
optim_fn=<optimizer constructor function>,
# lr scheduler constructor function which takes an optimizer as input and returns a lr scheduler.
# None if no lr scheduler is used
# examples could be found in `decent_dp.optim` module
lr_scheduler_fn=<lr scheduler constructor function>,
# topology of the network which is a string
# supported topologies are 'ring', 'exp', 'complete', 'alternating-exp-ring'
# see Section `Communication topology` for more details
topology=<topology>)
# Training loop
for epoch in range(num_epochs):
model.train()
for batch in data_loader:
loss = model(batch)
model.zero_grad()
loss.backward()
# no need for optimizer.step() as it is handled by DecentDP
model.eval()
for batch in val_data_loader:
with torch.no_grad():
loss = model(batch)Launch the script on multiple processes/nodes using torchrun.
Code of experiments conducted in the paper: 🔍 WangZesen/Decentralized-Training-Exp
Comprehensive documentation, including tutorials, API references, and performance tips, is available on the Github page: Decent-DP Documentation
If you use Decent-DP in your research, please cite our work:
@article{wang2024decentralized,
title={From Promise to Practice: Realizing High-Performance Decentralized Training},
author={Wang, Zesen and Zhang, Jiaojiao and Wu, Xuyang and Johansson, Mikael},
journal={arXiv preprint arXiv:2410.11998},
year={2024}
}We welcome contributions from the community!
To get involved:
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Submit a pull request with a clear description of your changes.
- For any issues or feature requests, please open an issue on GitHub.
Decent-DP is released under the MIT License.
The computations and storage resources were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725.
🚀 Happy training with Decent-DP!