Skip to content

Latest commit



89 lines (55 loc) · 3.59 KB

File metadata and controls

89 lines (55 loc) · 3.59 KB


Generative Language Model Pretrained on Inspur's Yuan Dataset, codebase for ASC22 supercomputing competition

Project Structure

To simplify experiments on different distributed training frameworks, we decoupled the training code into config, data, model and trainer modules.

The idea of this decoupling is inspired by pytorch-lightning, however we decoupled it even further to make it more flexible when integrating with other frameworks.

config Module

We put all hyperparameters and configurations into config module for better tracing and logging.

data Module

We directly use pytorch-lightning.LightningDataModule since it's interface is well-designed and easy to use.

model Module

Since most distributed training framework need to wrap the model before or after model initialization, and pytorch-lightning.LightningModule has already exposed some problem in integrating multiple frameworks simultaneously, we decide to further decouple this module into BaseModel class.

The BaseModel directly inherits nn.Module, which is the compatible for most of the distributed training frameworks. All implementations of the language model are derived from BaseModel and maintain only the model config, the model structure, the forward method, the loss function and the optimizer.

Currently, implemented models include:

  • native model: written in native pytorch
  • huggingface model: written in HuggingFace's transformers

trainer Module

Now we put everything else like model initialization, training, validation and testing into trainer module. All training preparation and iterations are done here.

Currently, implemented trainers include:

  • PytorchLightning trainer: distributed training with pytorch-lightning, with deepspeed integration provided by the lightning team
  • PatrickStar Trainer

Distributed Launch

Below are examples of how to launch the training job on different distributed frameworks.

DDP in PyTorch-Lightning

num_nodes must be set to number of GPUs in all nodes, otherwise it will use the number of GPUs in the master node.

torchrun --nnodes=2 --nproc_per_node=2 --master_addr GPU04 --master_port 9001 --node_rank 1

DeepSpeed in PyTorch-Lightning

OMP_NUM_THREADS=32 torchrun --nnodes=2 --nproc_per_node=2 --master_addr GPU04 --master_port 9001 --node_rank 1

Note that OMP_NUM_THREADS is a must when offload is used, since Optimizer now runs on CPU.

Horovod in PyTorch-Lightning

horovodrun -np 2 python

We still prefer to use torchrun


torchrun --nnodes=1 --nproc_per_node=2

Colossal AI

GLOO_SOCKET_IFNAME=ibs5 OMP_NUM_THREADS=32 torchrun --master_addr="" --master_port=29500 --nnodes=2 --node_rank=1 --nproc_per_node=2 --config=trainer/colossal_ai/

Run Profile

OMP_NUM_THREADS=32 nsys profile -o cpu_adam torchrun --nnodes=2 --nproc_per_node=2 --master_addr GPU04 --master_port 9001 --node_rank 0

OMP_NUM_THREADS=32 nsys profile --gpu-metrics-device=all --gpuctxsw=true --nic-metrics=true --cuda-memory-usage=true --cudabacktrace=all torchrun  --nnodes=2 --nproc_per_node=2 --config=trainer/colossal_ai/

Docker Environment

docker run -it --name pytorch --gpus all --privileged --cap-add=SYS_ADMIN --ipc=host --network=host --ulimit memlock=-1 --ulimit stack=67108864 --device=/dev/infiniband -v $(pwd):/workspace bash

Check details in Dockerfile