We are refactoring our internal LLM training system components to meet open source standard. The tentative timeline is as follows:
- by mid April, 4D parallelism (tensor parallelism, sequence parallelism, data parallelism and ZERO) examples for nanoGPT, Llama2 and Mixtral models
- by end of May, fast checkpointing system
- by end of July, CUDA event monitor, pipeline parallelism and supporting components for large-scale training
bash patches/build_pytorch_w_patch.sh
This will compile and install a patched version of PyTorch (based on v2.2.1_rc3). The patch code can be found here: PyTorch-Patch
bash patches/build_torchdistX_w_patch.sh
This will compile and install a patched version of TorchdistX (based on its master). The patch code can be found here: TorchDistX-Patch
pushd python && pip3 install -r requirements.txt && pip3 install -e . && popd
This will install veScale and its dependencies.
Make sure it is in the Vescale directory.
docker build .
It may take a while to build the image.
Once the building process is finished, you can docker run
with the id.
The veScale Project is under the Apache License v2.0.
veScale team would like to sincerely acknowledge the assistance of and collaboration with the PyTorch DTensor team.