veScale: A PyTorch Native LLM Training Framework

Coming Soon

We are refactoring our internal LLM training system components to meet open source standard. The tentative timeline is as follows:

by mid April, 4D parallelism (tensor parallelism, sequence parallelism, data parallelism and ZERO) examples for nanoGPT, Llama2 and Mixtral models
by end of May, fast checkpointing system
by end of July, CUDA event monitor, pipeline parallelism and supporting components for large-scale training

bash patches/build_pytorch_w_patch.sh

This will compile and install a patched version of PyTorch (based on v2.2.1_rc3). The patch code can be found here: PyTorch-Patch

bash patches/build_torchdistX_w_patch.sh

This will compile and install a patched version of TorchdistX (based on its master). The patch code can be found here: TorchDistX-Patch

pushd python && pip3 install -r requirements.txt && pip3 install -e . && popd

This will install veScale and its dependencies.

Make sure it is in the Vescale directory.

docker build .

It may take a while to build the image.

Once the building process is finished, you can docker run with the id.

The veScale Project is under the Apache License v2.0.

veScale team would like to sincerely acknowledge the assistance of and collaboration with the PyTorch DTensor team.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github		.github
patches		patches
python		python
scripts		scripts
test		test
.clang-format		.clang-format
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
pyproject.toml		pyproject.toml