Skip to content

A PyTorch Native LLM Training Framework

License

Notifications You must be signed in to change notification settings

lchang20/veScale

 
 

Repository files navigation

veScale: A PyTorch Native LLM Training Framework

Coming Soon

We are refactoring our internal LLM training system components to meet open source standard. The tentative timeline is as follows:

  1. by mid April, 4D parallelism (tensor parallelism, sequence parallelism, data parallelism and ZERO) examples for nanoGPT, Llama2 and Mixtral models
  2. by end of May, fast checkpointing system
  3. by end of July, CUDA event monitor, pipeline parallelism and supporting components for large-scale training

Installation

From Source

Install a Patched Version of PyTorch (optional)

bash patches/build_pytorch_w_patch.sh

This will compile and install a patched version of PyTorch (based on v2.2.1_rc3). The patch code can be found here: PyTorch-Patch

Install a Patched Version of TorchDistX

bash patches/build_torchdistX_w_patch.sh

This will compile and install a patched version of TorchdistX (based on its master). The patch code can be found here: TorchDistX-Patch

Install veScale

pushd python && pip3 install -r requirements.txt && pip3 install -e . && popd

This will install veScale and its dependencies.

Docker Image

Build the Docker Image

Make sure it is in the Vescale directory.

docker build .

It may take a while to build the image.

Once the building process is finished, you can docker run with the id.

The veScale Project is under the Apache License v2.0.

Acknowledgement

veScale team would like to sincerely acknowledge the assistance of and collaboration with the PyTorch DTensor team.

About

A PyTorch Native LLM Training Framework

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.8%
  • Other 0.2%