DeepSpeed Ulysses Release (#230)

Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com> Co-authored-by: chengming-zhang <chengming.zhang@anl.gov>
microsoft · Aug 23, 2023 · a2a476e · a2a476e
1 parent 41116e0
commit a2a476e
Show file tree

Hide file tree

Showing 17 changed files with 1,164 additions and 63 deletions.
diff --git a/examples_deepspeed/sequence_parallel/README.md b/examples_deepspeed/sequence_parallel/README.md
@@ -0,0 +1,36 @@
+# Sequence Parallelism
+
+This folder contains examples that demonstrate how to use DeepSpeed's sequence parallelism.
+
+## Setting Up the Environment for FlashAttention
+
+DeepSpeed's sequence parallelism can be combined with the following types of attention.
+
+- Classic attention
+- FlashAttention (enabled by `--use-flash-attn`)
+- FlashAttention + Triton (enabled by `--use-flash-attn-triton`)
+
+For the best performance, we recommend using FlashAttention + Triton. Here are the installation steps and the versions we have tested. Note that FlashAttention is compatible only with Turing, Ampere, Ada, or Hopper GPUs.
+
+```shell
+# install triton
+git clone -b legacy-backend https://github.com/openai/triton
+cd triton/python/
+pip install cmake
+pip install .
+
+# install
+cd ${WORK_DIR}
+git clone -b v1.0.4 https://github.com/HazyResearch/flash-attention
+cd flash-attention
+python setup.py install
+```
+
+## Enabling Sequence Parallelism
+
+To enable sequence parallelism, set the degree of parallelism using the `--ds-sequence-parallel-size` argument. Ensure that the number of attention heads is divisible by this value.
+Ensure your model configuration is compliant with FlashAttention's requirements. For instance, to achieve optimal performance, the head size should be divisible by 8. Refer to the document of [FlashAttention](https://github.com/Dao-AILab/flash-attention/tree/v1.0.4) for more details.
+
+Some working examples ([GPT1.3B](ds_pretrain_gpt_1.3B_seq_parallel_32k.sh), [GPT30B](ds_pretrain_gpt_30B_seq_parallel_32k.sh)), that enable sequence parallelism, are available in this foloder.
+
+Please note that our sequence parallelism feature is currently incompatible with Megatron-LM's tensor or pipeline parallelism.
diff --git a/examples_deepspeed/sequence_parallel/ds_config_gpt_TEMPLATE.json b/examples_deepspeed/sequence_parallel/ds_config_gpt_TEMPLATE.json
@@ -0,0 +1,24 @@
+{
+  "train_batch_size": GBSIZE,
+  "train_micro_batch_size_per_gpu": MBSIZE,
+  "steps_per_print": LOG_INTERVAL,
+
+  "zero_optimization": {
+    "stage": ZERO_STAGE,
+    "elastic_checkpoint": true
+  },
+
+  "gradient_clipping": 1.0,
+  "prescale_gradients": PRESCALE_GRAD,
+
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 500,
+    "hysteresis": 2,
+    "min_loss_scale": 1,
+    "initial_scale_power": 11
+  },
+
+  "wall_clock_breakdown" : false
+}