From 7ac7fdbac70bf864b008a9587c997cb2919ef10f Mon Sep 17 00:00:00 2001
From: HeyangQin <heyangqin@microsoft.com>
Date: Thu, 31 Aug 2023 17:12:16 +0000
Subject: [PATCH 1/3] add mixz tutorial

---
 docs/_tutorials/mixed_precision_zeropp.md | 55 +++++++++++++++++++++++
 1 file changed, 55 insertions(+)
 create mode 100644 docs/_tutorials/mixed_precision_zeropp.md

diff --git a/docs/_tutorials/mixed_precision_zeropp.md b/docs/_tutorials/mixed_precision_zeropp.md
new file mode 100644
index 000000000000..cd5f309911de
--- /dev/null
+++ b/docs/_tutorials/mixed_precision_zeropp.md
@@ -0,0 +1,55 @@
+---
+title: "Mixed Precision ZeRO++"
+tags: training ZeRO communication-efficiency large-model
+---
+
+Mixed Precision ZeRO++ (MixZ++) is optimization strategies based on [ZeRO](/tutorials/zero/) and [ZeRO++](/tutorials/zeropp/) to improve the efficiency and reduce memory usage for large model training and inference when users use [Low-Rank Adaptation (LoRA)]([/tutorials/zero/](https://arxiv.org/abs/2106.09685)) training. Read our [ZeRO++ blog](https://www.microsoft.com/en-us/research/blog/deepspeed-zero-a-leap-in-speed-for-llm-and-chat-model-training-with-4x-less-communication/) and [paper](https://arxiv.org/pdf/2306.10209.pdf) to learn more!
+
+We recommend that you read the tutorials on [Getting Started](/getting-started/), [ZeRO](/tutorials/zero/)  and [Megatron-DeepSpeed](/tutorials/megatron/) before stepping through this tutorial.
+
+## Key Designs
+Mixed Precision ZeRO++ (MixZ++) inherits key designs from [ZeRO++](/tutorials/zeropp/), namely quantized weights (*qwZ*), hierarchical partitioning ZeRO (*hpZ*) but has different applicability:
+ - *qwZ* applies block-based quantization on frozen weights to reduce memory usage and all-gather communication volume. Compared with ZeRO++, *qwZ* in Mixed Precision ZeRO++ keeps the frozen weights quantized so there is no quantization overhead during runtime and memory usage is reduced.
+ - *hpZ* eliminates inter-node parameter all-gather communication through data remapping and recomputation. Compared with ZeRO++, *hpZ* in Mixed Precision ZeRO++ applies to both backward and generation passes.
+
+Collectively, the optimizations bring better scalability and efficiency to LoRA training. Each of the components can be enabled independent of each other and collectively as a group.
+
+## Enabling Mixed Precision ZeRO++
+
+A ready to go Mixed Precision ZeRO++ example has been prepared at https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/llama2/run_llama2_7b_mixz.sh. If you prefer to manually enable Mixed Precision ZeRO++ in your pipeline, please refer to the instructions below.
+
+### DeepSpeed Configuration Changes
+An example snippet of deepspeed configurations with all Mixed Precision ZeRO++ optimization enabled is shown below:
+```json
+{
+    "zero_optimization": {
+        "stage": 3,
+        "..."
+        "zero_quantized_nontrainable_weights": true,
+        "zero_hpz_partition_size": 16,
+        "..."
+    }
+}
+```
+Note that the `"zero_hpz_partition_size"` should be set to the number of GPUs per node. For example, if you have 8 GPUs per node, then `"zero_hpz_partition_size"` should be set to 8.
+
+### Training Script Changes
+DeepSpeed engine will identify the LoRA frozen parameters if the LoRA model is passed when DeepSpeed initializes. However, the popular implementation is to initialize a base model and then convert to LoRA model later. In such cases, users need to explicitly call DeepSpeed engine after LoRA model is converted. This is only a 1-line effort. An example snippet of training script is shown below:
+
+```python
+model, optimizer, _, lr_scheduler = deepspeed.initialize(
+    model=model,
+    optimizer=optimizer,
+    args=args,
+    config=ds_config,
+    lr_scheduler=lr_scheduler,
+    dist_init_required=True)
+# ...
+# (the custom code to convert base model to LoRA model)
+# ...
+# call DeepSpeed engine again to identify LoRA frozen parameters
+model.optimizer.quantize_nontrainable_params()
+# ...
+```
+
+Congratulations! You have completed the Mixed Precision ZeRO++ tutorial.

From 56999f0c493c4cc85890b7e1ccc7460068dd38b8 Mon Sep 17 00:00:00 2001
From: HeyangQin <heyangqin@microsoft.com>
Date: Thu, 31 Aug 2023 17:25:32 +0000
Subject: [PATCH 2/3] update tutorial

---
 docs/_tutorials/mixed_precision_zeropp.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/_tutorials/mixed_precision_zeropp.md b/docs/_tutorials/mixed_precision_zeropp.md
index cd5f309911de..82cb399dd01d 100644
--- a/docs/_tutorials/mixed_precision_zeropp.md
+++ b/docs/_tutorials/mixed_precision_zeropp.md
@@ -3,7 +3,7 @@ title: "Mixed Precision ZeRO++"
 tags: training ZeRO communication-efficiency large-model
 ---
 
-Mixed Precision ZeRO++ (MixZ++) is optimization strategies based on [ZeRO](/tutorials/zero/) and [ZeRO++](/tutorials/zeropp/) to improve the efficiency and reduce memory usage for large model training and inference when users use [Low-Rank Adaptation (LoRA)]([/tutorials/zero/](https://arxiv.org/abs/2106.09685)) training. Read our [ZeRO++ blog](https://www.microsoft.com/en-us/research/blog/deepspeed-zero-a-leap-in-speed-for-llm-and-chat-model-training-with-4x-less-communication/) and [paper](https://arxiv.org/pdf/2306.10209.pdf) to learn more!
+Mixed Precision ZeRO++ (MixZ++) is optimization strategies based on [ZeRO](/tutorials/zero/) and [ZeRO++](/tutorials/zeropp/) to improve the efficiency and reduce memory usage for large model training and inference when users use [Low-Rank Adaptation (LoRA)]([/tutorials/zero/](https://arxiv.org/abs/2106.09685)) training. MixZ++ partitions model parameters across GPUs to reduce footprint and gathers them with quantized communication only when needed similar to its ZeRO and ZeRO++ siblings. Our evaluation indicates MixZ++ increases the training throughput by up to 3.2x for the Llama-2-70B model running on 128 V100 GPUs. Read our [ZeRO++ blog](https://www.microsoft.com/en-us/research/blog/deepspeed-zero-a-leap-in-speed-for-llm-and-chat-model-training-with-4x-less-communication/) and [paper](https://arxiv.org/pdf/2306.10209.pdf) to learn more!
 
 We recommend that you read the tutorials on [Getting Started](/getting-started/), [ZeRO](/tutorials/zero/)  and [Megatron-DeepSpeed](/tutorials/megatron/) before stepping through this tutorial.
 
@@ -14,12 +14,12 @@ Mixed Precision ZeRO++ (MixZ++) inherits key designs from [ZeRO++](/tutorials/ze
 
 Collectively, the optimizations bring better scalability and efficiency to LoRA training. Each of the components can be enabled independent of each other and collectively as a group.
 
-## Enabling Mixed Precision ZeRO++
+## Enabling Mixed Precision ZeRO++ (MixZ++)
 
-A ready to go Mixed Precision ZeRO++ example has been prepared at https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/llama2/run_llama2_7b_mixz.sh. If you prefer to manually enable Mixed Precision ZeRO++ in your pipeline, please refer to the instructions below.
+A ready to go MixZ++ example has been prepared at [MixZ++ example script](https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/llama2/run_llama2_7b_mixz.sh). If you prefer to manually enable MixZ++ in your pipeline, please refer to the instructions below.
 
 ### DeepSpeed Configuration Changes
-An example snippet of deepspeed configurations with all Mixed Precision ZeRO++ optimization enabled is shown below:
+An example snippet of deepspeed configurations with all MixZ++ optimization enabled is shown below:
 ```json
 {
     "zero_optimization": {
@@ -31,7 +31,7 @@ An example snippet of deepspeed configurations with all Mixed Precision ZeRO++ o
     }
 }
 ```
-Note that the `"zero_hpz_partition_size"` should be set to the number of GPUs per node. For example, if you have 8 GPUs per node, then `"zero_hpz_partition_size"` should be set to 8.
+Note that for multi-node training, the `"zero_hpz_partition_size"` should be set to the number of GPUs per node. For example, if you have 8 GPUs per node, then `"zero_hpz_partition_size"` should be set to 8. For single-node training, the `"zero_hpz_partition_size"` should not be set.
 
 ### Training Script Changes
 DeepSpeed engine will identify the LoRA frozen parameters if the LoRA model is passed when DeepSpeed initializes. However, the popular implementation is to initialize a base model and then convert to LoRA model later. In such cases, users need to explicitly call DeepSpeed engine after LoRA model is converted. This is only a 1-line effort. An example snippet of training script is shown below:

From 9d0a0218a9636f4114c61a07d2da3bc9cd0ce2ea Mon Sep 17 00:00:00 2001
From: HeyangQin <heyangqin@microsoft.com>
Date: Thu, 31 Aug 2023 17:33:28 +0000
Subject: [PATCH 3/3] update

---
 docs/_tutorials/mixed_precision_zeropp.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/_tutorials/mixed_precision_zeropp.md b/docs/_tutorials/mixed_precision_zeropp.md
index 82cb399dd01d..12ad3556abde 100644
--- a/docs/_tutorials/mixed_precision_zeropp.md
+++ b/docs/_tutorials/mixed_precision_zeropp.md
@@ -3,7 +3,7 @@ title: "Mixed Precision ZeRO++"
 tags: training ZeRO communication-efficiency large-model
 ---
 
-Mixed Precision ZeRO++ (MixZ++) is optimization strategies based on [ZeRO](/tutorials/zero/) and [ZeRO++](/tutorials/zeropp/) to improve the efficiency and reduce memory usage for large model training and inference when users use [Low-Rank Adaptation (LoRA)]([/tutorials/zero/](https://arxiv.org/abs/2106.09685)) training. MixZ++ partitions model parameters across GPUs to reduce footprint and gathers them with quantized communication only when needed similar to its ZeRO and ZeRO++ siblings. Our evaluation indicates MixZ++ increases the training throughput by up to 3.2x for the Llama-2-70B model running on 128 V100 GPUs. Read our [ZeRO++ blog](https://www.microsoft.com/en-us/research/blog/deepspeed-zero-a-leap-in-speed-for-llm-and-chat-model-training-with-4x-less-communication/) and [paper](https://arxiv.org/pdf/2306.10209.pdf) to learn more!
+Mixed Precision ZeRO++ (MixZ++) is a set of optimization strategies based on [ZeRO](/tutorials/zero/) and [ZeRO++](/tutorials/zeropp/) to improve the efficiency and reduce memory usage for large model training and inference when users use [Low-Rank Adaptation (LoRA)]([/tutorials/zero/](https://arxiv.org/abs/2106.09685)) training. MixZ++ partitions model parameters across GPUs to reduce footprint and gathers them with quantized communication only when needed similar to its ZeRO and ZeRO++ siblings. Our evaluation indicates MixZ++ increases the training throughput by up to [3.3x](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat/ds-chat-release-8-31) for the Llama-2-70B model running on 128 V100 GPUs. Read our [DeepSpeed Chat Blog](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat/ds-chat-release-8-31), [ZeRO++ blog](https://www.microsoft.com/en-us/research/blog/deepspeed-zero-a-leap-in-speed-for-llm-and-chat-model-training-with-4x-less-communication/) and [paper](https://arxiv.org/pdf/2306.10209.pdf) to learn more!
 
 We recommend that you read the tutorials on [Getting Started](/getting-started/), [ZeRO](/tutorials/zero/)  and [Megatron-DeepSpeed](/tutorials/megatron/) before stepping through this tutorial.