add npu optim docs

open-mmlab · Apr 21, 2023 · c779b2c · c779b2c
1 parent 76bbfbc
commit c779b2c
Show file tree

Hide file tree

Showing 2 changed files with 36 additions and 38 deletions.
diff --git a/docs/en/common_usage/speed_up_training.md b/docs/en/common_usage/speed_up_training.md
@@ -18,12 +18,6 @@ MMEngine supports training models with CPU, single GPU, multiple GPUs in single
   CUDA_VISIBLE_DEVICES=0,3 python -m torch.distributed.launch --nproc_per_node=2 examples/train.py --launcher pytorch
   ```
 
-  If you use Ascend, you can use the optimizer corresponding to npu to train, such as the following command
-
-  ```bash
-  python examples/train.py --cfg-options optimizer.type=NpuFusedSGD
-  ```
-
 - multiple machines
 
   Assume that there are 2 machines connected with ethernet, you can simply run following commands.
@@ -64,19 +58,6 @@ MMEngine supports training models with CPU, single GPU, multiple GPUs in single
       python examples/train.py --launcher="slurm"
   ```
 
-  If you use Ascend, you can use the optimizer corresponding to npu to train, such as the following command
-
-  ```bash
-  python -m torch.distributed.launch \
-      --nnodes 8 \
-      --node_rank 0 \
-      --master_addr 127.0.0.1 \
-      --master_port 29500 \
-      --nproc_per_node=8 \
-      examples/train.py --launche pytorch \
-      --cfg-options optimizer.type=NpuFusedSGD
-  ```
-
 ## Mixed Precision Training
 
 Nvidia introduced the Tensor Core unit into the Volta and Turing architectures to support FP32 and FP16 mixed precision computing. They further support BF16 in Ampere architectures. With automatic mixed precision training enabled, some operators operate at FP16/BF16 and the rest operate at FP32, which reduces training time and storage requirements without changing the model or degrading its training precision, thus supporting training with larger batch sizes, larger models, and larger input sizes.
@@ -132,3 +113,21 @@ This feature is only available for PyTorch >= 2.0.0.
 ```{warning}
 `torch.compile` is still under development by PyTorch team. Some models may fail compilation. If you encounter errors during compilation, you can refer to [PyTorch Dynamo FAQ](https://pytorch.org/docs/2.0/dynamo/faq.html) for quick fix, or [TorchDynamo Troubleshooting](https://pytorch.org/docs/2.0/dynamo/troubleshooting.html) to post an issue in PyTorch.
 ```
+
+## Faster Optimizers
+
+If Ascend's equipment is used, Ascend's optimizer can be used to reduce the training time of the model. The optimizers supported by Ascend devices are as follows
+
+```
+NpuFusedAdadelta
+NpuFusedAdam
+NpuFusedAdamP
+NpuFusedAdamW
+NpuFusedBertAdam
+NpuFusedLamb
+NpuFusedRMSprop
+NpuFusedRMSpropTF
+NpuFusedSGD
+```
+
+The usage method is the same as the native optimizer, and the more detailed usage method in MMEngine can be referred to [optimizers](https://mmengine.readthedocs.io/zh_CN/latest/tutorials/optim_wrapper.html?highlight=%E4%BC%98%E5%8C%96%E5%99%A8).
diff --git a/docs/zh_cn/common_usage/speed_up_training.md b/docs/zh_cn/common_usage/speed_up_training.md
@@ -18,12 +18,6 @@ MMEngine 支持 CPU、单卡、单机多卡以及多机多卡的训练。当环
   CUDA_VISIBLE_DEVICES=0,3 python -m torch.distributed.launch --nproc_per_node=2 examples/train.py --launcher pytorch
   ```
 
-  如果使用Ascend，可以使用npu对应的优化器来训练，如以下命令
-
-  ```bash
-  python examples/train.py --cfg-options optimizer.type=NpuFusedSGD
-  ```
-
 - 多机多卡
 
   假设有 2 台机器，每台机器有 8 张卡。
@@ -65,19 +59,6 @@ MMEngine 支持 CPU、单卡、单机多卡以及多机多卡的训练。当环
       python examples/train.py --launcher="slurm"
   ```
 
-  如果使用Ascend，可以使用npu对应的优化器来训练，如以下命令
-
-  ```bash
-  python -m torch.distributed.launch \
-      --nnodes 8 \
-      --node_rank 0 \
-      --master_addr 127.0.0.1 \
-      --master_port 29500 \
-      --nproc_per_node=8 \
-      examples/train.py --launche pytorch \
-      --cfg-options optimizer.type=NpuFusedSGD
-  ```
-
 ## 混合精度训练
 
 Nvidia 在 Volta 和 Turing 架构中引入 Tensor Core 单元，来支持 FP32 和 FP16 混合精度计算。在 Ampere 架构中，他们进一步支持了 BF16 计算。开启自动混合精度训练后，部分算子的操作精度是 FP16/BF16，其余算子的操作精度是 FP32。这样在不改变模型、不降低模型训练精度的前提下，可以缩短训练时间，降低存储需求，因而能支持更大的 batch size、更大模型和尺寸更大的输入的训练。
@@ -133,3 +114,21 @@ runner = Runner(
 ```{warning}
 `torch.compile` 目前仍然由 PyTorch 团队持续开发中，一些模型可能会编译失败。如果遇到了类似问题，你可以查阅 [PyTorch Dynamo FAQ](https://pytorch.org/docs/2.0/dynamo/faq.html) 解决常见问题，或参考 [TorchDynamo Troubleshooting](https://pytorch.org/docs/2.0/dynamo/troubleshooting.html) 向 PyTorch 提 issue.
 ```
+
+## 优化器
+
+如果使用了昇腾的设备，可以使用昇腾的优化器从而缩短模型的训练时间。昇腾设备支持的优化器如下
+
+```
+NpuFusedAdadelta
+NpuFusedAdam
+NpuFusedAdamP
+NpuFusedAdamW
+NpuFusedBertAdam
+NpuFusedLamb
+NpuFusedRMSprop
+NpuFusedRMSpropTF
+NpuFusedSGD
+```
+
+使用方式同原生优化器一样，MMEngine 中更详细的使用方式可参考 [optimizers](https://mmengine.readthedocs.io/zh_CN/latest/tutorials/optim_wrapper.html?highlight=%E4%BC%98%E5%8C%96%E5%99%A8).