Skip to content

Commit

Permalink
add npu optim docs
Browse files Browse the repository at this point in the history
  • Loading branch information
luomaoling authored and LRJKD committed Apr 21, 2023
1 parent 76bbfbc commit c779b2c
Show file tree
Hide file tree
Showing 2 changed files with 36 additions and 38 deletions.
37 changes: 18 additions & 19 deletions docs/en/common_usage/speed_up_training.md
Expand Up @@ -18,12 +18,6 @@ MMEngine supports training models with CPU, single GPU, multiple GPUs in single
CUDA_VISIBLE_DEVICES=0,3 python -m torch.distributed.launch --nproc_per_node=2 examples/train.py --launcher pytorch
```

If you use Ascend, you can use the optimizer corresponding to npu to train, such as the following command

```bash
python examples/train.py --cfg-options optimizer.type=NpuFusedSGD
```

- multiple machines

Assume that there are 2 machines connected with ethernet, you can simply run following commands.
Expand Down Expand Up @@ -64,19 +58,6 @@ MMEngine supports training models with CPU, single GPU, multiple GPUs in single
python examples/train.py --launcher="slurm"
```

If you use Ascend, you can use the optimizer corresponding to npu to train, such as the following command

```bash
python -m torch.distributed.launch \
--nnodes 8 \
--node_rank 0 \
--master_addr 127.0.0.1 \
--master_port 29500 \
--nproc_per_node=8 \
examples/train.py --launche pytorch \
--cfg-options optimizer.type=NpuFusedSGD
```

## Mixed Precision Training

Nvidia introduced the Tensor Core unit into the Volta and Turing architectures to support FP32 and FP16 mixed precision computing. They further support BF16 in Ampere architectures. With automatic mixed precision training enabled, some operators operate at FP16/BF16 and the rest operate at FP32, which reduces training time and storage requirements without changing the model or degrading its training precision, thus supporting training with larger batch sizes, larger models, and larger input sizes.
Expand Down Expand Up @@ -132,3 +113,21 @@ This feature is only available for PyTorch >= 2.0.0.
```{warning}
`torch.compile` is still under development by PyTorch team. Some models may fail compilation. If you encounter errors during compilation, you can refer to [PyTorch Dynamo FAQ](https://pytorch.org/docs/2.0/dynamo/faq.html) for quick fix, or [TorchDynamo Troubleshooting](https://pytorch.org/docs/2.0/dynamo/troubleshooting.html) to post an issue in PyTorch.
```

## Faster Optimizers

If Ascend's equipment is used, Ascend's optimizer can be used to reduce the training time of the model. The optimizers supported by Ascend devices are as follows

```
NpuFusedAdadelta
NpuFusedAdam
NpuFusedAdamP
NpuFusedAdamW
NpuFusedBertAdam
NpuFusedLamb
NpuFusedRMSprop
NpuFusedRMSpropTF
NpuFusedSGD
```

The usage method is the same as the native optimizer, and the more detailed usage method in MMEngine can be referred to [optimizers](https://mmengine.readthedocs.io/zh_CN/latest/tutorials/optim_wrapper.html?highlight=%E4%BC%98%E5%8C%96%E5%99%A8).
37 changes: 18 additions & 19 deletions docs/zh_cn/common_usage/speed_up_training.md
Expand Up @@ -18,12 +18,6 @@ MMEngine 支持 CPU、单卡、单机多卡以及多机多卡的训练。当环
CUDA_VISIBLE_DEVICES=0,3 python -m torch.distributed.launch --nproc_per_node=2 examples/train.py --launcher pytorch
```

如果使用Ascend,可以使用npu对应的优化器来训练,如以下命令

```bash
python examples/train.py --cfg-options optimizer.type=NpuFusedSGD
```

- 多机多卡

假设有 2 台机器,每台机器有 8 张卡。
Expand Down Expand Up @@ -65,19 +59,6 @@ MMEngine 支持 CPU、单卡、单机多卡以及多机多卡的训练。当环
python examples/train.py --launcher="slurm"
```

如果使用Ascend,可以使用npu对应的优化器来训练,如以下命令

```bash
python -m torch.distributed.launch \
--nnodes 8 \
--node_rank 0 \
--master_addr 127.0.0.1 \
--master_port 29500 \
--nproc_per_node=8 \
examples/train.py --launche pytorch \
--cfg-options optimizer.type=NpuFusedSGD
```

## 混合精度训练

Nvidia 在 Volta 和 Turing 架构中引入 Tensor Core 单元,来支持 FP32 和 FP16 混合精度计算。在 Ampere 架构中,他们进一步支持了 BF16 计算。开启自动混合精度训练后,部分算子的操作精度是 FP16/BF16,其余算子的操作精度是 FP32。这样在不改变模型、不降低模型训练精度的前提下,可以缩短训练时间,降低存储需求,因而能支持更大的 batch size、更大模型和尺寸更大的输入的训练。
Expand Down Expand Up @@ -133,3 +114,21 @@ runner = Runner(
```{warning}
`torch.compile` 目前仍然由 PyTorch 团队持续开发中,一些模型可能会编译失败。如果遇到了类似问题,你可以查阅 [PyTorch Dynamo FAQ](https://pytorch.org/docs/2.0/dynamo/faq.html) 解决常见问题,或参考 [TorchDynamo Troubleshooting](https://pytorch.org/docs/2.0/dynamo/troubleshooting.html) 向 PyTorch 提 issue.
```

## 优化器

如果使用了昇腾的设备,可以使用昇腾的优化器从而缩短模型的训练时间。昇腾设备支持的优化器如下

```
NpuFusedAdadelta
NpuFusedAdam
NpuFusedAdamP
NpuFusedAdamW
NpuFusedBertAdam
NpuFusedLamb
NpuFusedRMSprop
NpuFusedRMSpropTF
NpuFusedSGD
```

使用方式同原生优化器一样,MMEngine 中更详细的使用方式可参考 [optimizers](https://mmengine.readthedocs.io/zh_CN/latest/tutorials/optim_wrapper.html?highlight=%E4%BC%98%E5%8C%96%E5%99%A8).

0 comments on commit c779b2c

Please sign in to comment.