open-mmlab · zhouzaida · Jul 31, 2023 · Jul 31, 2023 · Jul 31, 2023
diff --git a/docs/en/common_usage/distributed_training.md b/docs/en/common_usage/distributed_training.md
@@ -2,7 +2,9 @@
 
 MMEngine supports training models with CPU, single GPU, multiple GPUs in single machine and multiple machines. When multiple GPUs are available in the environment, we can use the following command to enable multiple GPUs in single machine or multiple machines to shorten the training time of the model.
 
-## multiple GPUs in single machine
+## Launch Training
+
+### multiple GPUs in single machine
 
 Assuming the current machine has 8 GPUs, you can enable multiple GPUs training with the following command:
 
@@ -16,7 +18,7 @@ If you need to specify the GPU index, you can set the `CUDA_VISIBLE_DEVICES` env
 CUDA_VISIBLE_DEVICES=0,3 python -m torch.distributed.launch --nproc_per_node=2 examples/distributed_training.py --launcher pytorch
 ```
 
-## multiple machines
+### multiple machines
 
 Assume that there are 2 machines connected with ethernet, you can simply run following commands.
 
@@ -56,3 +58,51 @@ srun -p mm_dev \
     --kill-on-bad-exit=1 \
     python examples/distributed_training.py --launcher="slurm"
 ```
+
+## Customize Distributed Training
+
+When users switch from `single GPU` training to `multiple GPUs` training, no changes need to be made. [Runner](mmengine.runner.Runner.wrap_model) will use [MMDistributedDataParallel](mmengine.model.MMDistributedDataParallel) by default to wrap the model, thereby supporting multiple GPUs training.
+
+If you want to pass more parameters to MMDistributedDataParallel or use your own `CustomDistributedDataParallel`, you can set `model_wrapper_cfg`.
+
+### Pass More Parameters to MMDistributedDataParallel
+
+For example, setting `find_unused_parameters` to `True`:
+
+```python
+cfg = dict(
+    model_wrapper_cfg=dict(
+        type='MMDistributedDataParallel', find_unused_parameters=True)
+)
+runner = Runner(
+    model=ResNet18(),
+    work_dir='./work_dir',
+    train_dataloader=train_dataloader_cfg,
+    optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.001, momentum=0.9)),
+    train_cfg=dict(by_epoch=True, max_epochs=3),
+    cfg = cfg,
+)
+runner.train()
+```
+
+### Use a Customized CustomDistributedDataParallel
+
+```python
+from mmengine.registry import MODEL_WRAPPERS
+
+@MODEL_WRAPPERS.register_module()
+class CustomDistributedDataParallel(DistributedDataParallel):
+    pass
+
+
+cfg = dict(model_wrapper_cfg=dict(type='CustomDistributedDataParallel'))
+runner = Runner(
+    model=ResNet18(),
+    work_dir='./work_dir',
+    train_dataloader=train_dataloader_cfg,
+    optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.001, momentum=0.9)),
+    train_cfg=dict(by_epoch=True, max_epochs=3),
+    cfg = cfg,
+)
+runner.train()
+```
diff --git a/docs/zh_cn/common_usage/distributed_training.md b/docs/zh_cn/common_usage/distributed_training.md
@@ -2,7 +2,9 @@
 
 MMEngine 支持 CPU、单卡、单机多卡以及多机多卡的训练。当环境中有多张显卡时，我们可以使用以下命令开启单机多卡或者多机多卡的方式从而缩短模型的训练时间。
 
-## 单机多卡
+## 启动训练
+
+### 单机多卡
 
 假设当前机器有 8 张显卡，可以使用以下命令开启多卡训练
 
@@ -16,7 +18,7 @@ python -m torch.distributed.launch --nproc_per_node=8 examples/distributed_train
 CUDA_VISIBLE_DEVICES=0,3 python -m torch.distributed.launch --nproc_per_node=2 examples/distributed_training.py --launcher pytorch
 ```
 
-## 多机多卡
+### 多机多卡
 
 假设有 2 台机器，每台机器有 8 张卡。
 
@@ -56,3 +58,50 @@ srun -p mm_dev \
     --kill-on-bad-exit=1 \
     python examples/distributed_training.py --launcher="slurm"
 ```
+
+## 定制化分布式训练
+
+当用户从单卡训练切换到多卡训练时，无需做任何改动，[Runner](mmengine.runner.Runner.wrap_model) 会默认使用 [MMDistributedDataParallel](mmengine.model.MMDistributedDataParallel) 封装 model，从而支持多卡训练。
+如果你希望给 MMDistributedDataParallel 传入更多的参数或者使用自定义的 `CustomDistributedDataParallel`，你可以设置 `model_wrapper_cfg`。
+
+### 往 MMDistributedDataParallel 传入更多的参数
+
+例如设置 `find_unused_parameters` 为 `True`：
+
+```python
+cfg = dict(
+    model_wrapper_cfg=dict(
+        type='MMDistributedDataParallel', find_unused_parameters=True)
+)
+runner = Runner(
+    model=ResNet18(),
+    work_dir='./work_dir',
+    train_dataloader=train_dataloader_cfg,
+    optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.001, momentum=0.9)),
+    train_cfg=dict(by_epoch=True, max_epochs=3),
+    cfg = cfg,
+)
+runner.train()
+```
+
+### 使用自定义的 CustomDistributedDataParallel
+
+```python
+from mmengine.registry import MODEL_WRAPPERS
+
+@MODEL_WRAPPERS.register_module()
+class CustomDistributedDataParallel(DistributedDataParallel):
+    pass
+
+
+cfg = dict(model_wrapper_cfg=dict(type='CustomDistributedDataParallel'))
+runner = Runner(
+    model=ResNet18(),
+    work_dir='./work_dir',
+    train_dataloader=train_dataloader_cfg,
+    optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.001, momentum=0.9)),
+    train_cfg=dict(by_epoch=True, max_epochs=3),
+    cfg = cfg,
+)
+runner.train()
+```