Skip to content

Commit

Permalink
[Enhance] New-style CPU training and inference. (#1251)
Browse files Browse the repository at this point in the history
* [Enhance] New-style CPU training and inference.

* assert mmcv version

* SyncBN to BN in training and testing

* SyncBN to BN in training and testing

* upload untracked files to this branch

* delete gpu_ids

* fix bugs

* assert args.gpu_id in train.py

* use cfg.gpu_ids = [args.gpu_id]

* use cfg.gpu_ids = [args.gpu_id]

* fix typo

* fix typo

* fix typos
  • Loading branch information
MengzhangLI committed Jan 28, 2022
1 parent 02d2790 commit ecda117
Show file tree
Hide file tree
Showing 6 changed files with 63 additions and 5 deletions.
5 changes: 5 additions & 0 deletions docs/en/inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ and also some high-level apis for easier integration to other projects.
### Test a dataset

- single GPU
- CPU
- single node multiple GPU
- multiple node

Expand All @@ -15,6 +16,10 @@ You can use the following commands to test a dataset.
# single-gpu testing
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [--out ${RESULT_FILE}] [--eval ${EVAL_METRICS}] [--show]

# CPU: disable GPUs and run single-gpu testing script
export CUDA_VISIBLE_DEVICES=-1
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [--out ${RESULT_FILE}] [--eval ${EVAL_METRICS}] [--show]

# multi-gpu testing
./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [--out ${RESULT_FILE}] [--eval ${EVAL_METRICS}]
```
Expand Down
14 changes: 14 additions & 0 deletions docs/en/train.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,20 @@ python tools/train.py ${CONFIG_FILE} [optional arguments]

If you want to specify the working directory in the command, you can add an argument `--work-dir ${YOUR_WORK_DIR}`.

### Train with CPU

The process of training on the CPU is consistent with single GPU training. We just need to disable GPUs before the training process.

```shell
export CUDA_VISIBLE_DEVICES=-1
```

And then run the script [above](#train-with-a-single-gpu).

```{warning}
The process of training on the CPU is consistent with single GPU training. We just need to disable GPUs before the training process.
```

### Train with multiple GPUs

```shell
Expand Down
5 changes: 5 additions & 0 deletions docs/zh_cn/inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
### 测试一个数据集

- 单卡 GPU
- CPU
- 单节点多卡 GPU
- 多节点

Expand All @@ -14,6 +15,10 @@
# 单卡 GPU 测试
python tools/test.py ${配置文件} ${检查点文件} [--out ${结果文件}] [--eval ${评估指标}] [--show]

# CPU: 禁用 GPU 并运行单 GPU 测试脚本
export CUDA_VISIBLE_DEVICES=-1
python tools/test.py ${配置文件} ${检查点文件} [--out ${结果文件}] [--eval ${评估指标}] [--show]

# 多卡GPU 测试
./tools/dist_test.sh ${配置文件} ${检查点文件} ${GPU数目} [--out ${结果文件}] [--eval ${评估指标}]
```
Expand Down
14 changes: 14 additions & 0 deletions docs/zh_cn/train.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,20 @@ python tools/train.py ${配置文件} [可选参数]

如果您想在命令里定义工作文件夹路径,您可以添加一个参数`--work-dir ${YOUR_WORK_DIR}`

### 使用 CPU 训练

使用 CPU 训练的流程和使用单 GPU 训练的流程一致,我们仅需要在训练流程开始前禁用 GPU。

```shell
export CUDA_VISIBLE_DEVICES=-1
```

之后运行单 GPU 训练脚本即可。

```{warning}
我们不推荐用户使用 CPU 进行训练,这太过缓慢。我们支持这个功能是为了方便用户在没有 GPU 的机器上进行调试。
```

### 使用多卡 GPU 训练

```shell
Expand Down
9 changes: 6 additions & 3 deletions mmseg/apis/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,15 @@
import random
import warnings

import mmcv
import numpy as np
import torch
import torch.distributed as dist
from mmcv.parallel import MMDataParallel, MMDistributedDataParallel
from mmcv.runner import HOOKS, build_optimizer, build_runner, get_dist_info
from mmcv.utils import build_from_cfg

from mmseg import digit_version
from mmseg.core import DistEvalHook, EvalHook
from mmseg.datasets import build_dataloader, build_dataset
from mmseg.utils import find_latest_checkpoint, get_root_logger
Expand Down Expand Up @@ -99,9 +101,10 @@ def train_segmentor(model,
broadcast_buffers=False,
find_unused_parameters=find_unused_parameters)
else:
model = MMDataParallel(
model.cuda(cfg.gpu_ids[0]), device_ids=cfg.gpu_ids)

if not torch.cuda.is_available():
assert digit_version(mmcv.__version__) >= digit_version('1.4.4'), \
'Please use MMCV >= 1.4.4 for CPU training!'
model = MMDataParallel(model, device_ids=cfg.gpu_ids)
# build runner
optimizer = build_optimizer(model, cfg.optimizer)

Expand Down
21 changes: 19 additions & 2 deletions tools/test.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,13 @@

import mmcv
import torch
from mmcv.cnn.utils import revert_sync_batchnorm
from mmcv.parallel import MMDataParallel, MMDistributedDataParallel
from mmcv.runner import (get_dist_info, init_dist, load_checkpoint,
wrap_fp16_model)
from mmcv.utils import DictAction

from mmseg import digit_version
from mmseg.apis import multi_gpu_test, single_gpu_test
from mmseg.datasets import build_dataloader, build_dataset
from mmseg.models import build_segmentor
Expand Down Expand Up @@ -147,11 +149,18 @@ def main():
cfg.model.pretrained = None
cfg.data.test.test_mode = True

cfg.gpu_ids = [args.gpu_id]
if args.gpu_id is not None:
cfg.gpu_ids = [args.gpu_id]

# init distributed env first, since logger depends on the dist info.
if args.launcher == 'none':
cfg.gpu_ids = [args.gpu_id]
distributed = False
if len(cfg.gpu_ids) > 1:
warnings.warn(f'The gpu-ids is reset from {cfg.gpu_ids} to '
f'{cfg.gpu_ids[0:1]} to avoid potential error in '
'non-distribute testing time.')
cfg.gpu_ids = cfg.gpu_ids[0:1]
else:
distributed = True
init_dist(args.launcher, **cfg.dist_params)
Expand Down Expand Up @@ -236,7 +245,15 @@ def main():
tmpdir = None

if not distributed:
model = MMDataParallel(model, device_ids=[0])
warnings.warn(
'SyncBN is only supported with DDP. To be compatible with DP, '
'we convert SyncBN to BN. Please use dist_train.sh which can '
'avoid this error.')
if not torch.cuda.is_available():
assert digit_version(mmcv.__version__) >= digit_version('1.4.4'), \
'Please use MMCV >= 1.4.4 for CPU training!'
model = revert_sync_batchnorm(model)
model = MMDataParallel(model, device_ids=cfg.gpu_ids)
results = single_gpu_test(
model,
data_loader,
Expand Down

0 comments on commit ecda117

Please sign in to comment.