Skip to content

Commit

Permalink
[Enchance] Update FAQ docs (#6587)
Browse files Browse the repository at this point in the history
* Fix mosaic repr typo (#6523)

* Include mmflow in readme (#6545)

* Include mmflow in readme

* Include mmflow in README_zh-CN

* Add mmflow url into the document menu in docs/conf.py and docs_zh-CN/conf.py.

* Make OHEM work with seesaw loss (#6514)

* [Enhance] Support file_client in Datasets and evaluating panoptic results on Ceph (#6489)

* first version

* Replace with our api

* Add copyright

* Move the runtime error to multi_core interface

* Add docstring

* Fix comments

* Add comments

* Add unit test for pq_compute_single_core

* Fix MMDetection model to ONNX command (#6558)

* Update README.md (#6567)

* [Feature] Support custom persistent_workers (#6435)

* Fix aug test error when the number of prediction bboxes is 0 (#6398)

* Fix aug test error when the number of prediction bboxes is 0

* test

* test

* fix lint

* Support custom pin_memory and persistent_workers

* fix comment

* fix docstr

* remove pin_memory

* Fix SSD512 config error (#6574)

* Fix mosaic repr typo (#6523)

* Include mmflow in readme (#6545)

* Include mmflow in readme

* Include mmflow in README_zh-CN

* Add mmflow url into the document menu in docs/conf.py and docs_zh-CN/conf.py.

* Make OHEM work with seesaw loss (#6514)

* Fix ssd512 config error

Co-authored-by: Kyungmin Lee <30465912+lkm2835@users.noreply.github.com>
Co-authored-by: Czm369 <40661020+Czm369@users.noreply.github.com>
Co-authored-by: ohwi <supebulous@gmail.com>

* Catch symlink failure on Windows (#6482)

* Catch symlink failure on Windows

Signed-off-by: del-zhenwu <dele.zhenwu@gmail.com>

* Set copy mode on Windows

Signed-off-by: del-zhenwu <dele.zhenwu@gmail.com>

* Fix lint

Signed-off-by: del-zhenwu <dele.zhenwu@gmail.com>

* Fix logic error

Signed-off-by: del-zhenwu <dele.zhenwu@gmail.com>

* [Feature] Support Label Assignment Distillation (LAD) (#6342)

* add LAD

* inherit LAD from KnowledgeDistillationSingleStageDetector

* add configs/lad/lad_r101_paa_r50_fpn_coco_1x.py

* update LAD readme

* update configs/lad/README.md

* try not to use abbreviations for variable names

* add unittest for lad_head

* update test_lad_head

* remove main in tests/test_models/test_dense_heads/test_lad_head.py

* [Fix] Avoid infinite GPU waiting in dist training (#6501)

* [#6495] fix infinite GPU waiting in dist training

* print log_vars keys in assertion msg

* linting issue

* Support to collect the best models (#6560)

* Fix mosaic repr typo (#6523)

* Include mmflow in readme (#6545)

* Include mmflow in readme

* Include mmflow in README_zh-CN

* Add mmflow url into the document menu in docs/conf.py and docs_zh-CN/conf.py.

* Make OHEM work with seesaw loss (#6514)

* update

* support gather best model

Co-authored-by: Kyungmin Lee <30465912+lkm2835@users.noreply.github.com>
Co-authored-by: Czm369 <40661020+Czm369@users.noreply.github.com>
Co-authored-by: ohwi <supebulous@gmail.com>

* [Enhance]: Optimize augmentation pipeline to speed up training. (#6442)

* Refactor YOLOX (#6443)

* Fix aug test error when the number of prediction bboxes is 0 (#6398)

* Fix aug test error when the number of prediction bboxes is 0

* test

* test

* fix lint

* Support custom pin_memory and persistent_workers

* [Docs] Chinese version of robustness_benchmarking.md (#6375)

* Chinese version of robustness_benchmarking.md

* Update docs_zh-CN/robustness_benchmarking.md

Co-authored-by: RangiLyu <lyuchqi@gmail.com>

* Update docs_zh-CN/robustness_benchmarking.md

Co-authored-by: RangiLyu <lyuchqi@gmail.com>

* Update docs_zh-CN/robustness_benchmarking.md

Co-authored-by: RangiLyu <lyuchqi@gmail.com>

* Update docs_zh-CN/robustness_benchmarking.md

Co-authored-by: RangiLyu <lyuchqi@gmail.com>

* Update docs_zh-CN/robustness_benchmarking.md

Co-authored-by: RangiLyu <lyuchqi@gmail.com>

* Update docs_zh-CN/robustness_benchmarking.md

Co-authored-by: RangiLyu <lyuchqi@gmail.com>

* Update robustness_benchmarking.md

* Update robustness_benchmarking.md

* Update robustness_benchmarking.md

* Update robustness_benchmarking.md

* Update robustness_benchmarking.md

* Update robustness_benchmarking.md

Co-authored-by: RangiLyu <lyuchqi@gmail.com>

* update yolox_s

* update yolox_s

* support dynamic eval interval

* fix some error

* support ceph

* fix none error

* fix batch error

* replace resize

* fix comment

* fix docstr

* Update the link of checkpoints (#6460)

* [Feature]: Support plot confusion matrix. (#6344)

* remove pin_memory

* update

* fix unittest

* update cfg

* fix error

* add unittest

* [Fix] Fix SpatialReductionAttention in PVT. (#6488)

* [Fix] Fix SpatialReductionAttention in PVT

* Add warning

* Save coco summarize print information to logger (#6505)

* Fix type error in 2_new_data_mode (#6469)

* Always map location to cpu when load checkpoint (#6405)

* configs: update groie README (#6401)

Signed-off-by: Leonardo Rossi <leonardo.rossi@unipr.it>

* [Fix] fix config path in docs (#6396)

* [Enchance] Set a random seed when the user does not set a seed. (#6457)

* fix random seed bug

* add comment

* enchance random seed

* rename

Co-authored-by: Haobo Yuan <yuanhaobo@whu.edu.cn>

* [BugFixed] fix wrong trunc_normal_init use (#6432)

* fix wrong trunc_normal_init use

* fix wrong trunc_normal_init use

* fix #6446

Co-authored-by: Uno Wu <st9007a@gmail.com>
Co-authored-by: Leonardo Rossi <leonardo.rossi@unipr.it>
Co-authored-by: BigDong <yudongwang@tju.edu.cn>
Co-authored-by: Haian Huang(深度眸) <1286304229@qq.com>
Co-authored-by: Haobo Yuan <yuanhaobo@whu.edu.cn>
Co-authored-by: Shusheng Yang <shusheng.yang@qq.com>

* bump version to v2.18.1 (#6510)

* bump version to v2.18.1

* Update changelog.md

* add some comment

* fix some comment

* update readme

* fix lint

* add reduce mean

* update

* update readme

* update params

Co-authored-by: Cedric Luo <luochunhua1996@outlook.com>
Co-authored-by: RangiLyu <lyuchqi@gmail.com>
Co-authored-by: Guangchen Lin <347630870@qq.com>
Co-authored-by: Andrea Panizza <8233615+AndreaPi@users.noreply.github.com>
Co-authored-by: Uno Wu <st9007a@gmail.com>
Co-authored-by: Leonardo Rossi <leonardo.rossi@unipr.it>
Co-authored-by: BigDong <yudongwang@tju.edu.cn>
Co-authored-by: Haobo Yuan <yuanhaobo@whu.edu.cn>
Co-authored-by: Shusheng Yang <shusheng.yang@qq.com>

* [Refactor] Remove some code in `mmdet/apis/train.py` (#6576)

* remove some code about custom hooks in apis/train.py

* files were modified by yapf

* Fix lad repeatedly output warning message (#6584)

* update faq docs

* update

* update

* update

* fix lint

* update

* update

* update

* update readme

* Rephrase

Co-authored-by: Kyungmin Lee <30465912+lkm2835@users.noreply.github.com>
Co-authored-by: Czm369 <40661020+Czm369@users.noreply.github.com>
Co-authored-by: ohwi <supebulous@gmail.com>
Co-authored-by: Guangchen Lin <347630870@qq.com>
Co-authored-by: Rishit Dagli <rishit.dagli@gmail.com>
Co-authored-by: RangiLyu <lyuchqi@gmail.com>
Co-authored-by: del-zhenwu <dele.zhenwu@gmail.com>
Co-authored-by: Thuy Ng <thuypn9a4@gmail.com>
Co-authored-by: Han Zhang <623606860@qq.com>
Co-authored-by: Cedric Luo <luochunhua1996@outlook.com>
Co-authored-by: Andrea Panizza <8233615+AndreaPi@users.noreply.github.com>
Co-authored-by: Uno Wu <st9007a@gmail.com>
Co-authored-by: Leonardo Rossi <leonardo.rossi@unipr.it>
Co-authored-by: BigDong <yudongwang@tju.edu.cn>
Co-authored-by: Haobo Yuan <yuanhaobo@whu.edu.cn>
Co-authored-by: Shusheng Yang <shusheng.yang@qq.com>
Co-authored-by: Wenwei Zhang <40779233+ZwwWayne@users.noreply.github.com>
  • Loading branch information
18 people committed Dec 8, 2021
1 parent de60de7 commit 3c91d21
Show file tree
Hide file tree
Showing 2 changed files with 89 additions and 0 deletions.
44 changes: 44 additions & 0 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ We list some common troubles faced by many users and their corresponding solutio
2. Reduce the learning rate: the learning rate might be too large due to some reasons, e.g., change of batch size. You can rescale them to the value that could stably train the model.
3. Extend the warmup iterations: some models are sensitive to the learning rate at the start of the training. You can extend the warmup iterations, e.g., change the `warmup_iters` from 500 to 1000 or 2000.
4. Add gradient clipping: some models requires gradient clipping to stabilize the training process. The default of `grad_clip` is `None`, you can add gradient clippint to avoid gradients that are too large, i.e., set `optimizer_config=dict(_delete_=True, grad_clip=dict(max_norm=35, norm_type=2))` in your config file. If your config does not inherits from any basic config that contains `optimizer_config=dict(grad_clip=None)`, you can simply add `optimizer_config=dict(grad_clip=dict(max_norm=35, norm_type=2))`.

- ’GPU out of memory"
1. There are some scenarios when there are large amount of ground truth boxes, which may cause OOM during target assignment. You can set `gpu_assign_thr=N` in the config of assigner thus the assigner will calculate box overlaps through CPU when there are more than N GT boxes.
2. Set `with_cp=True` in the backbone. This uses the sublinear strategy in PyTorch to reduce GPU memory cost in the backbone.
Expand All @@ -83,8 +84,51 @@ We list some common troubles faced by many users and their corresponding solutio
1. This error indicates that your module has parameters that were not used in producing loss. This phenomenon may be caused by running different branches in your code in DDP mode.
2. You can set ` find_unused_parameters = True` in the config to solve the above problems or find those unused parameters manually.

- Save the best model

It can be turned on by configuring `evaluation = dict(save_best=‘auto’)`. In the case of the `auto` parameter, the first key in the returned evaluation result will be used as the basis for selecting the best model. You can also directly set the key in the evaluation result to manually set it, for example, `evaluation = dict(save_best='mAP' )`.

- Resume training with `ExpMomentumEMAHook`

If you use `ExpMomentumEMAHook` in training, you can't just use command line parameters `--resume-from` nor `--cfg-options resume_from` to restore model parameters during resume, i.e., the command `python tools/train.py configs/yolox/yolox_s_8x8_300e_coco.py --resume-from ./work_dir/yolox_s_8x8_300e_coco/epoch_x.pth ` will not work. Since `ExpMomentumEMAHook` needs to reload the weights, taking the `yolox_s` algorithm as an example, you should modify the values of `resume_from` in two places of the config as below:

```python
# Open configs/yolox/yolox_s_8x8_300e_coco.py directly and modify all resume_from fields
resume_from=./work_dir/yolox_s_8x8_300e_coco/epoch_x.pth
custom_hooks=[...
dict(
type='ExpMomentumEMAHook',
resume_from=./work_dir/yolox_s_8x8_300e_coco/epoch_x.pth,
momentum=0.0001,
priority=49)
]
```

## Evaluation

- COCO Dataset, AP or AR = -1
1. According to the definition of COCO dataset, the small and medium areas in an image are less than 1024 (32\*32), 9216 (96\*96), respectively.
2. If the corresponding area has no object, the result of AP and AR will set to -1.

## Model

- `style` in ResNet

The `style` parameter in ResNet allows either `pytorch` or `caffe` style. It indicates the difference in the Bottleneck module. Bottleneck is a stacking structure of `1x1-3x3-1x1` convolutional layers. In the case of `caffe` mode, the convolution layer with `stride=2` is the first `1x1` convolution, while in `pyorch` mode, it is the second `3x3` convolution has `stride=2`. A sample code is as below:

```python
if self.style == 'pytorch':
self.conv1_stride = 1
self.conv2_stride = stride
else:
self.conv1_stride = stride
self.conv2_stride = 1
```

- ResNeXt parameter description

ResNeXt comes from the paper [`Aggregated Residual Transformations for Deep Neural Networks`](https://arxiv.org/abs/1611.05431). It introduces group and uses “cardinality” to control the number of groups to achieve a balance between accuracy and complexity. It controls the basic width and grouping parameters of the internal Bottleneck module through two hyperparameters `baseWidth` and `cardinality`. An example configuration name in MMDetection is `mask_rcnn_x101_64x4d_fpn_mstrain-poly_3x_coco.py`, where `mask_rcnn` represents the algorithm using Mask R-CNN, `x101` represents the backbone network using ResNeXt-101, and `64x4d` represents that the bottleneck block has 64 group and each group has basic width of 4.

- `norm_eval` in backbone

Since the detection model is usually large and the input image resolution is high, this will result in a small batch of the detection model, which will make the variance of the statistics calculated by BatchNorm during the training process very large and not as stable as the statistics obtained during the pre-training of the backbone network . Therefore, the `norm_eval=True` mode is generally used in training, and the BatchNorm statistics in the pre-trained backbone network are directly used. The few algorithms that use large batches are the `norm_eval=False` mode, such as NASFPN. For the backbone network without ImageNet pre-training and the batch is relatively small, you can consider using `SyncBN`.
45 changes: 45 additions & 0 deletions docs_zh-CN/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,16 +76,61 @@
2. 降低学习率:由于某些原因,例如 batch size 大小的变化, 导致当前学习率可能太大。 您可以降低为可以稳定训练模型的值。
3. 延长 warm up 的时间:一些模型在训练初始时对学习率很敏感,您可以把 `warmup_iters` 从 500 更改为 1000 或 2000。
4. 添加 gradient clipping: 一些模型需要梯度裁剪来稳定训练过程。 默认的 `grad_clip``None`, 你可以在 config 设置 `optimizer_config=dict(_delete_=True, grad_clip=dict(max_norm=35, norm_type=2))` 如果你的 config 没有继承任何包含 `optimizer_config=dict(grad_clip=None)`, 你可以直接设置`optimizer_config=dict(grad_clip=dict(max_norm=35, norm_type=2))`.

- ’GPU out of memory"
1. 存在大量 ground truth boxes 或者大量 anchor 的场景,可能在 assigner 会 OOM。 您可以在 assigner 的配置中设置 `gpu_assign_thr=N`,这样当超过 N 个 GT boxes 时,assigner 会通过 CPU 计算 IOU。
2. 在 backbone 中设置 `with_cp=True`。 这使用 PyTorch 中的 `sublinear strategy` 来降低 backbone 占用的 GPU 显存。
3. 使用 `config/fp16` 中的示例尝试混合精度训练。`loss_scale` 可能需要针对不同模型进行调整。

- "RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one"
1. 这个错误出现在存在参数没有在 forward 中使用,容易在 DDP 中运行不同分支时发生。
2. 你可以在 config 设置 `find_unused_parameters = True`,或者手动查找哪些参数没有用到。

- 训练中保存最好模型

可以通过配置 `evaluation = dict(save_best=‘auto’)`开启。在 auto 参数情况下会根据返回的验证结果中的第一个 key 作为选择最优模型的依据,你也可以直接设置评估结果中的 key 来手动设置,例如 `evaluation = dict(save_best=‘mAP’)`

- 在 Resume 训练中使用 `ExpMomentumEMAHook`

如果在训练中使用了 `ExpMomentumEMAHook`,那么 resume 时候不能仅仅通过命令行参数 `--resume-from``--cfg-options resume_from` 实现恢复模型参数功能例如 `python tools/train.py configs/yolox/yolox_s_8x8_300e_coco.py --resume-from ./work_dir/yolox_s_8x8_300e_coco/epoch_x.pth`。以 `yolox_s` 算法为例,由于 `ExpMomentumEMAHook` 需要重新加载权重,你可以通过如下做法实现:

```python
# 直接打开 configs/yolox/yolox_s_8x8_300e_coco.py 修改所有 resume_from 字段
resume_from=./work_dir/yolox_s_8x8_300e_coco/epoch_x.pth
custom_hooks=[...
dict(
type='ExpMomentumEMAHook',
resume_from=./work_dir/yolox_s_8x8_300e_coco/epoch_x.pth,
momentum=0.0001,
priority=49)
]
```

## Evaluation 相关

- 使用 COCO Dataset 的测评接口时, 测评结果中 AP 或者 AR = -1
1. 根据COCO数据集的定义,一张图像中的中等物体与小物体面积的阈值分别为 9216(96\*96)与 1024(32\*32)。
2. 如果在某个区间没有检测框 AP 与 AR 认定为 -1.

## Model 相关

- **ResNet style 参数说明**

ResNet style 可选参数允许 `pytorch``caffe`,其差别在于 Bottleneck 模块。Bottleneck 是 `1x1-3x3-1x1` 堆叠结构,在 `caffe` 模式模式下 stride=2 参数放置在第一个 `1x1` 卷积处,而 `pyorch` 模式下 stride=2 放在第二个 `3x3` 卷积处。一个简单示例如下:

```python
if self.style == 'pytorch':
self.conv1_stride = 1
self.conv2_stride = stride
else:
self.conv1_stride = stride
self.conv2_stride = 1
```

- **ResNeXt 参数说明**

ResNeXt 来自论文 [`Aggregated Residual Transformations for Deep Neural Networks`](https://arxiv.org/abs/1611.05431). 其引入分组卷积,并且通过变量基数来控制组的数量达到精度和复杂度的平衡,其有两个超参 `baseWidth``cardinality `来控制内部 Bottleneck 模块的基本宽度和分组数参数。以 MMDetection 中配置名为 `mask_rcnn_x101_64x4d_fpn_mstrain-poly_3x_coco.py` 为例,其中 `mask_rcnn` 代表算法采用 Mask R-CNN,`x101` 代表骨架网络采用 ResNeXt-101,`64x4d`代表 Bottleneck 一共分成 64 组,每组的基本宽度是 4。

- **骨架网络 eval 模式说明**

因为检测模型通常比较大且输入图片分辨率很高,这会导致检测模型的 batch 很小,通常是 2,这会使得 BatchNorm 在训练过程计算的统计量方差非常大,不如主干网络预训练时得到的统计量稳定,因此在训练是一般都会使用 `norm_eval=True` 模式,直接使用预训练主干网络中的 BatchNorm 统计量,少数使用大 batch 的算法是 `norm_eval=False` 模式,例如 NASFPN。对于没有 ImageNet 预训练的骨架网络,如果 batch 比较小,可以考虑使用 `SyncBN`

0 comments on commit 3c91d21

Please sign in to comment.