Skip to content

Commit

Permalink
[Docs] Add a document about debug tricks (#938)
Browse files Browse the repository at this point in the history
* fix typo

* [Docs] Add debug skills

* minor fix

* refine

* rename debug_skills to debug_tricks

* refine

* Update docs/en/common_usage/debug_tricks.md
  • Loading branch information
zhouzaida committed Feb 21, 2023
1 parent 4861f03 commit 67acdbe
Show file tree
Hide file tree
Showing 5 changed files with 59 additions and 3 deletions.
3 changes: 3 additions & 0 deletions docs/en/common_usage/debug_tricks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Debug Tricks

Coming soon. Please refer to [chinese documentation](https://mmengine.readthedocs.io/zh_CN/latest/common_usage/debug_tricks.html).
1 change: 1 addition & 0 deletions docs/en/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ You can switch between Chinese and English documents in the lower-left corner of
common_usage/resume_training.md
common_usage/speed_up_training.md
common_usage/save_gpu_memory.md
common_usage/debug_tricks.md
common_usage/epoch_to_iter.md

.. toctree::
Expand Down
51 changes: 51 additions & 0 deletions docs/zh_cn/common_usage/debug_tricks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# 调试技巧

## 设置数据集的长度

在调试代码的过程中,有时需要训练几个 epoch,例如调试验证过程或者权重的保存是否符合期望。然而如果数据集太大,需要花费较长时间才能训完一个 epoch,这种情况下可以设置数据集的长度。注意,只有继承自 [BaseDataset](mmengine.dataset.BaseDataset) 的 Dataset 才支持这个功能,`BaseDataset` 的用法可阅读 [数据集基类(BASEDATASET)](../advanced_tutorials/basedataset.md)

`MMClassification` 为例(参考[文档](https://mmclassification.readthedocs.io/zh_CN/dev-1.x/get_started.html#id2)安装 MMClassification)。

启动训练命令

```bash
python tools/train.py configs/resnet/resnet18_8xb16_cifar10.py
```

下面是训练的部分日志,其中 `3125` 表示需要迭代的次数。

```
02/20 14:43:11 - mmengine - INFO - Epoch(train) [1][ 100/3125] lr: 1.0000e-01 eta: 6:12:01 time: 0.0149 data_time: 0.0003 memory: 214 loss: 2.0611
02/20 14:43:13 - mmengine - INFO - Epoch(train) [1][ 200/3125] lr: 1.0000e-01 eta: 4:23:08 time: 0.0154 data_time: 0.0003 memory: 214 loss: 2.0963
02/20 14:43:14 - mmengine - INFO - Epoch(train) [1][ 300/3125] lr: 1.0000e-01 eta: 3:46:27 time: 0.0146 data_time: 0.0003 memory: 214 loss: 1.9858
```

关掉训练,然后修改 [configs/_base_/datasets/cifar10_bs16.py](https://github.com/open-mmlab/mmclassification/blob/dev-1.x/configs/_base_/datasets/cifar10_bs16.py) 中的 `dataset` 字段,设置 `indices=5000`

```python
train_dataloader = dict(
batch_size=16,
num_workers=2,
dataset=dict(
type=dataset_type,
data_prefix='data/cifar10',
test_mode=False,
indices=5000, # 设置 indices=5000,表示每个 epoch 只迭代 5000 个样本
pipeline=train_pipeline),
sampler=dict(type='DefaultSampler', shuffle=True),
)
```

重新启动训练

```bash
python tools/train.py configs/resnet/resnet18_8xb16_cifar10.py
```

可以看到,迭代次数变成了 `313`,相比原先,这样能够更快跑完一个 epoch。

```
02/20 14:44:58 - mmengine - INFO - Epoch(train) [1][100/313] lr: 1.0000e-01 eta: 0:31:09 time: 0.0154 data_time: 0.0004 memory: 214 loss: 2.1852
02/20 14:44:59 - mmengine - INFO - Epoch(train) [1][200/313] lr: 1.0000e-01 eta: 0:23:18 time: 0.0143 data_time: 0.0002 memory: 214 loss: 2.0424
02/20 14:45:01 - mmengine - INFO - Epoch(train) [1][300/313] lr: 1.0000e-01 eta: 0:20:39 time: 0.0143 data_time: 0.0003 memory: 214 loss: 1.814
```
1 change: 1 addition & 0 deletions docs/zh_cn/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
common_usage/speed_up_training.md
common_usage/save_gpu_memory.md
common_usage/set_random_seed.md
common_usage/debug_tricks.md
common_usage/model_analysis.md
common_usage/set_interval.md
common_usage/epoch_to_iter.md
Expand Down
6 changes: 3 additions & 3 deletions mmengine/dataset/base_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -714,9 +714,9 @@ def _get_unserialized_subset(self, indices: Union[Sequence[int],
Args:
indices (int or Sequence[int]): If type of indices is int,
indices represents the first or last few data of data
information. If indices of indices is Sequence, indices
represents the target data information index which consist
of subset data information.
information. If type of indices is Sequence, indices represents
the target data information index which consist of subset data
information.
Returns:
Tuple[np.ndarray, np.ndarray]: subset of data information.
Expand Down

0 comments on commit 67acdbe

Please sign in to comment.