Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Add a document about debug tricks #938

Merged
merged 8 commits into from
Feb 21, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/en/common_usage/debug_tricks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Debug Tricks

Coming soon. Please refer to [chinese documentation](https://mmengine.readthedocs.io/zh_CN/latest/common_usage/debug_tricks.html).
1 change: 1 addition & 0 deletions docs/en/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ You can switch between Chinese and English documents in the lower-left corner of
common_usage/resume_training.md
common_usage/speed_up_training.md
common_usage/save_gpu_memory.md
common_usage/debug_tricks.md
common_usage/epoch_to_iter.md

.. toctree::
Expand Down
51 changes: 51 additions & 0 deletions docs/zh_cn/common_usage/debug_tricks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# 调试技巧

## 设置数据集的长度

在调试代码的过程中,有时需要训练几个 epoch,例如调试验证过程或者权重的保存是否符合期望。然而如果数据集太大,需要花费较长时间才能训完一个 epoch,这种情况下可以设置数据集的长度。注意,只有继承自 [BaseDataset](mmengine.dataset.BaseDataset) 的 Dataset 才支持这个功能,`BaseDataset` 的用法可阅读 [数据集基类(BASEDATASET)](../advanced_tutorials/basedataset.md)。

以 `MMClassification` 为例(参考[文档](https://mmclassification.readthedocs.io/zh_CN/dev-1.x/get_started.html#id2)安装 MMClassification)。

启动训练命令

```bash
python tools/train.py configs/resnet/resnet18_8xb16_cifar10.py
```

下面是训练的部分日志,其中 `3125` 表示需要迭代的次数。

```
02/20 14:43:11 - mmengine - INFO - Epoch(train) [1][ 100/3125] lr: 1.0000e-01 eta: 6:12:01 time: 0.0149 data_time: 0.0003 memory: 214 loss: 2.0611
02/20 14:43:13 - mmengine - INFO - Epoch(train) [1][ 200/3125] lr: 1.0000e-01 eta: 4:23:08 time: 0.0154 data_time: 0.0003 memory: 214 loss: 2.0963
02/20 14:43:14 - mmengine - INFO - Epoch(train) [1][ 300/3125] lr: 1.0000e-01 eta: 3:46:27 time: 0.0146 data_time: 0.0003 memory: 214 loss: 1.9858
```

关掉训练,然后修改 [configs/_base_/datasets/cifar10_bs16.py](https://github.com/open-mmlab/mmclassification/blob/dev-1.x/configs/_base_/datasets/cifar10_bs16.py) 中的 `dataset` 字段,设置 `indices=5000`。

```python
train_dataloader = dict(
batch_size=16,
num_workers=2,
dataset=dict(
type=dataset_type,
data_prefix='data/cifar10',
test_mode=False,
indices=5000, # 设置 indices=5000,表示每个 epoch 只迭代 5000 个样本
pipeline=train_pipeline),
sampler=dict(type='DefaultSampler', shuffle=True),
)
```

重新启动训练

```bash
python tools/train.py configs/resnet/resnet18_8xb16_cifar10.py
```

可以看到,迭代次数变成了 `313`,相比原先,这样能够更快跑完一个 epoch。

```
02/20 14:44:58 - mmengine - INFO - Epoch(train) [1][100/313] lr: 1.0000e-01 eta: 0:31:09 time: 0.0154 data_time: 0.0004 memory: 214 loss: 2.1852
02/20 14:44:59 - mmengine - INFO - Epoch(train) [1][200/313] lr: 1.0000e-01 eta: 0:23:18 time: 0.0143 data_time: 0.0002 memory: 214 loss: 2.0424
02/20 14:45:01 - mmengine - INFO - Epoch(train) [1][300/313] lr: 1.0000e-01 eta: 0:20:39 time: 0.0143 data_time: 0.0003 memory: 214 loss: 1.814
```
1 change: 1 addition & 0 deletions docs/zh_cn/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
common_usage/speed_up_training.md
common_usage/save_gpu_memory.md
common_usage/set_random_seed.md
common_usage/debug_tricks.md
common_usage/model_analysis.md
common_usage/set_interval.md
common_usage/epoch_to_iter.md
Expand Down
6 changes: 3 additions & 3 deletions mmengine/dataset/base_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -714,9 +714,9 @@ def _get_unserialized_subset(self, indices: Union[Sequence[int],
Args:
indices (int or Sequence[int]): If type of indices is int,
indices represents the first or last few data of data
information. If indices of indices is Sequence, indices
represents the target data information index which consist
of subset data information.
information. If type of indices is Sequence, indices represents
the target data information index which consist of subset data
information.

Returns:
Tuple[np.ndarray, np.ndarray]: subset of data information.
Expand Down