Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CodeCamp2023-470] Runner supports setting the number of iterations for each epoch #1292

Merged
merged 56 commits into from Oct 8, 2023

Conversation

ShuRaymond
Copy link
Contributor

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

One of Openmmlab codecamp task.

Modification

Modified _flexible_runner.py and runner.py so that FlexibleRunner supports setting the number of rounds per epoch iteration to save debugging time.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  3. If the modification has potential influence on downstream projects, this PR should be tested with downstream projects, like MMDet or MMCls.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

@CLAassistant
Copy link

CLAassistant commented Aug 4, 2023

CLA assistant check
All committers have signed the CLA.

@ShuRaymond ShuRaymond reopened this Aug 4, 2023
@ShuRaymond ShuRaymond closed this Aug 4, 2023
@ShuRaymond ShuRaymond reopened this Aug 4, 2023
@HAOCHENYE
Copy link
Collaborator

Hi, we should also update a unit test to validate this feature works as expected 😄

@zhouzaida
Copy link
Member

Hi @ShuRaymond , thanks for your contribution.

Here are several comments:

  1. No need to modify FlexibleRunner, as num_batch_per_epoch can be passed to the loop directly through train_cfg, val_cfg, or test_cfg.
  2. No need to update IterBasedTrainLoop
  3. Unit tests need to be added.
  4. Documentation needs to be updated.
    • docs/zh_cn/common_usage/debug_tricks.md
    • docs/en/common_usage/debug_tricks.md

@zhouzaida zhouzaida linked an issue Aug 7, 2023 that may be closed by this pull request
@ShuRaymond ShuRaymond requested a review from C1rN09 as a code owner August 7, 2023 12:36
@ShuRaymond
Copy link
Contributor Author

Hi, we should also update a unit test to validate this feature works as expected 😄

thanks for reminding, I am doing it.

@ShuRaymond
Copy link
Contributor Author

Hi @ShuRaymond , thanks for your contribution.

Here are several comments:

  1. No need to modify FlexibleRunner, as num_batch_per_epoch can be passed to the loop directly through train_cfg, val_cfg, or test_cfg.

  2. No need to update IterBasedTrainLoop

  3. Unit tests need to be added.

  4. Documentation needs to be updated.

    • docs/zh_cn/common_usage/debug_tricks.md
    • docs/en/common_usage/debug_tricks.md

thanks for reminding and teaching, just done it.

@zhouzaida
Copy link
Member

Hi, also need to add several unit tests (whether the num_batch_per_epoch works as expected) in the following methods:

def test_train(self):

def test_val(self):

def test_test(self):

  • test_train
def test_train(self):
    # 15 test num_batch_per_epoch
    cfg = copy.deepcopy(self.epoch_based_cfg)
    cfg.train_cfg = dict(
        by_epoch=True,
        max_epochs=3,
        num_batch_per_epoch=2,
    )
    runner = Runner.from_cfg(cfg)
    runner.train()
    self.assertEqual(runner.iter, 3 * 2)

@@ -298,6 +298,7 @@ def __init__(
f'train_dataloader={train_dataloader}, '
f'train_cfg={train_cfg}, '
f'optim_wrapper={optim_wrapper}.')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

@zhouzaida
Copy link
Member

Also, need to update docstring in

corresponding milestone. Defaults to None.

num_batch_per_epoch (int, optional): 

@@ -40,6 +40,7 @@ def __init__(
max_epochs: int,
val_begin: int = 1,
val_interval: int = 1,
num_batch_per_epoch: Optional[int] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a new parameter in the middle position may cause a bc issue. Suggest moving it to the end.

Example of a training script

```python
# Copyright (c) OpenMMLab. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright (c) OpenMMLab. All rights reserved.


## Training for a fixed number of iterations (epoch-based training)

During the process of debugging code, sometimes it is necessary to train for several epochs, such as debugging the validation process or checking whether the checkpoint saving meets expectations. However, if the dataset is too large, it may take a long time to complete one epoch, in which case the cfg parameter can be added.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
During the process of debugging code, sometimes it is necessary to train for several epochs, such as debugging the validation process or checking whether the checkpoint saving meets expectations. However, if the dataset is too large, it may take a long time to complete one epoch, in which case the cfg parameter can be added.
During the process of debugging code, sometimes it is necessary to train for several epochs, such as debugging the validation process or checking whether the checkpoint saving meets expectations. However, if the dataset is too large, it may take a long time to complete one epoch, in which case the `num_batch_per_epoch` could be configured:

from mmengine.model import BaseModel
from mmengine.runner import Runner

os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we configure this env variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is to solve my conda environment and torch package conflict. I will delete it in commit.

Comment on lines 159 to 161
Take `MMEngine` as an example(Refer to the [documentation](https://mmengine.readthedocs.io/zh_CN/latest/get_started/installation.html)for installing MMEngine)。

Example of a training script
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Take `MMEngine` as an example(Refer to the [documentation](https://mmengine.readthedocs.io/zh_CN/latest/get_started/installation.html)for installing MMEngine)。
Example of a training script


```

Fast debugging is achieved by adding the `num_batch_per_epoch` parameter to `train_dataloader` and `val_dataloader`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Fast debugging is achieved by adding the `num_batch_per_epoch` parameter to `train_dataloader` and `val_dataloader`.
Fast debugging is achieved by configuring the `num_batch_per_epoch` in `train_dataloader` and `val_dataloader`. You can quickly debug the code of the validation after just 5 training iterations,


Fast debugging is achieved by adding the `num_batch_per_epoch` parameter to `train_dataloader` and `val_dataloader`.

Run the training script. You can see that after running each epoch run 5 batch is over. Compared to the original, debugging is faster and more flexible.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Run the training script. You can see that after running each epoch run 5 batch is over. Compared to the original, debugging is faster and more flexible.

@zhouzaida zhouzaida changed the title [CodeCamp2023-470] FlexibleRunner supports setting the number of iterations for each epoch, allowing for time-saving debugging. [CodeCamp2023-470] FlexibleRunner supports setting the number of iterations for each epoch Oct 8, 2023
@zhouzaida
Copy link
Member

FlexibleRunner supports setting the number of iterations for each epoch

@zhouzaida zhouzaida changed the title [CodeCamp2023-470] FlexibleRunner supports setting the number of iterations for each epoch [CodeCamp2023-470] Runner supports setting the number of iterations for each epoch Oct 8, 2023
@zhouzaida zhouzaida merged commit b8a3167 into open-mmlab:main Oct 8, 2023
16 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Support setting num_batch_per_epoch for debugging
5 participants