Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] training time validation bug fix #611

Merged
merged 10 commits into from Dec 23, 2021
Merged

Conversation

0x4f5da2
Copy link
Contributor

Motivation

This PR fixes the bug introduced in #588

Modification

add sampler_cfg for validation in mmcls/api/train.py

@codecov
Copy link

codecov bot commented Dec 22, 2021

Codecov Report

Merging #611 (21fd957) into master (d5ae255) will increase coverage by 0.44%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #611      +/-   ##
==========================================
+ Coverage   81.33%   81.78%   +0.44%     
==========================================
  Files         116      116              
  Lines        6720     6709      -11     
  Branches     1156     1151       -5     
==========================================
+ Hits         5466     5487      +21     
+ Misses       1103     1069      -34     
- Partials      151      153       +2     
Flag Coverage Δ
unittests 81.78% <100.00%> (+0.44%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
mmcls/apis/train.py 21.25% <ø> (-6.53%) ⬇️
mmcls/datasets/builder.py 73.21% <100.00%> (+27.06%) ⬆️
mmcls/datasets/samplers/repeat_aug.py 90.47% <100.00%> (+5.36%) ⬆️
mmcls/datasets/samplers/distributed_sampler.py 84.00% <0.00%> (+56.00%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d5ae255...21fd957. Read the comment docs.

@mzr1996
Copy link
Member

mzr1996 commented Dec 22, 2021

Please also add it in tools/test.py

@0x4f5da2
Copy link
Contributor Author

Please also add it in tools/test

fixed

@@ -83,6 +102,8 @@ def build_dataloader(dataset,
"""
rank, world_size = get_dist_info()

sampler_cfg = get_sampler_cfg(cfg, distributed=dist)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a good choice to pass the whole config into a module's builder.

What if we change like this?

    rank, world_size = get_dist_info()
    # default logic
    if dist:
        sampler = DistributedSampler(
            dataset, world_size, rank, shuffle=shuffle, round_up=round_up)
        batch_size = samples_per_gpu
        num_workers = workers_per_gpu
    else:
        sampler = None
        batch_size = num_gpus * samples_per_gpu
        num_workers = num_gpus * workers_per_gpu

    # build custom sampler logic
    if sampler_cfg:
        # shuffle=False when val and test
        sampler_cfg.update(shuffle=shuffle)
        sampler = build_sampler(
            sampler_cfg,
            default_args=dict(dataset=dataset, num_replicas=world_size, rank=rank))

    # If sampler exists, turn off dataloader shuffle
    if sampler is not None:
        shuffle = False

Copy link
Member

@mzr1996 mzr1996 Dec 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to use:

    rank, world_size = get_dist_info()

    # Custom sampler logic
    if sampler_cfg:
        # shuffle=False when val and test
        sampler_cfg.update(shuffle=shuffle)
        sampler = build_sampler(
            sampler_cfg,
            default_args=dict(dataset=dataset, num_replicas=world_size, rank=rank))
    # Default sampler logic
    elif dist:
        sampler = DistributedSampler(
            dataset, world_size, rank, shuffle=shuffle, round_up=round_up)
    else:
        sampler = None

    # If sampler exists, turn off dataloader shuffle
    if sampler is not None:
        shuffle = False

    if dist:
        batch_size = samples_per_gpu
        num_workers = workers_per_gpu
    else:
        batch_size = num_gpus * samples_per_gpu
        num_workers = num_gpus * workers_per_gpu

Then we can avoid an unnecessary instantiation of DistributedSampler. And split the logic of sampler and batch_size.

@@ -23,6 +23,25 @@
SAMPLERS = Registry('sampler')


def get_sampler_cfg(cfg, distributed):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MME need to handle DistributedSampler and InfiniteSampler as default sampler when runner type is different, so there is a set_default_sampler_cfg function.
But MMClassification only has one default sampler: DistributedSampler. I think this function is unnecessary here.

Copy link
Member

@mzr1996 mzr1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mzr1996 mzr1996 merged commit da39ca6 into open-mmlab:master Dec 23, 2021
mzr1996 added a commit to mzr1996/mmpretrain that referenced this pull request Nov 24, 2022
* sampler bugfixes

* sampler bugfixes

* reorganize code

* minor fixes

* reorganize code

* minor fixes

* Use `mmcv.runner.get_dist_info` instead of `dist` package to get rank
and world size.

* Add `build_dataloader` unit tests and fix sampler's unit tests.

* Fix unit tests

* Fix unit tests

Co-authored-by: mzr1996 <mzr1996@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants