[Fix] Convert SyncBN to BN when training on DP #772

sennnnn · 2021-08-10T12:18:09Z

Incompatible between DP and SyncBN

run this command without setting --launcher:

python tools/train.py [config_path]

The python environment will report errors about process group:

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

This error is caused by SyncBN. When SyncBN.training is True, SyncBN need to init process_group. However, process_group is only valid when DDP.

The situation that SyncBN is valid:

process_group is not None or torch.distributed.group.WORLD is not None;
SyncBN.training is False or SyncBN.eval();

Based on those mentioned above, we convert SyncBN to BN when training on DP.

This PR is blocked by MMCV #1253

MMCV PR #1253 may be released in MMCV 1.3.13 version, the mmcv compatibility of mmseg 0.18 will larger than 1.3.13.

The CI will pass when MMCV 1.3.13 version release

codecov · 2021-08-10T12:54:09Z

Codecov Report

Merging #772 (fdc4208) into master (56e18ba) will increase coverage by 0.03%.
The diff coverage is 100.00%.

❗ Current head fdc4208 differs from pull request most recent head c7446d7. Consider uploading reports for the commit c7446d7 to get more accurate results

@@            Coverage Diff             @@
##           master     #772      +/-   ##
==========================================
+ Coverage   89.02%   89.05%   +0.03%     
==========================================
  Files         111      111              
  Lines        6043     6051       +8     
  Branches      969      969              
==========================================
+ Hits         5380     5389       +9     
  Misses        467      467              
+ Partials      196      195       -1

Flag	Coverage Δ
unittests	`89.05% <100.00%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
mmseg/__init__.py	`90.32% <100.00%> (ø)`
mmseg/datasets/custom.py	`92.43% <100.00%> (+0.34%)`	⬆️
mmseg/datasets/pipelines/loading.py	`98.52% <0.00%> (+1.47%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 56e18ba...c7446d7. Read the comment docs.

xvjiarui · 2021-08-10T17:48:27Z

May move to models/utils

tools/train.py

Junjun2016

Please fix the conflict.

xvjiarui · 2021-08-31T17:03:41Z

open-mmlab/mmcv#1253

Junjun2016 · 2021-09-02T06:10:28Z

We can import from mmcv after it merged.

Junjun2016 · 2021-09-08T05:36:48Z

Can import revert_sync_batchnorm from open-mmlab/mmcv#1253

docker/serve/Dockerfile

tools/train.py

Junjun2016

Please fix the conflict.

Junjun2016

LGTM

xvjiarui · 2021-09-15T03:14:16Z

Please upgrade mmcv requirement.

Junjun2016

https://github.com/open-mmlab/mmsegmentation/blob/master/docs/get_started.md#installation

Update mmcv dependence of master branch.

Junjun2016

LGTM

* [Fix] Convert SyncBN to BN when training on DP. * Modify SyncBN2BN. * Add SyncBN2BN unit test. * Resolve some comments. * use mmcv official revert_sync_batchnorm * Remove local syncbn2bn unit tests. * Update mmcv version. * Fix bugs of gather model tools. * Modify warnings. * Modify docker mmcv version. * Update mmcv version table.

* add accelerate to load models with smaller memory footprint * remove low_cpu_mem_usage as it is reduntant * move accelerate init weights context to modelling utils * add test to ensure results are the same when loading with accelerate * add tests to ensure ram usage gets lower when using accelerate * move accelerate logic to single snippet under modelling utils and remove it from configuration utils * format code using to pass quality check * fix imports with isor * add accelerate to test extra deps * only import accelerate if device_map is set to auto * move accelerate availability check to diffusers import utils * format code * add device map to pipeline abstraction * lint it to pass PR quality check * fix class check to use accelerate when using diffusers ModelMixin subclasses * use low_cpu_mem_usage in transformers if device_map is not available * NoModuleLayer * comment out tests * up * uP * finish * Update src/diffusers/pipelines/stable_diffusion/safety_checker.py * finish * uP * make style Co-authored-by: Pi Esposito <piero.skywalker@gmail.com>

* modify stat.py merge_docs * unify merge_docs style * fix bugs * fix bugs

[Fix] Convert SyncBN to BN when training on DP.

cbd0a97

sennnnn added 2 commits August 11, 2021 02:04

Modify SyncBN2BN.

2ed3d20

Add SyncBN2BN unit test.

1ddcac2

Junjun2016 mentioned this pull request Aug 23, 2021

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. #809

Closed

Junjun2016 self-requested a review August 30, 2021 12:26

Junjun2016 reviewed Aug 31, 2021

View reviewed changes

tools/train.py Outdated Show resolved Hide resolved

Junjun2016 reviewed Aug 31, 2021

View reviewed changes

tools/train.py Outdated Show resolved Hide resolved

Junjun2016 reviewed Aug 31, 2021

View reviewed changes

tools/train.py Outdated Show resolved Hide resolved

Junjun2016 requested changes Aug 31, 2021

View reviewed changes

sennnnn added 2 commits August 31, 2021 21:20

Resolve some comments.

bcd7db4

Merge Master.

ba38a20

Junjun2016 approved these changes Aug 31, 2021

View reviewed changes

Junjun2016 requested a review from xvjiarui August 31, 2021 16:12

Junjun2016 mentioned this pull request Sep 2, 2021

[Fix] fix_torchserver1.1 #844

Merged

sennnnn added 5 commits September 9, 2021 20:38

Merge branch 'master' into syncbn2bn

06a3b60

use mmcv official revert_sync_batchnorm

4082c62

Remove local syncbn2bn unit tests.

c681ad0

Update mmcv version.

9cf3611

Fix bugs of gather model tools.

5b94937

Junjun2016 reviewed Sep 11, 2021

View reviewed changes

docker/serve/Dockerfile Show resolved Hide resolved

Junjun2016 reviewed Sep 11, 2021

View reviewed changes

tools/train.py Outdated Show resolved Hide resolved

Junjun2016 reviewed Sep 11, 2021

View reviewed changes

sennnnn added 3 commits September 12, 2021 23:18

Merge master.

57c4dce

Modify warnings.

14f33de

Modify docker mmcv version.

3946e9c

Junjun2016 approved these changes Sep 13, 2021

View reviewed changes

Junjun2016 reviewed Sep 15, 2021

View reviewed changes

Update mmcv version table.

c7446d7

Junjun2016 approved these changes Sep 15, 2021

View reviewed changes

xvjiarui merged commit cae715a into open-mmlab:master Sep 15, 2021

MengzhangLI mentioned this pull request Jul 5, 2022

SyncBN is only supported with DDP. #1738

Closed

Gsunshine mentioned this pull request Aug 24, 2022

Default process group has not been initialized Gsunshine/Enjoy-Hamburger#7

Closed

sibozhang pushed a commit to sibozhang/mmsegmentation that referenced this pull request Mar 22, 2024

[Docs] Update indexing of config readme (open-mmlab#772)

78eb446

* modify stat.py merge_docs * unify merge_docs style * fix bugs * fix bugs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Convert SyncBN to BN when training on DP #772

[Fix] Convert SyncBN to BN when training on DP #772

sennnnn commented Aug 10, 2021 •

edited

codecov bot commented Aug 10, 2021 •

edited

xvjiarui commented Aug 10, 2021

Junjun2016 left a comment

xvjiarui commented Aug 31, 2021

Junjun2016 commented Sep 2, 2021

Junjun2016 commented Sep 8, 2021

Junjun2016 left a comment

Junjun2016 left a comment

xvjiarui commented Sep 15, 2021

Junjun2016 left a comment

Junjun2016 left a comment

[Fix] Convert SyncBN to BN when training on DP #772

[Fix] Convert SyncBN to BN when training on DP #772

Conversation

sennnnn commented Aug 10, 2021 • edited

Incompatible between DP and SyncBN

codecov bot commented Aug 10, 2021 • edited

Codecov Report

xvjiarui commented Aug 10, 2021

Junjun2016 left a comment

Choose a reason for hiding this comment

xvjiarui commented Aug 31, 2021

Junjun2016 commented Sep 2, 2021

Junjun2016 commented Sep 8, 2021

Junjun2016 left a comment

Choose a reason for hiding this comment

Junjun2016 left a comment

Choose a reason for hiding this comment

xvjiarui commented Sep 15, 2021

Junjun2016 left a comment

Choose a reason for hiding this comment

Junjun2016 left a comment

Choose a reason for hiding this comment

sennnnn commented Aug 10, 2021 •

edited

codecov bot commented Aug 10, 2021 •

edited