Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add revert_sync_batchnorm #1253

Merged
merged 9 commits into from
Sep 8, 2021

Conversation

gaotongxiao
Copy link
Collaborator

Motivation

A model that uses SyncBN cannot support CPU inference. SyncBN can also cause some other issues during inference. We introduce revert_sync_batchnorm from @kapily's work (pytorch/pytorch#41081 (comment)), which can convert SyncBN in any model to BN.

Modification

Added revert_sync_batchnorm to mmcv/cnn/utils/sync_bn.py and its unittest.

BC-breaking (Optional)

No, but there could be a potential minor risk -

PyTorch provides an implementation of convert_sync_batchnorm which converts BatchNorm1D, BatchNorm2D and BatchNorm3D to SyncBatchNorm. However, it doesn't provide an inverse function for that. The reason is SyncBatchNorm neither has a strict input dimension checking nor stores the expected input dimension, whereas BatchNormxD strictly validates the input dimension (and this is the only difference between BatchNorm1D, 2D, and 3D). Therefore, if one converts BNxD to SyncBN using PyTorch's implementation and then converts it back to BN using this implementation, the input dimension check is no longer retained.

@codecov
Copy link

codecov bot commented Aug 10, 2021

Codecov Report

Merging #1253 (35df788) into master (9341856) will increase coverage by 0.69%.
The diff coverage is 63.05%.

❗ Current head 35df788 differs from pull request most recent head 7f85e59. Consider uploading reports for the commit 7f85e59 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1253      +/-   ##
==========================================
+ Coverage   68.20%   68.90%   +0.69%     
==========================================
  Files         160      162       +2     
  Lines       10607    10775     +168     
  Branches     1938     1978      +40     
==========================================
+ Hits         7234     7424     +190     
+ Misses       2992     2962      -30     
- Partials      381      389       +8     
Flag Coverage Δ
unittests 68.90% <63.05%> (+0.69%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
mmcv/image/photometric.py 99.27% <ø> (ø)
mmcv/ops/carafe.py 23.30% <ø> (ø)
mmcv/ops/cc_attention.py 38.70% <0.00%> (ø)
mmcv/ops/deform_conv.py 72.26% <0.00%> (+10.21%) ⬆️
mmcv/ops/deform_roi_pool.py 28.20% <ø> (ø)
mmcv/ops/focal_loss.py 23.89% <ø> (ø)
mmcv/ops/fused_bias_leakyrelu.py 30.90% <ø> (ø)
mmcv/ops/masked_conv.py 40.00% <ø> (ø)
mmcv/ops/psa_mask.py 31.81% <ø> (ø)
mmcv/ops/sync_bn.py 19.64% <ø> (ø)
... and 32 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9341856...7f85e59. Read the comment docs.

@ZwwWayne
Copy link
Collaborator

Please also consider/check/test MMSyncBN

@gaotongxiao
Copy link
Collaborator Author

@ZwwWayne It seems MMSyncBN even cannot be initialized in a non-distributed environment.

@ZwwWayne
Copy link
Collaborator

@ZwwWayne It seems MMSyncBN even cannot be initialized in a non-distributed environment.

So it is also necessary to convert it if it is used in the config for training?

@gaotongxiao
Copy link
Collaborator Author

@ZwwWayne A model with MMSyncBN layer cannot be built in a non-distributed environment. I think the only solution is to re-generate a Config replacing MMSyncBN with BN and use it to build the model, but the idea can be very different from the current implementation. I'm also not sure if the checkpoint is still compatible after such a conversion. BTW, what was the goal of MMSyncBN?

@ZwwWayne
Copy link
Collaborator

@ZwwWayne A model with MMSyncBN layer cannot be built in a non-distributed environment. I think the only solution is to re-generate a Config replacing MMSyncBN with BN and use it to build the model, but the idea can be very different from the current implementation. I'm also not sure if the checkpoint is still compatible after such a conversion. BTW, what was the goal of MMSyncBN?

It was written a long time ago when PyTorch does not provide SyncBN. It also supports some corner case that currently is not supported in PyTorch.

@gaotongxiao
Copy link
Collaborator Author

gaotongxiao commented Aug 17, 2021

@ZwwWayne Do we have any examples using MMSyncBN?

@ZwwWayne
Copy link
Collaborator

@ZwwWayne Do we have any examples using MMSyncBN?

Simply change the SyncBN in the norm_cfg to MMSyncBN should work

@ZwwWayne
Copy link
Collaborator

ZwwWayne commented Aug 17, 2021

I suggest implementing a function to simply modify the config to change SyncBN/MMSyncBN back to BN. It works for both SyncBN and MMSyncBN and meets the demands that this PR wants to achieve. It is a little bit late to convert the model back after building SyncBN layer.

@gaotongxiao
Copy link
Collaborator Author

@ZwwWayne Done

@ZwwWayne ZwwWayne merged commit 642d281 into open-mmlab:master Sep 8, 2021
@zhouzaida zhouzaida mentioned this pull request Sep 8, 2021
16 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants