Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Raising warning when the different number of GPU is set for resuming #1360

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Junjun2016
Copy link
Contributor

@Junjun2016 Junjun2016 commented Sep 22, 2021

Motivation

Raising warning when the different number of GPU is set for resuming

Modification

The resume function in iter_based_rnner.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
No.

Checklist

Before PR:

  • I have read and followed the workflow indicated in the CONTRIBUTING.md to create this PR.
  • Pre-commit or linting tools indicated in CONTRIBUTING.md are used to fix the potential lint issues.
  • Bug fixes are covered by unit tests, the case that causes the bug should be added in the unit tests.
  • New functionalities are covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • The documentation has been modified accordingly, including docstring or example tutorials.

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with some of those projects, like MMDet or MMCls.
  • CLA has been signed and all committers have signed the CLA in this PR.

@codecov
Copy link

codecov bot commented Sep 22, 2021

Codecov Report

Merging #1360 (d055500) into master (4bab292) will decrease coverage by 0.12%.
The diff coverage is 0.00%.

❗ Current head d055500 differs from pull request most recent head 7a6d411. Consider uploading reports for the commit 7a6d411 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1360      +/-   ##
==========================================
- Coverage   69.14%   69.01%   -0.13%     
==========================================
  Files         162      162              
  Lines       10746    10765      +19     
  Branches     1978     1984       +6     
==========================================
  Hits         7430     7430              
- Misses       2927     2944      +17     
- Partials      389      391       +2     
Flag Coverage Δ
unittests 69.01% <0.00%> (-0.13%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
mmcv/runner/iter_based_runner.py 60.71% <0.00%> (-2.25%) ⬇️
mmcv/ops/pixel_group.py 72.72% <0.00%> (-27.28%) ⬇️
mmcv/ops/contour_expand.py 75.00% <0.00%> (-25.00%) ⬇️
mmcv/utils/ext_loader.py 35.89% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4bab292...7a6d411. Read the comment docs.

Comment on lines +168 to +178
if 'config' in checkpoint['meta']:
config = mmcv.Config.fromstring(
checkpoint['meta']['config'], file_format='.py')
previous_gpu_ids = config.get('gpu_ids', None)
if previous_gpu_ids and len(previous_gpu_ids) > 0 and len(
previous_gpu_ids) != self.world_size:
warnings.warn(
f'The number of GPU is {len(previous_gpu_ids)} before \
resuming while the number of GPU is {len(self.world_size)}\
after resuming. It is better to set the same number of \
GPU for resuming.')
Copy link
Member

@zhouzaida zhouzaida Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep the consistent behavior with BaseRunner to modify the self._iter

# Re-calculate the number of iterations when resuming
# models with different number of GPUs
if 'config' in checkpoint['meta']:
config = mmcv.Config.fromstring(
checkpoint['meta']['config'], file_format='.py')
previous_gpu_ids = config.get('gpu_ids', None)
if previous_gpu_ids and len(previous_gpu_ids) > 0 and len(
previous_gpu_ids) != self.world_size:
self._iter = int(self._iter * len(previous_gpu_ids) /
self.world_size)
self.logger.info('the iteration number is changed due to '
'change of GPU number')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there are no solid conclusions between optimizer status and batch size.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So? Should we rethink modifying self._iter in BaseRunner?

# Re-calculate the number of iterations when resuming
# models with different number of GPUs
if 'config' in checkpoint['meta']:
config = mmcv.Config.fromstring(
checkpoint['meta']['config'], file_format='.py')
previous_gpu_ids = config.get('gpu_ids', None)
if previous_gpu_ids and len(previous_gpu_ids) > 0 and len(
previous_gpu_ids) != self.world_size:
self._iter = int(self._iter * len(previous_gpu_ids) /
self.world_size)
self.logger.info('the iteration number is changed due to '
'change of GPU number')

# since the optimizer status are relative with batch size
# (#GPUs x bs/GPU)
if 'config' in checkpoint['meta']:
config = mmcv.Config.fromstring(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can also check the difference of config between checkpoint and current.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future.

@ZwwWayne
Copy link
Collaborator

Suggest only raising warnings to remind users because we are not sure the intention of users.

@MeowZheng
Copy link
Collaborator

MeowZheng commented Dec 24, 2021

Might also raise a warning in BaseRunner? as no one takes log messages seriously

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants