[Enhancement] Raising warning when the different number of GPU is set for resuming #1360

Junjun2016 · 2021-09-22T09:31:46Z

Motivation

Raising warning when the different number of GPU is set for resuming

Modification

The resume function in iter_based_rnner.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
No.

Checklist

Before PR:

I have read and followed the workflow indicated in the CONTRIBUTING.md to create this PR.
Pre-commit or linting tools indicated in CONTRIBUTING.md are used to fix the potential lint issues.
Bug fixes are covered by unit tests, the case that causes the bug should be added in the unit tests.
New functionalities are covered by complete unit tests. If not, please add more unit test to ensure the correctness.
The documentation has been modified accordingly, including docstring or example tutorials.

After PR:

If the modification has potential influence on downstream or other related projects, this PR should be tested with some of those projects, like MMDet or MMCls.
CLA has been signed and all committers have signed the CLA in this PR.

mmcv/runner/iter_based_runner.py

codecov · 2021-09-22T16:17:47Z

Codecov Report

Merging #1360 (d055500) into master (4bab292) will decrease coverage by 0.12%.
The diff coverage is 0.00%.

❗ Current head d055500 differs from pull request most recent head 7a6d411. Consider uploading reports for the commit 7a6d411 to get more accurate results

@@            Coverage Diff             @@
##           master    #1360      +/-   ##
==========================================
- Coverage   69.14%   69.01%   -0.13%     
==========================================
  Files         162      162              
  Lines       10746    10765      +19     
  Branches     1978     1984       +6     
==========================================
  Hits         7430     7430              
- Misses       2927     2944      +17     
- Partials      389      391       +2

Flag	Coverage Δ
unittests	`69.01% <0.00%> (-0.13%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
mmcv/runner/iter_based_runner.py	`60.71% <0.00%> (-2.25%)`	⬇️
mmcv/ops/pixel_group.py	`72.72% <0.00%> (-27.28%)`	⬇️
mmcv/ops/contour_expand.py	`75.00% <0.00%> (-25.00%)`	⬇️
mmcv/utils/ext_loader.py	`35.89% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4bab292...7a6d411. Read the comment docs.

…GPUs for resuming

zhouzaida · 2021-09-23T13:13:33Z

mmcv/runner/iter_based_runner.py

+        if 'config' in checkpoint['meta']:
+            config = mmcv.Config.fromstring(
+                checkpoint['meta']['config'], file_format='.py')
+            previous_gpu_ids = config.get('gpu_ids', None)
+            if previous_gpu_ids and len(previous_gpu_ids) > 0 and len(
+                    previous_gpu_ids) != self.world_size:
+                warnings.warn(
+                    f'The number of GPU is {len(previous_gpu_ids)} before \
+                    resuming while the number of GPU is {len(self.world_size)}\
+                     after resuming. It is better to set the same number of \
+                    GPU for resuming.')


Should we keep the consistent behavior with BaseRunner to modify the self._iter

mmcv/mmcv/runner/base_runner.py

Lines 371 to 382 in b92ea0b

# Re-calculate the number of iterations when resuming

# models with different number of GPUs

if 'config' in checkpoint['meta']:

config = mmcv.Config.fromstring(

checkpoint['meta']['config'], file_format='.py')

previous_gpu_ids = config.get('gpu_ids', None)

if previous_gpu_ids and len(previous_gpu_ids) > 0 and len(

previous_gpu_ids) != self.world_size:

self._iter = int(self._iter * len(previous_gpu_ids) /

self.world_size)

self.logger.info('the iteration number is changed due to '

'change of GPU number')

No, there are no solid conclusions between optimizer status and batch size.

So? Should we rethink modifying self._iter in BaseRunner?

mmcv/mmcv/runner/base_runner.py

Lines 369 to 380 in e4b5348

# Re-calculate the number of iterations when resuming

# models with different number of GPUs

if 'config' in checkpoint['meta']:

config = mmcv.Config.fromstring(

checkpoint['meta']['config'], file_format='.py')

previous_gpu_ids = config.get('gpu_ids', None)

if previous_gpu_ids and len(previous_gpu_ids) > 0 and len(

previous_gpu_ids) != self.world_size:

self._iter = int(self._iter * len(previous_gpu_ids) /

self.world_size)

self.logger.info('the iteration number is changed due to '

'change of GPU number')

Junjun2016 · 2021-12-09T07:43:50Z

mmcv/runner/iter_based_runner.py

+        # since the optimizer status are relative with batch size
+        # (#GPUs x bs/GPU)
+        if 'config' in checkpoint['meta']:
+            config = mmcv.Config.fromstring(


Can also check the difference of config between checkpoint and current.

In the future.

ZwwWayne · 2021-12-22T11:12:05Z

Suggest only raising warnings to remind users because we are not sure the intention of users.

MeowZheng · 2021-12-24T12:14:02Z

Might also raise a warning in BaseRunner? as no one takes log messages seriously

raising error when the different number of gpu is set for resuming

bd28b99

zhouzaida reviewed Sep 22, 2021

View reviewed changes

mmcv/runner/iter_based_runner.py Show resolved Hide resolved

add a comment to explain why we recommend setting the same number of …

d055500

…GPUs for resuming

Junjun2016 mentioned this pull request Sep 23, 2021

Bug when checking if number of GPUs is different during resume #1359

Closed

zhouzaida reviewed Sep 23, 2021

View reviewed changes

zhouzaida mentioned this pull request Sep 24, 2021

Iteration Plan v1.3.15 - Oct 2021 #1367

Closed

29 tasks

zhouzaida mentioned this pull request Oct 16, 2021

Iteration Plan v1.3.16 - Oct 2021 #1409

Closed

11 tasks

zhouzaida mentioned this pull request Oct 27, 2021

Iteration Plan v1.3.17 - Nov 2021 #1439

Closed

13 tasks

zhouzaida requested a review from ZwwWayne November 2, 2021 12:17

ZwwWayne requested a review from MeowZheng December 9, 2021 07:16

Junjun2016 commented Dec 9, 2021

View reviewed changes

add TODO

7a6d411

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Raising warning when the different number of GPU is set for resuming #1360

[Enhancement] Raising warning when the different number of GPU is set for resuming #1360

Junjun2016 commented Sep 22, 2021 •

edited

codecov bot commented Sep 22, 2021 •

edited

zhouzaida Sep 23, 2021 •

edited

Junjun2016 Sep 24, 2021

MeowZheng Dec 21, 2021

Junjun2016 Dec 9, 2021

Junjun2016 Dec 9, 2021

ZwwWayne commented Dec 22, 2021

MeowZheng commented Dec 24, 2021 •

edited

	# Re-calculate the number of iterations when resuming
	# models with different number of GPUs
	if 'config' in checkpoint['meta']:
	config = mmcv.Config.fromstring(
	checkpoint['meta']['config'], file_format='.py')
	previous_gpu_ids = config.get('gpu_ids', None)
	if previous_gpu_ids and len(previous_gpu_ids) > 0 and len(
	previous_gpu_ids) != self.world_size:
	self._iter = int(self._iter * len(previous_gpu_ids) /
	self.world_size)
	self.logger.info('the iteration number is changed due to '
	'change of GPU number')

[Enhancement] Raising warning when the different number of GPU is set for resuming #1360

Are you sure you want to change the base?

[Enhancement] Raising warning when the different number of GPU is set for resuming #1360

Conversation

Junjun2016 commented Sep 22, 2021 • edited

Motivation

Modification

BC-breaking (Optional)

Checklist

codecov bot commented Sep 22, 2021 • edited

Codecov Report

zhouzaida Sep 23, 2021 • edited

Choose a reason for hiding this comment

Junjun2016 Sep 24, 2021

Choose a reason for hiding this comment

MeowZheng Dec 21, 2021

Choose a reason for hiding this comment

Junjun2016 Dec 9, 2021

Choose a reason for hiding this comment

Junjun2016 Dec 9, 2021

Choose a reason for hiding this comment

ZwwWayne commented Dec 22, 2021

MeowZheng commented Dec 24, 2021 • edited

Junjun2016 commented Sep 22, 2021 •

edited

codecov bot commented Sep 22, 2021 •

edited

zhouzaida Sep 23, 2021 •

edited

MeowZheng commented Dec 24, 2021 •

edited