New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gradient cumulative optimizer #784
Conversation
Unit tests to be added |
Codecov Report
@@ Coverage Diff @@
## master #784 +/- ##
==========================================
- Coverage 62.89% 62.81% -0.08%
==========================================
Files 144 145 +1
Lines 8467 8702 +235
Branches 1520 1574 +54
==========================================
+ Hits 5325 5466 +141
- Misses 2874 2971 +97
+ Partials 268 265 -3
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Please fix the linting error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Unit tests are missing.
- The detailed implementation should be clarified. Please see my comments.
@@ -34,6 +34,27 @@ def after_train_iter(self, runner): | |||
runner.optimizer.step() | |||
|
|||
|
|||
@HOOKS.register_module() | |||
class GradientCumulativeOptimizerHook(OptimizerHook): | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs are missing.
mmcv/runner/hooks/optimizer.py
Outdated
self.cumulative_iters = cumulative_iters | ||
|
||
def after_train_iter(self, runner): | ||
runner.outputs['loss'] = runner.outputs['loss'] / self.cumulative_iters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confusing about the detailed implementation of this function. Could you please offer a reference for this implementation to show that it is a general case?
In my opinion, the accumulative gradients are adopted to avoid large batch sizes. Is it right? If it's right, why should we divide the self.cumulative_iters
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gradient accumulate is to achieve an equivalent larger batch size with small batch size. Therefore, the loss should be normalised. See more at https://discuss.pytorch.org/t/pytorch-gradient-accumulation/55955/2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, got it. You may specify this in the docs.
In addition, please also fix the corner case where total_iters % cumulative_iters != 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, isn't BatchNormalization an issue as mentioned in here https://gist.github.com/thomwolf/ac7a7da6b1888c2eeac8ac8b9b05d3d3#gistcomment-3381285
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggalan87 Yes the behavior of batch normal is different, however, not all networks contain batchnorm
As mentioned by @ggalan87 |
Hi @ZhiyuanChen , many thanks for your contribution. Please fix the linting error first. It seems that you have not adopted the pre-commit hook as requested in CONTRIBUTING. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments.
@@ -40,11 +40,22 @@ class GradientCumulativeOptimizerHook(OptimizerHook): | |||
def __init__(self, grad_clip=None, cumulative_iters=1): | |||
super(GradientCumulativeOptimizerHook, self).__init__(grad_clip) | |||
self.cumulative_iters = cumulative_iters | |||
self.divisible_ietrs = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.divisible_iters
self.initialized = False | ||
|
||
def _init(self, runner): | ||
self.divisible_ietrs = runner.max_iters // self.cumulative_iters * self.cumulative_iters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another corner case where users resume from iter=2
but the cumulative_iters=4
. It seems this implementation will bring wrong gradients. If no good solutions, please put a warning here.
self.remainder_iters = 0 | ||
self.initialized = False | ||
|
||
def _init(self, runner): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put a warning for the usage of BN.
* Add gradient cumulative optimizer fixes #190 * Update optimizer.py * Update optimizer.py * fix loss scale improperly in last equivalent_iter * Add `GradientCumulativeOptimizerHook` in `__init__.py`. * Add docstring of `GradientCumulativeOptimizerHook`. * Add type check, BN warning and resume warning. And fix typo, lint the code. * Add unit test * Update docstring example. * Change GradientCumulativeOptimizerHook `__init__` arguments. * Add GradientCumulativeOptimzierHook unit tests with IterBasedRunner. * Add GradientCumulativeFp16OptimizerHook. * Add unit tests of GradientCumulativeFp16OptimizerHook * Use '!=' instead of '>' to determine resume Co-authored-by: Zhiyuan Chen <this@zyc.ai>
closed by #1221 |
fixes #190