What happens when EMAHook and GradientCumulativeOptimizerHook is both used? #1509
-
I am trying to reproduce the results of ConvNeXt-Tiny with my GPU server. Since I have only small amount of GPU, I used gradient accumulation for it. I expected this would be equivalent to larger batch size since there is no batch norm layers in ConvNeXt. However, I couldn't reproduce the results (around 81.6%~81.7%). After thinking about the config file, I found out that since GradientCumulativeOptimizerHook stales the update for 8 iterations, EMAHook might be called 8 times for single update (Which would be similar to momentum=8e-4). Is this concern valid? If is valid, would there be a possible fix? Snippet of my config: |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
I am very sorry we have not done the relevant experiments, I think the biggest problem may not be on the EMA, generally speaking, EMA only has ~ 0.2 or so gain. And more frequent EMAs should not reduce the effectiveness of the model. Maybe you can try |
Beta Was this translation helpful? Give feedback.
I am very sorry we have not done the relevant experiments, I think the biggest problem may not be on the EMA, generally speaking, EMA only has ~ 0.2 or so gain. And more frequent EMAs should not reduce the effectiveness of the model.
Maybe you can try
--auto-scale and no
GradientCumulativeOptimizerHook`. Our result was obtained by only using 16 GPUs, which is different from the 32 GPUs of the official paper.