Skip to content

Gradient Accumulation #345

Answered by rwightman
ademyanchuk asked this question in Q&A
Discussion options

You must be logged in to vote

@ademyanchuk I don't use it because these models all use BatchNorm by default. Gradient accumulation isn't a clear win with BatchNorm as it does not improve the effective batch size for the BN running stats calc and that can cause instability or poor results with small batches. One could use GroupNorm and several models do support switching the norm layer quite easily, but not something I've experimented with.

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@ademyanchuk
Comment options

Answer selected by ademyanchuk
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants