You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Izmailov et al. (2018) discovered that SGD explores regions of the weight space where networks with good performance lie, but does not reach the central point. By tracking a running average of the mean weights, they were able to find better weights than those found by SGD alone.
They also demonstrate that SWA leads to solutions in wider optima, which is conjectured to be important for generalization.
Here is a comparison of SWA and SGD with a ResNet-110 on CIFAR-100:
Implementation
The implementation is trivially easy, because the only thing we need to do is to update a running average of the weights in addition to the current weight vector.
Since we use batch normalization, we also need to calculate the running means and standard deviations for the resulting network.
The algorithm can be seen here:
The authors recommend starting with a pretrained model, before starting to average the weights. This we get for free, since we always initialize with the last best network.
The text was updated successfully, but these errors were encountered:
The paper says you may need to run the resulting net on the training set in order to get batchnorm weights too. In LZ it seems like it's not strictly necessary though.
Since SWA was successful for Leela Zero in producing stronger network weights (see leela-zero/leela-zero#814, leela-zero/leela-zero#1030), I want to record this as a possible improvement here.
What is Stochastic Weight Averaging?
Izmailov et al. (2018) discovered that SGD explores regions of the weight space where networks with good performance lie, but does not reach the central point. By tracking a running average of the mean weights, they were able to find better weights than those found by SGD alone.
They also demonstrate that SWA leads to solutions in wider optima, which is conjectured to be important for generalization.
Here is a comparison of SWA and SGD with a ResNet-110 on CIFAR-100:
Implementation
The implementation is trivially easy, because the only thing we need to do is to update a running average of the weights in addition to the current weight vector.
Since we use batch normalization, we also need to calculate the running means and standard deviations for the resulting network.
The algorithm can be seen here:
The authors recommend starting with a pretrained model, before starting to average the weights. This we get for free, since we always initialize with the last best network.
The text was updated successfully, but these errors were encountered: