Skip to content

Latest commit

 

History

History
52 lines (38 loc) · 2.13 KB

regularizer.rst

File metadata and controls

52 lines (38 loc) · 2.13 KB

Regularizers

Regularizers add extra penalties or constraints for network parameters to restrict the model complexity. The correspondences in Caffe are weight decays. Regularizers and weight decays are equivalent in back-propagation. The conceptual difference in the forward pass is that when treated as weight decay, they are not considered as parts of the objective function. However, in order to save computation, Mocha also omit forward computation for regularizers by default. We choose to use the term regularization instead of weight decay just because it is easier to understand when generalizing to sparse, group-sparse or even more complicated structural regularizations.

All regularizers have the property coefficient, corresponding to the regularization coefficient. During training, a global regularization coefficient can also be specified (see user-guide/solver), that globally scale all local regularization coefficients.

Regularizer that impose no regularization.

L2 regularizer. The parameter blob W is treated as a 1D vector. During the forward pass, the squared L2-norm W2 = ⟨W, W is computed, and λW2 is added to the objective function, where λ is the regularization coefficient. During the backward pass, 2λW is added to the parameter gradient, enforcing a weight decay when the solver moves the parameters towards the negative gradient direction.

Note

Caffe, only λW is added as a weight decay in back propagation, which is equivalent to having a L2 regularizer with coefficient 0.5λ.

L1 regularizer. The parameter blob W is treated as a 1D vector. During the forward pass, the L1-norm


W1 = ∑i|Wi|

is computed. And λW1 is added to the objective function. During the backward pass, λsign(W) is added to the parameter gradient. The L1 regularizer has the property of encouraging sparsity in the parameters.