Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalization Techniques in Training DNNs: Methodology, Analysis and Application #73

Open
howardyclo opened this issue Dec 27, 2020 · 0 comments

Comments

@howardyclo
Copy link
Owner

Metadata

The first author's work mainly focuses on normalization, some of his recent work:

TL;DR

This paper reviews the past, present and future of normalization methods for DNNs training, and aims to answer the following questions:

  1. What are the main motivations behind different normalization methods in DNNs, and how can we present a taxonomy
    for understanding the similarities and differences between a wide
    variety of approaches?
  2. How can we reduce the gap between the empirical success of normalization techniques and our theoretical understanding of them?
  3. What recent advances have been made in designing/tailoring normalization techniques for different tasks, and what are the main insights behind them?

Introduction

Normalization techniques typically serves as a "layer" between learnable weights and activations in DNN architectures.
More importantly, they've advanced deep learning research and become an essential module in DNN architectures for various applications. For example, Layer Normalization (LN) for transformers used in NLP; Spectral Normalization (SN) for discriminator in GANs used in generative modeling.

Question 1

Five normalization operations considered

  • Centering: Makes input zero mean.
  • Scaling: Makes input unit variance.
  • Decorrelating : Make input zero correlation between dimensions of input (i.e., zeroing covariance matrix's off-diagonals).
  • Standardization: Composition of centering and scaling.
  • Whitening: Make input a spherical Gaussian distribution. Composition of standardization and decorrelating, also called PCA Whitening.

Motivation of normalization

Convergence is proved to be related to the statistics fo input of a linear model, e.g., if the Hessian of input to a linear model is identity matrix, then this linear model can converge within one iteration by full gradient descent (GD). Several normalizations are discussed:

  1. Normalizing the activations (non-learnable or learnable)
  2. Normalizing the weights with a constrained distribution such that activations' gradients are implicitly normalized. These are inspired by weight normalizations but are extended towards satisfying the desired properties during training.
  3. Normalizing the gradients to exploit the curvature information for GD/SGD.

Normalization framework Π -> Φ -> Ψ

Take batch normalization (BN; ICML'15) for example, for a given input channel-first batch X with shape (c, b, h, w)

  1. Normalization area partitioning (Π): (c, b, h, w) -> (c, b*h*w)
  2. Normalization operation (Φ): Standardization along the last dimension of (c, b* h*w)
  3. Normalization representation recovery (Ψ): Affine transformation with learnable parameters for X.

Several weakness of BN

  1. Inconsistent between training and inference limits its usage in complex networks, such as RNN or GANs.
  2. Suffers from small batch size setting (e.g., object detection and segmentation)
    To address weakness of BN, several normalizations have been proposed and we discussed them under the framework.

Normalization area partitioning

  • Layer normalization (LN; arXiv'16): (c, b, h, w) -> (b, c*h*w). Widely used in NLP.
  • Group normalization (GN; ECCV'18): (c, b, h, w) -> (b*g, s*h*w), where g is group number of channel dimension, and s = c/g is splits number. When g=1, GN becomes LN. Widely used in object detection and segmentation.
  • Instance normalization (IN; arXiv'16): (c, b, h, w) -> (b*c, h*w) . Widely used in image style transfer.
  • Position normalization (PN; NeurIPS'19): (c, b, h, w -> (b*h*w, c). Designed to deal with spatial information and has the potential to enhance generative models.
  • Batch-group normalization (BGN; ICLR'20): (c, b, h, w) -> (g_b*g_c, s_b*s_c*h*w), where s_b = b/g_b and s_c = c/g_c. Extend GN by also grouping the batch dimension.

TBD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant