# Batch Normalization
One way to make deep networks easier to train is to use more sophisticated optimization procedures such as SGD+momentum, RMSProp, or Adam. Another strategy is to change the architecture of the network to make it easier to train. 
One idea along these lines is batch normalization which was proposed by [1] in 2015.

The idea is relatively straightforward. Machine learning methods tend to work better when their input data consists of uncorrelated features with zero mean and unit variance. When training a neural network, we can preprocess the data before feeding it to the network to explicitly decorrelate its features; this will ensure that the first layer of the network sees data that follows a nice distribution. However, even if we preprocess the input data, the activations at deeper layers of the network will likely no longer be decorrelated and will no longer have zero mean or unit variance since they are output from earlier layers in the network. Even worse, during the training process the distribution of features at each layer of the network will shift as the weights of each layer are updated.

The authors of [1] hypothesize that the shifting distribution of features inside deep neural networks may make training deep networks more difficult. To overcome this problem, [1] proposes to insert batch normalization layers into the network. At training time, a batch normalization layer uses a minibatch of data to estimate the mean and standard deviation of each feature. These estimated means and standard deviations are then used to center and normalize the features of the minibatch. A running average of these means and standard deviations is kept during training, and at test time these running averages are used to center and normalize features.

It is possible that this normalization strategy could reduce the representational power of the network, since it may sometimes be optimal for certain layers to have features that are not zero-mean or unit variance. To this end, the batch normalization layer includes learnable shift and scale parameters for each feature dimension.

[1] [Sergey Ioffe and Christian Szegedy, "Batch Normalization: Accelerating Deep Network Training by Reducing
Internal Covariate Shift", ICML 2015.](https://arxiv.org/abs/1502.03167)

In [1]:
# As usual, a bit of setup
import time
import numpy as np
import matplotlib.pyplot as plt
from cs231n.classifiers.fc_net import *
from cs231n.data_utils import get_CIFAR10_data
from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array
from cs231n.solver import Solver

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

def rel_error(x, y):
    """ returns relative error """
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))

def print_mean_std(x,axis=0):
    print('  means: ', x.mean(axis=axis))
    print('  stds:  ', x.std(axis=axis))
    print() 

In [2]:
# Load the (preprocessed) CIFAR10 data.
data = get_CIFAR10_data()
for k, v in data.items():
  print('%s: ' % k, v.shape)

X_train:  (49000, 3, 32, 32)
y_train:  (49000,)
X_val:  (1000, 3, 32, 32)
y_val:  (1000,)
X_test:  (1000, 3, 32, 32)
y_test:  (1000,)


## Batch normalization: forward
In the file `cs231n/layers.py`, implement the batch normalization forward pass in the function `batchnorm_forward`. Once you have done so, run the following to test your implementation.

Referencing the paper linked to above in [1] may be helpful!

In [12]:
# Check the training-time forward pass by checking means and variances
# of features both before and after batch normalization   

# Simulate the forward pass for a two-layer network
np.random.seed(231)
N, D1, D2, D3 = 200, 50, 60, 3
X = np.random.randn(N, D1)
W1 = np.random.randn(D1, D2)
W2 = np.random.randn(D2, D3)
a = np.maximum(0, X.dot(W1)).dot(W2)

print('Before batch normalization:')
print_mean_std(a,axis=0)

gamma = np.ones((D3,))
beta = np.zeros((D3,))
# Means should be close to zero and stds close to one
print('After batch normalization (gamma=1, beta=0)')
a_norm, _ = batchnorm_forward(a, gamma, beta, {'mode': 'train'})
print_mean_std(a_norm,axis=0)

gamma = np.asarray([1.0, 2.0, 3.0])
beta = np.asarray([11.0, 12.0, 13.0])
# Now means should be close to beta and stds close to gamma
print('After batch normalization (gamma=', gamma, ', beta=', beta, ')')
a_norm, _ = batchnorm_forward(a, gamma, beta, {'mode': 'train'})
print_mean_std(a_norm,axis=0)

Before batch normalization:
  means:  [ -2.3814598  -13.18038246   1.91780462]
  stds:   [27.18502186 34.21455511 37.68611762]

After batch normalization (gamma=1, beta=0)
  means:  [ 1.55431223e-17  7.88258347e-17 -3.40005801e-18]
  stds:   [0.99749686 0.99749686 0.99749686]

After batch normalization (gamma= [1. 2. 3.] , beta= [11. 12. 13.] )
  means:  [11. 12. 13.]
  stds:   [0.99749686 1.99499373 2.99249059]



In [34]:
# Check the test-time forward pass by running the training-time
# forward pass many times to warm up the running averages, and then
# checking the means and variances of activations after a test-time
# forward pass.

np.random.seed(231)
N, D1, D2, D3 = 200, 50, 60, 3
W1 = np.random.randn(D1, D2)
W2 = np.random.randn(D2, D3)

bn_param = {'mode': 'train'}
gamma = np.ones(D3)
beta = np.zeros(D3)

for t in range(50):
  X = np.random.randn(N, D1)
  a = np.maximum(0, X.dot(W1)).dot(W2)
  batchnorm_forward(a, gamma, beta, bn_param)

bn_param['mode'] = 'test'
X = np.random.randn(N, D1)
a = np.maximum(0, X.dot(W1)).dot(W2)
a_norm, _ = batchnorm_forward(a, gamma, beta, bn_param)

# Means should be close to zero and stds close to one, but will be
# noisier than training-time forward passes.
print('After batch normalization (test-time):')
print_mean_std(a_norm,axis=0)

After batch normalization (test-time):
  means:  [-0.03917523 -0.04338266 -0.10426524]
  stds:   [1.01277281 1.0098496  0.97575131]



## Batch normalization: backward
Now implement the backward pass for batch normalization in the function `batchnorm_backward`.

To derive the backward pass you should write out the computation graph for batch normalization and backprop through each of the intermediate nodes. Some intermediates may have multiple outgoing branches; make sure to sum gradients across these branches in the backward pass.

Once you have finished, run the following to numerically check your backward pass.

In [35]:
# Gradient check batchnorm backward pass
np.random.seed(231)
N, D = 4, 5
x = 5 * np.random.randn(N, D) + 12
gamma = np.random.randn(D)
beta = np.random.randn(D)
dout = np.random.randn(N, D)

bn_param = {'mode': 'train'}
fx = lambda x: batchnorm_forward(x, gamma, beta, bn_param)[0]
fg = lambda a: batchnorm_forward(x, a, beta, bn_param)[0]
fb = lambda b: batchnorm_forward(x, gamma, b, bn_param)[0]

dx_num = eval_numerical_gradient_array(fx, x, dout)
da_num = eval_numerical_gradient_array(fg, gamma.copy(), dout)
db_num = eval_numerical_gradient_array(fb, beta.copy(), dout)

_, cache = batchnorm_forward(x, gamma, beta, bn_param)
dx, dgamma, dbeta = batchnorm_backward(dout, cache)
#You should expect to see relative errors between 1e-13 and 1e-8
print('dx error: ', rel_error(dx_num, dx))
print('dgamma error: ', rel_error(da_num, dgamma))
print('dbeta error: ', rel_error(db_num, dbeta))

dx error:  1.6934271864958244e-09
dgamma error:  1.1188362943000848e-12
dbeta error:  2.379446949959628e-12


## Batch normalization: alternative backward
In class we talked about two different implementations for the sigmoid backward pass. One strategy is to write out a computation graph composed of simple operations and backprop through all intermediate values. Another strategy is to work out the derivatives on paper. For example, you can derive a very simple formula for the sigmoid function's backward pass by simplifying gradients on paper.

Surprisingly, it turns out that you can do a similar simplification for the batch normalization backward pass too!  

In the forward pass, given a set of inputs $X=\begin{bmatrix}x_1\\x_2\\...\\x_N\end{bmatrix}$, 

we first calculate the mean $\mu$ and variance $v$.
With $\mu$ and $v$ calculated, we can calculate the standard deviation $\sigma$  and normalized data $Y$.
The equations and graph illustration below describe the computation ($y_i$ is the i-th element of the vector $Y$).

\begin{align}
& \mu=\frac{1}{N}\sum_{k=1}^N x_k  &  v=\frac{1}{N}\sum_{k=1}^N (x_k-\mu)^2 \\
& \sigma=\sqrt{v+\epsilon}         &  y_i=\frac{x_i-\mu}{\sigma}
\end{align}

<img src="notebook_images/batchnorm_graph.png" width=691 height=202>

The meat of our problem during backpropagation is to compute $\frac{\partial L}{\partial X}$, given the upstream gradient we receive, $\frac{\partial L}{\partial Y}.$ To do this, recall the chain rule in calculus gives us $\frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} \cdot \frac{\partial Y}{\partial X}$.

The unknown/hart part is $\frac{\partial Y}{\partial X}$. We can find this by first deriving step-by-step our local gradients at 
$\frac{\partial v}{\partial X}$, $\frac{\partial \mu}{\partial X}$,
$\frac{\partial \sigma}{\partial v}$, 
$\frac{\partial Y}{\partial \sigma}$, and $\frac{\partial Y}{\partial \mu}$,
and then use the chain rule to compose these gradients (which appear in the form of vectors!) appropriately to compute $\frac{\partial Y}{\partial X}$.

If it's challenging to directly reason about the gradients over $X$ and $Y$ which require matrix multiplication, try reasoning about the gradients in terms of individual elements $x_i$ and $y_i$ first: in that case, you will need to come up with the derivations for $\frac{\partial L}{\partial x_i}$, by relying on the Chain Rule to first calculate the intermediate $\frac{\partial \mu}{\partial x_i}, \frac{\partial v}{\partial x_i}, \frac{\partial \sigma}{\partial x_i},$ then assemble these pieces to calculate $\frac{\partial y_i}{\partial x_i}$. 

You should make sure each of the intermediary gradient derivations are all as simplified as possible, for ease of implementation. 

After doing so, implement the simplified batch normalization backward pass in the function `batchnorm_backward_alt` and compare the two implementations by running the following. Your two implementations should compute nearly identical results, but the alternative implementation should be a bit faster.

In [36]:
np.random.seed(231)
N, D = 100, 500
x = 5 * np.random.randn(N, D) + 12
gamma = np.random.randn(D)
beta = np.random.randn(D)
dout = np.random.randn(N, D)

bn_param = {'mode': 'train'}
out, cache = batchnorm_forward(x, gamma, beta, bn_param)

t1 = time.time()
dx1, dgamma1, dbeta1 = batchnorm_backward(dout, cache)
t2 = time.time()
dx2, dgamma2, dbeta2 = batchnorm_backward_alt(dout, cache)
t3 = time.time()

print('dx difference: ', rel_error(dx1, dx2))
print('dgamma difference: ', rel_error(dgamma1, dgamma2))
print('dbeta difference: ', rel_error(dbeta1, dbeta2))
print('speedup: %.2fx' % ((t2 - t1) / (t3 - t2)))

dx difference:  2.263198472648062e-12
dgamma difference:  0.0
dbeta difference:  0.0
speedup: 3.92x


## Fully Connected Nets with Batch Normalization
Now that you have a working implementation for batch normalization, go back to your `FullyConnectedNet` in the file `cs231n/classifiers/fc_net.py`. Modify your implementation to add batch normalization.

Concretely, when the `normalization` flag is set to `"batchnorm"` in the constructor, you should insert a batch normalization layer before each ReLU nonlinearity. The outputs from the last layer of the network should not be normalized. Once you are done, run the following to gradient-check your implementation.

HINT: You might find it useful to define an additional helper layer similar to those in the file `cs231n/layer_utils.py`. If you decide to do so, do it in the file `cs231n/classifiers/fc_net.py`.

In [79]:
np.random.seed(231)
N, D, H1, H2, C = 2, 15, 20, 30, 10
X = np.random.randn(N, D)
y = np.random.randint(C, size=(N,))

# You should expect losses between 1e-4~1e-10 for W, 
# losses between 1e-08~1e-10 for b,
# and losses between 1e-08~1e-09 for beta and gammas.
for reg in [0, 3.14]:
  print('Running check with reg = ', reg)
  model = FullyConnectedNet([H1, H2], input_dim=D, num_classes=C,
                            reg=reg, weight_scale=5e-2, dtype=np.float64,
                            normalization='batchnorm')

  loss, grads = model.loss(X, y)
  print('Initial loss: ', loss)

  for name in sorted(grads):
    f = lambda _: model.loss(X, y)[0]
    grad_num = eval_numerical_gradient(f, model.params[name], verbose=False, h=1e-5)
    print('%s relative error: %.2e' % (name, rel_error(grad_num, grads[name])))
  if reg == 0: print()

Running check with reg =  0
Initial loss:  2.2621133407844862
(0, 0) 0.9960137235864862
(0, 1) 0.0002697220891079155
(0, 2) -0.0004630628769319855
(0, 3) 2.1356116874926553e-05
(0, 4) 4.598987857207248e-07
(0, 5) 2.1698642882483906e-05
(0, 6) -3.6962988225752724e-05
(0, 7) 0.000147555478946515
(0, 8) -6.046851908081407e-06
(0, 9) 5.962097482381522e-06
(0, 10) 1.3452883251829915e-05
(0, 11) -3.376312562863859e-05
(0, 12) -5.469180663908445e-06
(0, 13) 7.0384142958346266e-06
(0, 14) -3.461919639846655e-06
(0, 15) 5.506461953075358e-06
(0, 16) 3.2984770470534386e-05
(0, 17) -0.4021242400842339
(0, 18) -9.178378057583812e-05
(0, 19) 9.337730588754312e-06
(1, 0) -14.422934803515195
(1, 1) -0.003905711487561802
(1, 2) 0.006705383670535524
(1, 3) -0.00030924696137191177
(1, 4) -6.659917062279418e-06
(1, 5) -0.00031420683832550367
(1, 6) 0.000535242850041584
(1, 7) -0.0021366772129383094
(1, 8) 8.756129155074176e-05
(1, 9) -8.633391779255815e-05
(1, 10) -0.0001948048611311037
(1, 11) 0.0004889

(2, 6) -5.296509897334544e-05
(2, 7) -0.0010577872133765709
(2, 8) -0.0003377768509338352
(2, 9) 1.3924250674257619
(2, 10) 4.160360944638342e-06
(2, 11) -1.527407089696453e-05
(2, 12) 4.288636112903532e-06
(2, 13) -9.096368103200803e-06
(2, 14) 0.07270376241041987
(2, 15) 0.00013408665289205146
(2, 16) -0.00024510000695698864
(2, 17) -0.003622562760874359
(2, 18) 8.608951329591717e-05
(2, 19) -1.956363959720875e-05
(2, 20) -0.00036604996811462337
(2, 21) 8.949760932353e-05
(2, 22) -8.32241386916621e-05
(2, 23) -0.43098282103137814
(2, 24) -0.0004229899541741133
(2, 25) -1.2330225729328957e-05
(2, 26) -5.5800208897949226e-05
(2, 27) 4.2044168147015164e-05
(2, 28) -0.00018915058408452976
(2, 29) -2.695161871457685e-05
(3, 0) 0.0035554310828800335
(3, 1) 0.001925303361183239
(3, 2) 0.0013831170431188864
(3, 3) -4.199602887666742e-05
(3, 4) 0.00010401284278316324
(3, 5) 2.725224490518485e-05
(3, 6) -5.324551910490526e-05
(3, 7) -0.001063387489175227
(3, 8) -0.0003395651315685199
(3, 9) 1.

(13, 22) 8.366523029934568e-05
(13, 23) 0.43326707335200604
(13, 24) 0.00042523180532327837
(13, 25) 1.2395551252097901e-05
(13, 26) 5.609597231170937e-05
(13, 27) -4.2267012112517925e-05
(13, 28) 0.00019015309327130578
(13, 29) 2.7094460008925122e-05
(14, 0) -0.0035554896138378918
(14, 1) -0.0019253350691528224
(14, 2) -0.0013831398026908912
(14, 3) 4.1996717214942685e-05
(14, 4) -0.00010401453032216067
(14, 5) -2.72526889943947e-05
(14, 6) 5.324638507886447e-05
(14, 7) 0.0010634049862900952
(14, 8) 0.00033957072709256403
(14, 9) -1.3998200544262926
(14, 10) -4.18243217836789e-06
(14, 11) 1.5355205995604138e-05
(14, 12) -4.31141788936884e-06
(14, 13) 9.144662804771997e-06
(14, 14) -0.07308988247967108
(14, 15) -0.00013479879434896702
(14, 16) 0.00024640174345336163
(14, 17) 0.003641801660236865
(14, 18) -8.65467253419183e-05
(14, 19) 1.9667489858932186e-05
(14, 20) 0.000367994013039663
(14, 21) -8.997291800483252e-05
(14, 22) 8.366609627330489e-05
(14, 23) 0.43327171517226754
(14, 24)

(11, 2) 0.028838802612618505
(11, 3) 0.031004127598599492
(11, 4) 0.033726951542689676
(11, 5) 0.03763587994587425
(11, 6) 0.03631801730463735
(11, 7) -0.31352926852523666
(11, 8) 0.02936723777491323
(11, 9) 0.04148381029800419
(12, 0) 0.042532548238938254
(12, 1) 0.032626525237056114
(12, 2) 0.02884058003527201
(12, 3) 0.031006038470060556
(12, 4) 0.0337290302576676
(12, 5) 0.0376381995348396
(12, 6) 0.03632025569189068
(12, 7) -0.31354859231225163
(12, 8) 0.029369047771510278
(12, 9) 0.04148636709722098
(13, 0) 0.042536615429966666
(13, 1) 0.032629645141390995
(13, 2) 0.02884333791808302
(13, 3) 0.03100900345387458
(13, 4) 0.03373225561098536
(13, 5) 0.03764179874465867
(13, 6) 0.036323728846987535
(13, 7) -0.313578575639184
(13, 8) 0.02937185621387783
(13, 9) 0.04149033423495041
(14, 0) -0.3053597186220003
(14, 1) 0.03417691711860016
(14, 2) 0.03223691571463405
(14, 3) 0.03330679314572649
(14, 4) 0.03777092802970117
(14, 5) 0.03949823841331579
(14, 6) 0.0339969977503074
(14, 7) 0.03

(5, 0) 0.13408893613231498
(5, 1) 0.28750429179602577
(5, 2) 0.00016054966245349078
(5, 3) -0.0790175360254608
(5, 4) 0.041514229964789706
(5, 5) -0.057178973111859925
(5, 6) -0.032097131930086675
(5, 7) 0.07198379305961566
(5, 8) 0.16668397204711027
(5, 9) 0.06874178426485855
(5, 10) -0.12161532993992806
(5, 11) 0.2119884751738965
(5, 12) 0.13384062769183913
(5, 13) -0.04122594225997034
(5, 14) 0.0432445222742217
(5, 15) 0.017743838798267575
(5, 16) -0.042245049636235876
(5, 17) -0.04034029124078131
(5, 18) -0.0115169481507138
(5, 19) -0.14890296546887782
(6, 0) -0.01998521792501151
(6, 1) 0.4041804740673171
(6, 2) 0.07874544012409501
(6, 3) -0.1380171119613749
(6, 4) -0.14432209383130612
(6, 5) 0.07142083191524762
(6, 6) 0.02638210747640812
(6, 7) -0.37036453033323363
(6, 8) 0.08443150822934342
(6, 9) 0.006130548690919112
(6, 10) 0.09832257599207138
(6, 11) -0.27387383552124334
(6, 12) -0.06846934930138104
(6, 13) -0.022287104961549172
(6, 14) -0.17740472526561743
(6, 15) 0.152207281

(3, 17) -0.07928979752769294
(3, 18) 0.021897114699243044
(3, 19) 0.014068313936022035
(3, 20) -0.04188588889064704
(3, 21) 0.05251345056578315
(3, 22) -0.09790609634130475
(3, 23) 0.15677095026589427
(3, 24) 0.12926569441518154
(3, 25) -0.04476037918088593
(3, 26) 0.06887959096424368
(3, 27) -0.49048893702163804
(3, 28) 0.08192963250053253
(3, 29) -0.019315627453053708
(4, 0) 0.023518248681853034
(4, 1) 0.1198320807471731
(4, 2) -0.12955629418165415
(4, 3) -0.1445558588386575
(4, 4) -0.03263567007749657
(4, 5) 0.053126078825016514
(4, 6) 0.004631461880677534
(4, 7) 0.13943230769974946
(4, 8) -0.033291538503732454
(4, 9) -0.032870022970143964
(4, 10) -0.22664703971386754
(4, 11) 0.215459887042968
(4, 12) -0.0786521382245553
(4, 13) 0.0754194552143872
(4, 14) 0.06940143744138538
(4, 15) -0.06103008267643872
(4, 16) 0.06715967226789132
(4, 17) -0.14362447857685368
(4, 18) 0.2164673697180319
(4, 19) -0.15133710569159575
(4, 20) 0.04660004511158888
(4, 21) -0.11433179398778746
(4, 22) 0.00

(13, 23) 0.08905955306559575
(13, 24) -0.006037968613981092
(13, 25) 0.5600826838758621
(13, 26) -0.044457426096045076
(13, 27) -0.406552796450299
(13, 28) 0.0741146963356698
(13, 29) -0.03502230021545927
(14, 0) 0.1722243290647185
(14, 1) -0.26282607823446824
(14, 2) 0.1085879491213859
(14, 3) 0.30462892421923016
(14, 4) 0.043266973381861355
(14, 5) 0.16425180611179258
(14, 6) 0.18780280384156353
(14, 7) -0.12588335156493713
(14, 8) -0.043251040748870644
(14, 9) 0.3977634909446692
(14, 10) 0.05847043009765684
(14, 11) 0.20290750457618853
(14, 12) 0.04292281579054701
(14, 13) -0.13684095971910892
(14, 14) 0.2868103555986323
(14, 15) -0.28771093489865507
(14, 16) 0.14675308843692392
(14, 17) -0.158349948886638
(14, 18) -0.10725230423780373
(14, 19) -0.10459800456885658
(14, 20) -0.0134233727244748
(14, 21) 0.15582845183104155
(14, 22) 0.11685756553525549
(14, 23) -0.013056073466799488
(14, 24) -0.062492970309691025
(14, 25) -0.10246953707770955
(14, 26) -0.051957688462778144
(14, 27) -0

(26, 2) -0.14690933527461425
(26, 3) 0.28277775232510294
(26, 4) -0.03654184403956151
(26, 5) -0.061895250347276935
(26, 6) -0.13565172167773198
(26, 7) 0.06698785490755199
(26, 8) 0.01554524589053585
(26, 9) -0.02983942919421167
(27, 0) -0.07337113983041377
(27, 1) 0.4113577852571381
(27, 2) 0.07378934929747061
(27, 3) 0.06035955619765331
(27, 4) 0.09919438674899082
(27, 5) -0.11630255736250204
(27, 6) -0.24843855297440595
(27, 7) -0.10743301879223564
(27, 8) 0.05401138953864403
(27, 9) -0.17532816922738445
(28, 0) -0.02328383397198763
(28, 1) -0.07690665833059995
(28, 2) -0.11777440827032136
(28, 3) 0.02113822610283478
(28, 4) -0.0626694377281467
(28, 5) 0.23848483747812563
(28, 6) 0.05149739035203992
(28, 7) -0.37658260660933246
(28, 8) -0.009545654977571871
(28, 9) 0.2066579877446628
(29, 0) -0.10592756556349057
(29, 1) -0.041102341397092346
(29, 2) 0.12541648350961054
(29, 3) -0.2014640275049828
(29, 4) 0.1507739507022876
(29, 5) 0.013162544965084065
(29, 6) 0.1057762699652187
(29

# Batchnorm for deep networks
Run the following to train a six-layer network on a subset of 1000 training examples both with and without batch normalization.

In [44]:
np.random.seed(231)
# Try training a very deep net with batchnorm
hidden_dims = [100, 100, 100, 100, 100]

num_train = 1000
small_data = {
  'X_train': data['X_train'][:num_train],
  'y_train': data['y_train'][:num_train],
  'X_val': data['X_val'],
  'y_val': data['y_val'],
}

weight_scale = 2e-2
bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization='batchnorm')
model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=None)

print('Solver with batch norm:')
bn_solver = Solver(bn_model, small_data,
                num_epochs=10, batch_size=50,
                update_rule='adam',
                optim_config={
                  'learning_rate': 1e-3,
                },
                verbose=True,print_every=20)
bn_solver.train()

print('\nSolver without batch norm:')
solver = Solver(model, small_data,
                num_epochs=10, batch_size=50,
                update_rule='adam',
                optim_config={
                  'learning_rate': 1e-3,
                },
                verbose=True, print_every=20)
solver.train()

Solver with batch norm:
(Iteration 1 / 200) loss: 2.340512
(Epoch 0 / 10) train acc: 0.107000; val_acc: 0.116000
(Epoch 1 / 10) train acc: 0.317000; val_acc: 0.260000
(Iteration 21 / 200) loss: 2.002910
(Epoch 2 / 10) train acc: 0.417000; val_acc: 0.285000
(Iteration 41 / 200) loss: 2.030562
(Epoch 3 / 10) train acc: 0.493000; val_acc: 0.294000
(Iteration 61 / 200) loss: 1.776793
(Epoch 4 / 10) train acc: 0.553000; val_acc: 0.311000
(Iteration 81 / 200) loss: 1.298337
(Epoch 5 / 10) train acc: 0.571000; val_acc: 0.307000
(Iteration 101 / 200) loss: 1.265137
(Epoch 6 / 10) train acc: 0.627000; val_acc: 0.329000
(Iteration 121 / 200) loss: 1.060890
(Epoch 7 / 10) train acc: 0.657000; val_acc: 0.318000
(Iteration 141 / 200) loss: 1.124555
(Epoch 8 / 10) train acc: 0.666000; val_acc: 0.292000
(Iteration 161 / 200) loss: 0.770253
(Epoch 9 / 10) train acc: 0.756000; val_acc: 0.322000
(Iteration 181 / 200) loss: 0.888094
(Epoch 10 / 10) train acc: 0.786000; val_acc: 0.309000

Solver without b

Run the following to visualize the results from two networks trained above. You should find that using batch normalization helps the network to converge much faster.

In [None]:
def plot_training_history(title, label, baseline, bn_solvers, plot_fn, bl_marker='.', bn_marker='.', labels=None):
    """utility function for plotting training history"""
    plt.title(title)
    plt.xlabel(label)
    bn_plots = [plot_fn(bn_solver) for bn_solver in bn_solvers]
    bl_plot = plot_fn(baseline)
    num_bn = len(bn_plots)
    for i in range(num_bn):
        label='with_norm'
        if labels is not None:
            label += str(labels[i])
        plt.plot(bn_plots[i], bn_marker, label=label)
    label='baseline'
    if labels is not None:
        label += str(labels[0])
    plt.plot(bl_plot, bl_marker, label=label)
    plt.legend(loc='lower center', ncol=num_bn+1) 

    
plt.subplot(3, 1, 1)
plot_training_history('Training loss','Iteration', solver, [bn_solver], \
                      lambda x: x.loss_history, bl_marker='o', bn_marker='o')
plt.subplot(3, 1, 2)
plot_training_history('Training accuracy','Epoch', solver, [bn_solver], \
                      lambda x: x.train_acc_history, bl_marker='-o', bn_marker='-o')
plt.subplot(3, 1, 3)
plot_training_history('Validation accuracy','Epoch', solver, [bn_solver], \
                      lambda x: x.val_acc_history, bl_marker='-o', bn_marker='-o')

plt.gcf().set_size_inches(15, 15)
plt.show()

# Batch normalization and initialization
We will now run a small experiment to study the interaction of batch normalization and weight initialization.

The first cell will train 8-layer networks both with and without batch normalization using different scales for weight initialization. The second layer will plot training accuracy, validation set accuracy, and training loss as a function of the weight initialization scale.

In [None]:
np.random.seed(231)
# Try training a very deep net with batchnorm
hidden_dims = [50, 50, 50, 50, 50, 50, 50]
num_train = 1000
small_data = {
  'X_train': data['X_train'][:num_train],
  'y_train': data['y_train'][:num_train],
  'X_val': data['X_val'],
  'y_val': data['y_val'],
}

bn_solvers_ws = {}
solvers_ws = {}
weight_scales = np.logspace(-4, 0, num=20)
for i, weight_scale in enumerate(weight_scales):
  print('Running weight scale %d / %d' % (i + 1, len(weight_scales)))
  bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization='batchnorm')
  model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=None)

  bn_solver = Solver(bn_model, small_data,
                  num_epochs=10, batch_size=50,
                  update_rule='adam',
                  optim_config={
                    'learning_rate': 1e-3,
                  },
                  verbose=False, print_every=200)
  bn_solver.train()
  bn_solvers_ws[weight_scale] = bn_solver

  solver = Solver(model, small_data,
                  num_epochs=10, batch_size=50,
                  update_rule='adam',
                  optim_config={
                    'learning_rate': 1e-3,
                  },
                  verbose=False, print_every=200)
  solver.train()
  solvers_ws[weight_scale] = solver

In [None]:
# Plot results of weight scale experiment
best_train_accs, bn_best_train_accs = [], []
best_val_accs, bn_best_val_accs = [], []
final_train_loss, bn_final_train_loss = [], []

for ws in weight_scales:
  best_train_accs.append(max(solvers_ws[ws].train_acc_history))
  bn_best_train_accs.append(max(bn_solvers_ws[ws].train_acc_history))
  
  best_val_accs.append(max(solvers_ws[ws].val_acc_history))
  bn_best_val_accs.append(max(bn_solvers_ws[ws].val_acc_history))
  
  final_train_loss.append(np.mean(solvers_ws[ws].loss_history[-100:]))
  bn_final_train_loss.append(np.mean(bn_solvers_ws[ws].loss_history[-100:]))
  
plt.subplot(3, 1, 1)
plt.title('Best val accuracy vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Best val accuracy')
plt.semilogx(weight_scales, best_val_accs, '-o', label='baseline')
plt.semilogx(weight_scales, bn_best_val_accs, '-o', label='batchnorm')
plt.legend(ncol=2, loc='lower right')

plt.subplot(3, 1, 2)
plt.title('Best train accuracy vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Best training accuracy')
plt.semilogx(weight_scales, best_train_accs, '-o', label='baseline')
plt.semilogx(weight_scales, bn_best_train_accs, '-o', label='batchnorm')
plt.legend()

plt.subplot(3, 1, 3)
plt.title('Final training loss vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Final training loss')
plt.semilogx(weight_scales, final_train_loss, '-o', label='baseline')
plt.semilogx(weight_scales, bn_final_train_loss, '-o', label='batchnorm')
plt.legend()
plt.gca().set_ylim(1.0, 3.5)

plt.gcf().set_size_inches(15, 15)
plt.show()

## Inline Question 1:
Describe the results of this experiment. How does the scale of weight initialization affect models with/without batch normalization differently, and why?

## Answer:
[FILL THIS IN]


# Batch normalization and batch size
We will now run a small experiment to study the interaction of batch normalization and batch size.

The first cell will train 6-layer networks both with and without batch normalization using different batch sizes. The second layer will plot training accuracy and validation set accuracy over time.

In [None]:
def run_batchsize_experiments(normalization_mode):
    np.random.seed(231)
    # Try training a very deep net with batchnorm
    hidden_dims = [100, 100, 100, 100, 100]
    num_train = 1000
    small_data = {
      'X_train': data['X_train'][:num_train],
      'y_train': data['y_train'][:num_train],
      'X_val': data['X_val'],
      'y_val': data['y_val'],
    }
    n_epochs=10
    weight_scale = 2e-2
    batch_sizes = [5,10,50]
    lr = 10**(-3.5)
    solver_bsize = batch_sizes[0]

    print('No normalization: batch size = ',solver_bsize)
    model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=None)
    solver = Solver(model, small_data,
                    num_epochs=n_epochs, batch_size=solver_bsize,
                    update_rule='adam',
                    optim_config={
                      'learning_rate': lr,
                    },
                    verbose=False)
    solver.train()
    
    bn_solvers = []
    for i in range(len(batch_sizes)):
        b_size=batch_sizes[i]
        print('Normalization: batch size = ',b_size)
        bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=normalization_mode)
        bn_solver = Solver(bn_model, small_data,
                        num_epochs=n_epochs, batch_size=b_size,
                        update_rule='adam',
                        optim_config={
                          'learning_rate': lr,
                        },
                        verbose=False)
        bn_solver.train()
        bn_solvers.append(bn_solver)
        
    return bn_solvers, solver, batch_sizes

batch_sizes = [5,10,50]
bn_solvers_bsize, solver_bsize, batch_sizes = run_batchsize_experiments('batchnorm')

In [None]:
plt.subplot(2, 1, 1)
plot_training_history('Training accuracy (Batch Normalization)','Epoch', solver_bsize, bn_solvers_bsize, \
                      lambda x: x.train_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)
plt.subplot(2, 1, 2)
plot_training_history('Validation accuracy (Batch Normalization)','Epoch', solver_bsize, bn_solvers_bsize, \
                      lambda x: x.val_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)

plt.gcf().set_size_inches(15, 10)
plt.show()

## Inline Question 2:
Describe the results of this experiment. What does this imply about the relationship between batch normalization and batch size? Why is this relationship observed?

## Answer:
[FILL THIS IN]


# Layer Normalization
Batch normalization has proved to be effective in making networks easier to train, but the dependency on batch size makes it less useful in complex networks which have a cap on the input batch size due to hardware limitations. 

Several alternatives to batch normalization have been proposed to mitigate this problem; one such technique is Layer Normalization [2]. Instead of normalizing over the batch, we normalize over the features. In other words, when using Layer Normalization, each feature vector corresponding to a single datapoint is normalized based on the sum of all terms within that feature vector.

[2] [Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. "Layer Normalization." stat 1050 (2016): 21.](https://arxiv.org/pdf/1607.06450.pdf)

## Inline Question 3:
Which of these data preprocessing steps is analogous to batch normalization, and which is analogous to layer normalization?

1. Scaling each image in the dataset, so that the RGB channels for each row of pixels within an image sums up to 1.
2. Scaling each image in the dataset, so that the RGB channels for all pixels within an image sums up to 1.  
3. Subtracting the mean image of the dataset from each image in the dataset.
4. Setting all RGB values to either 0 or 1 depending on a given threshold.

## Answer:
[FILL THIS IN]


# Layer Normalization: Implementation

Now you'll implement layer normalization. This step should be relatively straightforward, as conceptually the implementation is almost identical to that of batch normalization. One significant difference though is that for layer normalization, we do not keep track of the moving moments, and the testing phase is identical to the training phase, where the mean and variance are directly calculated per datapoint.

Here's what you need to do:

* In `cs231n/layers.py`, implement the forward pass for layer normalization in the function `layernorm_backward`. 

Run the cell below to check your results.
* In `cs231n/layers.py`, implement the backward pass for layer normalization in the function `layernorm_backward`. 

Run the second cell below to check your results.
* Modify `cs231n/classifiers/fc_net.py` to add layer normalization to the `FullyConnectedNet`. When the `normalization` flag is set to `"layernorm"` in the constructor, you should insert a layer normalization layer before each ReLU nonlinearity. 

Run the third cell below to run the batch size experiment on layer normalization.

In [None]:
# Check the training-time forward pass by checking means and variances
# of features both before and after layer normalization   

# Simulate the forward pass for a two-layer network
np.random.seed(231)
N, D1, D2, D3 =4, 50, 60, 3
X = np.random.randn(N, D1)
W1 = np.random.randn(D1, D2)
W2 = np.random.randn(D2, D3)
a = np.maximum(0, X.dot(W1)).dot(W2)

print('Before layer normalization:')
print_mean_std(a,axis=1)

gamma = np.ones(D3)
beta = np.zeros(D3)
# Means should be close to zero and stds close to one
print('After layer normalization (gamma=1, beta=0)')
a_norm, _ = layernorm_forward(a, gamma, beta, {'mode': 'train'})
print_mean_std(a_norm,axis=1)

gamma = np.asarray([3.0,3.0,3.0])
beta = np.asarray([5.0,5.0,5.0])
# Now means should be close to beta and stds close to gamma
print('After layer normalization (gamma=', gamma, ', beta=', beta, ')')
a_norm, _ = layernorm_forward(a, gamma, beta, {'mode': 'train'})
print_mean_std(a_norm,axis=1)

In [None]:
# Gradient check batchnorm backward pass
np.random.seed(231)
N, D = 4, 5
x = 5 * np.random.randn(N, D) + 12
gamma = np.random.randn(D)
beta = np.random.randn(D)
dout = np.random.randn(N, D)

ln_param = {}
fx = lambda x: layernorm_forward(x, gamma, beta, ln_param)[0]
fg = lambda a: layernorm_forward(x, a, beta, ln_param)[0]
fb = lambda b: layernorm_forward(x, gamma, b, ln_param)[0]

dx_num = eval_numerical_gradient_array(fx, x, dout)
da_num = eval_numerical_gradient_array(fg, gamma.copy(), dout)
db_num = eval_numerical_gradient_array(fb, beta.copy(), dout)

_, cache = layernorm_forward(x, gamma, beta, ln_param)
dx, dgamma, dbeta = layernorm_backward(dout, cache)

#You should expect to see relative errors between 1e-12 and 1e-8
print('dx error: ', rel_error(dx_num, dx))
print('dgamma error: ', rel_error(da_num, dgamma))
print('dbeta error: ', rel_error(db_num, dbeta))

# Layer Normalization and batch size

We will now run the previous batch size experiment with layer normalization instead of batch normalization. Compared to the previous experiment, you should see a markedly smaller influence of batch size on the training history!

In [None]:
ln_solvers_bsize, solver_bsize, batch_sizes = run_batchsize_experiments('layernorm')

plt.subplot(2, 1, 1)
plot_training_history('Training accuracy (Layer Normalization)','Epoch', solver_bsize, ln_solvers_bsize, \
                      lambda x: x.train_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)
plt.subplot(2, 1, 2)
plot_training_history('Validation accuracy (Layer Normalization)','Epoch', solver_bsize, ln_solvers_bsize, \
                      lambda x: x.val_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)

plt.gcf().set_size_inches(15, 10)
plt.show()

## Inline Question 4:
When is layer normalization likely to not work well, and why?

1. Using it in a very deep network
2. Having a very small dimension of features
3. Having a high regularization term


## Answer:
[FILL THIS IN]
