# Initialization

Welcome to the first assignment of "Improving Deep Neural Networks". 

Training your neural network requires specifying an initial value of the weights. A well chosen initialization method will help learning.  

If you completed the previous course of this specialization, you probably followed our instructions for weight initialization, and it has worked out so far. But how do you choose the initialization for a new neural network? In this notebook, you will see how different initializations lead to different results. 

A well chosen initialization can:
- Speed up the convergence of gradient descent
- Increase the odds of gradient descent converging to a lower training (and generalization) error 

To get started, run the following cell to load the packages and the planar dataset you will try to classify.

In [1]:
import random
import torch
import numpy as np

random_seed = 40
torch.manual_seed(random_seed)
torch.cuda.manual_seed(random_seed)
# torch.cuda.manual_seed_all(random_seed) # if use multi-GPU
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(random_seed)
random.seed(random_seed)

## 1 - Neural Network model 

You will use a 3-layer neural network (already implemented for you). Here are the initialization methods you will experiment with:  
- *Zeros initialization* --  setting `initialization = "zeros"` in the input argument.
- *Random initialization* -- setting `initialization = "random"` in the input argument. This initializes the weights to large random values.  
- *He initialization* -- setting `initialization = "he"` in the input argument. This initializes the weights to random values scaled according to a paper by He et al., 2015. 

**Instructions**: Please quickly read over the code below, and run it. In the next part you will implement the three initialization methods that this `model()` calls.

In [2]:
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import pandas as pd

bos = load_boston()
bos.keys()

df = pd.DataFrame(bos.data)
df.columns = bos.feature_names
df['Price'] = bos.target
df.head()

data = df[df.columns[:-1]]
data = data.apply(
    lambda x: (x - x.mean()) / x.std()
)

data['Price'] = df.Price
X = data.drop('Price', axis=1).to_numpy()
Y = data['Price'].to_numpy()
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)


n_train = X_train.shape[0]
X_train = torch.tensor(X_train, dtype=torch.float)
X_test = torch.tensor(X_test, dtype=torch.float)
Y_train = torch.tensor(Y_train, dtype=torch.float).view(-1, 1)
Y_test = torch.tensor(Y_test, dtype=torch.float).view(-1, 1)

(354, 13)
(152, 13)
(354,)
(152,)


In [3]:
# from torch.utils.data import DataLoader, TensorDataset

# datasets = TensorDataset(X_train, Y_train)
# train_set = DataLoader(datasets, batch_size=10, shuffle=True)

# datasets = TensorDataset(X_test, Y_test)
# test_set = DataLoader(datasets, batch_size=10, shuffle=True)

from torch.autograd import Variable
# torch can only train on Variable, so convert them to Variable
x, y = Variable(X_train), Variable(Y_train)

In [4]:
def training(net, X_train, Y_train, X_test, Y_test, batch_size, patience=5000, learning_rate = 0.1, best_loss = 1e06):
    criterion = nn.MSELoss()
    
    optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate)

    iter = 0
    while(best_loss>1e-06):    
        for i in range(len(X_train)//batch_size):
            inputs = Variable(X_train)
            labels = Variable(Y_train)
            # Clear gradients w.r.t. parameters
            optimizer.zero_grad()

            # Forward pass to get output/logits
            out = net(inputs)

            # Calculate Loss: softmax --> cross entropy loss
            loss = criterion(out, labels)

            # Getting gradients w.r.t. parameters
            loss.backward()

            # Updating parameters
            optimizer.step()

        iter += 1
        # Calculate Accuracy         
        correct = 0
        total = 0
        # Iterate through test dataset
        for j in range(len(X_test)//batch_size):
            inputs = Variable(X_test)
            labels = Variable(Y_test)
            # Forward pass only to get logits/output
            outputs = net(inputs)

            val_loss = criterion(outputs, labels)

            # Total number of labels
            total += labels.size(0)

           # Total correct predictions
            correct += (outputs.type(torch.FloatTensor).cpu() == labels.type(torch.FloatTensor)).sum()

        accuracy = 100. * correct.item() / total

        # Print Loss
        if best_loss > val_loss.item():
            p = patience
            best_loss = val_loss.item()
            print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, val_loss.item(), accuracy))
        else:
            p -= 1
            if p == 0:
                break

## 2 - Zero initialization

There are two types of parameters to initialize in a neural network:
- the weight matrices $(W^{[1]}, W^{[2]}, W^{[3]}, ..., W^{[L-1]}, W^{[L]})$
- the bias vectors $(b^{[1]}, b^{[2]}, b^{[3]}, ..., b^{[L-1]}, b^{[L]})$

**Exercise**: Implement the following function to initialize all parameters to zeros. https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.constant_
https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.zeros_

Run the following code to train your model on 15,000 iterations using zeros initialization.

In [5]:
import torch.nn as nn
import torch.nn.functional as F

w_num = X_train.shape[1]
net = nn.Sequential(
    nn.Linear(w_num, 1)
)

nn.init.constant_(net[0].weight, val=0)
nn.init.constant_(net[0].bias, val=0)

Parameter containing:
tensor([0.], requires_grad=True)

In [6]:
batch_size = 12
training(net, X_train, Y_train, X_test, Y_test, batch_size)

Iteration: 1. Loss: 422.0935974121094. Accuracy: 0.0
Iteration: 2. Loss: 331.7978515625. Accuracy: 0.0
Iteration: 3. Loss: 252.31092834472656. Accuracy: 0.0
Iteration: 4. Loss: 186.39996337890625. Accuracy: 0.0
Iteration: 5. Loss: 136.51019287109375. Accuracy: 0.0
Iteration: 6. Loss: 99.80962371826172. Accuracy: 0.0
Iteration: 7. Loss: 73.47848510742188. Accuracy: 0.0
Iteration: 8. Loss: 55.1191291809082. Accuracy: 0.0
Iteration: 9. Loss: 42.69768524169922. Accuracy: 0.0
Iteration: 10. Loss: 34.540130615234375. Accuracy: 0.0
Iteration: 11. Loss: 29.339153289794922. Accuracy: 0.0
Iteration: 12. Loss: 26.117141723632812. Accuracy: 0.0
Iteration: 13. Loss: 24.174522399902344. Accuracy: 0.0
Iteration: 14. Loss: 23.03174591064453. Accuracy: 0.0
Iteration: 15. Loss: 22.373502731323242. Accuracy: 0.0
Iteration: 16. Loss: 22.000469207763672. Accuracy: 0.0
Iteration: 17. Loss: 21.79120635986328. Accuracy: 0.0
Iteration: 18. Loss: 21.674148559570312. Accuracy: 0.0
Iteration: 19. Loss: 21.6083602

## 3 - Random initialization

To break symmetry, lets intialize the weights randomly. Following random initialization, each neuron can then proceed to learn a different function of its inputs. In this exercise, you will see what happens if the weights are intialized randomly, but to very large values. 

**Exercise**: Implement the following function to initialize your weights to large random values (scaled by \*10) and your biases to zeros. You can choose which type of random distributions.
https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.uniform_
https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.normal_

In [7]:
import torch.nn as nn
import torch.nn.functional as F

w_num = X_train.shape[1]
net = nn.Sequential(
    nn.Linear(w_num, 1)
)

nn.init.normal_(net[0].weight, mean=0, std=0.1)
nn.init.constant_(net[0].bias, val=0)

Parameter containing:
tensor([0.], requires_grad=True)

In [8]:
batch_size = 12
training(net, X_train, Y_train, X_test, Y_test, batch_size)

Iteration: 1. Loss: 422.0658874511719. Accuracy: 0.0
Iteration: 2. Loss: 331.5777282714844. Accuracy: 0.0
Iteration: 3. Loss: 252.1563262939453. Accuracy: 0.0
Iteration: 4. Loss: 186.18951416015625. Accuracy: 0.0
Iteration: 5. Loss: 136.3537139892578. Accuracy: 0.0
Iteration: 6. Loss: 99.70662689208984. Accuracy: 0.0
Iteration: 7. Loss: 73.39952087402344. Accuracy: 0.0
Iteration: 8. Loss: 55.06049728393555. Accuracy: 0.0
Iteration: 9. Loss: 42.65496063232422. Accuracy: 0.0
Iteration: 10. Loss: 34.5093879699707. Accuracy: 0.0
Iteration: 11. Loss: 29.317346572875977. Accuracy: 0.0
Iteration: 12. Loss: 26.101869583129883. Accuracy: 0.0
Iteration: 13. Loss: 24.163951873779297. Accuracy: 0.0
Iteration: 14. Loss: 23.024532318115234. Accuracy: 0.0
Iteration: 15. Loss: 22.368633270263672. Accuracy: 0.0
Iteration: 16. Loss: 21.997209548950195. Accuracy: 0.0
Iteration: 17. Loss: 21.789033889770508. Accuracy: 0.0
Iteration: 18. Loss: 21.672712326049805. Accuracy: 0.0
Iteration: 19. Loss: 21.60741

## 4 - He initialization

Finally, try "He Initialization"; this is named for the first author of He et al., 2015. (If you have heard of "Xavier initialization", this is similar except Xavier initialization uses a scaling factor for the weights $W^{[l]}$ of `sqrt(1./layers_dims[l-1])` where He initialization would use `sqrt(2./layers_dims[l-1])`.)

**Exercise**: Implement the following function to initialize your parameters with He initialization.

**Hint**: This function is similar to the previous `initialize_parameters_random(...)`. The only difference is that instead of multiplying `np.random.randn(..,..)` by 10, you will multiply it by $\sqrt{\frac{2}{\text{dimension of the previous layer}}}$, which is what He initialization recommends for layers with a ReLU activation. 

In [9]:
import torch.nn as nn
import torch.nn.functional as F

w_num = X_train.shape[1]
net = nn.Sequential(
    nn.Linear(w_num, 1)
)

nn.init.xavier_normal_(net[0].weight, gain=1.0)
nn.init.constant_(net[0].bias, val=0)

Parameter containing:
tensor([0.], requires_grad=True)

In [10]:
batch_size = 12
training(net, X_train, Y_train, X_test, Y_test, batch_size)

Iteration: 1. Loss: 424.1647033691406. Accuracy: 0.0
Iteration: 2. Loss: 332.1796875. Accuracy: 0.0
Iteration: 3. Loss: 251.9687042236328. Accuracy: 0.0
Iteration: 4. Loss: 185.84878540039062. Accuracy: 0.0
Iteration: 5. Loss: 136.1215362548828. Accuracy: 0.0
Iteration: 6. Loss: 99.51551818847656. Accuracy: 0.0
Iteration: 7. Loss: 73.26202392578125. Accuracy: 0.0
Iteration: 8. Loss: 54.966861724853516. Accuracy: 0.0
Iteration: 9. Loss: 42.59405517578125. Accuracy: 0.0
Iteration: 10. Loss: 34.4719352722168. Accuracy: 0.0
Iteration: 11. Loss: 29.295969009399414. Accuracy: 0.0
Iteration: 12. Loss: 26.090869903564453. Accuracy: 0.0
Iteration: 13. Loss: 24.159215927124023. Accuracy: 0.0
Iteration: 14. Loss: 23.02326202392578. Accuracy: 0.0
Iteration: 15. Loss: 22.369064331054688. Accuracy: 0.0
Iteration: 16. Loss: 21.99831199645996. Accuracy: 0.0
Iteration: 17. Loss: 21.790246963500977. Accuracy: 0.0
Iteration: 18. Loss: 21.673789978027344. Accuracy: 0.0
Iteration: 19. Loss: 21.608253479003