# 2. Neural Networks with Numpy

In this notebook we will build our first neural network using only `numpy` as library.

We will work on the same dataset as last week and try to predict which digit is shown on the given pixel values.

In [1]:
from sklearn.datasets import fetch_openml
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, data_home="./data", cache=True)

We know already from last time how the data looks:

In [2]:
X.head(3)

Unnamed: 0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,pixel10,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The label is a number between 0-9 represnting the digit shown on the pixels.

In [3]:
y.head(3)

0    5
1    0
2    4
Name: class, dtype: category
Categories (10, object): ['0', '1', '2', '3', ..., '6', '7', '8', '9']

As we can see from above, the label is given as a digit. However, to calculate the loss function of the neural network, we need the label as a one-hot-encoded version, in which the label is encoded as `1` and the rest as `0`.

For instance:
- `3` -> `[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]`
- `9` -> `[0, 0, 0, 0, 0, 0, 0, 0, 0, 1]`

This is done in the following:

In [4]:
import pandas as pd
y_categorical = pd.get_dummies(y).astype('float32').values
y_categorical[0:5]

array([[0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]], dtype=float32)

Before we start, we scale the data and divide it into train and test data:

In [5]:
from sklearn.model_selection import train_test_split

X_scaled = (X/255).astype('float32').values
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_categorical, test_size=0.15, random_state=42)

## Task 1: Implement the Forward pass

We start with the following structure of a neural network with two hidden layers:
- Input layer of size 784 with sigmoid activation
- First hidden layer of size 128 with sigmoid activation
- Second hidden layer of size 64 with sigmoid activation
- Output layer of size 10 with softmax activation

A skeleton code for this network is given in the following class. Your first task is to complete the method `forward_pass` to calculate the forward pass on one data point. After this class you can find a test to check whether your implementation is correct.

In [None]:
import time
import numpy as np

class DeepNeuralNetwork():
    
    # do not touch this method
    def __init__(self):
        
        # initialize weights randomly
        np.random.seed(0)
        self.w1 = np.random.randn(128, 784)
        self.w2 = np.random.randn(64, 128)
        self.w3 = np.random.randn(10, 64)

    def forward_pass(self, x_train):
        
        z1 = # implement the dot product for w1 * x_train
        a1 = # implement the sigmoid activation on z1
        z2 = # implement the dot product for w2 * a1
        a2 = # implement the sigmoid activation on z2
        z3 = # implement the dot product for w3 * a2
        a3 = # implement the softmax activation for z3
        
        # we need to remember all values for backpropagation
        self.fwdpass = [x_train, z1, a1, z2, a2, z3, a3]
        
        return a3
    
    # do not touch this method
    def backprop(self, y, y_hat):
        # restore values from foward pass
        a0, z1, a1, z2, a2, z3, a3 = self.fwdpass
        
        # Calculate W3 update
        exps = np.exp(z3 - z3.max())
        softmax_derivative = exps / np.sum(exps, axis=0) * (1 - exps / np.sum(exps, axis=0))
        error = 2 * (y_hat - y) / y_hat.shape[0] * softmax_derivative
        gradient_w3 = np.outer(error, a2)

        # Calculate W2 update
        sigmoid_derivative = (np.exp(-z2))/((np.exp(-z2)+1)**2)
        error = np.dot(self.w3.T, error) * sigmoid_derivative
        gradient_w2 = np.outer(error, a1)

        # Calculate W1 update
        sigmoid_derivative = (np.exp(-z1))/((np.exp(-z1)+1)**2)
        error = np.dot(self.w2.T, error) * sigmoid_derivative
        gradient_w1 = np.outer(error, a0)

        return [gradient_w1, gradient_w2, gradient_w3]

Hint: To calculate the ouput of a layer you can use numpys matrix operations. For instance:

In [5]:
import numpy as np

w = np.array([[2,2,2],[1,1,1]])
print(w)
print(w.shape)
x = np.array([3, 4, 5])
print(x.shape)

z = np.dot(w, x) + b

z

[[2 2 2]
 [1 1 1]]
(2, 3)
(3,)


NameError: name 'b' is not defined

In [None]:
# Test for task 1:
dnn = DeepNeuralNetwork()

# the network outputs a probability for every neuron in the last layer
y_hat = dnn.forward_pass(X_train[0])
print("The output of the last layer looks like this:\n", y_hat)

# to check if the network works correctly, check if the following condition is True
abs(y_hat[8] - 0.946) < 0.001

### Task 2: Implement  the training procedure

We can now start training the network by implementing the training procedure. We train the network for 10 epochs as shown in the code below.
In each epoch we go over every data point `x` in `X_train` and:
1. Calculate a forward pass on `x` and save it as `y_hat`
2. Calculate the gradients for the weight in w1, w2 and w3 using the `backprop` function of the network
3. Update the weights w1, w2 and w3 of the network by moving into the negative direction of the gradient multiplied with the `learning_rate`
4. Bonus: Calculate the cross-entropy-loss after each epoch and plot it in relation to the epochs.



In [None]:
dnn = DeepNeuralNetwork()

no_epochs = 10
learning_rate = 0.01

start_time = time.time()
losses = []
for iteration in range(no_epochs):
    # your code goes here

### Task 3: Predict on the test data

After the network is trained, we can use it to predict on the test data.

Task: 
- Iterate over the test data and use the trained network the predict on every test data point.
- Identify the index of the neuron which returned the highest probability.
- Compare this value to the true label in the test data.
- Compute the accuracy.

### Bonus Task:
- Remove the first hidden layer. Train the network and check the performance on the test data.