## Test out the neural network algorithm

Use the mnist dataset to test out the neural network algo different option. Remove all classes except for 0 and 1 to turn this very simple binary classification problem.

In [152]:
import numpy as np

In [153]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [154]:
from NeuralNetNLayer import NeuralNet

**Load in and standardize the dataset**

In [173]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
np.random.seed(1)

In [116]:
df = pd.read_csv('mnist/train.csv')

In [120]:
# Keep only the first two numbers (0 and 1) to convert this to binary classification
df_two = df[(df['label'] >= 2) & (df['label'] <= 3)].copy(deep=True)
df_two['label'] = np.where(df_two['label'] == 3, 1, 0)
df_two.label.value_counts()

1    4351
0    4177
Name: label, dtype: int64

In [121]:
# Check out the data
df_two.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
7,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
13,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
16,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [122]:
# Split into X and y training and validation sets 
X_train, X_test, y_train, y_test = train_test_split(df_two.drop('label', axis=1), df_two['label'], train_size = 0.67)

In [123]:
# Standardize Before transposing (skelearn assumes obs x features format)
scaler = StandardScaler()
scaler = scaler.fit(X_train)
X_s = scaler.transform(X_train)
# Transpose X and convert to an array for compatibility with the neural network
# neural network requires features as rows and observations as columns
X_t = X_s.T
X_t.shape

(784, 5713)

In [124]:
# Transpose y and convert to an array
y_t = y_train.values.reshape(1, len(y_train))
y_t.shape

(1, 5713)

**Create and train the test models**

In [198]:
from NeuralNetNLayer import NeuralNet

In [218]:
# Create and train a base model with random initialization
clf = NeuralNet(X_t, y_t, [6, 3, 1], initialization = 'he', rseed=9)

In [219]:
clf.train(1.0, 500)

batch size:  5713
Iteration: 0 , Cost, Accuracy, lr:  0.5434235869984946 83.84386486959566 1.0
Iteration: 100 , Cost, Accuracy, lr:  0.012629088185597841 99.82496061613864 1.0
Iteration: 200 , Cost, Accuracy, lr:  0.00350515235723548 99.94748818484159 1.0
Iteration: 300 , Cost, Accuracy, lr:  0.0015640257304202676 100.0 1.0
Iteration: 400 , Cost, Accuracy, lr:  0.0009389916395429696 100.0 1.0


In [220]:
# Create and train the a deeper model with regularization
clf_reg = NeuralNet(X_t, y_t, [6, 4, 4, 1], initialization='he', rseed=9)

In [221]:
clf_reg.train(1.0, 500, lambd=0.2)

batch size:  5713
Iteration: 0 , Cost, Accuracy, lr:  0.6087225183919193 74.82933660073516 1.0
Iteration: 100 , Cost, Accuracy, lr:  0.008207127233780398 99.70243304743566 1.0
Iteration: 200 , Cost, Accuracy, lr:  0.0046039442613619395 99.84246455452477 1.0
Iteration: 300 , Cost, Accuracy, lr:  0.003559489519287469 99.87747243129704 1.0
Iteration: 400 , Cost, Accuracy, lr:  0.0033522921001466 99.87747243129704 1.0


In [202]:
# Test training with minibatch and momentum
clf_mb = NeuralNet(X_t, y_t, [6, 4, 4, 1], initialization='he', rseed=9)

In [203]:
clf_mb.train(1.5, 500, lambd = 0.2, batch_size=512, beta=0.9)

batch size:  512
Iteration: 0 , Cost, Accuracy, lr:  0.23657557244958335 96.04410992473306 1.5
Iteration: 100 , Cost, Accuracy, lr:  0.003974552847330727 99.89497636968318 1.5
Iteration: 200 , Cost, Accuracy, lr:  0.0008883884679227721 100.0 1.5
Iteration: 300 , Cost, Accuracy, lr:  0.0007145784812119167 100.0 1.5
Iteration: 400 , Cost, Accuracy, lr:  0.0006763909040891826 100.0 1.5


In [215]:
# add in learning rate decay
clf_decay = NeuralNet(X_t, y_t, [6, 4, 4, 1], initialization='he', rseed=9)

In [216]:
clf_decay.train(3.0, 500, lambd = 0.3, batch_size=512, beta=0.9, decay_rate=0.3)

batch size:  512
Iteration: 0 , Cost, Accuracy, lr:  0.12603733529432928 96.14913355504989 3.0
Iteration: 100 , Cost, Accuracy, lr:  0.0010567218816566414 100.0 3.0
Iteration: 200 , Cost, Accuracy, lr:  0.0009963434845881721 100.0 2.3076923076923075
Iteration: 300 , Cost, Accuracy, lr:  0.0009788022172324703 100.0 1.875
Iteration: 400 , Cost, Accuracy, lr:  0.0009703391634436365 100.0 1.5789473684210527


**Test accuracy on the validation dataset**

In [209]:
# Standardize (using the original values) and transform
X_s = scaler.transform(X_test)
X_v = X_s.T
X_v.shape

(784, 2815)

In [210]:
y_v = y_test.values.reshape(1, len(y_test))
y_v.shape

(1, 2815)

In [222]:
clf.validation_accuracy(X_v=X_v, y_v=y_v)

96.98046181172292

In [223]:
clf_reg.validation_accuracy(X_v=X_v, y_v=y_v)

97.22912966252221

In [224]:
clf_mb.validation_accuracy(X_v=X_v, y_v=y_v)

98.61456483126109

In [225]:
clf_decay.validation_accuracy(X_v=X_v, y_v=y_v)

98.50799289520427

## Algorithm and Math Details

### Algorithm Steps

**Forward Propagation**

**Backpropagation**

First initalize the last layer using the derivative of the sigmoid (for 0/1 classification)

$dZ = A^{[L]} - y $

$dW = \frac{1}{m}*dZ • A^{[L-1]}$

$dB = \frac{1}{m}*\sum_{i} dZ $

Next calculate dA[L-1] to use in the next layer's dZ

$ dA^{[L-1]} = W.T•\frac{dJ}{dZ}  $

### Calculus Derivations

cost function, from:
$$J = -\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small  y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} \tag{1}$$

**Last Layer L**



$$\frac{dJ}{dW^{[L]}} =  \frac{dJ}{dA^{[L]}} \frac{dA^{[L]}}{dW^{[L]}} $$

$$ = \frac{dJ}{dA^{[L]}} \frac{dA^{[L]}}{dZ^{[L]}} \frac{dZ^{[L]}}{dW^{[L]}}$$

$$ \frac{dJ}{dA^{[L]}} = \frac{y}{A^{[L]}} - \frac{1-y}{1-A^{[L]}}$$

$$ A^{[L]} = \sigma(Z^{[L]})   $$  


$$ \frac{dA^{[L]}}{dZ^{[L]}} =  \sigma(Z^{[L]})(1-\sigma(Z^{[L]})) = A^{[L]}(1-A^{[L]}) $$

$$ \frac{dJ}{dZ^{[L]}} = {dZ^{[L]}} =  \frac{dJ}{dA^{[L]}} \frac{dA^{[L]}}{dZ^{[L]}} = \frac{A^{[L]}(1-A^{[L]})y}{A^{[L]}} - \frac{A^{[L]}(1-A^{[L]})(1-y)}{1-A^{[L]}}  $$

$$ = y - A^{[L]}y - A^{[L]} + A^{[L]}y
= y - A^{[L]} $$

$$ dZ^{[L]} = y - A^{[L]} $$

$$ Z^{[L]} = W^{[L]}.T•A^{[L-1]} + b^{[L]} $$

$$ \frac{dZ^{[L]}}{dW^{[L]}} = A^{[L-1]}$$

$$ dW^{[L]} = (y - A^{[L]})A^{[L-1]} $$

**Backpropagation Layers**

$$\frac{dJ}{dW^{[l]}} =  \frac{1}{m}\frac{dJ}{dA^{[l]}} \frac{dA^{[l]}}{dZ^{[l]}} \frac{dZ^{[l]}}{dW^{[l]}}$$ 

In the previous step, calculate the next layers dA using dZ from the previous layer

$$\frac{dJ}{dA^{[l]}} = \frac{dJ}{dZ^{[l+1]}}\frac{dZ^{[l+1]}}{dA^{[l]}} $$

A in terms of Al-1
$$ A^{[l+1]} = g(W^{[l+1]T}A^{[l]})  $$

Z_l is a function of A_l-1
$$ Z^{[l+1]} = W^{[l+1]T}A^{[l]} + b^{[l+1]} $$

$$ \frac{dZ^{[l+1]}}{dA^{[l]}} = W^{[l+1]T} $$

So the layer's calculations can easily give us dAl

$$\frac{dJ}{dA^{[l]}} = W^{[l+1]T} • \frac{dJ}{dZ^{[l+1]}} $$

Now use the dA_l-1 in the next layer (calclute the rest of the things)

$$ \frac{dA^{[l]}}{dZ^{[l]}} = g'(Z^{[l]})$$

$$ \frac{dZ^{[l]}}{dW^{[l]}} = W^{[l]} $$

so putting it all togther

$$\frac{dJ}{dW^{[l]}} =  \frac{1}{m}\frac{dJ}{dA^{[l]}} \frac{dA^{[l]}}{dZ^{[l]}} \frac{dZ^{[l]}}{dW^{[l]}} = \frac{1}{m}\frac{dJ}{dA^{[l]}} g'(Z^{[l]})  W^{[l]}
$$ 

Where dAl was calculated in the previous layer already

Save the dZ first part of this for use in the next layer:

$$ dZ^{[l]} = \frac{dJ}{dA^{[l]}} \frac{dA^{[l]}}{dZ^{[l]}} = 
\frac{dJ}{dA^{[l]}} g'(Z^{[l]}) $$

- Train on a single example (use L instead of J) is how the math works most easily
- Applying the gradient from each sample separately is stochastic gradient descent
- Applying all at once is regular gradient descent - extend by taking the average gradient (1/m)*grad
- Mini batch is somewhere in between

$$ \frac{dJ}{dA^{[L]}} = \frac{y}{\sigma(Z^{[L]})} - \frac{1-y}{1-\sigma(Z^{[L]})}$$

In [1]:
print i


SyntaxError: Missing parentheses in call to 'print'. Did you mean print(i)? (908402389.py, line 1)

In [2]:
i = 5
print(i)

5
