In [39]:
import numpy as np

In [54]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [74]:
from NeuralNetNLayer import NeuralNet

**Load in and transform the dataset**

In [21]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
np.random.seed(1)

In [4]:
df = pd.read_csv('mnist/train.csv')

In [5]:
# Keep only the first two numbers (0 and 1) to convert this to binary classification
df_two = df[df['label'] <= 1]
df_two.label.value_counts()

1    4684
0    4132
Name: label, dtype: int64

In [6]:
# Check out the data - pixel values range from 0 to 255
df_two.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
# Split into X and y training and validation sets 
X_train, X_test, y_train, y_test = train_test_split(df_two.drop('label', axis=1), df_two['label'], train_size = 0.67)

In [32]:
# Standardize Before transposing (skelearn assumes obs x features format)
scaler = StandardScaler()
scaler = scaler.fit(X_train)
X_s = scaler.transform(X_train)
# Transpose X and convert to an array
X_t = X_s.T
X_t.shape

(784, 5906)

In [29]:
# Transpose y and convert to an array
y_t = y_train.values.reshape(1, len(y_train))
y_t.shape

(1, 5906)

**Create and train the model**

In [None]:
from NeuralNetNLayer import NeuralNet

In [89]:
# Create and train the model with random initialization
clf = NeuralNet(X_ts, y_t, [6, 3, 1])

In [90]:
# Train
clf.train(0.5, 400)

Iteration: 0 , Cost, Accuracy 0.6931515484101237 47.9004402302743
Iteration: 100 , Cost, Accuracy 0.03890550505828735 99.94920419911954
Iteration: 200 , Cost, Accuracy 0.014119458725749564 100.0
Iteration: 300 , Cost, Accuracy 0.00849792250452857 100.0


In [91]:
# Create and train the model with he initialization and regularization
clf_reg = NeuralNet(X_ts, y_t, [6, 3, 1], initialization='he')


In [92]:
# Train with L2 regularization
clf_reg.train(0.5, 500, lambd=0.5)

Iteration: 0 , Cost, Accuracy 0.6158982195925158 47.17236708432103
Iteration: 100 , Cost, Accuracy 0.0224070735931505 100.0
Iteration: 200 , Cost, Accuracy 0.010984320981792299 100.0
Iteration: 300 , Cost, Accuracy 0.00725203808159974 100.0
Iteration: 400 , Cost, Accuracy 0.00541335532532747 100.0


**Test accuracy on the validation dataset**

In [93]:
# Standardize (using the original values) and transform
X_s = scaler.transform(X_test)
X_v = X_s.T
X_v.shape

(784, 2910)

In [94]:
y_v = y_test.values.reshape(1, len(y_test))
y_v.shape

(1, 2910)

In [95]:
clf.validation_accuracy(X_v=X_v, y_v=y_v)

96.73539518900344

In [96]:
# L2 regularization and he initialization improved validation accuracy
clf_reg.validation_accuracy(X_v=X_v, y_v=y_v)

97.76632302405498

**Forward Propagation**

**Backpropagation**

First initalize the last layer using the derivative of the sigmoid (for 0/1 classification)

$dZ = A^{[L]} - y $

$dW = \frac{1}{m}*dZ • A^{[L-1]}$

$dB = \frac{1}{m}*\sum_{i} dZ $

Next calculate dA[L-1] to use in the next layer's dZ

$ dA^{[L-1]} = W.T•\frac{dJ}{dZ}  $

### Calculus Derivations

cost function, from:
$$J = -\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small  y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} \tag{1}$$

**Last Layer L**



$$\frac{dJ}{dW^{[L]}} =  \frac{dJ}{dA^{[L]}} \frac{dA^{[L]}}{dW^{[L]}} $$

$$ = \frac{dJ}{dA^{[L]}} \frac{dA^{[L]}}{dZ^{[L]}} \frac{dZ^{[L]}}{dW^{[L]}}$$

$$ \frac{dJ}{dA^{[L]}} = \frac{y}{A^{[L]}} - \frac{1-y}{1-A^{[L]}}$$

$$ A^{[L]} = \sigma(Z^{[L]})   $$  


$$ \frac{dA^{[L]}}{dZ^{[L]}} =  \sigma(Z^{[L]})(1-\sigma(Z^{[L]})) = A^{[L]}(1-A^{[L]}) $$

$$ \frac{dJ}{dZ^{[L]}} = {dZ^{[L]}} =  \frac{dJ}{dA^{[L]}} \frac{dA^{[L]}}{dZ^{[L]}} = \frac{A^{[L]}(1-A^{[L]})y}{A^{[L]}} - \frac{A^{[L]}(1-A^{[L]})(1-y)}{1-A^{[L]}}  $$

$$ = y - A^{[L]}y - A^{[L]} + A^{[L]}y
= y - A^{[L]} $$

$$ dZ^{[L]} = y - A^{[L]} $$

$$ Z^{[L]} = W^{[L]}.T•A^{[L-1]} + b^{[L]} $$

$$ \frac{dZ^{[L]}}{dW^{[L]}} = A^{[L-1]}$$

$$ dW^{[L]} = (y - A^{[L]})A^{[L-1]} $$

**Backpropagation Layers**

$$\frac{dJ}{dW^{[l]}} =  \frac{1}{m}\frac{dJ}{dA^{[l]}} \frac{dA^{[l]}}{dZ^{[l]}} \frac{dZ^{[l]}}{dW^{[l]}}$$ 

In the previous step, calculate the next layers dA

$$\frac{dJ}{dA^{[l]}} = \frac{dJ}{dZ^{[l+1]}}\frac{dZ^{[l+1]}}{dA^{[l]}} $$

A in terms of Al-1
$$ A^{[l+1]} = g(W^{[l+1]T}A^{[l]})  $$

Z_l is a function of A_l-1
$$ Z^{[l+1]} = W^{[l+1]T}A^{[l]} + b^{[l+1]} $$

$$ \frac{dZ^{[l+1]}}{dA^{[l]}} = W^{[l+1]T} $$

So the layer's calculations can easily give us dAl

$$\frac{dJ}{dA^{[l]}} = W^{[l+1]T} • \frac{dJ}{dZ^{[l+1]}} $$

Now use the dA_l-1 in the next layer (calclute the rest of the things)

$$ \frac{dA^{[l]}}{dZ^{[l]}} = g'(Z^{[l]})$$

$$ \frac{dZ^{[l]}}{dW^{[l]}} = W^{[l]} $$

so putting it all togther

$$\frac{dJ}{dW^{[l]}} =  \frac{1}{m}\frac{dJ}{dA^{[l]}} \frac{dA^{[l]}}{dZ^{[l]}} \frac{dZ^{[l]}}{dW^{[l]}} = \frac{1}{m}\frac{dJ}{dA^{[l]}} g'(Z^{[l]})  W^{[l]}
$$ 

Where dAl was calculated in the previous layer already

Save the dZ first part of this for use in the next layer:

$$ dZ^{[l]} = \frac{dJ}{dA^{[l]}} \frac{dA^{[l]}}{dZ^{[l]}} = 
\frac{dJ}{dA^{[l]}} g'(Z^{[l]}) $$

- Train on a single example (use L instead of J) is how the math works most easily
- Applying the gradient from each sample separately is stochastic gradient descent
- Applying all at once is regular gradient descent - extend by taking the average gradient (1/m)*grad
- Mini batch is somewhere in between

$$ \frac{dJ}{dA^{[L]}} = \frac{y}{\sigma(Z^{[L]})} - \frac{1-y}{1-\sigma(Z^{[L]})}$$