# Training a Two-Layer Dense Neural Network Classifier on Spiral Data

We will be two-layer dense neural network, using a cross-entropy loss, on our toy dataset of points. After training our model, we will be able to visualize the decision boundary that it learned!

Two-layer Model of dense layers:
- Layer 1: dense mapping with a ReLU activation.
  - Layer 1 has 100 neurons: $W_{1}$ has the shape (2, 100), $b_{1}$ has the shape (100,) 
  - $f_{1}(X; W_{1},b_{1}) = ReLU(XW_{1} + b_{1})$
- Layer 2: dense mapping with a softmax activation
  -  Layer 2 has 3 neurons: $W_{2}$ has the shape (100, 3), $b_{2}$ has the shape (3,) 
  - $f_{2}(X; W_{2}, b_{2}) = softmax(XW_{2} + b_{2})$

#### Composing Layers
$f_{model}(X; W_{1}, W_{2}, b_{1}, b_{2}) = f_{2}(f_{1}(X; W_{1}, b_{1}); W_{2}, b_{2})$

#### The Model in Full
$f(X; W_{1}, W_{2}, b_{1}, b_{2}) = softmax(ReLU(XW_{1} + b_{1})W_{2} + b_{2})$

Loss:
- cross-entropy loss
  - $L_{i} = -\sum_{k=0}^{2}{p^{(i)}_{k} \log{q^{(i)}_{k}}}$
 
 > $p^{(i)}$ is the **true** probability-distribution for classification. E.g. $p^{(i)}= (0, 0, 1)$
    if datum $i$ belongs to tendril 2.
    
  > $q^{(i)}$ is the **predicted** probability-distribution for classification. E.g. $q^{(i)} = (0.1, 0.1, 0.8)$ predicts that datum $i$ is class 2 with a 80% probability.
    if datum $i$ belongs to tendril 2.

In [None]:
import numpy as np

try:
    from jupyterthemes import jtplot
    jtplot.style()
except ImportError:
    pass

import matplotlib.pyplot as plt
%matplotlib notebook

In [None]:
from datasets import ToyData
toy_data = ToyData(num_classes=3)
xtrain, ytrain, xtest, ytest = toy_data.load_data()

In [None]:
toy_data.plot_spiraldata()

### Initializing Our Model Parameters
We will be using a intialization technique known as "He-normal" initialization (pronounced "hey"). Basically we draw all of our W-values from a normal distribution, but which has been scaled by $\frac{1}{\sqrt{2N_{row}}}$, where $N_{row}$ is the number of rows in $W$.

We need to take care in our initialization since we are now working with many more W-parameters (2, 100) and (100, 3), instead of (2, 3)!

In [None]:
def he_normal(shape):
    """ Given the desired shape of your array, draws random
        values from a scaled-Gaussian distribution.
        
        Returns
        -------
        numpy.ndarray"""
    N = shape[0]
    scale = 1 / np.sqrt(2*N)
    return np.random.randn(*shape)*scale

In [None]:
def sgd(param, rate):
    """ Performs a gradient-descent update on the parameter.
    
        Parameters
        ----------
        param : mygrad.Tensor
            The parameter to be updated.
        
        rate : float
            The step size used in the update"""
    param.data -= rate*param.grad
    return None

In [None]:
def compute_accuracy(model_out, labels):
    """ Computes the mean accuracy, given predictions and true-labels.
        
        Parameters
        ----------
        model_out : numpy.ndarray, shape=(N, K)
            The predicted class-scores/probabilities
        labels : numpy.ndarray, shape=(N, K)
            The one-hot encoded labels for the data.
        
        Returns
        -------
        float
            The mean classification accuracy of the N samples."""
    return np.mean(np.argmax(model_out, axis=1) == np.argmax(labels, axis=1))

## Cross-Entropy Loss
Because we are taking the soft-max at the end of our neural network, the (N,3) scores that we compute can actually be interpreted as (N, 3) probabilities. Just as with the scores, the largest probability determines what class we predict. For instance, we might get the following for our **predicted** probability distribution.

```
p_pred = [[0.1,  0.4,  0.5],
          [0.9, 0.05, 0.05],
          ...]
```

Which means the datum-0 is predicted to be in tendril-2 (with 50% probability), datum-1 is predicted to be in tendril-0 (with 90% probability). Etc.

Our one-hot labels give us the "true" probability distribution. The correct class is 100% correct, the wrong classes have 0% probability of being correct:

```
p_true = [[0, 0, 1],
          [1, 0, 0],
          ...]
```

Cross-entropy measures how **dissimilar** two probability distributions are. We want our predicted probabilities to be close to our true-distribution. Thus a large cross entropy, meaning they are dissimilar, is bad. A small cross-entropy, meaning the distributions are similar, is good. This is also differentiable function - all the makings of a great loss function!

The following is the cross-entropy loss for datum-i. $p$ is the **true** probability distribution. $q$ is the **predicted** probability distribution. The sum is over the different possible classes:

\begin{equation}
L_{i} = -\sum_{k=0}^{2}{p^{(i)}_{k} \log{q^{(i)}_{k}}}
\end{equation}

Thus our total cross-entropy loss is just the mean-value of these for our N-values.
\begin{equation}
L = \frac{1}{N}\sum_{i=0}^{N-1}{L_{i}} = -\frac{1}{N}\sum_{i=0}^{N-1}\sum_{k=0}^{2}{p^{(i)}_{k} \log{q^{(i)}_{k}}}
\end{equation}

Let's code this up!

In [None]:
from mygrad.math import log
def cross_entropy(p_pred, p_true):
    """ Computes the mean cross-entropy.
        
        Parameters
        ----------
        p_pred : mygrad.Tensor, shape:(N, K)
            N predicted distributions, each over K classes.
        
        p_true : mygrad.Tensor, shape:(N, K)
            N 'true' distributions, each over K classes
        
        Returns
        -------
        mygrad.Tensor, shape=()
            The mean cross entropy (scalar)."""
    
    N = p_pred.shape[0]
    p_logq = (p_true) * log(p_pred)
    return (-1/ N) * p_logq.sum()  

In [None]:
from mygrad.nnet.layers import dense
from mygrad.nnet.activations import softmax, relu

Using a learning rate of `1.` and **no** regularization, train your model for 1000 iterations. Record the loss and accuracy for each operation. Plot them afterwards.

In [None]:
from mygrad import Tensor

D = 2  # dimensionality of a piece of input data
K = 3  # number of distinct classes

W1 = Tensor(he_normal((D, 100)))
b1 = Tensor(np.zeros((100,), dtype=W1.dtype))
W2 = Tensor(he_normal((100, K)))
b2 = Tensor(np.zeros((K,), dtype=W2.dtype))
params = [b1, W1, b2, W2]

rate = 1.

l = []
acc = []
for i in range(1000):
    o1 = relu(dense(xtrain, W1) + b1)
    p_pred = softmax(dense(o1, W2) + b2)
    
    # Li = -1 * sum_over_classes(p_true * log(p_predict))
    # L = 1/N sum_over_i(Li)
    loss = cross_entropy(p_pred=p_pred, p_true=ytrain)
    
    l.append(loss.data.item())
    loss.backward()

    acc.append(compute_accuracy(p_pred.data, ytrain))
    
    for param in params:
        sgd(param, rate)
    
    loss.null_gradients()

In [None]:
# Plot training  performance
fig,(ax,ax2) = plt.subplots(nrows=2)
ax.plot(l)
ax2.plot(acc)

In [None]:
def fwd_pass(x):
    """ Computes the forward-pass of our model, using numpy arrays
        since we don't need to bother with back-prop when computing
        predictions."""
    o1 = relu(dense(x, W1.data) + b1.data)
    o2 = softmax(dense(o1, W2.data) + b2.data)
    return o2.data

In [None]:
toy_data.visualize_model(fwd_pass, entropy=False)

In [None]:
toy_data.visualize_model(fwd_pass, entropy=True)