# Background

I wrote this notebook as a simple training exercise using the simplest case to understand feedforward neural networks. The naming conventions in this code match with [Andrew Ng's](http://andrewng.com) free online [course in Machine Learning on Coursera](http://ml-class.org) (highly recommended). This example neural network has a single hidden layer.

Here's how the neural network is connected and equations for calculating the hypothesis, $h_{\theta}(x)$.
![Feed Forward](feedforward.png)

This neural network also implements [backpropagation](https://en.wikipedia.org/wiki/Backpropagation) during training to determine the difference between the hypothesis and the training data in order to update the $\theta{}s$, or weights, in the network.
![Backpropagation](backpropagation.png)

The example has a trivial training set with **X** equal to

<table width="50%">
<tr><td>0</td><td>0</td></tr>
<tr><td>0</td><td>1</td></tr>
<tr><td>1</td><td>0</td></tr>
<tr><td>1</td><td>1</td></tr>
</table>

and the **y** vector used for this supervised learning matches the [exclusive or (XOR)](https://en.wikipedia.org/wiki/Exclusive_or) pattern.

<table width="50%">
<tr><td>0</td></tr>
<tr><td>1</td></tr>
<tr><td>1</td></tr>
<tr><td>0</td></tr>
</table>


*Note: the images above are from Andrew Ng's [Machine Learning Course](http://ml-class.org).*

In [1]:
# NumPy is the fundamental package for scientific computing with Python.
import numpy as np

The `theta_init` function is used to initialize the thetas (weights) in the network. It returns a random matrix with values in the range of [-epsilon, epsilon].

In [2]:
def theta_init(in_size, out_size, epsilon = 0.12):
    return np.random.rand(in_size + 1, out_size) * 2 * epsilon - epsilon

This network uses a sigmoid activating function. The [sigmoid derivative](http://kawahara.ca/how-to-compute-the-derivative-of-a-sigmoid-function-fully-worked-example/) is used during backpropagation.

The sigmoid function ($\sigma$) is: $\sigma(x) = \frac{1}{1+e^{-x}}$

The derivative of the sigmoid function is: $\sigma(x) \times (1 - \sigma(x))$

In [3]:
def sigmoid(x):
    return np.divide(1.0, (1.0 + np.exp(-x)))
def sigmoid_derivative(x):
    return np.multiply(x, (1.0 - x))

The [mean squared error (MSE)](https://en.wikipedia.org/wiki/Mean_squared_error) provides measure of the distance between the actual value and what is estimated by the neural network.

In [4]:
def mean_squared_error(X):
    return np.power(X, 2).mean(axis=None)

The `nn_train` function trains an artificial neural network with a single hidden layer. Each column in **X** is a feature and each row in **X** is a single training observation. The **y** value contains the classifications for each observation. For  multi-classification problems, **y** will have more than one column. After training, this function returns the calculated theta values (weights) that can be used for predictions.

The training will end when the desired error or maximum iterations is reached whichever comes first. It returns the weights for the model.

In [5]:
def nn_train(X, y, desired_error = 0.001, max_iterations = 100000, hidden_nodes = 5):

    m = X.shape[0] # number of rows (samples)
    input_nodes = X.shape[1] # number of columns
    print(f"Training data with {m} samples and {input_nodes} features")
    assert m == y.shape[0], f"There are {m} samples in X but {y.shape[0]} y values"
    output_nodes = y.shape[1]

    a1 = np.insert(X, 0, 1, axis=1) # add a_0^(1)
    theta1 = theta_init(input_nodes, hidden_nodes)
    theta2 = theta_init(hidden_nodes, output_nodes)
    
    for x in range(0, max_iterations):
        # Feedforward
        a2 = np.insert(sigmoid(a1.dot(theta1)), 0, 1, axis=1)
        a3 = sigmoid(a2.dot(theta2))
        
        # Calculate error using backpropagation
        a3_delta = np.subtract(y, a3)
        mse = mean_squared_error(a3_delta)
        if mse <= desired_error:
            print (f"Achieved requested MSE {mse} at iteration {x}")
            break
        a2_error = a3_delta.dot(theta2.T)
        a2_delta = np.multiply(a2_error, sigmoid_derivative(a2))
        
        # Update thetas to reduce the error on the next iteration
        theta2 += np.divide(a2.T.dot(a3_delta), m)
        theta1 += np.delete(np.divide(a1.T.dot(a2_delta), m), 0, 1)
        
    return (theta1, theta2)

The `nn_predict` function takes the theta values (weights) calculated by `nn_train` to make predictions about the data in **X**.

In [6]:
def nn_predict(X, theta1, theta2):
    a2 = sigmoid(np.insert(X, 0, 1, axis=1).dot(theta1))
    return sigmoid(np.insert(a2, 0, 1, axis=1).dot(theta2))

# Example

We start by plugging our data and classifications into our neural network which returns the weights we can use to make predictions with new data.

In [7]:
X = np.matrix('0 0; 0 1; 1 0; 1 1')
y = np.matrix('0; 1; 1; 0')
print(X)
print(y)

[[0 0]
 [0 1]
 [1 0]
 [1 1]]
[[0]
 [1]
 [1]
 [0]]


In [8]:
(theta1, theta2) = nn_train(X, y)
print ("\nTrained weights for calculating the hidden layer from the input layer")
print (theta1)
print ("\nTrained weights for calculating from the hidden layer to the output layer")
print (theta2)

Training data with 4 samples and 2 features
Achieved requested MSE 0.0009971347541874588 at iteration 3262

Trained weights for calculating the hidden layer from the input layer
[[ 0.05187587 -0.75564316  6.01590861  2.55036487 -0.61780587]
 [-0.05123144  0.85352135 -3.98152019 -6.28200152  0.79135377]
 [-0.18365105  0.857064   -3.97298513 -6.2828984   0.82203806]]

Trained weights for calculating from the hidden layer to the output layer
[[-0.91130206]
 [-0.20098316]
 [-2.29209456]
 [ 8.0253236 ]
 [-9.79200751]
 [-2.15564216]]


Now that we've trained the neural network. We can make predictions for new data.

In [9]:
# Our test input doesn't match our training input 'X'
X_test = np.matrix('1 1; 0 1; 0 0; 1 0')
y_test = np.matrix('0; 1; 0; 1')
y_calc = nn_predict(X_test, theta1, theta2)
y_diff = np.subtract(y_test, y_calc)
print ("The MSE for our test set is %f" % (mean_squared_error(y_diff)))
print (np.concatenate((y_test, y_calc), axis=1))

The MSE for our test set is 0.000997
[[0.         0.03832304]
 [1.         0.97007261]
 [0.         0.02713896]
 [1.         0.97020551]]


Column one is the correct value, column two is the value predicted by this simple neural network.

The neural network correctly learned the XOR pattern.