# Neural Networks
Neural networks are a machine learning technique that function similarily to the neurons in our brain. In this notebook, we'll cover some of the basics of neural networks and walk through an example or two.

## Basic Function
Neural networks have three parts: the inputs, the outputs, and the "hidden" layers. The inputs are the data feed into the neural network and the outputs are the predictions of the neural network. The hidden layers are the inner workings of the neural network and are were the magic happens. Well, not really magic, just math.

There can be multiple hidden layers but for this example, let's just have a single hidden layer. Suppose we have three inputs $x_1, x_2, x_3$ and are trying to predict $y$. The hidden layer creates a weighted sum of the input before returning an output value. Mathematically, if the $x_i$ input has weight $W_i$, the output from the hidden layer is $y_{hidden}=W_1*x_1 + W_2*x_2 + W_3*x_3$. A hidden layer can have multiple neurons so we'd have to find $y_{hidden}$ for each neuron. 

Here's where the "neural" part comes into play. In the brain, neurons only fire if the input signals are above a certain threshold. To create that threshold, an activation function is used, which  tells the neuron whether to "fire" or not. While the activation function could be chosen to be a step function to simulate this binary nature, a sigmoid function, $f(x)=\frac{1}{1+e^{-x}}$, is often used so that we instead get a probability of the neuron firing (since the sigmoid function varies between 0 and 1). The output of the hidden layer is passed through the activation function which then gives the actual output of that hidden layer neuron. In our single hidden layer example, the hidden layer's output will be the output of the neural network since there are no more layers to pass through.

Let's actually do an example now to see how this works

### Example
The code is shamelessly based off an example I found at https://iamtrask.github.io/2015/07/12/basic-python-network/. I've changed variable names to make it easier to follow.

First, let's import numpy since we'll need to for all the data manipulations

In [None]:
import numpy as np

Now let's make some training data. Let's suppose we have three binary inputs and a single binary output.

| Input 1 | Input 2 | Input 3 | Output |   |
|---------|---------|---------|--------|---|
| 0       | 0       | 1       | 0      |   |
| 1       | 1       | 1       | 1      |   |
| 1       | 0       | 1       | 1      |   |
| 0       | 1       | 1       | 0      |   |

Let's call the input array $X$ and the outputs $y$. The ".T" is so that the output is a column vector rather than a row vector.

In [None]:
X = np.array([  [0,0,1],
                [0,1,1],
                [1,0,1],
                [1,1,1] ])
y = np.array([[0,0,1,1]]).T

The weights are unknown and the goal of the neural network is to find the optimal values for the weights to make accurate predictions. We start by randomly assigning weights to each connection between the inputs and the outputs. Here, we pick the weights to fall in [-1,1] so that the mean is 0. (There's a whole theory behind the weight initialization but I didn't really look into it beyond that having a mean of zero is good). Since we have three inputs and want a single output, we need 3 initial weights. These are stored in a 3x1 array. In general, the array is of size (number inputs)x(number outputs). 

For replicability, we will set a seed. This just means as long as the seed is the same, we'll always get the same random numbers

In [None]:
np.random.seed(5)
weights=2*np.random.random((3,1))-1

Before we start training, we need to define our activation function. We'll use the sigmoid one in this example. The $\tanh$ function and the rectified linear function $ReLu$ are other popular choices so feel free to see how the results differ based on the activation function.

In [147]:
def sigmoid(x):
    return 1/(1+np.exp(-x))

We are now ready to start the training. Since the inputs are in an array and the weights are in an array, we just use the dot product to multiply everything together. To get the output, we just pass the result through the activation function

In [None]:
output=sigmoid(np.dot(X,weights))

Let's see how the model did with the actual outputs as reference

In [None]:
print(output)
print(y)

Since we randomly picked the weights, we shouldn't expect much from the model. The strength of the neural network comes from backpropagation, which allows the network to update itself after seeing the predicting. Essentially, a loss function is calculated after each run and the neural network adjusts the weights to minimize the loss function.

One way to do this is using gradient descent, which is based on the derivative of the activation function. Since the derivative at a minimum is zero, the sign of the derivative gives insight into how the weights should be adjusted. Here, we will adjust the weights by the Delta rule which means the changes in the weights are proportional to the derivative of the activation function times the error times the value of the input.

Let's add the backpropagation stuff now. It's useful to the know the derivative of the sigmoid function, $S(x)$, is $S^\prime(x)=S(x)*(1-S(x))$

In [None]:
def Dsigmoid(x):
    return sigmoid(x)*(1-sigmoid(x))

In [None]:
errors=y-output

#the changes to the weights
deltas=errors*Dsigmoid(output)

#now the new weights
weights +=np.dot(X.T,deltas)

In [None]:
weights

Now we train again and look at the new predictions

In [None]:
output=sigmoid(np.dot(X,weights))
print(output)
print(y)

Again, pretty far off, so let's make a loop repeat this process a few thousand times and see how it looks at the end

In [None]:
for n in range(10000):
    output=sigmoid(np.dot(X,weights))
    error=y-output
    delta=error*Dsigmoid(output)
    weights +=np.dot(X.T,delta)
print(output)

These represent the predicted outputs of the training data. Remember, the correct responses are $0, 0, 1, 1$ so the neural network did a pretty good job.

Since we have three binary variables, there are 8 possible combinations. Let's test the neural network on the other 4 and see how well it can predict their outputs. 

| Input 1 | Input 2 | Input 3 | Output |   |
|---------|---------|---------|--------|---|
| 0       | 0       | 0       | 0      |   |
| 1       | 0       | 0       | 1      |   |
| 0       | 1       | 0       | 0      |   |
| 1       | 1       | 0       | 1      |   |
To get the outputs, we note that in the original data, the output was always the same as input 1 so we have applied that pattern in generating the outputs for this new dataset

In [None]:
test = np.array([  [0,0,0],
                [1,0,0],
                [0,1,0],
                [1,1,0] ])

In [None]:
test_output=sigmoid(np.dot(test,weights))
print(test_output)

Hm, we were able to predict the cases with $1$ as the output successfully but not the cases with $0$ as the output. This is likely due to the how we trained the model. Recall that input 3 was always $1$ while input 1 and input 2 varied. Here, input 3 is fixed as $0$ which is something the neural network hasn't seen before. Nevertheless, the neural network is still correct about 50% of the time

## Multilayer networks
In the first example, the output was just the value in input 1 so the relationship was linear and a single layer was sufficient. Now, let's change the output values a bit so that we have a nonlinear relationship

| Input 1 | Input 2 | Input 3 | Output |   |
|---------|---------|---------|--------|---|
| 0       | 0       | 1       | 0      |   |
| 0       | 1       | 1       | 1      |   |
| 1       | 0       | 1       | 1      |   |
| 1       | 1       | 1       | 0      |   |

The result clearly depends on inputs 1 and 2 now. To make a neural network for this problem, we need a multilayer network. For simplicity, let's use two hidden layers this time. Since the first hidden layer maps to the second hidden layer instead of the output now, we can have as many neurons as we want in the first hidden layer. Let's choose 5.

The initial code is basically the same

In [None]:
X2 = np.array([[0,0,1],
            [0,1,1],
            [1,0,1],
            [1,1,1]])
y2 = np.array([[0],
            [1],
            [1],
            [0]])

In [None]:
np.random.seed(1)
l1_weights=2*np.random.random((3,5))-1
l2_weights=2*np.random.random((5,1))-1

The (3,5) forms the matrix that connects the 3 inputs to the 5 neurons in the first hidden layer. The 5 can be changed to any number as long as the 5 in the (5,1) is changed as well since this creates the matrix that connects the 5 neurons in the first hidden layer to the single output

We now pass each input through the first hidden layer and then the output of the first hidden layer to the second hidden layer

In [None]:
l1_output=sigmoid(np.dot(X2,l1_weights))
l2_output=sigmoid(np.dot(l1_output,l2_weights))
print(l2_output)
print(y2)

For the backpropagation, we need to update the weights of both the first and second hidden layers. We start by finding the error with respect to the final layer and update those weights the same way as in the previous example. To find the changes in the weights of the first hidden layer, we need to find out how much the first hidden layer weights contributed to the error in the second hidden layer weights. This is just the dot product of the first hidden layer weights and the deltas for the second hidden layer weights.

The changes in the weights are still found by the Delta rule: derivative of activation function evaluated at the output of that hidden layer times the error of the layer times the input of that layer.

In [None]:
l2_error=y2-l2_output
l2_delta=l2_error*Dsigmoid(l2_output)
l1_error=np.dot(l2_delta,l2_weights.T)
l1_delta=l1_error*Dsigmoid(l1_output)
l2_weights += np.dot(l1_output.T,l2_delta)
l1_weights += np.dot(X2.T,l1_delta)

In [None]:
l1_output=sigmoid(np.dot(X2,l1_weights))
l2_output=sigmoid(np.dot(l1_output,l2_weights))
print(l2_output)
print(y2)

Again, let's create a loop to do this process a few thousand times.

In [None]:
for iter in range(10000):
    l1_output=sigmoid(np.dot(X2,l1_weights))
    l2_output=sigmoid(np.dot(l1_output,l2_weights))
    
    l2_error=y2-l2_output
    l2_delta=l2_error*Dsigmoid(l2_output)
    l1_error=np.dot(l2_delta,l2_weights.T)
    l1_delta=l1_error*Dsigmoid(l1_output)
    
    l2_weights += np.dot(l1_output.T,l2_delta)
    l1_weights += np.dot(X2.T,l1_delta)
print(l2_output)

Looks like the neural network works pretty well.

We can keep adding more hidden layers to handle more complex data. This is the basis for deep learning.

## SciKit-Learn Example
So far, we've just built a simple neural network from scratch on simple data. Let's use some real data now. We'll use the Breast Cancer dataset in SciKit-Learn's library with the goal of predicting whether a tumor is cancerous or not based on a variety of factors.

Example based off https://www.kdnuggets.com/2016/10/beginners-guide-neural-networks-python-scikit-learn.html

In [157]:
from sklearn.datasets import load_breast_cancer
cancer=load_breast_cancer()

Let's create our inputs and outputs.

NameError: name 'head' is not defined

In [158]:
X=cancer['data']
y=cancer['target']

Much like the other machine learning techniques, we should really train the neural network on some of the data and then test the model on new data. We can use the *train_test_split()* function to do this.

In [159]:
from sklearn.model_selection import train_test_split

In [160]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

The default SciKit learn version of neural networks keep updating the weights until the model output is within some tolerance of the actual output. Therefore, preprocessing the data is especially important so that the training converges. *StandardScaler* sets the mean to zero and the variance to 1 so we use that. The same scale should be applied to the training and testing data.

In [161]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [162]:
#set the scale based on the training data
scaler.fit(X_train)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [164]:
#Apply the scale to the training and testing data
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Now we need to import the neural network function from SciKit-learn.

In [165]:
from sklearn.neural_network import MLPClassifier

There are a lot of parameters that can be adjusted, but let's just focus on the hidden layers. We can select how many layers we want and how many neurons to be in each. For all the parameters and what they do, go to http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

Since we have 30 inputs, let's pick all the hidden layers to have 30 inputs. Let's use 3 hidden layers.

In [172]:
mlp=MLPClassifier(hidden_layer_sizes=(30,30,30))

In [173]:
#now train the model
mlp.fit(X_train,y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(30, 30, 30), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

Now let's see how well the model works. We'll predict the test data from our model

In [182]:
predictions=mlp.predict(X_test)

Now we make a confusion matrix to look at the results

In [175]:
from sklearn.metrics import confusion_matrix

In [183]:
#actual are rows, predictions are columns
print(confusion_matrix(y_test,predictions))

[[48  4]
 [ 1 90]]


Not bad. We have a few misclassifications so let's try changing the hidden layers

In [178]:
mlp4=MLPClassifier(hidden_layer_sizes=(30,30,30,30))
mlp_more=MLPClassifier(hidden_layer_sizes=(30,40,50))

In [179]:
mlp4.fit(X_train,y_train)
mlp_more.fit(X_train,y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(30, 40, 50), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [181]:
predictions4=mlp4.predict(X_test)
predictions_more=mlp_more.predict(X_test)

In [184]:
print(confusion_matrix(y_test,predictions4))
print(confusion_matrix(y_test,predictions_more))

[[50  2]
 [ 1 90]]
[[49  3]
 [ 1 90]]


So we can get slight improvement with more layers or by adding more neurons. Feel free to play around with the hidden layers or the other parameters and see if you can get a perfect classification.