# Introduction to Neural Networks and Keras

Deep learning models layer simple machine learning models and trains the composition jointly, so that earlier layers can learn features which are useful to later layers. 

The simplest neural network is the **multilayer perceptron** (MLP). Consider a set of training data of the form $\{(\mathbf{x}_1,\mathbf{y}_1), \ldots, (\mathbf{x}_n,\mathbf{y}_n)\}$ where each input $\mathbf{x}_i$ is in $\mathbb{R}^p$ and each output $\mathbf{y}_i$ is in $\mathbb{R}^q$. The idea of the MLP is to model the input-output relationship as a function which alternately applies linear transformations (of the form $A(\mathbf{x}) = W\mathbf{x} + \mathbf{b}$) and component-wise applications of the nonlinear _ReLU_ function $\operatorname{ReLU}(x) = \max(0,x)$. We call this nonlinear function the model's **activation function**. The **depth** of the neural network is the number of layers. 

Here's a simple example, illustrated three ways: (1) in Python, (2) using a vector-based **computational graph**, and (3) using a component-based computational graph. 

In [None]:
import numpy as np
def ReLU(x):
    return np.maximum(0,x)
x  = np.array([1,-2,3])
W1 = np.array([[4,2,-1],[5,-2,-2]])
b1 = np.array([0,1])
W2 = np.array([[0,1],[-2,2],[3,4],[-2,0]])
b2 = np.array([-1,-7,-14,-5])
W3 = np.array([[-2,3,-6,4],[-1,0,-1,0]])
b3 = np.array([14, 4])
output = W3 @ ReLU(W2 @ ReLU(W1 @ x + b1) + b2) + b3
output, np.linalg.norm(output - np.array([3,2]))

<img src="nn.png">
<img src="nodes-nn.png">

_**Exercise**. Why is it necessary to include a nonlinear function like ReLU for a MLP's expressive power to increase as the depth is increased?_ 

## Training

The next task is to identify parameters such that the model accurately reflects the input-output relationship for a given set of data. This is done using called **backpropagation**. The idea is to measure the error (or **loss**) of an output value $\widehat{\mathbf{y}}$ given the desired output $\mathbf{y}$ and take the derivative of this error value with respect to each of the weights in the model. These derivatives can be averaged over a set of training samples, and that information can be used to nudge each weight in a direction that decreases average error. 

_**Exercise**. Compute a rough estimate of the derivative of the loss in the example above with respect to the top-left entry of $W_1$ by changing that value by a small amount and determining the resulting change in loss. Based on that information, should the value in that position be adjusted up or down (to decrease the loss)?_

## Softmax activation

For classification problems, it's common practice to let the number of dimensions in the last layer to be equal to the number of classes (so that each node corresponds to a particular class) and apply the **softmax** function* at that layer so that the output values are nonnegative and sum to 1. 

*The softmax function exponentiates each entry of a vector and then divides each entry by the sum of all of the entries: 

In [None]:
a = np.exp(np.array([1,2,-3])) # exponentiate
a /= np.sum(a) # normalize
a

We interpret the values as confidence values, or *probabilities*, for each class. For example, a neural network which returns the vector `a` (above) for a particular input would be expressing a strong belief that that input is not in the third class, and is more likely to be in the second class than the first. 

Usually when we use softmax in the last layer of a classification problem, we also use the *cross-entropy loss* function, which returns the natural logarithm of the reciprocal of the value in the position of the correct class. For example, if the correct class were 2 and the neural network output `a`, the loss would be $-\log(0.727) = 0.319$, while a correct class of 3 would result in a much larger penalty of $-\log(0.0049) = 5.32$. 

## Keras

Keras is a Python module (built into the `tensorflow` module) that supports convenient layer-by-layer model building and automatic differentiation for training. Let's see a neural network in action. 

This example draws from the notebook at https://www.tensorflow.org/tutorials/keras/basic_classification

In [None]:
from tensorflow.keras.datasets import fashion_mnist
import tensorflow.keras as keras

In [None]:
(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()

The training data consists of 60,000 images of size $28\times28$, together with labels for each. 

In [None]:
X_train.shape

In [None]:
y_train

We will need to scale the images so that the pixel values are between 0 and 1. Initiatially they range from 0 to 255. 

In [None]:
X_train = X_train / 255.0
X_test = X_test / 255.0

We can take a look at some of the images using `plt.imshow`:

In [None]:
import matplotlib.pyplot as plt
# show first image in grayscale
plt.imshow(X_train[0,:,:], cmap=plt.cm.binary)
plt.axis('off')
plt.show()

We build the model as a `Sequential` object which first flattens each image into a length-$28^2$ vector and then runs it through two MLP layers (called _Dense_ layers in Keras). 

In [None]:
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=(28,28)))
model.add(keras.layers.Dense(units=128,activation='relu'))
model.add(keras.layers.Dense(units=10,activation='softmax'))

Next we _compile_ the model, which entails specifying an optimization algorithm, a loss function, and any extra data to track during training. In this case, we use `sparse_categorical_crossentropy`, which is appropriate when the classes are stored as integers. We'll use the `adam` optimization algorithm, which is a little more sophistcated variant of gradient descent. 

In [None]:
model.compile(optimizer='adam', 
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

In [None]:
model.fit(X_train, y_train, epochs=5)

In [None]:
model.evaluate(X_test, y_test)

## Dropout

We can often get a small boost in performance by adding a _dropout_ layer, which causes units in the preceding layer to be zeroed out during the training process. This prevents over-reliance on a smaller number of units and forces the network to compute more holisiticaly. 

_**Exercise**. Built a new network with a dropout layer between the `Dense` layers. Does the training accuracy improve relative to the version without dropout? Does the test accuracy improve?_

In [None]:
model.add(keras.layers.Dropout(0.2))

## Weight regularization

Dropout is a form of _regularization_: we impose some constraints on the model to try to improve test accuracy. Another way to regularize is to directly penalize large weights. This is done in Keras using the `kernel_regularization` keyword argument. 

In [None]:
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=(28,28)))
model.add(keras.layers.Dense(units=128,
                             activation='relu', 
    kernel_regularizer = keras.regularizers.l2(0.01)))
model.add(keras.layers.Dense(units=10,
                             activation='softmax', 
    kernel_regularizer = keras.regularizers.l2(0.01)))

In [None]:
model.compile(optimizer='adam', 
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

In [None]:
model.fit(X_train, y_train, epochs=5)

In [None]:
model.evaluate(X_test, y_test)

Would you say that the $L^2$ regularization we just tried is leading to an overfit model or an underfit model?