# Neural Networks

Innocent Mamvura (innocent.mamvura@wits.ac.za)<br>
Data Scientist <br>
[Wits University](https://www.wits.ac.za//)

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [2]:
try:
  %tensorflow_version 2.x
except Exception:
  pass

import tensorflow as tf
import tensorflow.keras as keras

## Introduction

An [Artificial Neural Network](https://en.wikipedia.org/wiki/Artificial_neural_network) (ANN) was conceived as a model which would learn in a manner similar to the brain.

It's a mistake to take this analogy too literally though.

> However, modern neural network research is guided by many mathematical and engineering disciplines, and **the goal of neural networks is not to perfectly model the brain**. It is best to think of feedforward networks as function approximation machines that are designed to achieve statistical generalization, occasionally **drawing some insights from what we know about the brain, rather than as models of brain function**.
>
> &mdash; Goodfellow, Bengio & Courville: *Deep Learning* (2016).

### Neurons

The neurons in a neural network were originally intended to model the neurons in the brain. It's a very rough, approximate model. The basic idea is this:

- a neuron receives stimuli ("inputs") from its dendrites;
- these stimuli are aggregated and if they exceed some threshold then they cause the neuron to "fire";
- when a neuron fires it sends output along its axon.

The original idea for this model was published in "[A logical calculus of the ideas immanent in nervous activity](https://doi.org/10.1007/BF02478259)" (McCulloch & Pitts, The Bulletin of Mathematical Biophysics, 5(4), 115–133, 1943).

![](https://github.com/datawookie/useful-images/raw/master/neuron-wikipedia.png)

### Neuron Model

This is what the mathematical model of a neuron looks like:

$$
g\left(\sum_i w_i \cdot x_i + b\right)
$$

where

- $x_i$ are the input signals;
- $w_i$ are the weights attached to each of the input channels;
- $b$ is a constant bias value; and
- $g()$ is a non-linear activation function.

The process for calculating the output from the neuron model is:

- multiply the inputs by the corresponding weights and sum the products;
- add the bias; and
- apply an activation function.

This calculation consists of two components:

- *linear component* &mdash; multiplication and addition; and
- *non-linear component* &mdash; the activation function.

This model is illustrated below.

![](https://raw.githack.com/datawookie/useful-images/master/neuron-artificial.svg)

The inputs and weights for a single neuron are often represented by vectors, in which case the operation is given as

$$
g\left(w \cdot x +b\right).
$$

### Network

A single neuron in isolation doesn't do an awful lot. However, when you take a bunch of neurons and connect them in a network, suddenly you have something which is rather powerful.

A [Multilayer Perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron) (MLP) is a network consisting of three layers:

- input layer
- hidden layer and
- output layer.

<!-- Image created with http://alexlenail.me/NN-SVG/. Can do CNN too. -->

![](https://github.com/datawookie/useful-images/raw/master/neural-network.png)

The above network is "dense" or "fully connected": every node in a layer is linked to each node in the next layer.

The input layer (left) consists of the features being fed into the model. The output layer (right) is the prediction being generated by the model.

The maths remains essentially the same (although there's now a lot more of it!). Now, rather than a vector or weights, there's a *weight matrix* associated with each layer.

### Universal Approximation Theorem

The [Universal Approximation Theorem](https://en.wikipedia.org/wiki/Universal_approximation_theorem) states that a feed-forward network with a single hidden layer with a finite number of nodes *can* approximate essentially any function. However, it does not say whether it's feasible for such a network to be trained in practice.

> In summary, a **feedforward network with a single layer is sufficient to represent
any function**, but the layer may be infeasibly large and may fail to learn and generalize correctly. In many circumstances, using deeper models can reduce the number of units required to represent the desired function and can reduce the amount of generalization error.
>
> &mdash; Goodfellow, Bengio & Courville: *Deep Learning* (2016).

## Activation Functions

A key component of how an ANN works is the activation function. This determines whether or not a given neuron will respond to its inputs.

For a multi-layer network it's often the case that a different activation function is used for each layer in the network.

There are a wide range of activation functions to choose from. The most common ones are listed below.

### Linear

A linear activation function simply maps its input to its output. It's just the identity function.

In [None]:
x = np.linspace(-10, 10, 1000)
y = x

sns.lineplot(x, y)
plt.show()

We'll ultimately be building networks using Keras, so this is a good time to see what activations in Keras look like.

In [None]:
keras.activations.linear(tf.constant([-5, 0, 5], dtype=tf.float32)).numpy()

### ReLU

The [ReLU](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) (Rectified Linear Unit) function returns the maximum of its argument and zero. ReLU is normally used in hidden layers.

Here the "response" nature of the activation function is apparent: if the input is less than zero then the output is zero (no response), but if the input is greater than zero then the response is equal to the input.

In [None]:
y = np.maximum(0, x)

sns.lineplot(x, y)
plt.show()

In [None]:
keras.activations.relu(tf.constant([-5, 0, 5], dtype=tf.float32)).numpy()

### ELU

The ELU (Exponential Linear Unit) activation function is like ReLU for positive arguments but exponential for negative arguments:

$$
f(x)={\begin{cases}x&{\text{if }}x>0,\\ \alpha(e^{x}-1)&{\text{otherwise}}.\end{cases}}
$$

It gives a smoother transition at zero.

In [None]:
alpha = 1
#
y = np.where(x >= 0, x, alpha * (np.exp(x) - 1))

sns.lineplot(x, y)
plt.show()

In [None]:
keras.activations.elu(tf.constant([-5, 0, 5], dtype=tf.float32)).numpy()

### Sigmoid

The [sigmoid](https://en.wikipedia.org/wiki/Sigmoid_function) activation function is generally used for the output later of a binary (two class) classification problem. This function effectively "squashes" it's input to a value between 0 and 1.

In [None]:
y = 1 / (1 + np.exp(-x))

sns.lineplot(x, y)
plt.show()

In [None]:
keras.activations.sigmoid(tf.constant([-5, 0, 5], dtype=tf.float32))

### Hyperbolic Tangent

The [hyperbolic tangent](https://en.wikipedia.org/wiki/Hyperbolic_function) (or *tanh*) is similar to the sigmoid, but its output is between -1 and +1.

In [None]:
y = np.tanh(x)

sns.lineplot(x, y)
plt.show()

In [None]:
keras.activations.tanh(tf.constant([-5, 0, 5], dtype=tf.float32))

### Softmax

The [softmax](https://en.wikipedia.org/wiki/Softmax_function) activation function is used in the output layer of classification problems with more than two target classes. The output can be interpreted as probabilities of each of the classes.

In [None]:
keras.activations.softmax(tf.constant([[1, 7, 2]], dtype=tf.float32))

*Note:* The output from the softmax is not quite as simple as scaling the arguments so that they sum to one!

## Loss Functions

A *loss function* (or *cost function*) indicates how well a model fits the data. Smaller values of the loss function are associated with better fitting models. So fitting a model is equivalent to minimising the loss function.

These are some commonly used loss functions:

- Mean Squared Error (MSE) &mdash; regression;
- Mean Absolute Error (MAE) &mdash; regression;
- Binary Cross-Entropy &mdash; binary classification and
- Categorical Cross-Entropy &mdash; multi-class classification.

For many common Machine Learning applications the loss function is *convex*, which means that it's easy to find the minimum value. Deep Learning problems, on the other hand, often have loss functions which are not convex, with many local minima. As a result, finding the global minimum can be much more challenging.

In [None]:
fig = plt.figure(figsize = (12, 6))

x = np.linspace(-5, 5, 1000)

fig.add_subplot(1, 2, 1)
sns.lineplot(x, x**2)

fig.add_subplot(1, 2, 2)
sns.lineplot(x, x**2 + np.sin(x*10))

plt.show()

## Optimisers

An optimiser is an algorithm which tries to find a set of parameters which minimise a loss function.

> The largest difference between the linear models we have seen so far and neural networks is that the **nonlinearity of a neural network causes most interesting loss functions to become non-convex**. This means that **neural networks are usually trained by using iterative, gradient-based optimizers that merely drive the cost function to a very low value**, rather than the linear equation solvers used to train linear regression models or the convex optimization algorithms with global convergence guarantees used to train logistic regression or SVMs. **Convex optimization converges starting from any initial parameters. Stochastic gradient descent applied to non-convex loss functions has no such convergence guarantee, and is sensitive to the values of the initial parameters.** For feedforward neural networks, **it is important to initialize all weights to small random values**. The **biases may be initialized to zero or to small positive values**.
>
> &mdash; Goodfellow, Bengio & Courville: *Deep Learning* (2016).

### Gradient Descent

The principle of the [Gradient Descent](https://en.wikipedia.org/wiki/Gradient_descent) algorithm is simple: just move "down hill" (in the opposite direction to the gradient) on the loss function until you get to the bottom.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/ff/Gradient_descent.svg/560px-Gradient_descent.svg.png">

The *learning rate* determines the size of each step:

- small learning rate &mdash; gradual convergence; and
- large learning rate &mdash; rapid convergence but less stable and might miss global minimum.

This is also known as "batch" gradient descent because the entire dataset is used to calculate the gradient.

If the loss function is convex (seldom the case for a neural network) then this simple approach works fine. However, if there are local minima this algorithm is likely to get stuck.

There are variations on Gradient Descent which make it applicable to non-convex problems.

### Stochastic Gradient Descent

[Stochastic Gradient Descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) (SGD) is like Gradient Descent but, rather than calculating the actual gradient of the loss function using the entire dataset, it approximates the gradient using a randomly selected subset of the data. As a result there are more, faster iterations.

The *momentum* combines updates from previous iterations with the gradient from the current interation. Momentum is an important feature because it allows the optimiser to jump out of local minima.

There are other optimisation techniques which are significantly more efficient than SGD.

![](https://github.com/datawookie/useful-images/raw/master/optimisation-techniques-saddle-point.gif)

### Batches and Mini-Batches

Strictly speaking, each update in Stochastic Gradient Descent should be based on a single, randomly selected sample from the data.

At the opposite extreme, "batch" Gradient Descent uses the complete dataset for each update.

There is a compromise between these two extremes: train on subsets (or "mini-batches") of data. When the model has been trained on all mini-batches it's called an "epoch".

**Advantages**

- Less time per iteration (more frequent updates).
- Less RAM per iteration (and can handle larger data).
- Results are more noisy, but this can be good because it helps escape from local minima.


**Disadvantages**

- More iterations.
- How to find "best" mini-batch size?

### RMS Propagation

The Root Mean Square (RMS) Propagation (or RMSProp) optimiser has an adaptive learning rate *per parameter*. Recent gradients are used to provide a smoothed update.

### Adaptive Moment

The Adaptive Moment (or Adam) optimiser is similar to RMS Propagation but uses a more sophisticated technique to incorporate gradients from previous iterations. It tends to perform better without tuning to individual problems.

## Resources

- [How does backpropagation work?](https://brohrer.github.io/how_backpropagation_works.html)
- [What are the meanings of batch size, mini-batch, iterations and epoch in neural networks?](https://www.quora.com/What-are-the-meanings-of-batch-size-mini-batch-iterations-and-epoch-in-neural-networks)
- [What is meant by "Batch" in machine learning?](https://www.quora.com/What-is-meant-by-Batch-in-machine-learning)

![](https://github.com/datawookie/useful-images/raw/master/banner/banner-lab-tensorflow-keras.png)

The lab for this section will involve building a simple Neural Network model.