<h1 style="color:white;background-color:rgb(255, 108, 0);padding-top:1em;padding-bottom:0.7em;padding-left:1em;">2.2 Training Process and Gradient Descent</h1>
<hr>

<h2>Introduction</h2>

We already saw why artificial neurons and neural networks are such powerful computing tools. Now we will see why they
<br>
are practical. The training process enables us to tune the weights of a neural network model automatically.
<br>
For this purpose we will use the Gradient Descent algorithm.

First of all import the necessary modules:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

<h2>Gradient and Optimization</h2>

In order to train a neural netwok, first of all, a goal for the network have to be defined.
<br>
This goal can appear in several forms. It can be to classify the inputs, perform clustering, regress
<br>
continuous values, etc. The objective is to modify the weights of the network so it performs
<br>
the given task well, a.k.a. with minimal error.

So in order to train a neural network we should define a function that computes the error
<br>
of the predictions of the neural network. During training we use iterative optimization methods
<br>
to minimize the predicion error across the training dataset with respect to the weights of the
<br>
neural network.

The error function will be a multi-variable function. While in case of functions with a single
<br>
variable the derivative can be used to search for the minima, in case of multi-variable functions
<br>
a multi-variable generalization of the derivative have to e used. This is the gradient.

The gradient of a multi-variable function is a vector with the dimension of the number of variables.
<br>
The elements of the gradient are the partial derivatives of the function with respect to the
<br>
correspoding variables. The gradient can be formulated like:

$$\nabla f(\mathbf{x}) = \left[\frac{\partial f(\mathbf{x})}{\partial \mathrm{x_1}},
\frac{\partial f(\mathbf{x})}{\partial \mathrm{x_2}},
\dots,
\frac{\partial f(\mathbf{x})}{\partial \mathrm{x_n}}
\right]$$

,where $f(\mathbf{x})$ is the multi-variable function,
<br>
<br>
$\nabla f(\mathbf{x})$ is the gradient,
<br>
<br>
$\frac{\partial f(\mathbf{x})}{\partial \mathrm{x_i}}$ is the $i^{th}$ partial derivative and $\mathbf{x} = \left[\mathrm{x_1}, \mathrm{x_2}, \dots, \mathrm{x_n}\right],\space (i \in \{1,2, \dots, n\})$

Similar to the derivative, the gradient also represents the slope of the function. Furthermore
<br>
it is a vector that points in the direction of the steepest increase in the function and its
<br>
magnitude is the steepness of the function in that given direction.

Let's see, how the gradient of a simple multi-variable function, $f(x,y) = x^2+y^2$ looks like:

In [None]:
#Create function:
def function(inp):
    return (inp[0]**2+inp[1]**2)

def gradient(inp):
    return(np.array([2*inp[0],2*inp[1]]))

#Plot function
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
x = y = np.arange(-3.0, 3.0, 0.05)
X, Y = np.meshgrid(x, y)
zs = np.array([function([x,y]) for x,y in zip(np.ravel(X), np.ravel(Y))])
Z = zs.reshape(X.shape)

ax.plot_surface(X, Y, Z)

ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z')
ax.set_title('Function with its gradient')

#Plot gradient
X, Y = np.meshgrid(np.arange(-3, 3, 0.4), np.arange(-3, 3, 0.4))
Z = np.zeros(X.shape)

U = np.zeros(X.shape)
V = np.zeros(X.shape)
W = np.zeros(X.shape)
for i in range(X.shape[0]):
    for j in range(X.shape[1]):
        g = gradient([X[i][j],Y[i][j]])
        U[i][j] = g[0]
        V[i][j] = g[1]
        
ax.quiver(X, Y, Z, U, V, W, length=0.1, color = 'black')

#Plot gradient alone
plt.figure(2)
plt.title('Gradient of function')
Q = plt.quiver(X, Y, U, V, units='width')

plt.show()

Seeing how the gradient works it is understandable that if we would like to
<br>
find the minimum of a function we would have to take steps in the opposite
<br>
direction the gradients points.

During the training of a neural networks we would like to modify the weights
<br>
of the networks in order to minimize the error function.

So the basic idea of the gradient descent algorithm is the following:
 - Initialize the parameters to random values (select a random point in the parameter space)
 - Compute the gradient of the function at the specified point
 - Get a new point in the parameter space by taking a step from the current point in the
 <br>
 oppsite direction as the gradient.
 - Repeat this process as long as a condition is met (number of iterations or the error is below
 <br> a previously defined level.
 
The size of the step we take determines the speed of the conversion. This is called the learning rate
<br>
and it is a hyperparameter. This means that it is not directly modified during the training process,
<br>
but it has to be tuned using validation.

So the training is an iterative process where the computations of the weights can be formulated like:

$$\mathbf{w}^{i+1} = \mathbf{w}^{i} - \eta \nabla E(\mathbf{w})(\mathbf{w}^{i}) \text{ , so}$$

$$w_{j}^{i+1} =w_j^{i} - \eta \frac{\partial E(\mathbf{w})}{\partial w_{j}}(\mathbf{w}^i)$$

,where $\mathbf{w} = [w_1, w_2,\dots, w_n]$ and  $\mathbf{w}^{i}$ is a collection of the weights in the $i^{th}$ iteration of the training process
<br>
in case of  $i = 0$ the weights are initialized with random values.
<br>
$E(\mathbf{w})$ is the error function and $ \nabla E(\mathbf{w})(\mathbf{w}^{i})$ is the evaluation of the gradient of the error function in the point $\mathbf{w}^{i}$
<br>
$\eta$ is the learning rate and
<br>
$j \in \{1, 2, \dots, n\}$

<h3>Learning rate, iteration, batch and epoch</h3>

Increasing the learning rate means that we take larger steps durong the training process.
<br>
This could increase the speed of the training, hoowever a too large learning rate can cause problems
<br>
in the stability of the training process. The effects of the learning rate can be seen on the figures below:

<center>
<img src="https://srdas.github.io/DLBook/DL_images/TNN2.png" width="60%"/>
    <img src="https://cdn-images-1.medium.com/max/1600/0*uIa_Dz3czXO5iWyI." width="40%"/>
</center>

The error function is often referred to as the loss function as well.

The meaning of an iteration of the training process is a sigle step of weight update.
<br>
A batch is a collection of training data for which the error is computed before the weight update
<br>
is performed. In case of Batch Gradient Descent, the whole training dataset is used as a batch.
<br>
In this case the weight update is performed only after the error is computed for all predicitions
<br>
on each training samples.
<br>
In case of the Stochastic Gradient Descent algorithm, weight update is carried out for each training sample.
<br>
In case of Mini Batch Gradient Descent the batch size is greater than one but smaller than the size of the
<br>
training dataset.

The advantage of the Batch Gradient Descent method is its stability. However, it may get stuck in a suboptimal
<br>
solution and it requires the whole training set to be loaded in memory.

Stochastic Gradient Descent can converge faster than Batch Gradient Descent but it requires more computations
<br>
due to the frequent weight updates and can result noisy gradients, so the error rate will not decrease gradually.

The Mini Batch Gradient Descent can combine the advantages of the Batch and Stochastic Gradient Descent algorithms.
<br>
That is why choosing the right batch size for the given problem is very important. Just like the learning rate
<br>
the batch size is a hyperparameter as well.

An epoch is a full pass of all the training samples during the training process.

Now that we know how to compute the gradient of a multi-variable function and how to perform the training process
<br>
it is time to inspect the gradient in a neural network.

<h3>Derivatives of activation functions in neurons</h3>

The activation function of the neurons in the neural network has a great impact on the gradient computation,
<br>
so the derivatives of the activation functions should be calculated and inspected.

| Name | Equation of function, $f(x)$ | Equation of derivative, $f'(x)$ |
| ----- | ----- | ----- |
| Linear | $f(x) = x$ | f'(x) = 1 |
| Step | $f(x)=\begin{cases} 1 & \text{if $x \geq$ 0}\\ 0 & \text{if $x < 0$}\end{cases}$ | $f'(x)=\begin{cases} 0 & \text{if $x \neq 0$ }\\ ? & \text{if $x = 0$ }\end{cases}$ |
| Sigmoid | $\sigma (x)=\frac{1}{1+e^{-x}}$ | $\sigma '(x) = \sigma (x)(1-\sigma (x))$ |
| Tangent Hyperbolic | $f(x) = \tanh(x)$ | $f'(x) = \frac{1}{x^2+1}$ |
| ReLU | $R(x) = \max{(0,x)}$ | $R'(x)=\begin{cases} 0 & \text{if $x \leq 0$ }\\ 1 & \text{if $x > 0$ }\end{cases}$ |
| Leaky ReLu | $LR(x) = \max{(\alpha x,x)}$ | $LR'(x)=\begin{cases} \alpha & \text{if $x \leq 0$ }\\ 1 & \text{if $x > 0$ }\end{cases}$ |


From this table it is clear why the step function is not used often in neural networks. The slope of the function is
<br>
zero in most locations, so the gradient cannot be used to determine in which direction should we modify the parameters.

The gradients in neural networks can be computed by using the chain rule for derivation, because the calculation of the
<br>
output can be formulated as a series of multi-variable functions.

The final component that should be discussed about the training process is the formulation of the error, or the so-called loss function.

<h2>Loss function</h2>

The two most popular loss functions for neural networks are the Mean Squared Error and the Cross Entropy loss.
<br>
The Mean Squared Error is used in case of regression problems, when the error is the difference between continuous values.
<br>
The Cross Entropy loss is used in case of classification problems, when the output of the network is a probability distibution
<br>
for the output classes.

These loss functions can be formulated like:

$$MSE = \frac{1}{N}\sum_{i = 1}^N(z_i - y_i)^2$$

,where $MSE$ is the Mean Squared Error for $N$ number of predictions (this will be the batch size)
<br>
$z_i$ is the predicted value for the $i^{th}$ sample
<br>
$y_i$ is the correct value for the output if the input is the $i^{th}$ sample

and the Cross Entropy is:

$$CE = -\sum_{j = 1}^My_j\log(p_j)$$

,where $CE$ is the Cross Entropy loss for a single prediction on a single training sample
<br>
$M$ is the number of classes
<br>
$p_j$ is the $j^{th}$ element of the predicted probability distibution
$z_j$ is the $j^{th}$ element of the correct probability distibution

For a batch of samples, the sum or the mean of the Cross Entropy loss over the samples can be used.

It can be seen that for both loss functions, the correct labels have to be provided. We call these methods
<br>
supervised learning, because the training samples have to be provided with correct labels.

Unsupervised learning can be used to detect similarities in the data and to perform clustering.
<br>
That is why the loss functions for unsupervised learning methods are usually distance measures.

Now let's see a simple exmple on a supervised regression problem using the MSE as the loss function:

In [None]:
#Define activation functions:
def sigmoid(x):
    return (1/ (1 + np.exp(-x)))

def linear(x):
    return (x)


#Function to regress:
def function(inp):
    return (0.1*inp[0]+0.4*inp[1])


#Layers of the network:
def layer(inputs, weights, activation):
    bias_inputs = np.ones((inputs.shape[0],1))
    inputs = np.append(inputs, bias_inputs, axis=1)
    return (activation(inputs.dot(weights)))


#Prepare inputs:
x = np.arange(-3.0, 3.0, 0.05)
y = np.arange(-3.0, 3.0, 0.05)
X, Y = np.meshgrid(x, y)
zs = np.array([function([x,y]) for x,y in zip(np.ravel(X), np.ravel(Y))])
train_data = np.array([([x,y],function([x,y])) for x,y in zip(np.ravel(X), np.ravel(Y))])
Z = zs.reshape(X.shape)

inputs = []
labels = []
for t in train_data:
    inputs.append(t[0])
    labels.append(t[1])

labels = np.array(labels)
labels = np.reshape(labels,[-1,1])


#Hyperparameters:
number_of_inputs = 2
number_of_neurons = [2,1]
activations = [sigmoid,linear]
learning_rate = 0.1
number_of_epochs = 200


#Define weights:
weights = []
for i in range(len(number_of_neurons)):
    if i==0:
        weights.append(np.random.uniform(-1.0,1.0,(number_of_inputs+1,number_of_neurons[i])))
    else:
        weights.append(np.random.uniform(-1.0,1.0,(number_of_neurons[i-1]+1,number_of_neurons[i])))


#Build network:
losses = []

def net(inputs, weights, activations):
    acts = []
    acts.append(inputs)
    for i in range(len(weights)):
        hidden = layer(acts[-1],weights[i],activations[i])
        acts.append(hidden)
    return (acts)

#Training:
for k in range(number_of_epochs):
    acts = net(np.array(inputs), weights, activations)
    output = acts[-1]

    loss = ((output - labels)**2).sum()/(2*len(labels))

    print('Epoch', k, 'Loss: ', loss)
    losses.append(loss)

    dldz = (output-labels) #z-y
    dzdw2 = np.append(acts[-2], np.ones((len(acts[-2]),1)), axis=1) #h

    dldw2 = np.reshape((dldz*dzdw2).sum(axis=0)/len(labels),[-1,1])

    dzdh = np.reshape(weights[1][0:-2],-1) #w_2^T
    dhdn = acts[-2]*(1-acts[-2]) #h.(1-h)
    dndw1 = np.append(acts[-3], np.ones((len(acts[-3]),1)), axis=1) #x

    dldw1 = np.zeros((number_of_neurons[0]+1,number_of_inputs))
    for i in range(len(labels)):
        dldw1 += np.reshape(dndw1[i],[-1,1]).dot(np.reshape(dldz[i]*dzdh*dhdn[0],[1,-1]))

    dldw1 /= len(labels)

    weights[0] -= learning_rate*dldw1
    weights[1] -= learning_rate*dldw2


#Visualize the training and results:
plt.figure(1)
plt.plot(range(len(losses)),losses)
plt.xlabel('epoch')
plt.ylabel('MSE')
plt.title('Training process')

fig = plt.figure(2)
ax = fig.add_subplot(111, projection='3d')

ax.plot_surface(X, Y, Z)

pred_zs = np.array([net(np.array([[x,y]]), weights, activations)[-1] for x,y in zip(np.ravel(X), np.ravel(Y))])
pred_Z = pred_zs.reshape(X.shape)

ax.plot_surface(X, Y, pred_Z)

ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z')
ax.set_title('Function and predictions')

plt.show()

From the example above it can be seen that even a very simple neural network is really hard to make
<br>
if we are coding everything from scratch. That is why in the next block we will learn how to handle
<br>
the TensorFlow framework. However, first of all, we should discuss the theory of deep learning.

Continue: [2.3 Deep Learning Theory](Deep_Learning.ipynb)