<h1 style="color:white;background-color:rgb(255, 108, 0);padding-top:1em;padding-bottom:0.7em;padding-left:1em;">3.4 Optimizers and Training Process</h1>
<hr>

<h2>Introduction</h2>

In this lesson first we will see how to use the built-in optimizers in tensorflow on a simple task.
<br>
Next we will see, how to create simple neural network structures for classification and regression and how to
<br>
perform a training process to tune the variables of the models.

First of all, import the required modules:

In [None]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import SIT_visit.Block_3.utils as utils

<h2>Optimizers</h2>

The optimizers provided in TensorFlow can be found in the ```tf.train``` package.
<br>
The most important ones are the GradientDescentOptimizer, the AdamOptimizer and the MomentumOptimizer.
<br>
For more info on optizers see https://www.tensorflow.org/api_docs/python/tf/train

These optimizers automatically modify the variables of the model, in order to find the minimum of a function.

<p style="margin-top:2em;">Now let's see how to use an optimizers to find the minimum value of a simple function $(y=x^4+2x^3-4x^2-4x+5)$, where $x,y \in \mathbb{R}$:</p>

In [None]:
#Create a function that computes y
def function(x):
    return (x**4+2*x**3-4*x**2-4*x+5)

#Plot the function
x = np.linspace(-3, 2, 100)

plt.plot(x, function(x), color='black')
plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Function')
plt.show()

In [None]:
#Create lists to store values during all iterations
X_values = []
Y_values = []

tf.reset_default_graph()

#Create a variable and initialize it with a random normal distribution
x_1 = tf.get_variable('x_1', shape=[], dtype=tf.float32, initializer=tf.initializers.random_normal)

#Create a variable and initialize it as zero
x_2 = tf.get_variable('x_2', shape=[], dtype=tf.float32, initializer=tf.initializers.zeros)

#Create a variable and initialize it with a constant value of -1
x_3 = tf.get_variable('x_3', dtype=tf.float32, initializer=-1.0)

#Stack all variables into a vector, so the result of the function for each one can be computed in one pass
X = tf.stack(tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES), axis=0, name='stack')

Y = function(X)

#Create a gradient descent optimizer and tell it to minimize the value of Y
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(Y)

#Create an initializer for the variables
init = tf.global_variables_initializer()

#Run the session as long as the chane in the x values is less than 0.0001
train = True

with tf.Session() as sess:
    sess.run(init)
    while train:
        X_v,Y_v,_ = sess.run([X,Y,train_step])
        if not X_values:
            X_values.append(X_v)
            Y_values.append(Y_v)
        else:
            if all(abs(X_values[-1]-X_v) < 0.0001):
                train = False
                X_values.append(X_v)
                Y_values.append(Y_v)
            else:
                X_values.append(X_v)
                Y_values.append(Y_v)
                
print('Training finished')
print('The minimum places and values (x,y) found from the initial positions:')
for i in range(len(X_values[-1])):
    print('x:', X_values[-1][i], 'y:', Y_values[-1][i])

#Plot the function and the iteration steps
x = np.linspace(-3, 2, 100)

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(x, function(x), label='function', color='black')

cm1 = cm.Blues(np.linspace(0.4,1,len(X_values)))
cm2 = cm.Reds(np.linspace(0.4,1,len(X_values)))
cm3 = cm.YlGn(np.linspace(0.4,1,len(X_values)))

ax.scatter([x[0] for x in X_values], [y[0] for y in Y_values], s=100, label='st norm init', color=cm1, marker='^')
ax.scatter([x[1] for x in X_values], [y[1] for y in Y_values],  s=100, label='0 init', color=cm2, marker='x')
ax.scatter([x[2] for x in X_values], [y[2] for y in Y_values],  s=100, label='-1 init', color=cm3, marker='+')
plt.legend()
plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Finding minimum from different start points')
plt.show()

<p style="margin-top:2em;">From the example above it can be seen, that the Gradient Descent optimization method can
<br>
be stuck in local minima. That's why the initialization of the variables is very important.</p>

Next let's see what difference the optimization process makes:

In [None]:
#Create lists to store values during all iterations
X_values = []
Y_values = []

tf.reset_default_graph()

#Create variables and initialize them with values of -1
x_1 = tf.get_variable('x_1', dtype=tf.float32, initializer=-1.0)

x_2 = tf.get_variable('x_2', dtype=tf.float32, initializer=-1.0)

x_3 = tf.get_variable('x_3', dtype=tf.float32, initializer=-1.0)

#Stack all variables into a vector, so the result of the function for each one can be computed in one pass
X = tf.stack(tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES), axis=0, name='stack')

Y = function(X)

#Create optimizers and tell them to minimize the value of Y
train_step_1 = tf.train.GradientDescentOptimizer(0.01).minimize(Y, var_list=[x_1])
train_step_2 = tf.train.AdamOptimizer(0.1).minimize(Y, var_list=[x_2])
train_step_3 = tf.train.MomentumOptimizer(0.01, 0.8).minimize(Y, var_list=[x_3])

#Create an initializer for the variables
init = tf.global_variables_initializer()

#Run the session as long as the chane in the x values is less than 0.0001
train = True

with tf.Session() as sess:
    sess.run(init)
    while train:
        X_v,Y_v,_,_,_ = sess.run([X,Y,train_step_1,train_step_2,train_step_3])
        if not X_values:
            X_values.append(X_v)
            Y_values.append(Y_v)
        else:
            if all(abs(X_values[-1]-X_v) < 0.0001):
                train = False
                X_values.append(X_v)
                Y_values.append(Y_v)
            else:
                X_values.append(X_v)
                Y_values.append(Y_v)
                
print('Training finished')
print('The minimum places and values (x,y) found from the initial positions:')
for i in range(len(X_values[-1])):
    print('x:', X_values[-1][i], 'y:', Y_values[-1][i])

#Plot the function and the iteration steps
x = np.linspace(-3, 2, 100)

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(x, function(x), label='function', color='black')

cm1 = cm.Blues(np.linspace(0.4,1,len(X_values)))
cm2 = cm.Reds(np.linspace(0.4,1,len(X_values)))
cm3 = cm.YlGn(np.linspace(0.4,1,len(X_values)))

ax.scatter([x[0] for x in X_values], [y[0] for y in Y_values], s=100, label='Gradient Descent', color=cm1, marker='^')
ax.scatter([x[1] for x in X_values], [y[1] for y in Y_values],  s=100, label='Adam', color=cm2, marker='x')
ax.scatter([x[2] for x in X_values], [y[2] for y in Y_values],  s=100, label='Momentum', color=cm3, marker='+')
plt.legend()
plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Finding minimum with different optimizers')

#Uncomment these to show the behaviour near the minimum point
#plt.xlim(-2.3,-2)
#plt.ylim(-3.5,-3)

plt.show()

<p style="margin-top:2em;">As seen from the previous example, the Adam and the Momentum optimizers settle in the
<br>
minimum with oscillations. This behaviour is due to the introduced momentum term in their formulation.</p>
<br>
The Momentum optimizer uses a fix momentum for the weight updates, while the Adam optimizer uses an adaptively decaying momentum.
<br>
The momentum term can help these optimizers to swing over smaller local minima.

<h2>Training Process</h2>

Now, let's see two simple examples for the construction of the whole data pipeline from
<br>
data preparation to the end of the training process. First, we create a fully connected neural network
<br>
and train it to adapt the classic XOr method.

**Problem statement:**

There are two logical variables $a$ and $b$ that will be the inputs for our network.
<br>
We would like to predict the XOr connection between $a$ and $b$ with our model.

The truth table of the XOr connection betwwen $a$ and $b$ is:

| a | b | XOr |
| -- | -- | -- |
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |

We can compute the right labels for the network according to this table.
<br>
This task can be solved by classification, because the output is a discrete value.
<br>
It can either be True (1) or False (0).

The network should have a hidden layer of two neurons and a single output neuron.
<br>
The network should look like this:

<center>
<img src="http://drive.google.com/uc?export=view&id=1vuDFz0vK6tKGyk-uHg4peybJCGcS_L4u" width="30%"/>
</center>

The neurons should implement a sigmoid activation function to force the outputs to be
<br>
between zero and one. We will use the built-in sigmoid activation function in TensorFlow.

The output of the network should be compared to the labels and a loss term should be introduced that we can minimize.
<br>
In this task we will use the binary cross entropy loss function:

$$L_{CE}=z(-\log{p})+(1-z)(-\log{(1-p)})$$

where $z$ is the correct label (0 or 1) and $p$ is the prediction of the network that can be interpreted as
<br>
the probability of the output being True.

In TensorFlow, this loss function is already defined, so we just have to use it.
<br>
It is also important to note that the loss function in Tensorflow  awaits the output of the network
<br>
without the sigmoid activation function, as it is applied inside the calculation of the loss for performance purposes.

We will use the Gradient Descent optimizer to minimize this loss by modifying the parameters of our network.

We have to address the following tasks:
 - Prepare the training data for the training process
 - Build the data pipeline for Tensorflow
 - Construct the neural network model
 - Define a loss function and select an optimizer for the task
 - Code the training process
 - Train the model
 - Evaluate the model

In [None]:
import itertools

#Data preparation:
inputs = list(itertools.product([0, 1], repeat=2))
labels = [[a^b] for a,b in inputs]
print(inputs)
print(labels)


#Data pipeline:
batch_size = 100

tf.reset_default_graph()

ds = tf.data.Dataset.from_tensor_slices((inputs,labels)).shuffle(len(labels)).repeat().batch(batch_size)

iterator = ds.make_initializable_iterator()
iterator_init = iterator.initializer

inps,labs = iterator.get_next()
inps = tf.identity(inps, name='inputs')
labs = tf.identity(labs, name='labels')


#Neural network model:
def net(x, layer_sizes=[2,1], reuse=False, name='FC'):
    with tf.variable_scope(name, reuse=reuse):
        hidden = tf.layers.dense(x, layer_sizes[0], activation=tf.nn.sigmoid, name='hidden_layer')
        output = tf.layers.dense(hidden, layer_sizes[1], name='output_layer')
        return output

logits = net(tf.to_float(inps))
predictions = tf.nn.sigmoid(logits, name='predictions')
th_predictions = tf.to_int32(predictions > 0.5, name='th_predictions')
accuracy = tf.identity(tf.reduce_sum(tf.to_int32(tf.equal(th_predictions,labs)))/batch_size, name='accuracy')


#Loss function and optimization:
loss = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(labels=tf.to_float(labs), logits=logits, name='loss'))

train_step = tf.train.GradientDescentOptimizer(0.1).minimize(loss)

variable_init = tf.global_variables_initializer()

#Training process and evaluation:
train = True

with tf.Session() as sess:
    sess.run([iterator_init,variable_init])
    c = 0
    while train:
        l,a,_ = sess.run([loss, accuracy, train_step])
        print('Batch:', c, 'loss:', l, 'accuracy:', a)
        if a==1 or c > 500:
            train = False
        c += 1
    print('\nEnd of training process\n')
    p,th_p = sess.run([predictions,th_predictions], feed_dict={inps: inputs})
    for j in range(len(inputs)):
        print('\nInputs:', inputs[j], 'prediction:', p[j], '\nth_prediction:', th_p[j], 'correct label:', labels[j])

<h2>Excersise 3.4</h2>

Train a fully connected neural network for a regression task!
<br>
The training data can be found in the Numpy file data.npy in the subdirectory called data.
<br>
Load the data like:
```python
inputs,labels = np.load('SIT_visit/Block_3/data/data.npy')
```
The inputs for the network are single scalars $x$ and the labels $y$ are scalars as well, where $x,y\in \mathbb{R},\quad x\in[0,1]$

You will have to perform the following tasks:
 - Plot the training data
 - Prepare the training data for the training process
 - Build the data pipeline for Tensorflow
 - Construct the neural network model (Fully connected neural network with two hidden layers and one output neuron)
 <br>
 The hidden neurons should have hyperbolic tangent activation and the output neuron does not need an activation function.
 - You should choose the proper number of neurons in the hidden layers for yoursef.
 - Use the ```tf.losses.mean_squared_error()``` loss function and the Gradient Descent optimizer and choose a learning rate
 - Code the training process
 - Train the model and plot the loss during the training
 - Evaluate the model (Plot the training data and the predictions for values in the given ```test_data``` array)
 
 Extra tasks:
 - Experiment with the batch size, number of neurons, activation functions, optimizer and learning rate
 <br>
 to create models that converge fast and are accurate, or ones that require the least number of parameters.
 
 - Build a fully connected network that estimes the parameters $s_i,\space i \in \{0,1,2,3,4,5\}$ if we suppose
 <br>
 that the data can be best approximated with a function like:
 $$y=s_0\sin^2(s_1x+s_2)+s_3\sin(s_4x+s_5)$$

In [None]:
#Test data
test_data = np.arange(0,1,0.01)

See solution here: [Excersise 3.4 solution](Excersise_3_4.ipynb)

Continue: [3.5 TensorFlow Project](TensorFlow_Project.ipynb)