Inga Ulusoy, Computational modelling in python, SoSe2020 

# Machine learning using neural networks and Keras/TensorFlow

Parts of the notebooks are taken from 

https://developers.google.com/machine-learning/crash-course/ml-intro


## Neural networks

If you would like to implement your own neural network, you may take a look at this page:

https://towardsdatascience.com/how-to-build-your-own-neural-network-from-scratch-in-python-68998a08e4f6

However, we will use the framework provided by Keras/TensorFlow in the following and not start from scratch.

### Feedforward
The output is constructed as
\begin{align}
 z_1 & = W^{(1)}x + b^{(1)} \\
 z_2 & = W^{(2)} \sigma(z_1) + b^{(2)} \\
 y &= \sigma(z_2)
 \end{align}
 The "correct" weights $W$ and biases $b$ are obtained through training the neural network.
 The function $\sigma$ is the activation function. Many different activation functions are in use, with the sigmoid function being a popular function:
 \begin{align}
 \sigma(z)=\frac{1}{1+\exp(-z)}
 \end{align}
### Loss function
We will use a simple sum-of-squares (least squares) loss function (see Problem2):
\begin{align}
R^2=\sum_{i=1}^n [y_{i,\rm{exact}}-y_{i}]^2
 \end{align}
However, any other loss function also works, and it depends on the problem that is considered. The value $y_{i,\rm{exact}}$ is the exact solution (the actual $y$) from the dataset that is used in the training.

### Backpropagation
We need to follow the loss function towards its minimum. As $R^2$ does not directly depend on $W$ and $b$, we apply the chain rule:
\begin{align}
\frac{\partial R^2}{\partial W^{(2)}} = \frac{\partial R^2}{\partial y} \frac{\partial y}{\partial z_2} \frac{\partial z_2}{\partial W^{(2)}} 
\end{align}
and similarly for $b^{(2)}$. In this specific example, for the above equation we obtain
\begin{align}
\frac{\partial R^2}{\partial W^{(2)}} =2 (y_{i,\rm{exact}}-y_{i})
\cdot \frac{\partial \sigma}{\partial z_2} 
\cdot \left( \sigma(z_1) + b^{(2)} \right)
\end{align}
and
\begin{align}
\frac{\partial \sigma}{\partial z_2}= z_2(1-z_2)
\end{align}

In [None]:
#Modified google colab

#@title Copyright 2020 Google LLC. Double-click here for license information.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

You need to use version 2.x of TensorFlow. You may need to 

`pip install tensorflow`

In [None]:
#we will use numpy only once here so I import it as np
import numpy as np
import pandas as pd
import tensorflow as tf
from matplotlib import pyplot as plt

We need to initialize the model first.

In [None]:
def build_model(my_learning_rate):
    """Create and compile a simple linear regression model."""
    # Most simple tf.keras models are sequential. 
    # A sequential model contains one or more layers.
    # Each layer has one input and one output.
    # A non-sequential model has branching in the layers (multiple inputs or outputs)
    model = tf.keras.models.Sequential()

    # Describe the topography of the model.
    # The topography of a simple linear regression model
    # is a single node in a single layer. 
    # Dense means that it is a densely connected NN
    # https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense
    # units specifies the number of layers (dimensionality) - here one
    # input_shape specifies the dimensionality of the input - here one
    model.add(tf.keras.layers.Dense(units=1, 
                                  input_shape=(1,)))

    # Compile the model topography into code that 
    # TensorFlow can efficiently execute. 
    # Configure training to minimize the model's mean squared error. 
    # compile is a method of model, and will effectively configure the model 
    # for training.
    model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=my_learning_rate),
                loss="mean_squared_error",
                metrics=[tf.keras.metrics.RootMeanSquaredError()])

    return model        

This is the training function of the model. You need to pass the hyperparameters such as epochs and batch size.

In [None]:
def train_model(model, feature, label, epochs, batch_size):
    """Train the model by feeding it data."""

    # Feed the feature values and the label values to the 
    # model. The model will train for the specified number 
    # of epochs, gradually learning how the feature values
    # relate to the label values. 
    history = model.fit(x=feature,
                        y=label,
                        batch_size=None,
                        epochs=epochs)
  
    # Gather the trained model's weight and bias.
    trained_weight = model.get_weights()[0]
    trained_bias = model.get_weights()[1]
  
    # The list of epochs is stored separately from the 
    # rest of history.
    epochs = history.epoch
    
    # Gather the history (a snapshot) of each epoch.
    hist = pd.DataFrame(history.history)
  
    # Specifically gather the model's root mean 
    #squared error at each epoch. 
    rmse = hist["root_mean_squared_error"]
  
    return trained_weight, trained_bias, epochs, rmse

Here we have some plotting functions that help us visualize the output from the training.

In [None]:
#@title Define the plotting functions
def plot_the_model(trained_weight, trained_bias, feature, label):
    """Plot the trained model against the training feature and label."""
  
    # Label the axes.
    plt.xlabel("feature")
    plt.ylabel("label")
  
    # Plot the feature values vs. label values.
    plt.scatter(feature, label)
  
    # Create a red line representing the model. The red line starts
    # at coordinates (x0, y0) and ends at coordinates (x1, y1).
    x0 = 0
    y0 = trained_bias
    x1 = my_feature[-1]
    y1 = trained_bias + (trained_weight * x1)
    plt.plot([x0, x1], [y0, y1], c='r')
  
    # Render the scatter plot and the red line.
    plt.show()

def plot_the_loss_curve(epochs, rmse):
    """Plot the loss curve, which shows loss vs. epoch."""
  
    plt.figure()
    plt.xlabel("Epoch")
    plt.ylabel("Root Mean Squared Error")
  
    plt.plot(epochs, rmse, label="Loss")
    plt.legend()
    plt.ylim([rmse.min()*0.97, rmse.max()])
    plt.show()

### Define the dataset

We will start with some simple artificial ("synthetic") data.

We will use the volume vs extinction data from Problem2 (Problem16), slightly linearized for the high volumes and with a few additional points. We add some noise to the data to make it more realistic. 

This dataset consists of 11 examples; the features (input x) are stored in `my_feature` and the label (output y) is stored in `my_label`.

In [None]:
my_feature = ([2.5,5.0,7.5,10.0,12.5,15.0,17.5,20.0,22.5,25.0,30.0])
my_label   = ([0.097, 0.195, 0.289, 0.387, 0.483, 0.581, 0.686, 0.792,0.890, 0.982, 1.166])
plt.plot(my_feature,my_label,marker='x')
#now add some noise
noise = np.random.normal(0,0.05,len(my_feature))
my_label+=noise
my_label = list(my_label)
plt.scatter(my_feature,my_label,marker='o')
plt.show()

# The hyperparameters

The hyperparameters are: the learning rate, the epochs, the batch size. The learning rate is
\begin{align}
\mathrm{learning\:rate} = \frac{\mathrm{gradient\:magnitude}}{\mathrm{step\:size}}
\end{align}

Build and train the model.

In [None]:
learning_rate=0.005
epochs=10
my_batch_size=11
my_model = build_model(learning_rate)
trained_weight, trained_bias, epochs, rmse = train_model(my_model, my_feature, 
                                                         my_label, epochs,
                                                         my_batch_size)
plot_the_model(trained_weight, trained_bias, my_feature, my_label)
plot_the_loss_curve(epochs, rmse)

The loss curve has not flattened out. The model is not trained completely. The dots in the first plot specify the correct y and the red line the predicted line. If you execute the above cell several times, the result will look different as randomness plays a role in the evaluation.

### Increase the number of epochs

In [None]:
learning_rate=0.005
epochs= 200   

my_model = build_model(learning_rate)
trained_weight, trained_bias, epochs, rmse = train_model(my_model, my_feature, 
                                                        my_label, epochs,
                                                        my_batch_size)
plot_the_model(trained_weight, trained_bias, my_feature, my_label)
plot_the_loss_curve(epochs, rmse)

Depending on the initial random numbers of the weights, you will observe that the loss curve may flatten out, but not completely, and the fit may still not be very good. In addition, this took quite a while to run due to the high number of epochs.

### Increase the learning rate

In [None]:
# Increase the learning rate and decrease the number of epochs.
learning_rate=1
epochs=50

my_model = build_model(learning_rate)
trained_weight, trained_bias, epochs, rmse = train_model(my_model, my_feature, 
                                                         my_label, epochs,
                                                         my_batch_size)
plot_the_model(trained_weight, trained_bias, my_feature, my_label)
plot_the_loss_curve(epochs, rmse)

You may see lots of bumps and peaks in the loss curve. The learning rate is too high.

### Find the ideal combination of learning rate and epochs

In [None]:
#learning_rate=0.022
learning_rate=0.010
epochs=250
my_batch_size=11

my_model = build_model(learning_rate)
trained_weight, trained_bias, epochs, rmse = train_model(my_model, my_feature, 
                                                         my_label, epochs,
                                                         my_batch_size)
plot_the_model(trained_weight, trained_bias, my_feature, my_label)
plot_the_loss_curve(epochs, rmse)

This somewhat depends on the random noise of the initial dataset. It is quite difficult to adjust the parameters!


### Adjust the batch size

The system recalculates the model's loss value and adjusts the model's weights and bias after each **iteration**.  Each iteration is the span in which the system processes one batch. For example, if the **batch size** is 6, then the system recalculates the model's loss value and adjusts the model's weights and bias after processing every 6 examples.  

One **epoch** spans sufficient iterations to process every example in the dataset. For example, if the batch size is 11, then each epoch lasts one iteration. However, if the batch size is 6, then each epoch consumes two iterations.  

It is tempting to simply set the batch size to the number of examples in the dataset (11, in this case). However, the model might actually train faster on smaller batches. Conversely, very small batches might not contain enough information to help the model converge. 

In [None]:
learning_rate=0.02
epochs=250
my_batch_size=6

my_model = build_model(learning_rate)
trained_weight, trained_bias, epochs, rmse = train_model(my_model, my_feature, 
                                                        my_label, epochs,
                                                        my_batch_size)
plot_the_model(trained_weight, trained_bias, my_feature, my_label)
plot_the_loss_curve(epochs, rmse)

## Summary of hyperparameter tuning

Most machine learning problems require a lot of hyperparameter tuning.  Unfortunately, we can't provide concrete tuning rules for every model. Lowering the learning rate can help one model converge efficiently but make another model converge much too slowly.  You must experiment to find the best set of hyperparameters for your dataset. That said, here are a few rules of thumb:

 * Training loss should steadily decrease, steeply at first, and then more slowly until the slope of the curve reaches or approaches zero. 
 * If the training loss does not converge, train for more epochs.
 * If the training loss decreases too slowly, increase the learning rate. Note that setting the training loss too high may also prevent training loss from converging.
 * If the training loss varies wildly (that is, the training loss jumps around), decrease the learning rate.
 * Lowering the learning rate while increasing the number of epochs or the batch size is often a good combination.
 * Setting the batch size to a *very* small batch number can also cause instability. First, try large batch size values. Then, decrease the batch size until you see degradation.
 * For real-world datasets consisting of a very large number of examples, the entire dataset might not fit into memory. In such cases, you'll need to reduce the batch size to enable a batch to fit into memory. 

Remember: the ideal combination of hyperparameters is data dependent, so you must always experiment and verify.

# A neural net model

The `create_model` function defines the topography of the deep neural net, specifying the following:

* The number of [layers](https://developers.google.com/machine-learning/glossary/#layer) in the deep neural net.
* The number of [nodes](https://developers.google.com/machine-learning/glossary/#node) in each layer.

The `create_model` function also defines the [activation function](https://developers.google.com/machine-learning/glossary/#activation_function) of each layer. Here the rectifier is used
\begin{align}
f(x)=x^{+}=\max(0,x)
\end{align}
with $x$ the node input, and it is very popular for deep neural networks.

In [None]:
# check version
import tensorflow
print(tensorflow.__version__)

In [None]:
#we will use numpy only once here so I import it as np
import numpy as np
import pandas as pd
import tensorflow as tf
from matplotlib import pyplot as plt
my_feature = ([2.5,5.0,7.5,10.0,12.5,15.0,17.5,20.0,22.5,25.0,30.0])
my_label   = ([0.097, 0.195, 0.289, 0.387, 0.483, 0.581, 0.686, 0.792,0.890, 0.982, 1.166])
plt.plot(my_feature,my_label,marker='x')
#now add some noise
noise = np.random.normal(0,0.05,len(my_feature))
my_label+=noise
my_label = list(my_label)
plt.scatter(my_feature,my_label,marker='o')
plt.show()

In [None]:
# define the model

def create_model(my_learning_rate):
    """Create and compile a simple linear regression model."""
    model = tf.keras.models.Sequential()

    # Add the layer containing the feature columns to the model.
    # this expects the input data to hold 11 values
    # it adds one hidden layer with 10 nodes
    
    model.add(tf.keras.layers.Dense(1, input_shape=(1,)))
    

    #   * units specifies the number of nodes in this layer.
    #   * activation specifies the activation function (Rectified Linear Unit).
    #   * name is just a string that can be useful when debugging.
    # Define the hidden layer with 11 nodes.   
    model.add(tf.keras.layers.Dense(units=11, 
                                  activation='relu', 
                                  name='Hidden1'
                                   ))
  
    # Define the output layer.
    model.add(tf.keras.layers.Dense(units=1,  
                                  name='Output'))                              
  
    model.compile(optimizer=tf.keras.optimizers.Adam(lr=my_learning_rate),
                loss="mean_squared_error",
                metrics=[tf.keras.metrics.MeanSquaredError()])

    return model

Define a training function.

In [None]:
def train_model(model, dataset, epochs, label_name, feature_name,
                batch_size=None):
    """Train the model by feeding it data."""

    # Split the dataset into features and label. This generates a dictionary
    features = {name:np.array(value) for name, value in dataset.items()}
    #print(features)
    
    label = np.array(features.pop(label_name))
    #print(label)
    feature = np.array(features.pop(feature_name))
    history = model.fit(x=feature, y=label, batch_size=batch_size,
           epochs=epochs, shuffle=True) 

    # The list of epochs is stored separately from the rest of history.
    epochs = history.epoch
  
    # To track the progression of training, gather a snapshot
    # of the model's mean squared error at each epoch. 
    hist = pd.DataFrame(history.history)
    mse = hist["mean_squared_error"]
    
    trained_weight = model.get_weights()[0]
    trained_bias = model.get_weights()[1]

    return trained_weight, trained_bias, epochs, mse

## Call the functions to build and train a deep neural net

We need to rephrase the input as a pandas dataframe and hand it to tensorflow.

In [None]:
train_df = pd.DataFrame(list(zip(my_feature, my_label)),columns=['volume','extinction'])

In [None]:
train_df.head()

Now tensorflow knows how the features are structured. At this point, my_feature_layer is not assigned any values, but just the structure of the data.

In [None]:
# The following variables are the hyperparameters.
learning_rate = 0.01
epochs = 20
batch_size = 4

# Specify the label
label_name = "extinction"
feature_name = "volume"


# Establish the model's topography.
my_model = create_model(learning_rate)

# Train the model on the training set.
trained_weight, trained_bias, epochs, rmse = train_model(my_model, train_df, epochs,
                                                        label_name,feature_name,
                                                        batch_size)

#plotting the weights and biases only works for the linear model with one layer
#plot_the_model(trained_weight, trained_bias, my_feature, my_label)
#with more hidden layers, there is no simple linear relationship that allows the plotting
#of a linear output function - it is more complex than that
plot_the_loss_curve(epochs, rmse)

In [None]:
# The following variables are the hyperparameters.
learning_rate = 0.01
epochs = 200
batch_size = 11

# Specify the label
label_name = "extinction"
feature_name = "volume"


# Establish the model's topography.
my_model = create_model(learning_rate)

# Train the model on the training set.
trained_weight, trained_bias, epochs, rmse = train_model(my_model, train_df, epochs,
                                                        label_name,feature_name,
                                                        batch_size)

#plotting the weights and biases only works for the linear model with one layer
#plot_the_model(trained_weight, trained_bias, my_feature, my_label)
#with more hidden layers, there is no simple linear relationship that allows the plotting
#of a linear output function - it is more complex than that
plot_the_loss_curve(epochs, rmse)

In [None]:
#let's generate some test data to check the performance of the model
from scipy import interpolate
func = interpolate.interp1d(my_feature, my_label)
test_x = np.linspace(my_feature[0],my_feature[-1],10)
plt.plot(test_x,func(test_x))
plt.scatter(my_feature, my_label)
plt.show()

Now let's select a few sample feature values and compute the predicted labels. For this purpose, we need to evaluate the model.

In [None]:
# After building a model against the training set, test that model
# against the test set.
test_features = test_x
test_label = func(test_x)
print("Evaluate the new model against the test set:")
my_model.evaluate(x = test_features, y = test_label, batch_size=batch_size)

# Task 1 

Modify the number of nodes in the layer of the neural network and the hyperparameters. 

- What is the final loss in the training?
- What is the loss for the test set?

Summarize your results and consider Occam's razor in discussing your results.