**Notebook 1 – Introduction to Artificial Neural Networks with Keras**

_This notebook introduces ANNs._
jens.schwarzbach@ukr.de

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/jvschw/ml4ni/blob/master/ML4NI/1_introduction_to_neural_nets_with_keras.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

# Intro
To correctly classify images (such as "this is a house") is a difficult task at which biological organisms used to outperform machines. This has changed since developers have begun to mimic biological systems in machine learning by means of artificial neural networks.

Advance with SHIFT ENTER

## Biological and artificial neurons 

An Artificial Neural Network (ANN) is a Machine Learning Model inspired by the networks of biological neurons found in the brain. Neurons receive input through the synapses on their dentritic branches, which leads to changes of the neuron's membrane potential. If the membrane potential falls below a certain threshold, the neuron creates an action potential, which travels down the axon and eventually reaches the synaptic terminals where the neuron makes contact with other neurons.

<img src="https://github.com/jvschw/ml4ni/blob/master/ML4NI/images/ann/Blausen_0657_MultipolarNeuron_small2.png?raw=1">

Neuronal activity is frequency encoded. This means, the more excitatory input the neuron receives the higher its firing rate (measured in spikes per second or Hz) becomes. Below, you see how the spike rate (scaled to 100%) of four neurons in primary visual cortex (V1) increases as a function of stimulus contrast [(Albrecht & Hamilton, 1982)](https://www.physiology.org/doi/abs/10.1152/jn.1982.48.1.217).

<img src="https://github.com/jvschw/ml4ni/blob/master/ML4NI/images/ann/CRF_Albrecht_Hamilton_1982_small.jpg?raw=1">

This input-output dependency can be captured mathematically by a perceptron, an artificial neuron that computes the weighted sum of its inputs (input x synaptic strength) which then is passed to an activation function, which determines the output of that neuron. 

<img src="https://github.com/jvschw/ml4ni/blob/master/ML4NI/images/ann/Rosenblattperceptron.png?raw=1">

Artificial neurons can be equipped with a host of different activation functions. The left panel in the figure below shows some of the most common activation functions, the right panel shows their respective derivatives.
Derivatives are important, since learning yields the strongest changes where the derivative is highest.

<img src="https://github.com/jvschw/ml4ni/blob/master/ML4NI/images/ann/activation_functions_plot.png?raw=1">

## Neural Networks

ANNs consist of different layers of (artificial) neurons, typically one input layer (red), one output layer (blue), and a variable number of hidden layers (yellow) in between. Neurons in one layer are connected to neurons in subsequent layers by means of (synaptic) weights. 

<img src="https://github.com/jvschw/ml4ni/blob/master/ML4NI/images/ann/ANN_example.jpg?raw=1">

# Simulating ANNs with KERAS

## Setup

First, let's import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures. We also check that Python 3.5 or later is installed (although Python 2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as Scikit-Learn ≥0.20 and TensorFlow ≥2.0.

In [None]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

try:
    # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
except Exception:
    pass

# TensorFlow ≥2.0 is required
import tensorflow as tf
assert tf.__version__ >= "2.0"

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "ann"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

**Note**: we set `max_iter` and `tol` explicitly to avoid warnings about the fact that their default value will change in future versions of Scikit-Learn.

## Building an Image Classifier

First let's import TensorFlow and Keras.

Tensorflow is a library for numerical computation developed by the Google Brain Team. In mathematics, a tensor is an algebraic object that describes a linear mapping from one set of algebraic objects (e.g.  a scalar, vector or matrix) to another. In case of ANNs this means how the activity of one neuronal layer is transformed to another layer.

Keras is a high-level neural networks application programming interface (API), written in Python and capable of running on top of TensorFlow, and other libraries such as CNTK or Theano. It was developed with a focus on enabling fast experimentation.

In [None]:
import tensorflow as tf
from tensorflow import keras

In [None]:
#which version is installed?
tf.__version__

In [None]:
#which version is installed?
keras.__version__

Now, let's load the fashion MNIST dataset. Keras has a number of functions to load popular datasets in `keras.datasets`. 
The training data is a matrix (X) with the dimensions [nSamples, nFeatures]. The labels are contained in a vector (y) with nFeatures elements.
The dataset is already split for you between a training set and a test set, but it can be useful to split the training set further to have a validation set:

In [None]:
fashion_mnist = keras.datasets.fashion_mnist
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()

The training set contains 60,000 grayscale images, each 28x28 pixels:

In [None]:
X_train_full.shape

and of 60,000 class labels, each of which being a number from 0-9: 

The test set contains 10,000 grayscale images, each 28x28 pixels:

In [None]:
y_train_full.shape

The test set contains 10,000 grayscale images, each 28x28 pixels:

In [None]:
X_test.shape

and 10,000 class labels, each of which being a number from 0-9: 

In [None]:
y_test.shape

Each pixel intensity is represented as a byte (0 to 255):

In [None]:
X_train_full.dtype

Let's split the full training set into a validation set and a (smaller) training set. We also scale the pixel intensities down to the 0-1 range and convert them to floats, by dividing by 255.

In [None]:
X_valid, X_train = X_train_full[:5000] / 255., X_train_full[5000:] / 255.
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
X_test = X_test / 255.

In [None]:
X_train.shape

In [None]:
X_valid.shape

In [None]:
X_test.shape

You can plot an image using Matplotlib's `imshow()` function, with a `'binary'`
 color map:

In [None]:
plt.imshow(X_train[0], cmap="binary")
plt.axis('off')
plt.show()

The labels are the class IDs (represented as uint8), from 0 to 9:

In [None]:
y_train

Here are the corresponding class names:

In [None]:
class_names = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat",
               "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]

So the first image in the training set is a coat:

In [None]:
class_names[y_train[0]]

The validation set contains 5,000 images, and the test set contains 10,000 images:

In [None]:
X_valid.shape

In [None]:
X_test.shape

Let's take a look at a sample of the images in the dataset:

In [None]:
n_rows = 4
n_cols = 10
plt.figure(figsize=(n_cols * 1.2, n_rows * 1.2))
for row in range(n_rows):
    for col in range(n_cols):
        index = n_cols * row + col
        plt.subplot(n_rows, n_cols, index + 1)
        plt.imshow(X_train[index], cmap="binary", interpolation="nearest")
        plt.axis('off')
        plt.title(class_names[y_train[index]], fontsize=12)
plt.subplots_adjust(wspace=0.2, hspace=0.5)
save_fig('fashion_mnist_plot', tight_layout=False)
plt.show()

We start a keras session and initialize the random-seed generator to a fixed value such that we can replicate the session.
Then we build a sequential model with keras. Sequential means that the information flows sequentially from one layer to the next.
We start with an input layer, then we add two hidden layers (the first with 300 neurons, the second with 100 neurons), which are densely connected (dense means each unit of a given layer is connected to each unit to the preceding layer, as opposed to sparse connections). Both hidden layers use the relu activation function. Finally, we add an output layer with 10 neurons (because we have 10 output categories) and a softmax activation function, which gives us the probability of the class a neuron represents (all neurons sum up to a probability of 1).


In [None]:
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

In [None]:
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[28, 28]))
model.add(keras.layers.Dense(300, activation="relu"))
model.add(keras.layers.Dense(100, activation="relu"))
model.add(keras.layers.Dense(10, activation="softmax"))

'model.summary' provides us with a text-description of our model 

In [None]:
model.summary()

Keras has a function that produces a graphical depiction of a model. [Note: the ? in the numerical descriptions denotes the batch-size, which we have not defined yet. More about batch-sizes further below when we talk about the model's optimizing function.]

In [None]:
keras.utils.plot_model(model, "my_mnist_model.png", show_shapes=True)

We can access layers and use variables to refer to them:

In [None]:
hidden1 = model.layers[1]
hidden1.name

In [None]:
model.get_layer(hidden1.name) is hidden1

In [None]:
weights, biases = hidden1.get_weights()

Weights are initialized with small random values, otherwise there would not be any gradients and the model would be unable to learn.

In [None]:
weights

The first hidden layer gets its input from the input layer, which has 784 (i.e. 28x28) units. The first hidden layer itself has 300 units. We can look at the shape of the weight matrix. You see it is organized as (from, to).

In [None]:
weights.shape

Biases are initialized with zeroes.

In [None]:
biases

Each unit has its own bias.

In [None]:
biases.shape

Now we are ready to compile the model, for which we have to provide three more pieces of information: 1) the loss function (here sparse_categorical_crossentropy), 2) the optimizer for learning (here stochastic gradient descent), and 3) the performance metrics (here accuracy).

Why 'sparse_categorical_crossentropy'?
We have sparse labels (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) with nothing in between, and these labels are mutually exclusive.

If we had used one-hot encoding, such as [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.] to represent class 3, we would need to use the "categorical_crossentropy" loss instead.

If we were doing binary classification (e.g. patient vs control) with one or more binary labels (e.g. patient vs control, male vs female),  we would use the sigmoid (i.e. logistic) activation function in the output layer instead of the "softmax" function, and we would use the "binary_crossentropy" loss.

In [None]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="sgd",
              metrics=["accuracy"])

This is equivalent to:

```python
model.compile(loss=keras.losses.sparse_categorical_crossentropy,
              optimizer=keras.optimizers.SGD(),
              metrics=[keras.metrics.sparse_categorical_accuracy])
```

We have set up the data and model. Now we need to tell the learning algorithm how many epochs (i.e. one loop through all training samples) to run and set the batch size.
Stochastic Gradient Descent performs a weight update for every batch (i.e. a subset) of training data, implying there are multiple weight updates per epoch instead of going through all samples and computing one weight update per epoch. This approach leads to a faster, more stable convergence. There is a discussion about optimal batch sizes: one strategy is to make batches as large as possible such that they fit in the memory of a GPU. This strategy optimizes parallelization. But there are claims that large batch sizes lead to instable learning (which could be countered by learning-rate warmup, i.e. start with smaller learning rates, increase the learning rate for a while and later decrease it again). The alternative, and that is the approach we take here is to use a batch size of 32 samples or less.

Now we are ready to fit the model (and grab a coffee).

In [None]:
history = model.fit(X_train, y_train, epochs=30, batch_size=32,
                    validation_data=(X_valid, y_valid))

Fitting a model provides us with a history object that contains a lot of information about the learning history.

In [None]:
history.params

In [None]:
print(history.epoch)

In [None]:
history.history.keys()

Most importantly, history contains a dictionary (history.history) containing the loss and extra metrics it measured at the end of each epoch on the training set and on the validation set.
You can create a pandas data frame from history.history and plot the learning curves below.

In [None]:
import pandas as pd

pd.DataFrame(history.history).plot(figsize=(8, 5))
plt.grid(False)
plt.gca().set_ylim(0, 1)
save_fig("keras_learning_curves_plot")
plt.show()

Both the training accuracy and the validation accuracy steadily increase during training, while the training loss and the validation loss decrease. Moreover, since the validation curves are close to the training curves, there seems to be no problem with overfitting.

## Hyperparameter Tuning

At this point our model predicts the labels of the validation dataset with an accuracy of 87.2%. Not too bad, but is this the best we can do?
There are a couple of parameters that we can change before we even begin with the learning process. These are called hyperparameters, such as:
- number of hidden layers
- number of units per hidden layer
- learning rate
- leaning algorithm

Although there are reasonable (and also unreasonable) assumptions you can start with, there is no apriory certainty about which combination of parameters will work best (as in "produces the best generalization"). Therefore, developing a useful machine learning model includes searching for the best set of hyperparameters. This is called hyperparameter tuning.

To this aim we will start a new tensorflow session and write a function that creates a new model based on a given set of hyperparameters. Then we will search through the parameter space, which can be very large, in a loop that tests a given model and keeps in memory which model has fared best so far. This procedure should return this best model (already trained).

In [None]:
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

In [None]:
def build_model(n_hidden=2, n_neurons=30, learning_rate=3e-3, input_shape=[28, 28]):
    print(n_hidden, n_neurons, learning_rate)
    model = keras.models.Sequential()   
    model.add(keras.layers.Flatten(input_shape=input_shape))
    for layer in range(n_hidden):
        model.add(keras.layers.Dense(n_neurons, activation="relu"))
    model.add(keras.layers.Dense(10, activation="softmax"))
    optimizer = keras.optimizers.SGD(lr=learning_rate)
    model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
    #model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer)
    return model

Scikit-learn has some smart procedures for hyperparameter tuning. Here we create a keras object that is accessible for scikit learn. [Unfortunately this function may be faulty leading to the cloning error at the end. Let's nevertheless take a look at how this can be implemented and become usable once the keras/scikit learn community can fix the underlying problem].

In [None]:
keras_reg = keras.wrappers.scikit_learn.KerasRegressor(build_model)

In [None]:
from scipy.stats import reciprocal
from sklearn.model_selection import RandomizedSearchCV

#param_distribs = {
#    #"n_hidden": [1, 2], #[0, 1, 2, 3],
#    #"n_neurons": np.arange(1, 100),
#    "learning_rate": reciprocal(3e-4, 3e-2),
#}

#create a dictionary of hyperparameters within certain ranges 
param_distribs = {
    "n_hidden": [1, 2], #[0, 1, 2, 3],
    "n_neurons": np.arange(50, 300),
    "learning_rate": [.1, .01, .001, .0001 , .00001],
}

rnd_search_cv = RandomizedSearchCV(keras_reg, param_distribs, n_iter=10, cv=3, verbose=2)
#you can speed this up by using all available cores on your computer (n_jobs=-1)
#rnd_search_cv = RandomizedSearchCV(keras_reg, param_distribs, n_iter=10, cv=3, verbose=2, n_jobs=-1)
rnd_search_cv.fit(X_train, y_train, epochs=30,
    validation_data=(X_valid, y_valid),
    callbacks=[keras.callbacks.EarlyStopping(patience=10)])


In [None]:
rnd_search_cv.best_params_

In [None]:
rnd_search_cv.best_score_

In [None]:
rnd_search_cv.best_estimator_

In [None]:
model = rnd_search_cv.best_estimator_.model

## Testing the model

Let's assume we are satisfied with the validation accuracy of the original model (without hyperparameter tuning). The next step is to evaluate the model on the test data to see whether it generalizes well to data it has not seen before.

In [None]:
model.evaluate(X_test, y_test)

## Using the model to make predictions

Let's predict the labels of the first three elements from the test set. For each test item you will see ten probabilities, i.e. the probability that a given item belongs to each of the respective output classes.

In [None]:
X_new = X_test[:3]
y_proba = model.predict(X_new)
y_proba.round(2)

We can directly output the label of the winning class per test element.

In [None]:
y_pred = model.predict_classes(X_new)
y_pred

We can also output the respective names of the winning classes.

In [None]:
np.array(class_names)[y_pred]

In [None]:
y_new = y_test[:3]
y_new

Last, but not least, we can plot the pictures for which we wanted to predict the labels and compare human and artificial intelligence.

In [None]:
plt.figure(figsize=(7.2, 2.4))
for index, image in enumerate(X_new):
    plt.subplot(1, 3, index + 1)
    plt.imshow(image, cmap="binary", interpolation="nearest")
    plt.axis('off')
    plt.title(class_names[y_test[index]], fontsize=12)
plt.subplots_adjust(wspace=0.2, hspace=0.5)
save_fig('fashion_mnist_images_plot', tight_layout=False)
plt.show()

Let's take a look at all the predictions of the independent test set

In [None]:
y_pred = model.predict_classes(X_test)
y_pred

It is always interesting to see, whether ther are systematic prediction errors. This can be visualized with a confusion matrix.

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_pred, y_test)

Imagine we want to see how well our model can classify handwritings of the digit "5". A two by two confusion matrix of (predicted vs actual class) allows for four options: true positives ("5" | 5), true negatives ("not 5" | not 5), false positives ("5" | not 5) and false negatives ("not 5" | 5).
We can apply the same logic to the classification of fashion items and the confusion matrix above. 

From that we can compute different kinds of performance metrics:
- Precision (accuracy of the positive predictions):
precision = TP/(TP+FP)
- Recall (aka sensitivity or true positive rate: recall = TP/(TP+FN)
- F1 (harmonic meanof precision and recall, giving high weights to lower values. Thus, F1 will only be high, if both, precision and recall are high): F1 = TP/ (TP + (FN+FP)/2)

<img src="https://github.com/jvschw/ml4ni/blob/master/ML4NI/images/ann/confusion_mat.png?raw=1">

sckit learn has a function for that

In [None]:
import statistics
from sklearn.metrics import precision_score, recall_score, f1_score
acc = statistics.mean((y_test==y_pred)*1.0)
ps = precision_score(y_test, y_pred, average='weighted')
rs = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
print('accuracy: ', acc)
print('precision:', ps)
print('recall:   ', rs)
print('F1:       ', f1)

Why does this yield accuracy?

In [None]:
import statistics
acc = statistics.mean((y_test==y_pred)*1.0)
acc

# Extras

How well would a linear support vector classifier perform?