<a href="https://colab.research.google.com/github/marshka/ml-20-21/blob/main/03_neural_networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning

Prof. Cesare Alippi\
Andrea Cini ([`andrea.cini@usi.ch`](mailto:andrea.cini@usi.ch))\
Ivan Marisca ([`ivan.marisca@usi.ch`](mailto:ivan.marisca@usi.ch))\
Nelson Brochado ([`nelson.brochado@usi.ch`](mailto:nelson.brochado@usi.ch))

---
# Lab 03: Feedforward Neural networks


Also known as __multilayer perceptrons__ , neural networks are computational models inspired by the connected structure of the brain. The core component of neural networks is the neuron, which is composed of a perceptron and an activation function: 

$$
f(x; \boldsymbol \theta) =  h( x^T \boldsymbol \theta).
$$

The main idea behind neural networks is to compose neurons in two different ways: 

1. by taking many neurons __in parallel__;
2. by composing many subsequent __layers__ of neurons;

The result is a network of neurons that take data as input, and compute sequential transformations until the desired result is produced as output.

![alt text](https://res.cloudinary.com/practicaldev/image/fetch/s--4XiAvCCB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1200/1%2AYgJ6SYO7byjfCmt5uV0PmA.png)

---

We can write the output of the hidden layer as:

$$
\begin{bmatrix} 
h_0 \\
h_1 \\
h_2 \\
\vdots\\
h_l 
\end{bmatrix}
=
h\left(
\begin{bmatrix} 
w_{00} & w_{01} & w_{02} & \cdots & w_{0m} \\
w_{10} & w_{11} & w_{12} & \cdots & w_{1m} \\
w_{20} & w_{21} & w_{22} & \cdots & w_{2m} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
w_{l0} & w_{l1} & w_{l2} & \cdots & w_{lm} \\
\end{bmatrix}
\begin{bmatrix} 
x_0 \\
x_1 \\
x_2 \\
\vdots\\
x_m
\end{bmatrix}
+
\begin{bmatrix} 
b_0 \\
b_1 \\
b_2 \\
\vdots\\
b_l
\end{bmatrix}
\right)
$$

In short, we write the output of a __layer__ of neurons as:
$$
H = h(Wx + b_w)
$$

_NB: without the activation function a layer is a simple affine trasformation._

We can compute the output of the network doing the same calculation for the  "Output" neurons, with the difference that their input is not $X$, as for the hidden neurons, but it is the output $H$ of the last hidden layer. The output layer can be written as: 

$$
Y = \sigma(VH + b_v)
$$

(note that $V$ is a different matrix of parameters).

Finally, stacking the two layers simply means __composing__ them together, so that the whole neural network can be written as: 

$$
\hat y = f(x;\boldsymbol \theta = \{W, b_w, V, b_v\}) = \sigma\left(V h(Wx + b_w)  + b_v\right)
$$

---
Neural networks are trained with __stochastic gradient descent__ (SGD). The key idea behind SGD is to update all the parameters of the network at the same time, based on how each parameter contributed to the __loss__ function $L( \boldsymbol \theta)$. 

The generalized update rule reads: 

$$
{\boldsymbol \theta}^{i+1} = {\boldsymbol \theta}^{i} + \varepsilon \frac{\partial L({\boldsymbol \theta})}{\partial {\boldsymbol \theta}}\bigg\vert_{{\boldsymbol \theta} = {\boldsymbol \theta}^i}
$$

where $\varepsilon$ is again called __learning rate__.

---

When training neural networks for binary classification, we take the loss to be the __cross-entropy error function__: 

$$
L({\boldsymbol \theta}) =  -\frac1n \sum_{i=1}^n \bigg[y_i  \log \hat y_i + (1 - y_i)  \log (1 - \hat y_i)\bigg]
$$


# Neural networks in Python

To build our neural network we will use [TensorFlow](https://www.tensorflow.org/), one of the most popular deep learning libraries for Python (the other being [PyTorch](https://pytorch.org/)). 
TensorFlow provides a huge number of functions, like Numpy, that can be used to manipulate arrays, but offers two great advantages w.r.t. Numpy: 

1. the computation can be accelerated on GPU via the CUDA library;
2. the library implements __automatic differentiation__, meaning that the most analytically complex step of training, the computation of the gradient, is handled for you.

While TensorFlow is a very powerful library that offers a fine-coarsened control over what you build, for this course we will skip the low level details and instead use the official high-level API for TensorFlow: [Keras](https://keras.io).

## Introduction to Keras

![alt text](https://s3.amazonaws.com/keras.io/img/keras-logo-2018-large-1200.png)



Keras offers collections of TF operations already arranged to implement neural networks with little to no effort. 
For instance, building a layer of 4 neurons like the one we saw above is as easy as calling `Dense(4)`. That's it. 

Moreover, Keras offers a high-level API for doing all the usual steps that we usually do when training a neural network, like training on some data, evaluating the performance, and predicting on unseen data. 

The core data structure of Keras is a model, a way to organize layers. The simplest type of model is the `Sequential` model, a linear stack of layers.

---

Let's start with a toy classification problem.

In [None]:
import numpy as np
from sklearn.datasets import make_classification, make_circles, make_moons
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

# color_maps
cm = plt.cm.RdBu
cm_bright = ListedColormap(['#FF0000', '#0000FF']) 

# function to generate classification problems
def get_data(n, ctype='simple'):
  if ctype == 'simple':
    x, y = make_classification(n_features=2, 
                               n_redundant=0, 
                               n_informative=2, 
                               n_clusters_per_class=1)
    x += np.random.uniform(size=x.shape) # add some noise
  elif ctype == 'circles':
    x, y = make_circles(n, noise=0.1, factor=0.5)
  
  elif ctype == 'moons':
    x, y = make_moons(n, noise=0.1)
  else:
    raise ValueError
  return x, y.reshape(-1, 1)

# function to plot decision boundaries
def plot_decision_surface(model, x, y, transform=lambda x:x):    
  #init figure
  fig = plt.figure()

  # Create mesh
  h = .01  # step size in the mesh
  x_min, x_max = x[:, 0].min() - .5, x[:, 0].max() + .5
  y_min, y_max = x[:, 1].min() - .5, x[:, 1].max() + .5
  xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                        np.arange(y_min, y_max, h))

  # plot train data
  plt.scatter(x[:, 0], x[:, 1], c=y, cmap=cm_bright,
              edgecolors='k')
  plt.xlim(xx.min(), xx.max())
  plt.ylim(yy.min(), yy.max())

  plt.xlabel(r'$x_1$')
  plt.ylabel(r'$x_2$');

  y_pred = model.predict(transform(np.c_[xx.ravel(), yy.ravel()]))

  y_pred = y_pred.reshape(xx.shape)
  plt.contourf(xx, yy, y_pred > 0.5, cmap=cm, alpha=.5)

Let's go back to the problem that we saw in the previous part.

In [None]:
np.random.seed(20)

# Create a classification problem
x, y = get_data(120, 'circles')

# Let's look at the data
plt.scatter(x[:, 0], x[:, 1], c=y, cmap=cm_bright, edgecolors='k')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$');

Now let's build a neural network to fit the data.

Using Keras, this will take only a few lines of code.

In [None]:
from tensorflow.keras import Input, Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import losses
from tensorflow import keras
import tensorflow as tf

tf.random.set_seed(25) # this makes the experiment easy to reproduce

# Define the network
classifier = Sequential()
classifier.add(Dense(8, activation='tanh', input_shape=(x.shape[1],)))
classifier.add(Dense(1, activation='sigmoid'))

# Set up the model for training
classifier.compile(optimizer=keras.optimizers.RMSprop(learning_rate=0.025), # choose optimizer and learning rate
                   loss=losses.BinaryCrossentropy(),                        # define loss function 
                   metrics=['accuracy']                                     # define metric to monitor during training
                  )

# Evaluate the performance
classifier.fit(x, y, epochs=1000, verbose=0)
plot_decision_surface(classifier, x, y)

Let's try to understand why this is working adding another hidden layer with 2 units to visualize the space in which a neural network projects the data.

In [None]:
def compute_boundary(out_layer):
  w = np.asarray(out_layer.weights[0]).ravel()
  b = np.asarray(out_layer.bias).ravel()
  theta = np.r_[b, w].ravel()
  b = -theta[0]/theta[2]
  m = -theta[1]/theta[2]

  x1 = np.array([x[:,0].min(), x[:,0].max()])
  x2 = b + m * x1
  return x1, x2



def plot_training_iterations(classifier, features, out_layer, iterations=200):
  from IPython import display

  phi = features.predict(x)
  xd, yd = compute_boundary(out_layer)


  #init figure
  fig = plt.figure()
  ax = fig.gca()
  #fixed plots
  ax.scatter(x[:, 0], x[:, 1], c=y, cmap=cm_bright, edgecolors='k', alpha=0.05)
  splt = ax.scatter(phi[:, 0], phi[:, 1], c=y, cmap=cm_bright, edgecolors='k')
  line = ax.plot(xd, yd, color='black')[0]
  ax.set_xlabel(r'$\phi_1$')
  ax.set_ylabel(r'$\phi_2$')
  ax.set_xlim([-1.3, 1.3])
  ax.set_ylim([-1.3, 1.3])
  display.display(plt.gcf(), display_id=40)

  n = iterations
  e = 10
  for i in range(n):
    hist = classifier.fit(x, y, epochs=e, verbose=0)
    phi = features.predict(x)
    _, yd = compute_boundary(out_layer)
    ax.set_title(f"Iteration {(i+1)*e}/{e*n} | Train accuracy: {hist.history['accuracy'][-1]:.2f}")
    # update plot
    splt.set_offsets(phi)
    line.set_ydata(yd)
    display.update_display(plt.gcf(), display_id=40)

In [None]:
tf.random.set_seed(25)
# An alternative way to define a network in keras as a sequence of operations

inp = Input((x.shape[1],)) # inpute layer
hidden1 = Dense(8, activation='tanh')(inp) # first nonlinear transormation
hidden2 = Dense(2, activation='tanh')(hidden1) # second nonlinear transformation

out_layer = Dense(1, activation='sigmoid')
out = out_layer(hidden2) # output layer

# define the model using the input and output layers
classifier = Model(inp, out)
classifier.compile(loss=losses.BinaryCrossentropy(), metrics=['accuracy'])

features = Model(inp, hidden2)

plot_training_iterations(classifier, features, out_layer, iterations=100);

The data are linearly separable in the projected space!

If you go back and remove the nonlinear activation functions from the hidden layers, you'll see that this is not true anymore. In fact, without nonlinearities the hidden layers are simple affine transormations (e.g., can represent only linear mappings like rotation, translation, shear, ...).

_Homework: check [this](https://colah.github.io/posts/2014-03-NN-Manifolds-Topology/) insightful blog post from Chris Olah._

In [None]:
features = Model(inp, hidden2)
phi = features.predict(x)

plt.scatter(x[:, 0], x[:, 1], c=y, cmap=cm_bright, edgecolors='k', alpha=0.05)
plt.scatter(phi[:, 0], phi[:, 1], c=y, cmap=cm_bright, edgecolors='k')
plt.xlabel(r'$\phi_1$')
plt.ylabel(r'$\phi_2$');

# Wine quality dataset

Let's try with a real dataset now. 

We are given a set of wine reviews, with the following characteristics: 

In [None]:
import pandas as pd

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(url, delimiter=';')

data.head(10)

In [None]:
# Let's look at the distribution of the reviews
data['quality'].hist(bins=10)

We can turn this into a binary classification problem by setting a threshold on the reviews: was the wine good (>= 6) or not?

In [None]:
# Extract features
X = data[data.columns[:-1]].values

# Extact targets
quality = data['quality'].values.astype(np.int32)
y = (quality >= 6).astype(np.int32)
plt.hist(y);

Notice how the values of the features are not commensurable with one another. For instance, "total sulfur dioxide" can have values up to 100, while the "density" is necessarily limited to be <= 1. 

While this in principle is not a problem for our machine learning models, in practice it can lead to issues in the training procedure.

To standardize the data, we compute the following transformation: 

$$
X_{\textrm{standardized}} = \frac{X - \textrm{mean}(X)}{\textrm{std}(X)}
$$

NB: here we are scaling the complete dataset at once for semplicity, but in reality you should use only training data to compute mean and std deviation. Do it in the proper way in the assignments :D

In [None]:
# Normalize features
X -= np.mean(X, axis=0)
X /= np.std(X, axis=0)

In order to train our network, we will split the data into train and test set:

In [None]:
from sklearn.model_selection import train_test_split

# Split train / test / validation data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, stratify=y_train)

Now that we have loaded and pre-processed our data, we only need to build the neural network that we will train. 

In [None]:
# Define the network
network = Sequential()
network.add(Dense(32, activation='relu', input_shape=X.shape[1:]))
network.add(Dense(1, activation='sigmoid'))

# Prepare the computational graph and training operations
network.compile(optimizer='sgd', 
                loss='binary_crossentropy', 
                metrics=['acc'])

# Train the network
network.fit(X_train, y_train, epochs=100, validation_data=(X_val, y_val))

# Evaluate the performance
eval_results = network.evaluate(X_test, y_test)
print('Test loss: {} - Test acc: {}'.format(*eval_results))

# Pokémon dataset: unbalanced classes

Pokémon are fictional creatures that are central to the Pokémon franchise.
Among them, Legendary Pokémon are a group of incredibly rare and often very powerful Pokémon, generally featured prominently in the legends and myths of the Pokémon world [[source]](https://bulbapedia.bulbagarden.net/wiki/Legendary_Pok%C3%A9mon).

The task that we will tackle in this exercise is simple: can we tell whether a Pokémon is legendary or not, by looking at its statistics (like attack, defense, HP, etc.)?

Let's start by getting the data and looking at it...

In [None]:
url = 'https://raw.githubusercontent.com/lgreski/pokemonData/565a330aa57d1f60e1cab9d40320cf7473be566c/Pokemon.csv'
data = pd.read_csv(url)

data.head(10)

We will train a neural network to predict the "Legendary" labels using 'HP', 'Attack', 'Defense', 'SpecialAtk', 'SpecialDef', and 'Speed' as features.

In [None]:
# Extract features
X = data[['HP', 'Attack', 'Defense', 'SpecialAtk', 'SpecialDef', 'Speed']].values

# Extact targets
y = data['Legendary'].values.astype(np.float32)

Like we did before, we will need to standardize the data in order to have commensurable features.

In [None]:
# Standardize data
X = (X - X.mean(0)) / X.std(0)
print(X)



However, here we face a problem that we didn't have before: we have substantially less samples of one class w.r.t. the other. This means that our neural network is likely to ignore samples with $y=1$, because getting right the samples for which $y=0$ will lead to a lower error. 

Would you study for an exam question that was only asked once by the professor, in previous years? Or would you focus on the more common exercises that are more likely to be asked again? :)


In [None]:
# Plot histogram of labels
plt.hist(y);

To deal with the __class unbalance__ we will use a simple trick, that will allow our model to learn better. 

The trick consists in __re-weighting__ the loss function, so that the error on rare samples will count more than the error on common samples:

$$
L_{\textrm{reweighted}}(y, f(X; W)) =
\begin{cases}
\lambda_0 L(y, f(X; W))\textrm{, if } y=0 \\
\lambda_1 L(y, f(X; W))\textrm{, if } y=1
\end{cases}
$$

Ideally, $\lambda_0$ and $\lambda_1$should represent how rare the respective classes are in the dataset. 
A common way of computing the two values automatically is as: 

$$
\lambda_i = \frac{\textrm{# samples in dataset}}{\textrm{# classes}\cdot\textrm{# samples of class } i}
$$

In Keras (and also in Scikit-learn) we call these values `class_weight`.

Let's see how to compute them...

In [None]:
# Split train / test / validation data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, stratify=y_train)

# Compute class weights
n_pokemons = X_train.shape[0]
n_legendaries = y_train.sum()
n_classes = 2
class_weights = {0: n_pokemons / (n_classes * (n_pokemons - n_legendaries)),
                 1: n_pokemons / (n_classes * n_legendaries)}

print('Training data: {} legendaries out of {} mons'.format(int(n_legendaries), int(n_pokemons)))
print('Training data: class weights {}'.format(class_weights))

In order to train a neural network in Keras using class weights, we only need to apport some minor modifications to the previous model:

In [None]:
network = Sequential()
network.add(Dense(32, activation='relu', input_shape=X.shape[1:]))
network.add(Dense(1, activation='sigmoid'))

network.compile('sgd', 'binary_crossentropy', weighted_metrics=['acc'])

network.fit(X, y, epochs=25, validation_data=(X_val, y_val))

# network.fit(X, y, epochs=100, validation_data=(X_val, y_val), class_weight=class_weights)

Finally, let's analyze the __test__ performance of our model:

In [None]:
from sklearn.metrics import confusion_matrix


def plot_confusion_matrix(y_true, y_pred, classes, cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Adapted from https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
    """
    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.grid(None)
    ax.figure.colorbar(im, ax=ax)
    ax.set(xticks=np.arange(cm.shape[1]), yticks=np.arange(cm.shape[0]),
           xticklabels=classes, yticklabels=classes,
           title='Confusion matrix',
           ylabel='True label',
           xlabel='Predicted label')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, cm[i, j],
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax

eval_results = network.evaluate(X_test, y_test, verbose=False)
print('Loss: {:.4f} - Acc: {:.2f}'.format(*eval_results))

y_pred = network.predict(X_test)
y_pred = np.round(y_pred)
plot_confusion_matrix(y_test, y_pred, classes=['normal', 'legendary']); 