# Deep Learning 101

<div class="alert alert-success">
This lecture takes a practical approach to introduce modern deep learning approaches.  It provides foundational deep learning knowledge in order for you to move onto time series forecasting.
</div>

**By the end of this lecture you will have:**
    
* Developed a conceptual understanding of modern deep neural networks.
* Built intuition about what hidden layers within a deep network are doing and how they aid prediction.
* Understand the definition and benefits of mini batches of training data
* Built intuition about how networks 'learn' using the backpropogation algorithm and stochastic gradient descent.
* Learnt how to build simple neural network architectures in Keras and Tensorflow 2.0
* The foundational knowledge to move onto using feedforward neural networks for time series forecasting.

# Standard Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Keras and Tensorflow Imports

For your deep learning you will make use of [Keras](https://keras.io/).  This is a python library that sits on top of Google's deep learning toolset: Tensorflow 2.0.  Keras makes deep learning relatively straightfoward because it hides a lot of the complexity of Tensorflow. 

> Another very powerful deep learning framework is [PyTorch](https://pytorch.org/).  This is a pythonic deep learning toolkit and is also very powerful.  Our research experience is that PyTorch is more efficient than Keras and Tensorflow (sometimes by a considerable margin), but that it requires more code to do the same things as Keras.  Another way to look at this is that Keras comes with 'more bells and whistles' than PyTorch and for learning that comes in very handy!  The exercises that you will tackle in this course are written in Keras/TF, but you will also have access to optional material written in PyTorch.

In [None]:
import tensorflow as tf
from tensorflow import keras

# if using hds_forecast this should be version 2.7
print(tf.__version__)

# Computational cost of deep learning

When you have a complex deep learning architecture (which isn't always the case) and lots of data you should expect it to be more computationally expensive (take longer to run and work your CPU hard) than other types of ML.  In these instances, you really need a powerful machine and for some models a GPU.  For **time series forecasting**, we will be using the OpenStack on the High Performance Cluster, but for personal learning and coursework you can also make use of Google Colaboratory (Jupyter in the cloud).  Google also provide a GPU.  All of the neural network notebooks in this course are runnable in Google Colab. 

# A first look at Deep Learning using Tensorflow Playground

[Tensorflow playground](https://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.55467&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false) is provided by Google.  I recommend that you spend some time using it as it helps build intuition about how deep learning works.

# Components needed for deep learning.

## 1. Training and test data

The data for this lecture is a real dataset known as the [Wisconsin Breast Cancer](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)) dataset. The data set consists of individual   we will use a **Feedforward Neural Network Architecture.**

The data are published and open; feel free to take a look at them in more detail.

> The dataset contains 30 features and a binary label (benign/malignant)

In [None]:
url = 'https://raw.githubusercontent.com/MichaelAllen1966/' \
       + 'synthetic_data_pilot/main/01_wisconsin/wisconsin.csv'
cancer = pd.read_csv(url)

In [None]:
# Load data and drop 'id' column
cancer.drop('id', axis=1, inplace=True)

# Change 'diagnosis' column to 'malignant', and put in last column place
cancer['malignant'] = cancer['diagnosis'] == 'M'
cancer.drop('diagnosis', axis=1, inplace=True)

In [None]:
# 30 features and a single binary (0/1) label
cancer.shape

In [None]:
cancer.head()

# Train-Test Split

Just like other Machine Learning approaches we first do a train test split (and do not look at or use our test data until we are ready!).

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# split into x and y for split
X = cancer[cancer.columns[:-1]]
y = cancer['malignant']

In [None]:
# setting random_state means we always get the same split
X_train, X_test, y_train, y_test \
    = train_test_split(X.to_numpy(), y.to_numpy(), 
                       test_size=0.20, 
                       random_state=42)

In [None]:
print(X_train.shape)
print(X_test.shape)

In [None]:
# class balance
unique_elements, counts_elements = np.unique(y_train, return_counts=True)
print(unique_elements, counts_elements)
print(y_train.mean())

# Rescale data

You should always rescale the features that you use to train a neural network.

I recommend scaling **after** a train-test split where the scaler uses the **training** data.  In a production setting you cannot scale on the data you are about to predict!  You should aim to simulate this setting and avoid leakage!

Here we will use [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) to rescale features to range from 0 to 1.

In [None]:
# we will rescale all data to be between 0 and 1
from sklearn.preprocessing import MinMaxScaler

In [None]:
# do the rescaling and assign to dataframe
scaler = MinMaxScaler(feature_range=(0, 1))
scaler.fit(X_train)

# scale training data (cast to dataframe)
X_train_scaled = pd.DataFrame(scaler.transform(X_train))
X_train_scaled.columns = cancer.columns[:-1]

# scale test data (cast to dataframe)
X_test_scaled = pd.DataFrame(scaler.transform(X_test))
X_test_scaled.columns = cancer.columns[:-1]

# take a look at the rescaled data
print(X_train_scaled.shape)
X_train_scaled.head()

## Sequential layers and activation functions

Feedforward neural networks accept a vector of quantitative values as input and pass this through a sequence of fully connected layers (all neurons are connected to each other).  Each neuron in a layer recieves input from all of the neurons in the previous layer. The neuron weights the input vector, adds bias and then passes it through an activation function. In a hidden layer, for example, you could use a Rectified Linear Unit (ReLU; $f(x) = \max\left(0, x\right)$).  

The final layer is an **output** layer.  The thrombolysis example is a binary classification so the output layer is a fully connected layer with a single neuron.  You will need to use a `sigmoid` activation function to provide a probability of recieving thromboysis between 0.0 and 1.0.

For feedforward networks, `Keras` provides simple classes to help you construct your model.

In [None]:
# a model consisting of a sequential set of layers
from tensorflow.keras.models import Sequential

# a fully connected layer, an input layer
from tensorflow.keras.layers import Dense, Input

In [None]:
# The first input
model = Sequential(name='breast_cancer_nn')

# input layer shape = (no. features, )
model.add(Input(shape=(X_train.shape[1],)))

# hidden layer 1: relu: f(x) = max(0, x)
model.add(Dense(units=10, activation='relu'))

# hidden layer 2
model.add(Dense(units=10, activation='relu'))

# output layer
model.add(Dense(units=1, activation='sigmoid'))

# summary including number of trainable parameters
model.summary()
          

## Training a network

For each neuron you can think of the weights as the **strength** of the neuron's **connections** to all of the neurons in the **previous layer**.  The network is initialised with these weights set to **random** values.  The purpose of training is therefore to find the 'best' weights for your prediction problem.   

In this context, 'best' means the weights that minimise the training **loss** (sometimes called **cost** or **error**).  Loss is a measure of model fit i.e. a metric quantifying the difference between the predictions from the model and the ground truth observations. For classification models, you will use **Binary Cross-Entropy**.  For each training example, this measure simply takes the -log of the probability the model assigned to the correct label and then averages across all samples.  Or more formally, $$-\dfrac{1}{N} \sum_{i=1}^N y_i \cdot log(p(y_i)) + (1 - y_i) \cdot log(1 - p(y_i))$$ 

where $y$ is the 0/1 label and $p(y_i)$ is the probability (0 to 1) assigned to predicting label $i$.  

> Note that in a regression model, you would use a loss metric such as **Mean Absolute Error** or **Mean Squared Error**.

In [None]:
# let's assume these are the probs assigned to a TRUE value
cost = [0.1, 0.2, 0.8, 0.99]
-np.log(cost)

### Stochastic Gradient Descent and Backpropagation

Think of finding the best weights for your model as a large scale optimisation problem. Even in the simple first model we created there were 400+ parameters to optimise! To optimise these weights an algorithm is needed to estimate the gradient of the loss function and taking a **step** to gradually **decend** it into a local optima. 

<img src="gradient1.png" alt="drawing" width="500"/>

Finding these local optima is achieved using the **backpropagation algorithm** and **stochastic gradient descent**.  After pushing a training sample through our binary classification network we get a single probability of the patient having cancer. We also have the ground truth value labelling if the patient has a malignant tumor or not. Starting from the networks output layer, backprop works backward through each layer of the network to find the weights and biases that correctly classify the patient. This is repeated for all of the training data and the average of these values is the gradient of the loss function with respect to each weight. Repeated enough times the networks weights will converge on solution that minimises the loss. 

<img src="iteration.png" alt="drawing" width="400"/>

For large networks and datasets this is computationally expensive (and possibly infeasible) to run backpropagation against every training sample individually. Therefore we use **stochastic gradient decent (SGD)** to estimate the gradient.  The gradient is estimated by averaging results after breaking the dataset into random **mini-batches** (subsets). This reduces number of computations needed by a substantial amount (and makes large problems tractible).  

<img src="yanlcun.png" alt="drawing" width="600"/>

The size of the step taken at each iteration is called the **learning rate**. It is used to subtract a fraction of the gradient from the current weights. The learning rate is a hyperparameter and may need tuning.  A typical starting point is to try powers of 10; for example, 0.1, 0.01, 0.001.  A larger step means that you can descend more quickly!  The downside is that you might overshoot the minimum!  

In practice, SGD is not quite so simple and there are a few other concepts and optimisations procedures to appreciate.  The first is that it useful to tweak SGD to have **momentum**. The modification adds a fraction (typically 0.9) of the previous update to weights. The second is adapative learning rates. This comes in different flavours from a simple reduction in learning rate over time to varying the learning rate by parameter.  Both momentum and adaptive learning rates help SGD converge more efficiently.

<img src="overshoot.png" alt="drawing" width="400"/>



### Training in Keras

In Keras, the complexity described above is hidden and you simply **compile** a model and choose an `optimizer` and `loss` function.  You then call the model's `fit` method and pass in the training data, mini batch size (32 is the default) and the number of **epochs** to run.  An epoch is a single pass through all of the training data.  If, for example, there are 1000 points and you are using a mini-batchs of 100 then there are 10 iterations within a single epoch.  You need to run multiple epochs for the network to descend the loss function.

I recommend you choose the Adam optimiser.  This is the most popular version of stochastic gradient descent since about 2015.  

In [None]:
# set the optimizer, loss function and also report classification accuracy
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# fit the model and include a validation split to check for overfitting. 
results = model.fit(x=X_train_scaled.to_numpy(), 
                    y=y_train, 
                    batch_size=32,
                    validation_split=0.10, 
                    epochs=200, 
                    verbose=0)

In [None]:
def plot_loss(results):
    '''
    Two charts 1.) train versus validation loss and 2.) accuracy
    '''
    fig, ax = plt.subplots(2, 1, figsize=(12,6))
    ax[0].plot(results.history['loss'])
    ax[0].plot(results.history['val_loss'])
    ax[0].legend(['loss', 'val_loss'])
    ax[1].plot(results.history['accuracy'])
    ax[1].plot(results.history['val_accuracy'])
    ax[1].legend(['accurracy', 'val accurracy'])

In [None]:
plot_loss(results)

# Overfitting

Given enough capacity, neural networks will overfit to your training data.  Models that are overfitted have highly variable performance when used for prediction.  Keras provides a couple of simple mechanisms to reduce overfitting: **dropout layers** and **early stopping callbacks**.


In [None]:
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Dropout

In [None]:
def get_model(input_size, n_hidden=2, n_neurons=10, activation='relu', 
              dropout=False, d_rate=0.2):
    '''
    Create a simple Keras feedforward model.
    '''
    #The first input
    model = Sequential(name='breast_cancer_nn')

    #input layer.
    model.add(Input(shape=(input_size,)))

    for i in range(n_hidden):
        #hidden layer 1 
        model.add(Dense(units=n_neurons, activation=activation))
        #include a dropout layer
        if dropout:
            model.add(Dropout(d_rate))
    
    #output layer
    model.add(Dense(units=1, activation='sigmoid'))
    
    return model

In [None]:
############ General Parameters ############################

N_HIDDEN = 3
N_NEURONS = 32
N_EPOCHS = 200

# 0 fit silently; 1 show results per epoch
VERBOSE = 0

########### Regularization options ########################
INCLUDE_DROPOUT = True
DROPOUT_RATE = 0.2

INCLUDE_EARLY_STOP = True
PATIENCE = 10

#create an early stopping callback
es = EarlyStopping(monitor='val_loss', 
                   patience=PATIENCE,
                   restore_best_weights=True)

###########################################################

#include early stopping?
callbacks = []
if INCLUDE_EARLY_STOP:
    callbacks.append(es)

#get the custom feedforward model
model_2 = get_model(input_size=X_train.shape[1], 
                    n_hidden=N_HIDDEN,
                    n_neurons=N_NEURONS,
                    dropout=INCLUDE_DROPOUT,
                    d_rate=DROPOUT_RATE)

#summary to remind us what we have built!
print(model_2.summary())

#compile the new model
model_2.compile(optimizer='adam',
                loss='binary_crossentropy',
                metrics=['accuracy'])


#fit the model and also pass in the callback
results_2 = model_2.fit(x=X_train_scaled, 
                        y=y_train, 
                        batch_size=32,
                        validation_split=0.10, 
                        epochs=N_EPOCHS, 
                        verbose=VERBOSE,
                        callbacks=callbacks)

#plot loss and val loss
plot_loss(results_2)


# Prediction

Predicting the training set very straightforward making use of the models `.predict()` method.  

> When predicting results for individual patients you will need to `.reshape` your input

In [None]:
X_test_scaled.to_numpy()[0].shape

In [None]:
X_test_scaled.to_numpy()[0].reshape(1, -1).shape

### Predicting and individual patients result

In [None]:
test_id = 1

pred = model_2.predict(x=X_test_scaled.to_numpy()[test_id].reshape(1, -1))[0,0]
print(f'prediction proba {pred:.2f}')
print(f'prediction: {pred >= 0.5}')
print(f'ground truth value: {y_test[test_id]}')

In [None]:
model_2.predict(x=X_test_scaled.to_numpy()[test_id].reshape(1, -1))[0, 0]

### Predicting the full test set

In [None]:
y_pred = model_2.predict(x=X_test_scaled.to_numpy()).flatten()
np.round(y_pred, 3)

### quick reminder of classification metrics

TP = True positives
FP = False positives
TN = True negatives
FN = False negatives

$precision = \dfrac{TP}{TP + FP}$  e.g if model predicts a patient does have a maligant tumor with precision 0.8.  Then when a model makes a positive cancer prediction it is right about 80% of the time.

$recall = sensitivity = \dfrac{TP}{TP + FN}$ I.e. The proportion of true positive identified. 

$specificity =  \dfrac{TN}{TN + FP}$  How many negative classifications are actually negative?

For the Winconsin breast cancer dataset we want a high sensitivity (detecting as many patients tumors as possible), but not at the cost of specificity (lots of false positives).

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
#get predictions
y_pred = model_2.predict(x=X_test_scaled.to_numpy()).flatten()

In [None]:
#predictions are probabilities that the patient has a malignant tumour
np.round(y_pred, 2)

In [None]:
#view as 0/1
THRESHOLD = 0.5
(y_pred >= THRESHOLD).astype('int')

In [None]:
#classification results
THRESHOLD = 0.5
tn, fp, fn, tp = confusion_matrix(y_test, y_pred >= THRESHOLD).flatten()

#sensitivity and specificity
sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)

print(f'sensitivity {sensitivity:.3f}')
print(f'specificity {specificity:.3f}')

In [None]:
report = classification_report(y_test, y_pred >= THRESHOLD, digits=3)
print(report)

# Prediction Uncertainty

One issue with neural networks is that they don't automatically produce estimates of uncertainty.  This is problematic as they have been trained using stochastic gradient descent. 

One way to get an estimate of uncertainty from a neural network is to use **Monte Carlo Dropout**.  We have already learnt about `Dropout` layers and using them for regularisation.  What perhaps isn't clear is that when we make a prediction with a Keras model the `Dropout` layers are turned **off**.  If we instead turn them **on** and making a large number of predictions from the same data produce a distribution of predictions.  

> The below will only work **if you include dropout layers in your model!**

In [None]:
#get predictions
sample_no = 77
y_pred = model_2.predict(x=X_test_scaled.to_numpy()[sample_no].reshape(1, -1)).flatten()
print(np.round(y_pred, 3))

In [None]:
y_test[sample_no]

In [None]:
y_probas = [model_2(X_test_scaled.to_numpy()[sample_no].reshape(1, -1), training=True) 
            for sample in range(100)]

y_probas = np.round(np.stack(y_probas).flatten(), 2)
y_probas

In [None]:
np.quantile(y_probas, q=0.5)

In [None]:
y_probas.mean()

In [None]:
y_probas.std()

# End