# Deep Learning 101

<div class="alert alert-success">
This lecture takes a practical approach to introduce modern deep learning approaches.  It provides foundational deep learning knowledge in order for you to move onto time series forecasting.
</div>

By the end of this lecture you will have: 
    
* Developed a conceptual understanding of modern deep neural networks and their difference from shallow neural networks.
* Built intuition about what hidden layers within a deep network are doing and how they aid prediction.
* Learnt how to build deep learning models in Keras and Tensorflow 2.0
* The foundational knowledge to move onto using feedforward neural networks for time series forecasting.

# Standard Imports

In [6]:
import numpy as np
import pandas as pd

# Keras and Tensorflow Imports

For your deep learning you will make use of [Keras](https://keras.io/).  This is a python library that sits on top of Google's deep learning toolset: Tensorflow 2.0.  Keras makes deep learning relatively straightfoward because it hides a lot of the complexity of Tensorflow. 

> Another very powerful deep learning framework is [PyTorch](https://pytorch.org/).  This is a pythonic deep learning toolkit and is also very powerful.  Our research experience is that PyTorch is more efficient than Keras and Tensorflow (sometimes by a considerable margin), but that it requires more code to do the same things as Keras.  Another way to look at this is that Keras comes with 'more bells and whistles' than PyTorch and for learning that comes in very handy!  The exercises that you will tackle in this course are written in Keras/TF, but you will also have access to optional material written in PyTorch.

In [2]:
import tensorflow as tf
from tensorflow import keras

#if using hds_forecast this should be version 2.1.0
print(tf.__version__)


2.1.0


# Computational cost of deep learning

When you have a complex deep learning architecture (which isn't always the case) and lots of data you should expect it to be more computationally expensive (take longer to run and work your CPU hard) than other types of ML.  In these instances, you really need a powerful machine and for some models a GPU.  For time series forecasting, we will be using the OpenStack on the High Performance Cluster, but for personal learning and coursework you can also make use of Google Colaboratory (Jupyter in the cloud).  Google also provide a GPU.  All of the neural network notebooks in this course are runnable in Google Colab. 

# A first look at Deep Learning using Tensorflow Playground

[Tensorflow playground](https://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.55467&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false) is provided by Google.  I recommend that you spend some time using it as it helps build intuition about how deep learning works.

# Components needed for deep learning.

## 1. Training and test data

The data for this lecture is taken from a real ML study published in BMJ Open: ['*Can clinical audits be enhanced by pathway simulation and machine learning? An example from the acute stroke pathway*'](https://bmjopen.bmj.com/content/9/9/e028296) by Dr Michael Allen and colleagues.  The actual study used a Random Forest Classifier (from `sklearn`), but we will use a **Feedforward Neural Network Architecture.**

The data are published and open; feel free to take a look at them in more detail.

In [22]:
url = 'https://raw.githubusercontent.com/MichaelAllen1966/1807_stroke_pathway/master/machine_learning/data/data_for_ml_clin_only.csv'
lysis = pd.read_csv(url)

In [33]:
# 50 features and a single binary (0/1) label
lysis.shape

(1862, 51)

In [34]:
lysis.head()

Unnamed: 0,Thrombolysis given,Hosp_1,Hosp_2,Hosp_3,Hosp_4,Hosp_5,Hosp_6,Hosp_7,Male,Age,...,S2NihssArrivalFacialPalsy,S2NihssArrivalMotorArmLeft,S2NihssArrivalMotorArmRight,S2NihssArrivalMotorLegLeft,S2NihssArrivalMotorLegRight,S2NihssArrivalLimbAtaxia,S2NihssArrivalSensory,S2NihssArrivalBestLanguage,S2NihssArrivalDysarthria,S2NihssArrivalExtinctionInattention
0,1,0,1,0,0,0,0,0,0,63,...,3,4,0,4,0,0,0,0,1,1
1,1,0,1,0,0,0,0,0,0,85,...,0,0,0,0,0,0,0,2,1,0
2,0,0,1,0,0,0,0,0,0,91,...,0,1,0,0,0,1,0,0,0,0
3,0,0,1,0,0,0,0,0,0,90,...,1,1,0,1,0,0,1,0,1,0
4,1,0,1,0,0,0,0,0,0,69,...,2,0,4,1,4,0,1,2,2,1


# Rescaled data

In [47]:
from sklearn.preprocessing import MinMaxScaler

In [35]:
to_rescale = ['Age',
              '# Comorbidities',
              'S2RankinBeforeStroke',
              'S2NihssArrival',
              'S2NihssArrivalLocQuestions',
              'S2NihssArrivalLocCommands',
              'S2NihssArrivalBestGaze',
              'S2NihssArrivalVisual',
              'S2NihssArrivalFacialPalsy',
              'S2NihssArrivalMotorArmLeft',
              'S2NihssArrivalMotorArmRight',
              'S2NihssArrivalMotorLegLeft',
              'S2NihssArrivalMotorLegRight',
              'S2NihssArrivalLimbAtaxia',
              'S2NihssArrivalSensory',
              'S2NihssArrivalBestLanguage',
              'S2NihssArrivalDysarthria',
              'S2NihssArrivalExtinctionInattention']

In [68]:
#create dataframe of quantitative data to rescale 0 - 1
df_to_rescale = lysis[to_rescale]

#target label
y_name = 'Thrombolysis given'

#dataframe of dummy variables
categorical_features = lysis.drop(to_rescale + [y_name], axis=1)

#dataframe of y variable
y = lysis['Thrombolysis given']

In [58]:
scaler = MinMaxScaler()
scaler.fit(df_to_rescale)
x_scaled = pd.DataFrame(scaler.transform(df_to_rescale))
x_scaled.columns = df_to_rescale.columns
print(x_scaled.shape)
x_scaled.head()

(1862, 18)


Unnamed: 0,Age,# Comorbidities,S2RankinBeforeStroke,S2NihssArrival,S2NihssArrivalLocQuestions,S2NihssArrivalLocCommands,S2NihssArrivalBestGaze,S2NihssArrivalVisual,S2NihssArrivalFacialPalsy,S2NihssArrivalMotorArmLeft,S2NihssArrivalMotorArmRight,S2NihssArrivalMotorLegLeft,S2NihssArrivalMotorLegRight,S2NihssArrivalLimbAtaxia,S2NihssArrivalSensory,S2NihssArrivalBestLanguage,S2NihssArrivalDysarthria,S2NihssArrivalExtinctionInattention
0,0.383333,0.25,0.0,0.404762,0.0,0.0,0.5,0.666667,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.5,0.5
1,0.75,0.5,0.2,0.166667,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.666667,0.5,0.0
2,0.85,0.5,0.4,0.047619,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0
3,0.833333,0.5,0.4,0.119048,0.0,0.0,0.0,0.0,0.333333,0.25,0.0,0.25,0.0,0.0,0.5,0.0,0.5,0.0
4,0.483333,0.0,0.0,0.52381,1.0,0.0,0.5,0.666667,0.666667,0.0,1.0,0.25,1.0,0.0,0.5,0.666667,1.0,0.5


In [69]:
X_processed = pd.concat([x_scaled, categorical_features], 
                         axis=1)
X_processed.head()

Unnamed: 0,Age,# Comorbidities,S2RankinBeforeStroke,S2NihssArrival,S2NihssArrivalLocQuestions,S2NihssArrivalLocCommands,S2NihssArrivalBestGaze,S2NihssArrivalVisual,S2NihssArrivalFacialPalsy,S2NihssArrivalMotorArmLeft,...,Anticoag before stroke_0,Anticoag before stroke_1,Anticoag before stroke_NK,Stroke severity group_1. No stroke symtpoms,Stroke severity group_2. Minor,Stroke severity group_3. Moderate,Stroke severity group_4. Moderate to severe,Stroke severity group_5. Severe,Stroke Type_I,Stroke Type_PIH
0,0.383333,0.25,0.0,0.404762,0.0,0.0,0.5,0.666667,1.0,1.0,...,0,0,1,0,0,0,1,0,1,0
1,0.75,0.5,0.2,0.166667,1.0,1.0,0.0,0.0,0.0,0.0,...,0,0,1,0,0,1,0,0,1,0
2,0.85,0.5,0.4,0.047619,0.0,0.0,0.0,0.0,0.0,0.25,...,0,0,1,0,1,0,0,0,1,0
3,0.833333,0.5,0.4,0.119048,0.0,0.0,0.0,0.0,0.333333,0.25,...,0,0,1,0,0,1,0,0,1,0
4,0.483333,0.0,0.0,0.52381,1.0,0.0,0.5,0.666667,0.666667,0.0,...,0,0,1,0,0,0,0,1,1,0


# Train-Test Split

In [67]:
from sklearn.model_selection import train_test_split

In [71]:
#setting random_state means we always get the same split
X_train, X_test, y_train, y_test \
    = train_test_split(X_processed.to_numpy(), y.to_numpy(), 
                       test_size=0.20, 
                       random_state=42)

In [74]:
print(X_train.shape)
print(X_test.shape)

(1489, 50)
(373, 50)


## Sequential layers and activation functions

Feedforward neural networks accept a vector of quantitative values as input and pass this through a sequence of fully connected layers (all neurons are connected to each other).  Each neuron in a layer recieves input from all of the neurons in the previous layer. The neuron weights the input vector, adds bias and then passes it through an activation function. In a hidden layer, for example, you could use a Rectified Linear Unit (ReLU).  

The final layer is an **output** layer.  The thrombolysis example is a binary classification so the output layer is a fully connected layer with a single neuron.  You will need to use a `sigmoid` activation function to provide a probability of recieving thromboysis between 0.0 and 1.0.

For feedforward networks, `Keras` provides simple classes to help you construct your model.

In [77]:
#a model consisting of a sequential set of layers
from tensorflow.keras.models import Sequential

#a fully connected layer and an input layer
from tensorflow.keras.layers import Dense, Input

In [83]:
#The first input
model = Sequential(name='lysis_nn')

#input layer.
model.add(Input(shape=(X_train.shape[1],)))

#hidden layer 1 
model.add(Dense(units=4, activation='relu'))

#hidden layer 2
#model.add(Dense(units=2, activation='relu'))

#output layer
model.add(Dense(units=1, activation='sigmoid'))

#summay includeing number of trainable parameters
model.summary()
          

Model: "lysis_nn"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_11 (Dense)             (None, 4)                 204       
_________________________________________________________________
dense_12 (Dense)             (None, 1)                 5         
Total params: 209
Trainable params: 209
Non-trainable params: 0
_________________________________________________________________


# Training a network

For each neuron you can think of the weights as the **strength** of the neuron's **connections** to all of the neurons in the **previous layer**.  The network is initialised with these weights set to **random** values.  The purpose of training is therefore to find the 'best' weights for your prediction problem.   

> In this context, 'best' means the weights that minimise the training **loss**.  Loss is a measure of model fit i.e. a metric quantifying the difference between the predictions from the model and the ground truth observations. For the thrombolysis example 

In [None]:
def model_1(X_train, y_train, epochs=10):
  model = keras.Sequential([
    keras.layers.Dense(len(X_train), activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
  ])
  
  
  model.compile(optimizer='adam',
              loss='mean_squared_error',
              metrics=['accuracy'])
  
  history = model.fit(X_train, y_train, 
                      validation_split=0.25, epochs=epochs)
  
  return model, history

In [3]:
from tensorflow.keras.layers import Input, Dense, Flatten
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.callbacks import EarlyStopping


In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Sklearn imports

> The above is called a forward pass.  But the real innovation in deep learning came from the backpropagation algorithm.  Backprop provides a way for a neural network to `learn` the weights that minimise the   