# P-ai AI/ML Workshop: Session 4b

Welcome to P-ai's third session of the AI/ML workshop series! Today we'll learn about
- Deep learning
    - How to build and train a neural net with Tensorflow and Keras
    - Types of neural nets
- Solving a real-world classification problem with a neural net

<img src="https://images.squarespace-cdn.com/content/5d5aca05ce74150001a5af3e/1580018583262-NKE94RECI46GRULKS152/Screen+Shot+2019-12-05+at+11.18.53+AM.png?content-type=image%2Fpng" width="200px">

## Implementation: Intro to Tensorflow and Keras

<img src="https://3.bp.blogspot.com/-QZVBl08fmPk/XhO909Ha1dI/AAAAAAAACZI/q1a1UykGKe0KDUZ_ZITtWmM7bBJFRrvPQCLcBGAsYHQ/s1600/tensorflowkeras.jpg" width="500px">

You might be wondering how to actually build and train a neural net. The most popular frameworks for deep learning are Google's [Tensorflow](https://www.tensorflow.org/) and Facebook's [Pytorch](https://pytorch.org/). Under the hood, it's basically a bunch of optimized graph algorithms that are necessary for neural networks.

While you can build a neural net with Tensorflow alone (and in the future, you might need to do this to create a more "customized" neural net), this can often be a bit more involved than it has to be for a beginner. Luckily, Google also developed [Keras](https://keras.io/), which is an API for Tensorflow; in other words, it lets you write more readable and intuitive code, and Keras takes care of the nitty-gritty Tensorflow-y details.

Let's take a look at how we would build the hypothetical neural net above with Keras.

In [1]:
''' Imports; this might take a second to load '''
import tensorflow
# Import layers we need (in this case, just Dense)
from tensorflow.keras.layers import Dense
# Sequential model means you just add layers in a sequence
from tensorflow.keras.models import Sequential
# Adam optimizer
from tensorflow.keras.optimizers import Adam

In [2]:
''' Build model '''
model = Sequential()                                   # Define sequential model
model.add(Dense(128, input_dim=26))                    # Input layer and first hidden layer
model.add(Dense(256))                                  # Second hidden layer
model.add(Dense(3, activation='softmax'))              # Output layer; note activation function

2021-11-13 14:54:14.593870: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
''' Choose optimizer and loss function '''
opt = Adam(learning_rate=0.001)              # Set learning rate to 0.001
loss = 'categorical_crossentropy'            # Using categorical crossentropy for multiclass classification

In [4]:
''' Compile model and print layer summary '''
model.compile(optimizer=opt, 
    loss=loss,
    metrics=['accuracy'])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 128)               3456      
_________________________________________________________________
dense_1 (Dense)              (None, 256)               33024     
_________________________________________________________________
dense_2 (Dense)              (None, 3)                 771       
Total params: 37,251
Trainable params: 37,251
Non-trainable params: 0
_________________________________________________________________


And that's that! Now, let's imagine we have some data:

In [None]:
import random
import numpy as np

# Generate enough random numbers
random_numbers = np.random.rand(1000 * 26)
# This represents 1,000 examples, each with 26 features (frequencies a-z)
X = np.reshape(random_numbers, (1000, 26))
# This represents 1,000 one-hot vectors (which language the corresponding x actually is)
y = np.zeros((1000, 3), dtype=int)
for i in range(len(y)):
    y[i][random.randint(0, 2)] = 1

In [None]:
print("Fake dataset:")
print("X shape: ", X.shape)
print("First X:", X[0])
print("y shape: ", y.shape)
print("First y:", y[0])

Now, we can fit our model to this fake data.

In [None]:
from sklearn.model_selection import train_test_split
# Split data into training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Here we can specify the number of epochs to train on, the batch size, and much more
# Check out the documentation for more!
history = model.fit(X_train, y_train, epochs=10, batch_size=32)

Cool! We see a lot of stuff being output, and it might seem like a lot; here's the breakdown:

<img src="images/tf_output.png" width="700px">

Notice how, as the model trained on the training data for more epochs, the loss decreased and the accuracy increased. This is to be expected; the more times the model sees the data, the more it can fit to that data. You should probably be concerned about something, though...

Since the data is random, the model should have not much more than a 33% accuracy. Yet, we see at the end of 10 epochs, the model's accuracy is above 40%. How can this be? Did the model discover some hidden patterns in the apparently random data?

No. This is the classic problem of **overfitting**. 

<img src="https://media.geeksforgeeks.org/wp-content/cdn-uploads/20190523171258/overfitting_2.png" width="600px">

Virtually every machine learning model has the tendency to overfit on its training data; that is, it begins to learn the *specifics* of the data instead of the *general trend*. This isn't good, because that means your model is **unstable**, which means (among other things) it *won't generalize well*. This is why we have training and test sets; we test our model on data it's never seen before to test whether it can actually generalize what it's learned, or if it overfit.

In [None]:
# Test our model on the test dataset
model.evaluate(X_test, y_test)

The result of `evaluate` is a list of the final loss and accuracy (or whatever metric we compiled the model with). We can see that the model is actually only right ~33% of the time on data it hasn't seen before, which is exactly what we would expect. That means that our model did overfit on the training data a bit, as we would expect would happen with random data.

**Helpful tip**: When calling `.fit()`, you can also pass in a validation split, which splits your training data *again* into training and validation data, and Tensorflow will test your model on the validation data after each epoch.

In [None]:
def buildModel():
    model = Sequential()
    model.add(Dense(128, input_dim=26))
    model.add(Dense(256))
    model.add(Dense(3, activation='softmax'))
    opt = Adam(learning_rate=0.001)
    loss = 'categorical_crossentropy'
    model.compile(optimizer=opt, 
        loss=loss,
        metrics=['accuracy'])
    return model

In [None]:
model = buildModel()
# Use 20% of the data for validation
history = model.fit(X_train, y_train, epochs=5, batch_size=32, validation_split=0.2)

We can now see the validation loss and accuracy after each epoch and, as we would expect, the validation loss / accuracy don't get any better the more we train.

Before moving onto other types of neural nets and an example, we can quickly see how easy it is to make predictions with our neural net.

In [None]:
# Dummy data; shape: (1, 26)
# That is, one example with 26 features
dataToClassify = np.reshape(np.random.rand(26), (1,26))
pred = model.predict(dataToClassify)
# Print prediction of first (only) example
print(pred[0])

Remember that, for multiclass classification, the output of the model is a vector of probabilities for each class. To get a single class, we can easily find the index with the highest probability.

In [None]:
print(np.argmax(pred[0]))

So, our model predicts that this hypothetical input (which we imagine is a vector of letter frequencies) belongs to language 0 (say, English).

Pat yourself on the back, that's the basics of building a neural net!

## 2. Other neural nets!

There are many "flavors" of neural networks. Here are some of the most common:

### Multilayer Perceptron (MLP) aka "Vanilla" Neural Net
The neural nets we just looked at are MLPs, or the "vanilla" feed-forward neural net. Many times, people will refer to these "basic" neural nets as just "neural nets" because they're not specifying a more specific type. MLPs find a relationship between 1D input and 1D output. So, if you can encode both your inputs and outputs as vectors, chances are, all you need is an MLP.

### Convolutional Neural Network (CNN)
A CNN is a type of neural net that works best with input where the **location** of the values is important. Take our  vanilla neural net example; it doesn't matter which order we store the letter frequencies in our input vector so long as we stick with an order. We can't say the same for images; it may matter a lot *where* in an image some feature is. CNNs work by first **convolving** over the image, which would probably be a bit tangential to explain now, but you can read more about it [here](https://machinelearningmastery.com/convolutional-layers-for-deep-learning-neural-networks/). In any case, convolution allows the CNN to learn features about the relative location of values in the matrix, which is perfect for image processing!

<img src="https://miro.medium.com/max/1400/1*XbuW8WuRrAY5pC4t-9DZAQ.jpeg" width="600px">

### Long Short Term Memory (LSTM)
The LSTM is a type of **RNN** (Recurrent Neural Network), which means it's meant to deal with **temporal** data. For example, if you were trying to predict something about a **sequence** of values, a regular neural network wouldn't do the trick. The LSTM introduces mechanisms for the net to "remember" old information and synthesize it with new information (hence the name). For a long time, LSTMs were the go-to for learning language, since you can represent a series of words numerically (see [word embeddings](https://machinelearningmastery.com/what-are-word-embeddings/)!).

<img src="https://miro.medium.com/max/1400/1*ahafyNt0Ph_J6Ed9_2hvdg.png" width="600px">

### Transformer

Transformers have really taken off in the past few years, especially with the enormous success of language models like BERT and GPT. Transformers build off of another type of model called the **encoder-decoder**, where the model learns both how to encode and decode data to get from input to output; the classic example is language translation. First, you encode a series of words in one language into a vector (the "meaning"), which is then decoded into a series of words in the second language. The transformer is basically an encoder-decoder on steroids; it has stacks of encoders and stacks of decoders, and it also implements this cool concept called [Attention](https://arxiv.org/abs/1706.03762), which helps the model learn *which* parts of the input correspond to which parts of the output. Transformers pretty much totally outperform LSTMs on translation tasks!

<img src="https://jalammar.github.io/images/t/The_transformer_encoder_decoder_stack.png" width="600px">

### GAN

GAN stands for Generative Adversarial Network, and they were invented by machine learning celebrity [Ian Goodfellow](https://en.wikipedia.org/wiki/Ian_Goodfellow). The main idea of GANs is to have two components of the neural net; the *generator* and the *discriminator*, which "compete" with each other. The generator tries to generate data that is similar to the real data, and the discriminator tries to figure out which data are real and which were generated by the generator. Both "adversaries" get better throughout the training, and at some point, the generator may become so good, the discriminator can't tell which data is real and which is fake. At this point, thanks to the GAN, you have an impeccable generator and a classifier! There are lots of uses for GANs, from [generating images of people that don't exist](https://thispersondoesnotexist.com/) to [turning sketches into photorealistic images](https://arxiv.org/pdf/1801.02753.pdf).

<img src="https://i1.wp.com/bdtechtalks.com/wp-content/uploads/2018/05/GANs.png?resize=696%2C304&ssl=1" width="600px">

These are just a few of the most common "types" of models you'll hear about out there. There's no limit to the kinds of model you can put together, though. For example, if you need to learn spatio-temporal data, you can combine a CNN with an LSTM to get a CNN-LSTM. When you work with tensorflow directly, you can build your very own custom neural nets!

## 3. Case Study

To put our theory into practice, let's take a look at the [heart failure prediction dataset](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data). The goal of this task is to predict whether patients died from heart failure within a certain time frame after their last check-in. First, we should take a look at the data we have.

In [None]:
# Helpful imports
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

In [None]:
# Read and display data from file
heart_data = pd.read_csv('data/heart_failure.csv')
heart_data

We see that we have 12 features and one prediction variable, `DEATH_EVENT`. Some of the variables are continuous (like `age`, `platelets`), and others are binary (e.g. `anaemia`, `diabetes`). 

In [None]:
CONTINUOUS_COLS = ['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'time']
BINARY_COLS = ['anaemia', 'diabetes', 'high_blood_pressure', 'sex', 'smoking']
TARGET_COL = 'DEATH_EVENT'

Let's plot all of our continuous variables against the target variable.

In [None]:
for col in CONTINUOUS_COLS:
    # Distribution plot
    sns.displot(heart_data, x=col, hue=TARGET_COL)
    plt.show()

We don't immediately see that any of these features would be excellent predictors of heart failure. The `time` variable (the follow-up period) might offer the most clues; it seems that a very quick follow-up results in heart failure more often than a much later follow-up.

We can also take a look at our binary variables; we'll plot them as pie graphs.

In [None]:
for col in BINARY_COLS:
    f, (ax1, ax2) = plt.subplots(1, 2)
    print(col)
    ax1.pie(heart_data[heart_data[TARGET_COL] == 0][col].value_counts(), labels=[0,1], autopct='%1.1f%%')
    ax1.set_title('Survived')
    ax2.pie(heart_data[heart_data[TARGET_COL] == 1][col].value_counts(), labels=[0,1], autopct='%1.1f%%')
    ax2.set_title('Died')
    plt.show()

Across the board, we see little differences in the target variable due to any of these binary variables alone. Thus, we might expect this classification task be a bit challenging.

In [None]:
# X data is everything but the target column
X_data = heart_data[heart_data.columns[:-1]]
# y data is the target column
y_data = heart_data[TARGET_COL]

# Split into training and test data
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2)
print(f"{len(X_train)} training examples and {len(X_test)} test examples")

It's also worth noting that the *scale* of the continuous variables vary drastically. Take a look at this:

In [None]:
print("Max age:", max(X_data['age']))
print("Max platelets:", max(X_data['platelets']))

This is generally not ideal for neural nets. A big difference in scale between different features means that the "larger scale" features will naturally overpower the "smaller scale" ones, and the net will need to learn extremely small / large weights to keep the features comparable. This is unnecessary work for the neural net when we can just **normalize** our input first.

There's a few different ways to normalize data, but the goal is to scale each features so they're comparable. We can use the [standard scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to perform standard normalization, which means transforming the data so that the mean and standard deviation of each feature is 0 and 1, respectively.

In [None]:
def normalify(data, columns, scaler=None):
    ''' Apply normalization to the specified columns in data. Fits a scaler if one is not given '''
    # Get columns to be normalized
    data_to_normalize = data[columns]
    # Get remaining data (not to be normalized)
    remaining_cols = [c for c in data.columns if c not in columns]
    remaining_data = data[remaining_cols]
    # If no scaler is given, fit one
    if not scaler:
        scaler = StandardScaler()
        scaler.fit(data_to_normalize)
    # Apply standard scaler
    data_normalized = scaler.transform(data_to_normalize)
    # Recombine normalized and remaining data, and also return the scaler
    return np.hstack((data_normalized, remaining_data)), scaler

In [None]:
# Apply our normalization process
X_train_normalized, scaler = normalify(X_train, CONTINUOUS_COLS)
X_test_normalized, _ = normalify(X_test, CONTINUOUS_COLS, scaler=scaler)

In [None]:
# For consistency, we'll also turn our y data into numpy arrays
y_train, y_test = np.array(y_train), np.array(y_test)

Our X data is now a matrix of values that mostly sit between -1 and 1. Cool!

In [None]:
print("X training data:\n", X_train_normalized)
print("Shape:", X_train_normalized.shape)

And, as expected, our y data are binary 0s (no heart failure) and 1s (heart failure).

In [None]:
print("y training data:\n", y_train)
print("Shape:", y_train.shape)

Let's build a model! We'll incorporate something called `Dropout`, which randomly deactivates (sets to 0) a certain proportion of weights in a layer. This might seem destructive; why do we want to un-learn what we've learned? Dropout is a great way to combat overfitting! 

By deactivating weights, the model is forced to re-learn those connections from the existing weights and in doing so, learns the data more deeply instead of "memorizing" the input-output relationship in the training data. You can think of dropout like working out; your muscles get damaged to grow back stronger.

In [None]:
from tensorflow.keras.layers import Dropout

In [None]:
def defineBaselineModel():
    ''' Define and return a neural net '''
    model = Sequential()
    model.add(Dense(64, input_dim=12))         # Input layer has 12 features; second layer of 64 neurons
    model.add(Dropout(0.3))                    # 30% of weights are deactivated
    model.add(Dense(128))                      # Third layer has 128 neurons
    model.add(Dropout(0.3))
    model.add(Dense(256))                      # Fourth layer has 256 neurons
    model.add(Dropout(0.3))
    model.add(Dense(1, activation='sigmoid'))  # One neuron for output with sigmoid activation
    opt = Adam(learning_rate=0.01)             # Define optimizer
    loss = 'binary_crossentropy'               # Binary crossentropy for loss
    model.compile(optimizer=opt, loss=loss, metrics=['accuracy'])
    return model

In [None]:
model = defineBaselineModel()
model.summary()

In [None]:
# Train model on data!
model.fit(X_train_normalized, y_train, batch_size=8, epochs=5)

In [None]:
# And evaluate!
acc = model.evaluate(X_test_normalized, y_test)
print("Accuracy:", acc[1])

The performance will vary every time you train the model, but when I ran it, the test accuracy was 78%. Let's check how well the model would do if it guessed `0` every time.

In [None]:
1 - (sum(y_data) / len(y_data))

So, that's a little bit reassuring; the model performs about 10% better than it would by just guessing that there is never a heart failure. This isn't stellar, but if we preprocessed the data better (e.g. apply a log scale to `creatinine_phosphokinase`), or if we had more data, we could hope for a better performance.

Even though we split our data into training and test data, it's very possible that, by random chance, our test set was "easy" and resulted in a better test accuracy than we would get with a different train-test split. For this reason, we often use **K-fold cross validation** as a way of combating this. Basically, the data gets split up into `k` "folds", and then the model is trained and tested on different combinations of those folds (each fold gets to be the test set once). We can take the average of each test accuracy as a more honest evaluation.

In [None]:
from sklearn.model_selection import KFold

def create_data_splits(data, k=5):
    ''' Returns a list of k tuples (X_train, X_test, y_train, y_test) '''
    splits = []
    kf = KFold(n_splits=k, shuffle=True)
    # Get raw X and y data
    X_data = data[data.columns[:-1]]
    y_data = data[data.columns[-1]]
    # kf.split returns train and test indexes
    for train_indexes, test_indexes in kf.split(X_data):
        # Get actual data by "filtering" by index using pandas' iloc
        X_train, X_test = X_data.iloc[train_indexes], X_data.iloc[test_indexes]
        y_train, y_test = y_data.iloc[train_indexes], y_data.iloc[test_indexes]
        # Apply normalization
        X_train_normalized, scaler = normalify(X_train, CONTINUOUS_COLS)
        X_test_normalized, _ = normalify(X_test, CONTINUOUS_COLS, scaler=scaler)
        y_train, y_test = np.array(y_train), np.array(y_test)
        # Add tuple of data to splits
        splits.append((X_train_normalized, X_test_normalized, y_train, y_test))
    return splits

In [None]:
# Use 10-fold validation
k = 10
splits = create_data_splits(heart_data, k=k)
accuracies = []
for X_train, X_test, y_train, y_test in splits:
    # Important: define a new model every time you train on a new data split!
    model = defineBaselineModel()
    # verbose=0 doesn't show the progress bar
    model.fit(X_train, y_train, batch_size=8, epochs=10, verbose=0)
    _, accuracy = model.evaluate(X_test, y_test, verbose=0)
    accuracies.append(accuracy)
# Print average accuracy
print(sum(accuracies) / k)

When I ran this code, it resulted in an average accuracy around 80%, which isn't half bad, and substantially better than random guess.

That's all we have time for today! Another great thing to check out would be a confusion matrix, which will let you visualize *how* the model misclassifies. Check out workshop 3 if you'd like a refresher on how to make / interpret those.

## Closing remarks

<img src="working_on_workshop.jpg" width="400px">
<br />
<div width="100%" style="text-align: center">
    Me working on this workshop at midnight; hope you've enjoyed learning about machine learning!
</div>