# Workshop 5 - Model Building

In this workshop we're going to leverage the features we generated last week and train a few different models. If you completed the assignment last week you'll already have a high level idea of the performance we should be aiming for when training our models.

This tutorial focuses on a few different model architectures and how to set up k-fold cross validation.

Lets run through the steps together (there are some questions and some blanks to fill in as we run through).

## Imports

In [None]:
import os
from collections import defaultdict

import pandas as pd
import numpy as np
import sklearn
from sklearn import linear_model
from sklearn import tree
from sklearn import ensemble
from tensorflow import keras

## 1. Load Data


For this workshop please download the latest:
- `train_test_data` folder and put within `data/`

Key for data:
- train_pa_genes = presence absence binary features for training data
- test_pa_genes = presence absence binary features for test data
- train_kmers = kmer counts for training data
- test_kmers = kmer counts for testing data
- y_train = array of S/R target values
- y_train_ids = array of genome_ids in order of y_train
- y_test_ids = array of genome_ids in order of y_test

In [None]:
seed = 130

def load_data():
    """
    Load the data needed for Workshop 5
    """
    # Presence absence features
    train_pa_genes = pd.read_csv('../data/train_test_data/train_pa_genes.csv').set_index('genome_id')
    test_pa_genes = pd.read_csv('../data/train_test_data/test_pa_genes.csv').set_index('genome_id')
    
    # Load Kmer data
    train_kmers = np.load('../data/train_test_data/train_kmers.npy', allow_pickle=True)
    test_kmers = np.load('../data/train_test_data/test_kmers.npy', allow_pickle=True)

    # Load target data & IDs
    y_train = np.load('../data/train_test_data/y_train.npy', allow_pickle=True)
    y_train_ids = np.load('../data/train_test_data/train_ids.npy', allow_pickle=True).astype(str)
    y_test_ids = np.load('../data/train_test_data/test_ids.npy', allow_pickle=True).astype(str)

    # Load raw gene data for optional neural network section
    train_gene_alignment = pd.read_csv('../data/train_test_data/train_genes.csv')
    
    return train_pa_genes, test_pa_genes, train_kmers, test_kmers, y_train, y_train_ids, y_test_ids, train_gene_alignment

train_pa_genes, test_pa_genes, X_train_kmers, X_test_kmers, y_train, y_train_ids, y_test_ids, train_gene_alignment = load_data()

## 1. Linear Models

For our first model we're going to try using a simple regression based model. The key limitation of regression is that it will only model linear combinations of our input features which may or may not be sufficient.

If we wanted to use the linear model for inference (reviewing feature importances and understanding the impact of predictors on our response) we'd want to be much more careful about ensuring we're meeting the assumptions of linear regression (see this nice article: https://www.jmp.com/en_us/statistics-knowledge-portal/what-is-regression/simple-linear-regression-assumptions.html)

#### Check our data

- Our target (response) is either S/R so we have a binary prediction
- This means we'll need to use logistic regression

In [None]:
np.unique(y_train)

In [None]:
train_pa_genes.head(3)

#### Convert dataframes to numpy arrays

In [None]:
X_train_pa = np.array(train_pa_genes)
X_test_pa = np.array(test_pa_genes)

#### Build Simple Logistic model

Sklearn has a simple interface for building logistic models: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

By default this model has regularization, lets try with it off first

In [None]:
logistic_model = ---
logistic_model.fit(---)

In [None]:
y_pred_train_log_pa = logistic_model.predict(X_train_pa)
sklearn.metrics.balanced_accuracy_score(y_train, y_pred_train_log_pa)

#### Try regularizing

- Regularizing adds a penality to the loss function
- The idea being that it will penalize the model for high weights and reduce overfitting to training data

In [None]:
logistic_model = linear_model.LogisticRegression(max_iter=10000, penalty='l2')
logistic_model.fit(X_train_pa, y_train.reshape(-1))

y_pred_train_log_pa = logistic_model.predict(X_train_pa)
sklearn.metrics.balanced_accuracy_score(y_train, y_pred_train_log_pa)

<div class="question" style="color: #534646; background-color: #ffdfa3; padding: 1px; border-radius: 5px;">

#### Q. What do we think of these balanced accuracy scores?

</div>

#### Finally we could also use Kmers in exactly the same way
- Both are tabular datasets
- Given we have a lot more kmer features this will take longer to train
- You may also need more regularization to offset the additional N features

In [None]:
# logistic_model = linear_model.LogisticRegression(penalty='l2')
# logistic_model.fit(X_train_kmers, y_train.reshape(-1))

## 2. Tree Based Models

Sklearn has a very similar interface for fitting tree based models.

In this case we'll try:
1. Simple decision tree
2. Ensemble random forest method

Tree based models are a great fit for binary feature data due to the successive decision making process but it will also work for both our tabular feature sets.

In this case lets try using the kmer data

In [None]:
# Use a decision tree classifier 
tree_model = tree.DecisionTreeClassifier(
    ---
)
tree_model.fit(---)

y_pred_train_tree_kmer = tree_model.predict(---)

In [None]:
# Check our balanced accuracy
sklearn.metrics.balanced_accuracy_score(y_train, y_pred_train_tree_kmer)

<div class="question" style="color: #534646; background-color: #ffdfa3; padding: 1px; border-radius: 5px;">

#### Q. What would happen if we keep increasing max depth? Are there any alternatives?

</div>

#### Random Forest

In [None]:
# Use a decision tree classifier 
rf_model = ensemble.RandomForestClassifier(
   ---
)
rf_model.fit(---)

y_pred_train_rf_kmer = rf_model.predict(---)

In [None]:
# Check our balanced accuracy
sklearn.metrics.balanced_accuracy_score(y_train, y_pred_train_rf_kmer)

<div class="question" style="color: #534646; background-color: #ffdfa3; padding: 1px; border-radius: 5px;">

#### Q. Why do we see this change in train accuracy?

</div>

## 3. Gradient Boosting

Gradient boosting is generally considered to be the best default choice for many predictive modeling problems from tabular data. It commonly comes out on top during Kaggle competitions for a wide array of datasets. 

The ideal model will always depend on the data you're using but gradient boosting is always a good place to start.

In practice you might wish to use XGBOOST or LightGBM packages (feel free to install and use for your final project) but here we'll use the simple sklearn implementation

In [None]:
boost_model = sklearn.ensemble.AdaBoostClassifier(
    estimator = ---, # Can choose any simple base estimator
    n_estimators = 10,
    learning_rate = 2.0, # Another parameter to tune
    algorithm="SAMME",
)

### Sample Weighting

So far we've just fit basic default models, during your project you'll want to be more careful about tuning parameters and optimizing (we'll cover this next week).

One approach that will be useful across many model types however is sample weighting:
- We know our dataset is imbalanced
- We wish to encourage the model to learn both classes
- To do so we can upweight the minority class and downweight the majority class
- Sklearn can do this for us

In [None]:
sample_weights = sklearn.utils.class_weight.compute_sample_weight(---)

# Check the weights for a few samples
pd.DataFrame(list(zip(y_train[0:10], sample_weights[0:10])), columns=['y_train', 'weight'])

In [None]:
# Fit the booster (train time is a bit too long to demo so lets take only 100 samples)
X_train_kmers
boost_model.fit(X_train_kmers[0:100], y_train[0:100].reshape(-1), sample_weight=sample_weights[0:100])

y_pred_train_boost_kmer = boost_model.predict(X_train_kmers)

In [None]:
# Check our balanced accuracy
sklearn.metrics.balanced_accuracy_score(y_train, y_pred_train_boost_kmer)

## 4. Cross validation

So far we've seen a few different interfaces for training various models but we've only been looking at the training data.

For the test dataset we don't have access to the labels (they've been hidden as part of the final project) so what should we use to assess our models more fairly?

This is where K-fold CV and validation data in general comes into play.

We want to split our training data into train/validate where we hold out a portion of the data for checking model performance whilst tuning.

As usual there are a lot of nice packages where this has already been implemented for us!

#### Train Test Split

- If we just want a single split (one off)
- This will randomly split up data and match the IDs between train and validate for us

In [None]:
X_train_split, X_validate_split, y_train_split, y_validate_split = sklearn.model_selection.train_test_split(
    ---
    random_state=seed,
)

In [None]:
print(X_train_split.shape, y_train_split.shape)

In [None]:
print(X_validate_split.shape, y_validate_split.shape)

#### K-FOLD CV

In reality we want to make multiple splits so we can train multiple models.

This will allow to avoid overfitting to any specific split of the data.

In [None]:
K = 3
kfold = sklearn.model_selection.KFold(
    ---
    random_state = seed, # To ensure reproducible results
)

kfold_dfs = {}
for --- in enumerate(kfold.split(X_train_pa)):
    
    # Can either train models directly here or save out the data for future training
    kfold_dfs[i] = (X_train_pa[train_index], X_train_pa[val_index], y_train[train_index], y_train[val_index])

In [None]:
print("Fold 0 X Train: ", kfold_dfs[0][0].shape)
print("Fold 0 y Train: ", kfold_dfs[0][2].shape)
print("Fold 0 X Validate: ", kfold_dfs[0][1].shape)
print("Fold 0 y Validate: ", kfold_dfs[0][3].shape)

In [None]:
print("Fold 1 X Train: ", kfold_dfs[1][0].shape)
print("Fold 1 y Train: ", kfold_dfs[1][2].shape)
print("Fold 1 X Validate: ", kfold_dfs[1][1].shape)
print("Fold 1 y Validate: ", kfold_dfs[1][3].shape)

#### Recommend stratifying on our target
- When using this in your project it would be useful to also Stratify on the target variable
- This ensures we have an even balance of S/R in each split
- Avoids having any individual fold with an odd balance of S/R (e.g. missing any R examples)
- You can use StratifiedKFold for this: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html

## 5. [BONUS] Convolutional Neural Network

If you're interested in trying to use the sequencing features + a CNN here is an example of both featurization and model training:

- This particular model doesn't learn effectively from the data (it predicts majority class)
- It's going to be challenging to get it to learn but will be an interesting task to try!
- This can act as a starting point for future experimentation
- Much of the work will need to be in the featurization step: how to combine genes

I'm happy to schedule an office hour session to talk through if interested!

You may wish to use google colab or another cloud based platform with scalable resources to ensure you can train the model without running out of RAM

In [None]:
# 1. Take only the unique genes which weren't redundant (based on presence absence)
# This is just to reduce the data size and make it easier to train, ideally we'd use all genes here
subset_gene_data = train_gene_alignment[train_gene_alignment.res_gene.isin(set(train_pa_genes.columns))].copy()

In [None]:
# 2. Find maximum length and set padding
subset_gene_features = subset_gene_data.groupby('genome_id', sort=False)['ref_gene_str'].sum()  
pad_char = 0
max_length = np.max([len(x) for x in subset_gene_features])

In [None]:
# Functions to featurize sequence data
def encode_seq(seq):
    label_enc = {'A':1, 'C':2, 'G':3, 'T':4}
    return [label_enc.get(x.upper(), 5) for x in seq]

def featurize_variant_sequences(variant_genes, amr_max_length, pad_char=0):
    gene_features = variant_genes.groupby('genome_id', sort=False)['ref_gene_str'].sum()
    gene_features = [encode_seq(x) for x in gene_features]
    gene_features = keras.utils.pad_sequences(gene_features, maxlen=max_length, padding='post', value=pad_char)
       
    return gene_features

In [None]:
# 3. Featurize the data into our simple encoding
X_seq = featurize_variant_sequences(subset_gene_data, max_length)
X_seq.shape

#### Define a simple CNN

In [None]:
# Define CNN
input_layer = keras.layers.Input(shape=(X_seq.shape[-1], 1))
cnn_layer = keras.layers.Conv1D(
    20,
    11,
    strides=1,
    padding='same',
    activation='relu'
)(input_layer)
pool = keras.layers.MaxPool1D(pool_size=3)(cnn_layer)
cnn_layer2 = keras.layers.Conv1D(
    30,
    15,
    strides=1,
    padding='same',
    activation='relu'
)(pool)
pool2 = keras.layers.MaxPool1D(pool_size=5)(cnn_layer2)
cnn_layer3 = keras.layers.Conv1D(
    50,
    21,
    strides=1,
    padding='same',
    activation='relu'
)(pool2)
pool3 = keras.layers.MaxPool1D(pool_size=7)(cnn_layer2)
final_pool = keras.layers.Flatten()(pool3)
dense = keras.layers.Dense(20, activation='relu')(final_pool)
output = keras.layers.Dense(1, activation='sigmoid')(dense)

cnn = keras.Model(inputs=input_layer, outputs=output)

In [None]:
# Display model structure
cnn.summary()

In [None]:
# Compile model and select optimizer
cnn.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
# Encode y_train to numeric binary 1/0 from S/R
le = sklearn.preprocessing.LabelEncoder()
y_train_binary = le.fit_transform(y_train.reshape(-1))

In [None]:
# Fit the model
history = cnn.fit(X_seq, y_train_binary, validation_split=0.2, batch_size=16, epochs=10)

### This looks like it's started to learn something!

It's clearly lagging behind in terms of validation accuracy though and is potentially overfitting!

A few ideas:
- Use an encoding layer rather than simple numeric encoding
- Use all the genes!
- Pad the genes individually and then concatenate them to better preserve position information better
- Try different pooling strategies
- Use dropout for regularizing

You may use this architecture of feel free to start from scratch using your own CNN approach

You could also try RNN or other convolution approaches!