# Workshop 6 - Model Tuning & Optimization

In this workshop we're going to take a second look at one of the models we trained last week but try training it with a number of different parameters. First we'll do this naively but progress through to more advanced searching methods and using nested cross validation.

For this whole tutorial we'll just stick to a single model: **Random forest on our Kmer count dataset**

Lets run through the steps together (there are some questions and some blanks to fill in as we run through).

## Imports

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
import sklearn
from sklearn import linear_model
from sklearn import tree
from sklearn import ensemble
from tensorflow import keras
import bayes_opt

## 1. Load Data


For this workshop please download the latest:
- `train_test_data` folder and put within `data/`

Key for data:
- train_kmers = kmer counts for training data
- test_kmers = kmer counts for testing data
- y_train = array of S/R target values
- y_train_ids = array of genome_ids in order of y_train
- y_test_ids = array of genome_ids in order of y_test

In [2]:
seed = 130

def load_data():
    """
    Load the data needed for Workshop 5
    """
    # Load Kmer data
    train_kmers = np.load('../data/train_test_data/train_kmers.npy', allow_pickle=True)
    test_kmers = np.load('../data/train_test_data/test_kmers.npy', allow_pickle=True)

    # Load target data & IDs
    y_train = np.load('../data/train_test_data/y_train.npy', allow_pickle=True)
    y_train_ids = np.load('../data/train_test_data/train_ids.npy', allow_pickle=True).astype(str)
    y_test_ids = np.load('../data/train_test_data/test_ids.npy', allow_pickle=True).astype(str)
    
    return train_kmers, test_kmers, y_train, y_train_ids, y_test_ids

X_train_kmers, X_test_kmers, y_train, y_train_ids, y_test_ids = load_data()
y_train = y_train.reshape(-1) # convert to vector

## 2. Cross validation (recap)

Just to quickly recap on how to implement cross validation:
1. Pick a value of K
2. Shuffle Data
3. Generate K splits with a different 1/K as validation each time

In [2]:
K = 3
kfold = sklearn.model_selection.KFold(
    n_splits = K,
    shuffle = True, # Want to shuffle as seen in slides
    random_state = seed, # To ensure reproducible results
)

kfold_dfs = {}
val_idx = {}
for i, (train_index, val_index) in enumerate(kfold.split(X_train_kmers)):

    # Store val index for reference
    val_idx[i] = val_index
    
    # Can either train models directly here or save out the data for future training
    kfold_dfs[i] = (X_train_kmers[train_index], X_train_kmers[val_index], y_train[train_index], y_train[val_index])

In [3]:
print(f"Validation split 0: {val_idx[0][0:10]}")
print(f"Validation split 1: {val_idx[1][0:10]}")
print(f"Validation split 2: {val_idx[2][0:10]}")

## 3. Grid Search

Brute force approach - lets pick a set of parameters + search across all possibilities
- Optimization can be a slow process (training multiple models)
- We'll keep the parameter space small to allow speedy demos
- For now we'll just use a single fold (until we get to nested demo)
- We now have validation data! (Something we were missing last week)

In [4]:
# Take single fold
X_train_fold_0 = kfold_dfs[0][0]
X_val_fold_0 = kfold_dfs[0][1]
y_train_fold_0 = kfold_dfs[0][2]
y_val_fold_0 = kfold_dfs[0][3]

In [5]:
# Set up parameter grid
n_estimator_grid = --- # These are continuous (need to decide on grid size)
max_depth_grid = --- # These are continuous (need to decide on grid size)

# Manual Approach:
model_perf = {}
for n_estimator in n_estimator_grid:
    for max_depth in max_depth_grid:

        # TRAIN MODEL AND RECORD PERFORMANCE HERE
        cur_model = ---
        
         # Fit on fold train data
        cur_model.fit(---)

        # Evaluate on fold validation data
        model_perf[(n_estimator, max_depth)] = sklearn.metrics.balanced_accuracy_score(
            ---
        )


In [6]:
model_perf

<div class="question" style="color: #534646; background-color: #ffdfa3; padding: 1px; border-radius: 5px;">

#### Q. Which model seems best?

</div>

#### Sklearn implementation

- We don't need to write out the above manually for sklearn models
- GridSearchCV will do a grid search across parameters AND use CV to select the best averaged across CV folds
- Above we just used a single fold

In [8]:
## DON'T RUN FOR DEMO (takes ~1-2mins)

# Sklearn can do this for us (FOR SKLEARN BASED MODELS)
rfc_cv = sklearn.model_selection.GridSearchCV(
    estimator = ensemble.RandomForestClassifier(random_state = seed),
    param_grid = {
        "n_estimators" : n_estimator_grid,
        "max_depth": max_depth_grid,
    },
    cv = 3, # Use 3 fold CV (will take the best parameters averaged across folds)
)

# Fit the model
rfc_cv.fit(X_train_kmers, y_train)  # Use the full training data (grid search CV will do the splitting for us)

In [9]:
rfc_cv.best_estimator_

<div class="question" style="color: #534646; background-color: #ffdfa3; padding: 1px; border-radius: 5px;">

#### Q. Same model! This is good, did we really need to use CV then?

</div>

## 4. Random Search

Rather than search all parameters - lets see if we can do as good a job with random searching
- Won't look at a manual implementation
- Similar except we'd use random sampling instead of a grid
- Sklearn has an implementation for us again

Additionally here we're explicitly using Balanced Accuracy as our "scorer" when assessing models

In [7]:
# Set up parameter distributions
n_estimator_distribution = --- # Uniform sample from 0 -> 20
max_depth_distribution = --- # Uniform sample from 1 -> 10

rfc_random_cv = sklearn.model_selection.RandomizedSearchCV(
    estimator = ensemble.RandomForestClassifier(random_state = seed),
    param_distributions = {
        "n_estimators" : n_estimator_distribution,
        "max_depth": max_depth_distribution,
    },
    n_iter = 4, # Sample 4 times from distribution
    cv = 2, # Use 2 fold CV (for speed, in reality we'd want to set to some higher number 5/10 etc.)
    scoring = sklearn.metrics.make_scorer(sklearn.metrics.balanced_accuracy_score) # Use balanced accuracy to score
)

# Fit the model
rfc_random_cv.fit(X_train_kmers, y_train)  # Use the full training data (grid search CV will do the splitting for us)

In [8]:
rfc_random_cv.best_estimator_

In [9]:
rfc_random_cv.best_score_

<div class="question" style="color: #534646; background-color: #ffdfa3; padding: 1px; border-radius: 5px;">

#### Very similar to our grid search but with way less models fit!

</div>

## 5. Bayesian Optimization

Bayesian Optimization ends up looking very similar to Random Search
- Specifying distributions of parameters
- It will search across this grid

Use the Bayesian Optimization package (bayes_opt) to do the heavy lifting
- For details on implementation check out their Github page:
- https://github.com/bayesian-optimization/BayesianOptimization

Again for speed we're just going to use a single fold - there is no inbuilt CV implementation here so we'd need to build the loop ourselves

In [10]:
# Main difference is we need to provide a single function to optimize
def random_forest_model_fit(n_est, max_depth):
    # Specify the model with params
    rfc = ---
    
    # Fit the model
    rfc.fit(X_train_fold_0, y_train_fold_0)

    # Evaluate the model and return the evaluation score
    score = ---
    
    return score

In [11]:
# Bounded region of parameter space
parameter_limits = {'n_est': (1, 20), 'max_depth': (1, 10)}

optimizer = bayes_opt.BayesianOptimization(
    f = ---,
    pbounds = ---,
    random_state=seed,
)

In [12]:
# Fit the model using our custom optimizer
optimizer.maximize(
    init_points=4, # Number of random starting points (recommend at least 5)
    n_iter=10, # Number of searches to perform (recommend at least 10+ to allow search to optimize)
)

<div class="question" style="color: #534646; background-color: #ffdfa3; padding: 1px; border-radius: 5px;">

#### Best performance yet and fewer search points! What are our optimal parameters?

</div>

## 6. Nested CV

So far we've been training models individually and using CV to fairly search hyperparameters
- What if we also want to compare models?
- What data do we use for model comparison?

OR - even if we don't want to compare models, how well is our model doing?
- We only have the test set with which to benchmark our "best model"
- We don't get an idea of uncertainty or how we expect our model perform and another random split of data

So lets try nesting our CV using a single model!

In [13]:
# First build a manual K-fold loop using K=3
K = 3
kfold = sklearn.model_selection.KFold(
    n_splits = K,
    shuffle = True, # Want to shuffle as seen in slides
    random_state = seed, # To ensure reproducible results
)

fold_perf = {}

# Loop through each of our 3 outer folds once at a time
for i, (train_index, val_index) in enumerate(kfold.split(X_train_kmers)):

    X_train_outer, X_val_outer, y_train_outer, y_val_outer  = (
        X_train_kmers[train_index], X_train_kmers[val_index], y_train[train_index], y_train[val_index]
    )
    
    # Use Sklearn to do an inner CV loop for us just on the outer train data (in this case using K=2)
    n_estimator_distribution = stats.randint(low=0, high=20) 
    max_depth_distribution = stats.randint(low=1, high=10)
    
    rfc_random_cv = sklearn.model_selection.RandomizedSearchCV(
        estimator = ensemble.RandomForestClassifier(random_state = seed),
        param_distributions = {
            "n_estimators" : n_estimator_distribution,
            "max_depth": max_depth_distribution,
        },
        n_iter = 4, # Sample 4 times from distribution
        cv = 2, # Use 2 fold CV (for speed, in reality we'd want to set to some higher number 5/10 etc.)
        scoring = sklearn.metrics.make_scorer(sklearn.metrics.balanced_accuracy_score) # Use balanced accuracy to score
    )
    
    # Fit the model
    rfc_random_cv.fit(X_train_outer, y_train_outer)  # Here we're using just the train_outer from our manual K-fold split

    # Assess the best model using the outer validation data
    y_pred_outer = rfc_random_cv.predict(X_val_outer)

    fold_perf[i] = sklearn.metrics.balanced_accuracy_score(y_val_outer, y_pred_outer)

In [14]:
# Review performance across folds
for i in range(K):
    print(f"Fold {i} Balanced accuracy: {fold_perf[i]}")

<div class="question" style="color: #534646; background-color: #ffdfa3; padding: 1px; border-radius: 5px;">

#### Q. What does this tell us that the previous approach didnt?

</div>

## [Bonus] - Upload best model preds to kaggle

In [18]:
# Fit a single model on all training data using the best parameters we've found thus far!
rfc_final = ensemble.RandomForestClassifier(
    n_estimators=3, 
    max_depth=6,
    random_state = seed
)
rfc_final.fit(X_train_kmers, y_train)

In [19]:
# Make test predictions and save out as a dataframe
test_preds = rfc_final.predict(X_test_kmers)

# Save
test_preds_df = pd.DataFrame(data={"genome_id":y_test_ids, "y_pred":test_preds})
test_preds_df.to_csv("random_forest_test_preds.csv", index=False) # IMPORTANT: Do not save the index

In [20]:
test_preds_df.head()

Unnamed: 0,genome_id,y_pred
0,562.42833,R
1,562.42739,R
2,562.22823,S
3,562.45646,S
4,562.22547,S


## [Bonus] - Neural Networks

- Generally all the above can and does apply to Neural networks
- Given the more extensive training time and resource costs however...
- Nested CV is very rarely performed for NNs
- A simple single CV loop is sufficient
- For really large models just having a single train/validate split might be as good as you can do
- Compare parameters on the single validation dataset (can end up overfitting slightly to this specific data split)