## RTP experiments

In this notebook, we will experimenting with the full range of parameters and hyperparameters. We will distill the best results using both Support Vector Machine and Logisitc Regression models.

Fully run this notebook takes approximately 9 hours.

In [1]:
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
import json
import time

from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

import RTP.TemporalAbstraction as ta
from  RTP.RTP_classifier import RTPclf, preprocess
from RTP.Config import Options as opts

from tools.tools import train_test_split, evaluate

random_state = 2022

#### Here is the code to drive our experimentation



In [2]:
def experiment(model, MSS_pos, MSS_neg, combs, desc, logfile):
    '''
    Grid search through RTP parameters. 

    Imputs: model - base model (instantiated with hyperparameters)
            MSS_pos, MSS_neg - Multivariate State Sequences for positive and 
                               negative data
            logfile - log file

    Output: dataframe with results of each permutation                  
    '''
    results = []
    for comb in tqdm(combs, desc=desc):
        max_gap, min_support_pos, min_support_neg = comb
        rtp_params = {'max_gap': max_gap, 
                      'min_support_pos': min_support_pos, 
                      'min_support_neg': min_support_neg}

        clf = RTPclf(model, rtp_params, logfile)
        metrics = clf.trainCV(MSS_pos, MSS_neg, 5, verbose=False)
        metrics.update(rtp_params)
        results.append(metrics)

    output = pd.DataFrame(results, 
                          columns=['max_gap', 'min_support_pos', 'min_support_neg', 
                                   'accuracy', 'precision', 'recall', 'f1', 'auc'])
    return output
 

#### Load the positive and negative MIMICIII data sets

We will ONLY work with the training sets in this notebook, to avoid data leakage!

In [3]:
pos_path = os.path.join('./data', 'pos_train.csv')
pos_train = pd.read_csv(pos_path)

neg_path = os.path.join('./data', 'neg_train.csv')
neg_train = pd.read_csv(neg_path)

Generate Multivariate State Sequences (this may take a minute or so)

In [4]:
MSS_pos, MSS_neg = preprocess(pos_train, neg_train)

100%|██████████| 361/361 [00:14<00:00, 24.51it/s]
100%|██████████| 361/361 [00:11<00:00, 32.32it/s] 


Get combinations of RTP parameters

In [5]:
max_gaps = np.arange(4, 11)
min_supports_pos = np.linspace(0.1, 0.3, 5, dtype=float)
min_supports_neg = np.linspace(0.1, 0.3, 5, dtype=float)
combs = np.array(np.meshgrid(max_gaps, min_supports_pos, min_supports_neg)).T.reshape(-1, 3)
print(f'Number of combinations of RTP parameters: {combs.shape[0]}')

Number of combinations of RTP parameters: 175


Prepare to accumulate best parameters

In [6]:
best_params = {}

### Trials with SVM

We will be running a grid search over all the candidate RTP parameters plus the candidate kernel hyperparamters. We then will store the results sorted in descending order by F1 scores.

In [7]:
kernels = ['linear', 'poly', 'rbf', 'sigmoid']
results_svm = pd.DataFrame(columns=['kernel', 'max_gap', 'min_support_pos', 'min_support_neg', 
                                   'accuracy', 'precision', 'recall', 'f1', 'auc'])

logfile = os.path.join('./RTP/logs', 'svm_experiments.log')
if os.path.exists(logfile):
     os.remove(logfile)

start_time = time.time()
for kernel in kernels:
     model = SVC(gamma=0.1, C=1.0, kernel=kernel)
     desc = f'Using kernel {kernel}'
     results = experiment(model, MSS_pos, MSS_neg, combs, desc, logfile)
     results['kernel'] = kernel
     results_svm = pd.concat([results_svm, results])

elapsed_time = time.time() - start_time
hr = int(elapsed_time // 3600)
elapsed_time %= 3600
min = int(elapsed_time // 60)
sec = elapsed_time % 60
print(f'Total time: {hr:02d}:{min:02d}:{sec:0.2f}')

results_svm.sort_values(by='f1', ascending=False, inplace=True)
results_path = os.path.join('./RTP/results', 'results_svm.csv')
results_svm.to_csv(results_path, index=False)

# view the best five
results_svm.head()

Using kernel linear: 100%|██████████| 175/175 [1:35:16<00:00, 32.67s/it]
Using kernel poly: 100%|██████████| 175/175 [1:35:05<00:00, 32.60s/it]
Using kernel rbf: 100%|██████████| 175/175 [1:33:47<00:00, 32.16s/it]
Using kernel sigmoid: 100%|██████████| 175/175 [1:36:54<00:00, 33.23s/it]

Total time: 06:21:5.15





Unnamed: 0,kernel,max_gap,min_support_pos,min_support_neg,accuracy,precision,recall,f1,auc
25,poly,9.0,0.1,0.1,0.811111,0.75395,0.925208,0.830846,0.810793
30,poly,10.0,0.1,0.1,0.811111,0.75395,0.925208,0.830846,0.810793
5,poly,5.0,0.1,0.1,0.808333,0.753986,0.916898,0.8275,0.808031
65,poly,10.0,0.1,0.15,0.808333,0.755149,0.914127,0.827068,0.808039
60,poly,9.0,0.1,0.15,0.808333,0.755149,0.914127,0.827068,0.808039


Extract and save the best parameters/hyperparameters

In [8]:
kernel, max_gap, min_support_pos, min_support_neg = results_svm.iloc[0][['kernel', 
                                                                         'max_gap', 
                                                                         'min_support_pos', 
                                                                         'min_support_neg']]
best_params['SVM'] = {'kernel': kernel,
          'max_gap': max_gap,
          'min_support_pos': min_support_pos,
          'min_support_neg': min_support_neg}


### Trials with Logisitic Regression

Using only default hyperparameters, grid search across all candidate RTP parameters

In [9]:
model = LogisticRegression(penalty='l2', dual=False, tol=1e-4, C=1.0, 
                           solver='lbfgs', max_iter=100)
                           
logfile = os.path.join('./RTP/logs', 'lr_experiments.log')
if os.path.exists(logfile):
    os.remove(logfile)

start_time = time.time()
results_lr = experiment(model, MSS_pos, MSS_neg, combs, 'Using defaults', 
                        logfile)

elapsed_time = time.time() - start_time
hr = int(elapsed_time // 3600)
elapsed_time %= 3600
min = int(elapsed_time // 60)
sec = elapsed_time % 60
print(f'Total time: {hr:02d}:{min:02d}:{sec:0.2f}')

results_lr.sort_values(by='f1', ascending=False, inplace=True)
results_path = os.path.join('./RTP/results', 'results_lr.csv')
results_lr.to_csv(results_path, index=False)

# View the best five
results_lr.head()

Using defaults: 100%|██████████| 175/175 [1:35:30<00:00, 32.75s/it]

Total time: 01:35:30.73





Unnamed: 0,max_gap,min_support_pos,min_support_neg,accuracy,precision,recall,f1,auc
174,10.0,0.3,0.3,0.7875,0.772251,0.817175,0.794078,0.787417
134,9.0,0.3,0.25,0.7875,0.772251,0.817175,0.794078,0.787417
139,10.0,0.3,0.25,0.7875,0.772251,0.817175,0.794078,0.787417
169,9.0,0.3,0.3,0.7875,0.772251,0.817175,0.794078,0.787417
9,5.0,0.3,0.1,0.786111,0.770235,0.817175,0.793011,0.786025


In [10]:
max_gap, min_support_pos, min_support_neg = results_lr.iloc[0][['max_gap', 
                                                                'min_support_pos', 
                                                                'min_support_neg']]
best_params['LogisticRegression'] = {'max_gap': max_gap,
                                     'min_support_pos': min_support_pos,
                                     'min_support_neg': min_support_neg}

#### Save best parameters as JSON file for evaluation process

In [11]:
param_path = os.path.join('./RTP/results', 'best_parameters.json')
with open(param_path, 'w') as FP:
    json.dump(best_params, FP)