# Model Comparison

I created 2 classes, one for the dataset, one for the model.
These are the steps to run successfully the training, testing and prediction.

 1. Load datasets
 2. Apply transformations and feature engineering to the dataset (optional)
     1. Choose variables to be used for training the model (optional)
 4. Load model from SKLearn
 5. Run the simple test
 
 Below I created an example with the model that I had to test, Support Vector Machine.
 
 The shape of the dataset is the following:
 
1. `'Family_Case_ID'`
2. `'Severity'`
3. `'Birthday_year'`
4. `'Parents or siblings infected'`
5. `'Wife/Husband or children infected'`
7. `'Medical_Expenses_Family'`
8. `'Medical_Tent_A'`
9. `'Medical_Tent_B'`
10. `'Medical_Tent_C'`
11. `'Medical_Tent_D'`
12. `'Medical_Tent_E'`
13. `'Medical_Tent_F'`
14. `'Medical_Tent_G'`
15. `'Medical_Tent_T'`
16. `'Medical_Tent_n/a'`
17. `'City_Albuquerque'`
18. `'City_Santa Fe'`
19. `'City_Taos'`
20. `'Gender_M'`
21. `'family_size'`
22. `'Sev_by_city'`: Average severity in the city of the patient.
23. `'Sev_by_tent'`: Average severity in the medical tent of the patient.
24. `'Sev_by_gender'`: Average severity whithin the gender of the patient.
25. `'Sev_family'`: Average severity in the family of the patient.
26. `'spending_vs_severity'`: Medical Expenses Family / Patient's Severity
27. `'spending_family_member'`: Medical Expenses Family / Number of cases in the family
28. `'severity_against_avg_city'`: Patient's Severity / Sev_by_city
29. `'severity_against_avg_tent'`: Patient's Severity / Sev_by_tent
30. `'severity_against_avg_gender'`: Patient's Severity / Sev_by_gender
31. `'spending_family_severity'`: Patient's Severity / Sev_family


In [1]:
from dataset import Dataset
from model import Model
import numpy as np

## First model - Support Vector Machine - Alejandro

### Step 1: Load datasets

In [121]:
dataset = Dataset()            # Loads the preprocessed dataset
train_set = dataset.train_data # Training set without labels (train.csv)
target = dataset.target        # Labels for training set     (train.csv[Deceased])
test_set = dataset.test_data   # Unlabeled test set          (test.csv)

In [122]:
train_set.columns.values

array(['Family_Case_ID', 'Severity', 'Birthday_year',
       'Parents or siblings infected',
       'Wife/Husband or children infected', 'Medical_Expenses_Family',
       'Sev_by_city', 'Sev_by_tent', 'Sev_by_gender', 'Sev_family',
       'Medical_Tent_A', 'Medical_Tent_B', 'Medical_Tent_C',
       'Medical_Tent_D', 'Medical_Tent_E', 'Medical_Tent_F',
       'Medical_Tent_G', 'Medical_Tent_T', 'Medical_Tent_n/a',
       'City_Albuquerque', 'City_Santa Fe', 'City_Taos', 'Gender_M',
       'family_size', 'spending_vs_severity', 'spending_family_member',
       'severity_against_avg_city', 'severity_against_avg_tent',
       'severity_against_avg_gender', 'spending_family_severity'],
      dtype=object)

### Step 2: Apply transformations and select variables

In [123]:
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import RobustScaler, MinMaxScaler, StandardScaler


exclude_columns = [
    'Medical_Tent_T'
]

train_set = train_set.loc[:,~train_set.columns.isin(exclude_columns)]

In [124]:
#Scaling
scale_type = 'StandardScaler'
if scale_type == "RobustScaler":
    robust = RobustScaler().fit(train_set)
    train_set = robust.transform(train_set)
elif scale_type == "MinMaxScaler":
    minmax = MinMaxScaler().fit(train_set)
    train_set = minmax.transform(train_set)
elif scale_type == "StandardScaler":
    scaler = StandardScaler().fit(train_set)
    train_set = scaler.transform(train_set)

In [125]:
#X = train_set.copy()
#y = target.copy()
#df = X.merge(y, left_index = True, right_index = True)

In [148]:
tree = xgb.XGBClassifier(random_state = 1234, n_jobs = -1, scale_pos_weight = 0.6238698010849909)
cross_val_score(tree, train_set, target, n_jobs=-1, cv=5, scoring = 'accuracy').mean()

0.8184419615145873

### Step 3: Load model from SKLearn

In [149]:
import time
from scipy import stats
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV
import pandas as pd

clf = xgb.XGBClassifier(random_state = 1234, n_jobs = -1, scale_pos_weight = 0.6238698010849909)

param_grid = {
        'learning_rate': stats.uniform(0,1),
        'gamma': stats.uniform(0,0.1),
        'max_depth': stats.randint(3, 50),
        'min_child_weight': [0.5, 1.0, 3.0, 5.0, 7.0, 10.0],
        'max_delta_step':stats.randint(1,10),
        'subsample': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        'colsample_bytree': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        'colsample_bylevel': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        'reg_lambda': [0.1, 1.0, 5.0, 10.0, 25, 50.0, 75, 100.0]
}

rs_clf = RandomizedSearchCV(clf, param_grid, n_iter=1000,
                            n_jobs=-1, verbose=2, cv=5,
                            scoring='accuracy', random_state=42)
print("Randomized search..")
search_time_start = time.time()
rs_clf.fit(train_set, target)
print("Randomized search time:", time.time() - search_time_start)

best_score = rs_clf.best_score_
best_params = rs_clf.best_params_
print("Best score: {}".format(best_score))
print("Best params: ")
for param_name in sorted(best_params.keys()):
    print('%s: %r' % (param_name, best_params[param_name]))

Randomized search..
Fitting 5 folds for each of 1000 candidates, totalling 5000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.6s
[Parallel(n_jobs=-1)]: Done 276 tasks      | elapsed:    3.8s
[Parallel(n_jobs=-1)]: Done 682 tasks      | elapsed:    9.9s
[Parallel(n_jobs=-1)]: Done 1248 tasks      | elapsed:   20.0s
[Parallel(n_jobs=-1)]: Done 1978 tasks      | elapsed:   31.4s
[Parallel(n_jobs=-1)]: Done 2868 tasks      | elapsed:   45.7s
[Parallel(n_jobs=-1)]: Done 3922 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 4985 out of 5000 | elapsed:  1.4min remaining:    0.2s
[Parallel(n_jobs=-1)]: Done 5000 out of 5000 | elapsed:  1.4min finished


Randomized search time: 82.4287519454956
Best score: 0.8339913097454996
Best params: 
colsample_bylevel: 1.0
colsample_bytree: 0.4
gamma: 0.0683870124909032
learning_rate: 0.02534860426393848
max_delta_step: 1
max_depth: 13
min_child_weight: 1.0
reg_lambda: 0.1
subsample: 0.9


In [38]:
means = rs_clf.cv_results_['mean_test_score']
stds = rs_clf.cv_results_['std_test_score']
params = rs_clf.cv_results_['params']

scores_random = pd.concat([pd.DataFrame(params),pd.DataFrame(means, columns=["Accuracy"])],axis=1)
scores_random.sort_values('Accuracy', ascending = False).head(3)

Unnamed: 0,colsample_bylevel,colsample_bytree,gamma,learning_rate,max_delta_step,max_depth,min_child_weight,reg_lambda,subsample,Accuracy
5308,0.5,0.5,0.042058,0.076804,4,38,1.0,1.0,0.7,0.840708
4122,0.8,0.8,0.009009,0.758184,8,27,1.0,50.0,0.7,0.839603
108,0.8,0.5,0.008787,0.138825,9,19,0.5,5.0,1.0,0.838485


**Best score: 0.840707635009311**
Best params: 
colsample_bylevel: 0.5
colsample_bytree: 0.5
gamma: 0.042058384143182095
learning_rate: 0.07680397716570297
max_delta_step: 4
max_depth: 38
min_child_weight: 1.0
reg_lambda: 1.0
subsample: 0.7

In [10]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import auc, accuracy_score, confusion_matrix, mean_squared_error
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold, RandomizedSearchCV, train_test_split
import xgboost as xgb

import optuna
def objective(trial):
    

    learning_rate = trial.suggest_float('learning_rate', 0.0, 1.0)
    gamma = trial.suggest_float('gamma', 0, 0.1)
    max_depth = trial.suggest_int('max_depth', 3, 50)
    min_child_weight = trial.suggest_float('min_child_weight', 0.5, 10.0)
    max_delta_step = trial.suggest_int('max_delta_step', 1, 10)
    subsample = trial.suggest_float('subsample', 0.5, 1.0)
    colsample_bytree = trial.suggest_float('colsample_bytree', 0.3, 1.0)
    colsample_bylevel = trial.suggest_float('colsample_bylevel', 0.3, 1.0)
    reg_lambda = trial.suggest_float('reg_lambda', 0.1, 100.0)
    reg_alpha = trial.suggest_float('reg_alpha', 0.1, 100.0)
                                                                
    clf = xgb.XGBClassifier(
                    learning_rate = learning_rate
                    ,gamma = gamma
                    ,max_depth = max_depth
                    ,min_child_weight = min_child_weight
                    ,max_delta_step = max_delta_step
                    ,subsample = subsample
                    ,colsample_bytree = colsample_bytree
                    ,colsample_bylevel = colsample_bylevel
                    ,reg_lambda = reg_lambda
                    ,reg_alpha = reg_alpha
                    ,random_state=1234
                       )
    #print(clf)
    score = cross_val_score(clf, train_set, target, n_jobs=-1, cv=5, scoring="accuracy")
    # print(score)
    score = score.mean()
    return score

is_training = True
if is_training:
    study = optuna.create_study(direction='maximize')
    study.optimize(objective, n_trials=2000, n_jobs=-1)

[I 2020-05-23 13:32:17,263] Finished trial#3 with value: 0.7894289261328369 with parameters: {'learning_rate': 0.822167337432132, 'gamma': 0.09640270786323546, 'max_depth': 5, 'min_child_weight': 1.2065134266348332, 'max_delta_step': 9, 'subsample': 0.8740754500090397, 'colsample_bytree': 0.5320232722201619, 'colsample_bylevel': 0.9885639600262732, 'reg_lambda': 97.99975154492552, 'reg_alpha': 35.53670574122104}. Best is trial#3 with value: 0.7894289261328369.
[I 2020-05-23 13:32:17,413] Finished trial#1 with value: 0.7160335195530727 with parameters: {'learning_rate': 0.9035185806014384, 'gamma': 0.053351194848105146, 'max_depth': 18, 'min_child_weight': 0.910281089158295, 'max_delta_step': 9, 'subsample': 0.7439087538085941, 'colsample_bytree': 0.34319995881332505, 'colsample_bylevel': 0.9824027569143821, 'reg_lambda': 83.88101655186195, 'reg_alpha': 70.25325448942871}. Best is trial#3 with value: 0.7894289261328369.
[I 2020-05-23 13:32:17,517] Finished trial#2 with value: 0.73711980

In [12]:
# top scores
opt_scores = study.trials_dataframe()
opt_scores.sort_values("value", ascending=False).head(5)

Unnamed: 0,number,value,datetime_start,datetime_complete,params_colsample_bylevel,params_colsample_bytree,params_gamma,params_learning_rate,params_max_delta_step,params_max_depth,params_min_child_weight,params_reg_alpha,params_reg_lambda,params_subsample,state
648,648,0.841844,2020-05-23 13:35:03.621356,2020-05-23 13:35:05.929823,0.602264,0.666572,0.029668,0.53687,6,15,0.747512,3.168703,5.689629,0.730506,COMPLETE
234,234,0.838516,2020-05-23 13:33:00.847788,2020-05-23 13:33:02.857845,0.924291,0.892689,0.023562,0.50364,1,10,0.714042,2.992615,4.139067,0.976918,COMPLETE
1839,1839,0.838504,2020-05-23 13:45:04.816266,2020-05-23 13:45:09.145269,0.747391,0.858976,0.017743,0.525816,2,17,0.505997,3.810442,9.062423,0.905981,COMPLETE
236,236,0.838498,2020-05-23 13:33:01.080848,2020-05-23 13:33:02.855933,0.920195,0.899646,0.025084,0.504317,1,10,0.691808,3.181049,4.934252,0.951124,COMPLETE
1087,1087,0.838498,2020-05-23 13:38:02.182905,2020-05-23 13:38:05.559921,0.741688,0.880183,0.065563,0.63828,2,12,1.567493,1.67401,60.326872,0.900744,COMPLETE


In [14]:
import re
params = dict(opt_scores.sort_values("value", ascending=False).iloc[0,4:-1])
params = {re.sub('params_', '', key): val for key, val in params.items()}

In [15]:
params

{'colsample_bylevel': 0.6022636801339624,
 'colsample_bytree': 0.666571566502976,
 'gamma': 0.02966833944269946,
 'learning_rate': 0.5368697534586804,
 'max_delta_step': 6,
 'max_depth': 15,
 'min_child_weight': 0.7475122231419973,
 'reg_alpha': 3.168702933432126,
 'reg_lambda': 5.689628963117337,
 'subsample': 0.7305056798130244}

In [16]:
xgb_model = xgb.XGBClassifier(**{'colsample_bylevel': 0.6022636801339624,
                 'colsample_bytree': 0.666571566502976,
                 'gamma': 0.02966833944269946,
                 'learning_rate': 0.5368697534586804,
                 'max_delta_step': 6,
                 'max_depth': 15,
                 'min_child_weight': 0.7475122231419973,
                 'reg_alpha': 3.168702933432126,
                 'reg_lambda': 5.689628963117337,
                 'random_state' : 1234,
                 'subsample': 0.7305056798130244})

In [17]:
cross_val_score(xgb_model, train_set, target, n_jobs=-1, cv=5, scoring = 'accuracy').mean()

0.8418435754189945

### Step 4: Run model

In [18]:
model = Model(model     = xgb_model,              # Initialized classifier model from SKLearn
         #     variables = selected_variables_SVC, # Subset of variables from data to be used for training
                                                  # If variables=None, then all variables in set are used
              
              train_set = train_set,              # Samples X for training and validating
              target    = target,                 # Samples Y for training and validating
              test_set  = test_set                # Unlabeled samples for creating prediction
              )                 

model.run_model(path="results/xgb_results_v2.csv")
model.train_data

Model - XGBClassifier(base_score=0.5, booster='gbtree',
              colsample_bylevel=0.6022636801339624,
              colsample_bytree=0.666571566502976, gamma=0.02966833944269946,
              learning_rate=0.5368697534586804, max_delta_step=6, max_depth=15,
              min_child_weight=0.7475122231419973, missing=None,
              n_estimators=100, n_jobs=1, nthread=None,
              objective='binary:logistic', random_state=1234,
              reg_alpha=3.168702933432126, reg_lambda=5.689628963117337,
              scale_pos_weight=1, seed=None, silent=True,
              subsample=0.7305056798130244)
Average model accuracy: 82.40%
Highest model accuracy: 86.11%
Solution set saved as 'results/xgb_results_v2.csv'.


Unnamed: 0_level_0,Family_Case_ID,Severity,Birthday_year,Parents or siblings infected,Wife/Husband or children infected,Medical_Expenses_Family,Sev_by_city,Sev_by_tent,Sev_by_gender,Sev_family,...,City_Santa Fe,City_Taos,Gender_M,family_size,spending_vs_severity,spending_family_member,severity_against_avg_city,severity_against_avg_tent,severity_against_avg_gender,spending_family_severity
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4696,3,-1.0,0,0,225,2.354391,2.623932,2.169811,3.0,...,1,0,0,1,75.000000,225.000000,1.274215,1.143322,1.382609,75.000000
2,21436,1,1966.0,0,1,1663,1.893491,2.623932,2.169811,1.0,...,0,0,0,1,1663.000000,831.500000,0.528125,0.381107,0.460870,831.500000
3,7273,3,1982.0,0,0,221,2.354391,2.623932,2.391753,3.0,...,1,0,1,1,73.666667,221.000000,1.274215,1.143322,1.254310,73.666667
4,8226,3,1997.0,0,0,220,2.354391,2.623932,2.391753,3.0,...,1,0,1,1,73.333333,220.000000,1.274215,1.143322,1.254310,73.333333
5,19689,3,1994.0,0,0,222,2.354391,2.623932,2.169811,3.0,...,1,0,0,1,74.000000,222.000000,1.274215,1.143322,1.382609,74.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
896,6253,3,1998.0,1,1,344,2.354391,2.623932,2.169811,3.0,...,1,0,0,2,114.666667,114.666667,1.274215,1.143322,1.382609,38.222222
897,6483,3,2006.0,0,0,258,2.354391,2.623932,2.391753,3.0,...,1,0,1,1,86.000000,258.000000,1.274215,1.143322,1.254310,86.000000
898,981,3,1990.0,0,0,214,2.900000,2.623932,2.169811,3.0,...,0,1,0,1,71.333333,214.000000,1.034483,1.143322,1.382609,71.333333
899,16418,2,1994.0,1,1,812,2.354391,2.623932,2.391753,2.0,...,1,0,1,3,406.000000,270.666667,0.849476,0.762215,0.836207,135.333333
