# Model Selection 

The objective of this notebook is to build a very first ent to end Machine Learning model to predict the probability of a patient being discharged on a particular day.

We'll use the dataset generated by the `dataset` job of the ETL folder. 

At this instance, the focus won't be on model performance but rather on understanding the value potential of the available data and the speedness of the solution.

In [1]:
%cd /Users/josefinadallavia/Documents/MIM/Tesis/AML-hospital

/Users/josefinadallavia/Documents/MIM/Tesis/AML-hospital


In [2]:
import os
import pandas as pd
from matplotlib import pyplot as plt   
os.environ['KMP_DUPLICATE_LIB_OK']='True'
from thesis_lib.utils import * 
from thesis_lib.modelling.data import *
from thesis_lib.modelling.model import *

In [3]:
data = Data().load('data/hospital_dataset')
variables = data.get_variables_dict()
data.get_stats()

Loading dataset:  hospital_train_data.parquet
Loading dataset:  hospital_val_data.parquet
Loading dataset:  hospital_test_data.parquet


dataset_type,train,val,test
n_observations,319150,33482,33309
relative_size,0.82694,0.0867542,0.0863059
n_cols,71,71,71
positives,42697,4555,4507
negatives,276453,28927,28802
positive_prop,0.133783,0.136043,0.135309
negative_prop,0.866217,0.863957,0.864691
min_date,2017-01-01,2018-11-11,2018-11-11
max_date,2018-11-10,2019-11-11,2019-11-11


In [4]:
categorical_features = ['date_weekday','request_origin','origin','entity_group',
                        'gender','request_sector','insurance_entity',
                        'admission_sector','emergency_service']
numerical_features = ['patient_age','hosp_day_number','images_count','images_cumulative',
                    'labos_count','labos_cumulative','surgeries_count','surgeries_cumulative',
                      'new_born_weight','new_born_gestation_age']

In [5]:
model_params = {'classifier': 'lgbm',
               'accepts_sparse': True,
                'categorical_features' : categorical_features ,
                'numerical_features' : numerical_features}

In [6]:
lgbm_extra_features = Model(model_params)
lgbm_extra_features.transform(data)

Fitting pipeline...
Transforming data...


In [7]:
### baseline

In [8]:
lgbm_extra_features.fit_classifier()

Training classifier




In [9]:
lgbm_extra_features.get_performance_metrics()

training AUC ROC score:  0.8359565244475374
validation AUC ROC score:  0.8245795531254589
relative overfitting:  0.013609525124045466


In [15]:
lgbm_param_grid = {'max_depth': [3,7,10],
                  'learning_rate': [0.1,0.01,0.001,0.0001],
                  'num_iterations': [50,100,150,200,250]}

In [None]:
lgbm_extra_features.optimize_hyperparams(params_dict=lgbm_param_grid,
                                         n_folds=3,n_iter=10,
                                         search_type='grid')

Fitting 3 folds for each of 60 candidates, totalling 180 fits
[CV] learning_rate=0.1, max_depth=3, num_iterations=50 ...............


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  learning_rate=0.1, max_depth=3, num_iterations=50, score=0.809, total=   1.4s
[CV] learning_rate=0.1, max_depth=3, num_iterations=50 ...............


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.4s remaining:    0.0s


[CV]  learning_rate=0.1, max_depth=3, num_iterations=50, score=0.808, total=   1.2s
[CV] learning_rate=0.1, max_depth=3, num_iterations=50 ...............


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    2.6s remaining:    0.0s


[CV]  learning_rate=0.1, max_depth=3, num_iterations=50, score=0.790, total=   1.7s
[CV] learning_rate=0.1, max_depth=3, num_iterations=100 ..............


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    4.3s remaining:    0.0s


[CV]  learning_rate=0.1, max_depth=3, num_iterations=100, score=0.817, total=   2.9s
[CV] learning_rate=0.1, max_depth=3, num_iterations=100 ..............


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    7.2s remaining:    0.0s


[CV]  learning_rate=0.1, max_depth=3, num_iterations=100, score=0.814, total=   2.3s
[CV] learning_rate=0.1, max_depth=3, num_iterations=100 ..............


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    9.5s remaining:    0.0s


[CV]  learning_rate=0.1, max_depth=3, num_iterations=100, score=0.798, total=   4.4s
[CV] learning_rate=0.1, max_depth=3, num_iterations=150 ..............


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:   14.0s remaining:    0.0s


[CV]  learning_rate=0.1, max_depth=3, num_iterations=150, score=0.822, total=   4.3s
[CV] learning_rate=0.1, max_depth=3, num_iterations=150 ..............


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:   18.3s remaining:    0.0s


[CV]  learning_rate=0.1, max_depth=3, num_iterations=150, score=0.818, total=   4.3s
[CV] learning_rate=0.1, max_depth=3, num_iterations=150 ..............


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:   22.6s remaining:    0.0s


[CV]  learning_rate=0.1, max_depth=3, num_iterations=150, score=0.801, total=   4.1s
[CV] learning_rate=0.1, max_depth=3, num_iterations=200 ..............


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:   26.7s remaining:    0.0s


[CV]  learning_rate=0.1, max_depth=3, num_iterations=200, score=0.824, total=   4.9s
[CV] learning_rate=0.1, max_depth=3, num_iterations=200 ..............


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   31.5s remaining:    0.0s


[CV]  learning_rate=0.1, max_depth=3, num_iterations=200, score=0.821, total=   3.6s
[CV] learning_rate=0.1, max_depth=3, num_iterations=200 ..............


[Parallel(n_jobs=1)]: Done  11 out of  11 | elapsed:   35.1s remaining:    0.0s


[CV]  learning_rate=0.1, max_depth=3, num_iterations=200, score=0.803, total=   3.5s
[CV] learning_rate=0.1, max_depth=3, num_iterations=250 ..............


[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:   38.6s remaining:    0.0s


[CV]  learning_rate=0.1, max_depth=3, num_iterations=250, score=0.825, total=   4.2s
[CV] learning_rate=0.1, max_depth=3, num_iterations=250 ..............


[Parallel(n_jobs=1)]: Done  13 out of  13 | elapsed:   42.9s remaining:    0.0s


[CV]  learning_rate=0.1, max_depth=3, num_iterations=250, score=0.822, total=   4.5s
[CV] learning_rate=0.1, max_depth=3, num_iterations=250 ..............


[Parallel(n_jobs=1)]: Done  14 out of  14 | elapsed:   47.4s remaining:    0.0s


[CV]  learning_rate=0.1, max_depth=3, num_iterations=250, score=0.804, total=   4.7s
[CV] learning_rate=0.1, max_depth=7, num_iterations=50 ...............




[CV]  learning_rate=0.1, max_depth=7, num_iterations=50, score=0.830, total=   2.3s
[CV] learning_rate=0.1, max_depth=7, num_iterations=50 ...............




[CV]  learning_rate=0.1, max_depth=7, num_iterations=50, score=0.827, total=   2.3s
[CV] learning_rate=0.1, max_depth=7, num_iterations=50 ...............




[CV]  learning_rate=0.1, max_depth=7, num_iterations=50, score=0.806, total=   2.2s
[CV] learning_rate=0.1, max_depth=7, num_iterations=100 ..............




[CV]  learning_rate=0.1, max_depth=7, num_iterations=100, score=0.832, total=   3.4s
[CV] learning_rate=0.1, max_depth=7, num_iterations=100 ..............




[CV]  learning_rate=0.1, max_depth=7, num_iterations=100, score=0.829, total=   3.4s
[CV] learning_rate=0.1, max_depth=7, num_iterations=100 ..............




[CV]  learning_rate=0.1, max_depth=7, num_iterations=100, score=0.808, total=   3.7s
[CV] learning_rate=0.1, max_depth=7, num_iterations=150 ..............




In [12]:
lgbm_extra_features.get_model_selection_results()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_num_iterations,param_max_depth,param_learning_rate,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,1.465347,0.081056,0.147165,0.010955,100,3,0.001,0.774558,0.773822,0.762009,0.77013,0.00575,9
1,1.277648,0.064539,0.155121,0.015306,50,7,0.01,0.813239,0.810185,0.793397,0.805607,0.008723,7
2,3.010449,0.130473,0.393538,0.02342,150,7,0.01,0.820552,0.817371,0.79779,0.811905,0.010065,4
3,1.267699,0.081786,0.20676,0.02613,100,3,0.1,0.816922,0.814082,0.79798,0.809661,0.008341,5
4,3.689918,0.194446,0.4407,0.044156,150,10,0.01,0.820588,0.817735,0.797567,0.811963,0.010246,3
5,2.128317,0.032742,0.212158,0.028657,100,10,0.001,0.801027,0.802542,0.783841,0.795803,0.008481,8
6,2.276986,0.160329,0.261567,0.008895,100,7,0.01,0.81776,0.814457,0.795494,0.809237,0.009811,6
7,1.838509,0.096142,0.308506,0.041188,100,7,0.1,0.831538,0.828655,0.807655,0.822616,0.010644,2
8,1.004968,0.071024,0.104967,0.003093,50,3,0.001,0.755075,0.77102,0.755399,0.760498,0.007441,10
9,3.510338,0.669706,0.474222,0.042961,150,7,0.1,0.831524,0.829082,0.808125,0.82291,0.010502,1


In [12]:
best_params = lgbm_extra_features.model_selection.best_params_
best_params

{'num_iterations': 150, 'max_depth': 7, 'learning_rate': 0.1}

In [13]:
lgbm_extra_features.fit_best_classifier()

Training classifier




In [14]:
lgbm_extra_features.get_performance_metrics()

training AUC ROC score:  0.8388949769347003
validation AUC ROC score:  0.8250257613159011
relative overfitting:  0.016532719828025327
