# Praca domowa nr 5

Budowanie zbioru modeli Rashomon.

Modele są zbudowany na podstawie preprocessingu przedstawionego w artykule: https://academic.oup.com/jamiaopen/article/1/1/87/5032901. 

Kod do artykułu dostępny jest pod linkiem: https://github.com/illidanlab/urgent-care-comparative

Zadanie: problem klasyfikacji, predykcja śmiertelności na podstawie przedstawienia danych w postaci *X48* (wg. artykułu powyżej).

### Biblioteki

In [1]:
import numpy as np
import pandas as pd

import pickle
import os.path

import xgboost as xgb

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import auc as auc_score
from sklearn.utils import shuffle

### Załadowanie danych po preprocessingu

In [2]:
X = np.load("X48.npy")

In [3]:
with open('y.npy', 'rb') as f:
    labels = pickle.load(f)
    
task = [yy[0] for yy in labels]
y = np.array(task)

### Generowanie próbek do kroswalidacji

Przy modelowaniu skorzystamy z pięciokrotnej kroswalidacji - w celu zapewnienia reprodukowalności, indeksy użytych próbek można wczytać z pliku:

In [4]:
def get_cv_samples_indexes(X, y):
    if os.path.isfile('samples.npy'):
        return np.load("samples.npy", allow_pickle = True)
    else:
        tab = []
        skf = StratifiedKFold(n_splits = 5)
        
        for train_index, test_index in skf.split(X, y):
            tab.append((train_index, test_index))
            
        with open('samples.npy', 'wb') as f:
            pickle.dump(tab, f)
            
        return tab

In [5]:
cv_tab = get_cv_samples_indexes(X, y)

### Obiekty - model, random search, siatka hiperparametrów

In [6]:
model = xgb.XGBClassifier(objective='binary:logistic', n_jobs = -1, eval_metric = 'auc', use_label_encoder = False, seed = 123)

Zakres hiperparametrów wzorowany artykułem (tabela 1): https://jmlr.org/papers/volume20/18-444/18-444.pdf

Dokumentacja parametrów: https://xgboost.readthedocs.io/en/latest/parameter.html

In [7]:
hyperparameters =  {
    'learning_rate' : 2 ** np.linspace(-10, 0, num = 20),
    'subsample' : np.linspace(0.1, 1, num = 20),
    'booster' : ['gbtree', 'dart'],
    'max_depth' : list(range(1, 15 + 1)),
    'min_child_weight' : 2 ** np.linspace(0, 7, num = 20),
    'colsample_bytree' : np.linspace(0.001, 1, num = 20),
    'colsample_bylevel' : np.linspace(0.001, 1, num = 20),
    'lambda' : 2 ** np.linspace(-10, 10, num = 20),
    'alpha' : 2 ** np.linspace(-10, 10, num = 20),
    'n_estimators' : list(range(30, 740, 50))
}

In [8]:
class cross_val_gen:
    def __init__(self, cv_tab):
        self.n_splits = 5
        self.cv_tab = cv_tab

    def split(self, X, y, groups=None):
        for train_index, test_index in cv_tab:
            yield train_index, test_index 

    def get_n_splits(self, X, y, groups=None):
        return self.n_splits

In [9]:
number_of_models = 50

In [10]:
cv_search_obj = RandomizedSearchCV(estimator = model, param_distributions = hyperparameters, n_iter = number_of_models, 
                                   scoring = 'roc_auc', cv = cross_val_gen(cv_tab), return_train_score = True, verbose = 2)

### Modelowanie

In [11]:
search = cv_search_obj.fit(X, y)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV] END alpha=0.00871607748251206, booster=gbtree, colsample_bylevel=0.7371052631578947, colsample_bytree=0.7896842105263158, lambda=237.9866508821521, learning_rate=0.6943255713073277, max_depth=4, min_child_weight=1.666524012797089, n_estimators=180, subsample=0.2894736842105263; total time=   2.7s
[CV] END alpha=0.00871607748251206, booster=gbtree, colsample_bylevel=0.7371052631578947, colsample_bytree=0.7896842105263158, lambda=237.9866508821521, learning_rate=0.6943255713073277, max_depth=4, min_child_weight=1.666524012797089, n_estimators=180, subsample=0.2894736842105263; total time=   2.2s
[CV] END alpha=0.00871607748251206, booster=gbtree, colsample_bylevel=0.7371052631578947, colsample_bytree=0.7896842105263158, lambda=237.9866508821521, learning_rate=0.6943255713073277, max_depth=4, min_child_weight=1.666524012797089, n_estimators=180, subsample=0.2894736842105263; total time=   2.6s
[CV] END alpha=0.008716077482

### Ramka danych wynikowych

In [12]:
results = pd.DataFrame(search.cv_results_)

In [13]:
results.columns

Index(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time',
       'param_subsample', 'param_n_estimators', 'param_min_child_weight',
       'param_max_depth', 'param_learning_rate', 'param_lambda',
       'param_colsample_bytree', 'param_colsample_bylevel', 'param_booster',
       'param_alpha', 'params', 'split0_test_score', 'split1_test_score',
       'split2_test_score', 'split3_test_score', 'split4_test_score',
       'mean_test_score', 'std_test_score', 'rank_test_score',
       'split0_train_score', 'split1_train_score', 'split2_train_score',
       'split3_train_score', 'split4_train_score', 'mean_train_score',
       'std_train_score'],
      dtype='object')

In [14]:
results.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_subsample,param_n_estimators,param_min_child_weight,param_max_depth,param_learning_rate,param_lambda,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,2.433976,0.203905,0.010634,0.001131,0.289474,180,1.666524,4,0.694326,237.986651,...,0.848056,0.007312,12,0.928404,0.926079,0.927283,0.928269,0.930052,0.928018,0.001315
1,6.234745,1.474605,0.010191,0.000636,0.194737,580,76.806574,7,0.026039,55.310201,...,0.620302,0.016279,49,0.602346,0.591441,0.641234,0.634114,0.642813,0.62239,0.021304
2,8.02495,0.267324,0.015839,0.001033,0.431579,380,99.152617,4,0.161367,0.161367,...,0.833046,0.009093,17,0.851036,0.843831,0.849904,0.846633,0.850674,0.848416,0.002771
3,1.365094,0.014258,0.009709,7.3e-05,0.905263,130,12.85458,2,0.008716,0.161367,...,0.768341,0.014254,42,0.781201,0.777799,0.781677,0.783601,0.779132,0.780682,0.002025
4,1.673668,0.586323,0.008294,0.000597,0.384211,530,7.713408,1,1.0,0.037503,...,0.654,0.021556,48,0.672993,0.607689,0.68439,0.655589,0.665634,0.657259,0.02651


In [15]:
with open('results.npy', 'wb') as f:
    pickle.dump(results, f)

In [16]:
results.to_csv("results.csv")