# Traditional ML approach

In this notebook, a ML approach is followed to solve a cancer prediction task on gene-expression samples from a concrete tumor type. Here, gene expression profiles are directly modelled as vectors. For ML models to deal with the curse of dimensionality present in the gene expression dataset, feature selection/extraction methods are inlcuded in the current workflow.

In [1]:
import numpy as np
import pandas as pd
import warnings

# Auxiliary components
from bio_dl_utils import *

Using TensorFlow backend.


## Progression free-interval

Here, we predict the discrete progression-free interval (PFI) of each patient (sample), which correponds to a binary classification task:

In [2]:
# Define survival variable of interest
surv_variable = "PFI"
surv_variable_time = "PFI.time"

### Lung cancer

In [3]:
# Load samples-info dataset
Y_info_ft = pd.read_hdf("../data/PanCancer/Lung_pancan.h5", key="sample")
# Load survival clinical outcome dataset
Y_surv_ft = pd.read_hdf("../data/PanCancer/Lung_pancan.h5", key="survival_outcome")
# Filter tumor samples from survival clinical outcome dataset
Y_surv_ft = Y_surv_ft.loc[Y_info_ft.tumor_normal=="Tumor"]
# Drop rows where surv_variable or surv_variable_time is NA
Y_surv_ft.dropna(subset=[surv_variable, surv_variable_time], inplace=True)
Y_surv_ft.shape

(999, 33)

We create a discrete time class variable using the fixed-time point selected in `Lung_PFI_Prediction` notebook:

In [4]:
time = 230
Y_surv_disc_ft = Y_surv_ft[['PFI', 'PFI.time']].apply(
    lambda row: survival_fixed_time(time, row['PFI.time'], row['PFI']), axis=1)

Y_surv_disc_ft.dropna(inplace=True)
Y_surv_disc_ft.shape

(855,)

In [5]:
# Event class fraction
sum(Y_surv_disc_ft)/len(Y_surv_disc_ft)

0.09473684210526316

We also load the gene expression dataset, and select the samples with the survival information of interest:

In [6]:
%%time
# Load gene-exp vectors: this dataset was obtained from the final KEGG BRITE functional hierarchies dataset generated in
# 1-KEGG_BRITE_Hierarchy notebook, by selecting only the columns corresponding to PanCancer samples, removing the 
# duplicated genes (rows) and transposing it
df_gene_exp = pd.read_csv("./KEGG_gene_exp.csv")

CPU times: user 10.7 s, sys: 434 ms, total: 11.1 s
Wall time: 11.1 s


In [7]:
df_gene_exp.shape

(10535, 7509)

In [10]:
df_gene_exp.head()

Unnamed: 0,ENSG00000187961.13,ENSG00000188290.10,ENSG00000187608.8,ENSG00000188157.13,ENSG00000186891.13,ENSG00000186827.10,ENSG00000184163.3,ENSG00000162572.19,ENSG00000131584.18,ENSG00000169962.4,...,ENSG00000067048.16,ENSG00000183878.15,ENSG00000154620.5,ENSG00000165246.12,ENSG00000012817.15,ENSG00000198692.9,ENSG00000105227.14,ENSG00000164237.8,ENSG00000175048.16,ENSG00000188706.12
TCGA.02.0047.01,1.3225,4.1604,5.8166,6.3983,-1.9942,0.7493,0.3346,0.7321,5.7493,-2.1779,...,4.576,2.1013,1.2815,3.6497,3.7614,4.6508,1.2815,4.3618,4.9426,5.7748
TCGA.02.0055.01,2.3135,3.6148,6.9599,4.3356,2.9281,1.5266,0.4016,1.1316,4.1692,-3.458,...,-3.816,-6.5064,-9.9658,-5.5735,-3.0469,-4.035,0.2881,2.5924,2.9488,5.6056
TCGA.02.2483.01,2.5707,3.8729,5.9072,6.3946,-1.9379,2.2813,0.2029,0.9419,5.3995,-2.9324,...,3.8391,1.2085,1.7744,3.0428,2.727,5.3042,-1.1172,3.5523,3.345,4.836
TCGA.02.2485.01,3.3814,5.8875,9.9433,6.2132,-0.8599,1.3051,0.0014,1.8801,6.0637,-2.4659,...,4.1036,1.5661,0.5568,2.7095,4.0019,4.809,0.9642,3.6635,3.9468,4.5571
TCGA.04.1331.01,2.05,4.7661,8.6119,6.6414,-1.685,1.3846,0.7664,2.4831,3.6961,-3.1714,...,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,0.5955,4.366,1.4547,5.1486


In [11]:
%%time
# Select samples with discrete time survival information associated
df_gene_exp_disc_ft = df_gene_exp.loc[[s.replace("-", ".") for s in Y_surv_disc_ft.index]]

CPU times: user 13.7 ms, sys: 116 µs, total: 13.8 ms
Wall time: 13.3 ms


In [12]:
df_gene_exp_disc_ft.shape

(855, 7509)

Finally, we create the binary class variable to be predicted by the models:

In [13]:
from sklearn.preprocessing import LabelEncoder

# Convert discrete time survival numerical variables into binary variables
Y_surv_disc_class_ft = LabelEncoder().fit_transform(Y_surv_disc_ft)
np.unique(Y_surv_disc_class_ft)

array([0, 1])

In [14]:
# Event class fraction
sum(Y_surv_disc_class_ft)/len(Y_surv_disc_class_ft)

0.09473684210526316

#### Classical machine learning approach

We use classical ML methods to perform feature selection as well as to predict PFI fixed time period survival. Also, different resampling techniques are applied to deal with the severe class imbalance of the lung cancer survival data used in this work.

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.decomposition import PCA, KernelPCA
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import make_scorer, roc_auc_score
from hyperopt import hp

warnings.filterwarnings('ignore')

# Define the training data and the label to be predicted
X = df_gene_exp_disc_ft
# Using the numeric binary class variable
y = Y_surv_disc_class_ft

# Define the scaler
sc = StandardScaler()

# Define the dimensionality reduction techniques along with the hyperparameter space
# Feature selection methods
feat_sel_space = {'dim_reducer__k': hp.choice('k', [150, 200, 250, 300, 350])}
# Using ANOVA
anova = SelectKBest(score_func=f_classif)
# Feature extraction methods
feat_ext_space = {'dim_reducer__n_components': hp.choice('n_components', [150, 200, 250, 300, 350])}
# Using PCA
pca = PCA()
# Using KPCA
kpca = KernelPCA(kernel='rbf')

# Define the classifier along with the hyperparameter space
# Using logistic regression
log_reg = LogisticRegression(solver='saga', random_state=123)
log_reg_space = {
    'clf__penalty': hp.choice('penalty', ['l1', 'l2']), 
    'clf__C': hp.loguniform('C', np.log(1e-4), np.log(1e+3)), 
    'clf__max_iter': hp.choice('max_iter', [1e4, 1e5, 1e6, 1e7])}
# Using SVM
svm = SVC(kernel='rbf', probability=True, random_state=123)
svm_space = {
    'clf__C': hp.loguniform('C', np.log(1e-4), np.log(1e+3)),
    'clf__gamma': hp.loguniform('gamma', np.log(1e-4), np.log(1e+3)),
    'clf__max_iter': hp.choice('max_iter', [1e4, 1e5, 1e6, 1e7])
}
# Using ANN (adjust the number of hidden units depending on the number of selected features)
ann = MLPClassifier(learning_rate='constant', shuffle=True, 
                    tol=1e-5, verbose=False, early_stopping=True, validation_fraction=0.1, max_iter=200, 
                    solver='adam', activation='relu', random_state=123)
ann_space = {'clf__hidden_layer_sizes': hp.choice('hidden_layer_sizes', [(100,), (75,), (50,), (25,)]),
            'clf__alpha': hp.loguniform('alpha', np.log(1e-6), np.log(1e-1)),
            'clf__batch_size': hp.choice('batch_size', [20, 50, 80, 110, 140, 170]),
            'clf__learning_rate_init': hp.loguniform('learning_rate_init', np.log(5e-5), np.log(1e-1))}
# Using Random-Forest
rf = RandomForestClassifier(max_features="auto", criterion='gini', bootstrap=True, random_state=123)
rf_space = {'clf__n_estimators': hp.choice('n_estimators', [50, 100, 300, 500, 700]),
            'clf__max_depth': hp.choice('max_depth', [10, 30, 50, 70, 90]),
            'clf__min_samples_split': hp.choice('min_samples_split', [0.05, 0.1, 0.15, 0.2, 0.3]),
            'clf__min_samples_leaf': hp.choice('min_samples_leaf', [0.03, 0.06, 0.1, 0.2])}

# Define estimators hyper-parameter space
log_reg_feat_sel_space = dict(feat_sel_space, **log_reg_space)
log_reg_feat_ext_space = dict(feat_ext_space, **log_reg_space)
svm_feat_sel_space = dict(feat_sel_space, **svm_space)
svm_feat_ext_space = dict(feat_ext_space, **svm_space)
ann_feat_sel_space = dict(feat_sel_space, **ann_space)
ann_feat_ext_space = dict(feat_ext_space, **ann_space)
rf_feat_sel_space = dict(feat_sel_space, **rf_space)
rf_feat_ext_space = dict(feat_ext_space, **rf_space)

# Define dict where estimators Pipelines are stored
estim_pipeline = {}

# Define cross-validation iterator
outer_cv_split = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=23)

# Define evaluation metrics
model_sel_metric = 'auc'
eval_metric = {'auc': 'roc_auc', 
               'acc': make_scorer(opt_accuracy_score, needs_proba=True), 
               'sens': make_scorer(opt_recall_score, needs_proba=True),
               'spec': make_scorer(opt_specificity_score, needs_proba=True),
               'prec': make_scorer(opt_precision_score, needs_proba=True),
               'f1': make_scorer(opt_f1_score, needs_proba=True),
               'mcc': make_scorer(opt_mcc_score, needs_proba=True),
               'thres': make_scorer(opt_threshold_score, needs_proba=True)
              }

# Bayesian-optimization parameters
n = 100
random_state = 666

In [16]:
from imblearn.pipeline import Pipeline

def create_pipelines(dict_pipe, re_sampler):
    """
    Auxiliary procedure to create/update the Pipeline of every estimator in order to include a re-sampling method,
    with the goal of dealing with imbalanced datasets. Pipelines are stored in a dictionary.
    """
    # LR
    dict_pipe['log_reg_anova_sc_pipe'] = Pipeline([('dim_reducer', anova), ('scaler', sc), ('re_sample', re_sampler), 
                                                   ('clf', log_reg)])
    dict_pipe['log_reg_sc_pca_pipe'] = Pipeline([('scaler', sc), ('dim_reducer', pca), ('re_sample', re_sampler), 
                                                 ('clf', log_reg)])
    dict_pipe['log_reg_sc_kpca_pipe'] = Pipeline([('scaler', sc), ('dim_reducer', kpca), ('re_sample', re_sampler), 
                                                  ('clf', log_reg)])
    # SVM
    dict_pipe['svm_anova_sc_pipe'] = Pipeline([('dim_reducer', anova), ('scaler', sc), ('re_sample', re_sampler), 
                                               ('clf', svm)])
    dict_pipe['svm_sc_pca_pipe'] = Pipeline([('scaler', sc), ('dim_reducer', pca), ('re_sample', re_sampler), 
                                             ('clf', svm)])
    dict_pipe['svm_sc_kpca_pipe'] = Pipeline([('scaler', sc), ('dim_reducer', kpca), ('re_sample', re_sampler), 
                                              ('clf', svm)])
    # ANN
    dict_pipe['ann_anova_sc_pipe'] = Pipeline([('dim_reducer', anova), ('scaler', sc), ('re_sample', re_sampler), 
                                               ('clf', ann)])
    dict_pipe['ann_sc_pca_pipe'] = Pipeline([('scaler', sc), ('dim_reducer', pca), ('re_sample', re_sampler), 
                                             ('clf', ann)])
    dict_pipe['ann_sc_kpca_pipe'] = Pipeline([('scaler', sc), ('dim_reducer', kpca), ('re_sample', re_sampler), 
                                              ('clf', ann)])
    # RF
    dict_pipe['rf_anova_sc_pipe'] = Pipeline([('dim_reducer', anova), ('scaler', sc), ('re_sample', re_sampler), 
                                              ('clf', rf)])
    dict_pipe['rf_sc_pca_pipe'] = Pipeline([('scaler', sc), ('dim_reducer', pca), ('re_sample', re_sampler), 
                                            ('clf', rf)])
    dict_pipe['rf_sc_kpca_pipe'] = Pipeline([('scaler', sc), ('dim_reducer', kpca), ('re_sample', re_sampler), 
                                             ('clf', rf)])
    
    return None


#### Under-sampling

In [17]:
from imblearn.under_sampling import RandomUnderSampler

# Define the re-sampling method along with the hyper-parameter space
rus = RandomUnderSampler(replacement=False, random_state=69)
rus_space = {'re_sample__sampling_strategy': hp.choice('sampling_strategy', [1, 1/2, 1/3, 1/4])}

# Update Pipelines
create_pipelines(dict_pipe=estim_pipeline, re_sampler=rus)

##### RF

###### ANOVA

In [18]:
rf_anova_sc_search = HyperoptCV(estimator=estim_pipeline['rf_anova_sc_pipe'], 
                                hyper_space=dict(rf_feat_sel_space, **rus_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [22]:
%%time
best_trial = rf_anova_sc_search.model_selection(X, y)

100%|██████████| 100/100 [08:04<00:00,  4.17s/it, best loss: 0.30676600581581615]
CPU times: user 32.8 s, sys: 2.2 s, total: 35 s
Wall time: 8min 4s


In [23]:
best_trial['result']['params']

{'clf__max_depth': 50,
 'clf__min_samples_leaf': 0.03,
 'clf__min_samples_split': 0.05,
 'clf__n_estimators': 500,
 'dim_reducer__k': 300,
 're_sample__sampling_strategy': 0.3333333333333333}

In [24]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.672398,0.693234,0.283892,0.210725,0.196749,0.637279,0.676102,0.270109
std,0.144455,0.065399,0.059464,0.07151,0.068716,0.168411,0.174136,0.073313
min,0.210526,0.510887,0.179487,0.050075,0.100671,0.235294,0.135484,0.114305
25%,0.609649,0.663508,0.250525,0.170617,0.151834,0.5625,0.600974,0.23193
50%,0.684211,0.696976,0.269281,0.195411,0.179144,0.625,0.686112,0.269156
75%,0.783626,0.727923,0.309317,0.253294,0.241848,0.75,0.791935,0.293791
max,0.888889,0.829032,0.444444,0.382941,0.411765,0.9375,0.935484,0.421881


In [26]:
file_path = 'results/rus_rf_anova_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

###### PCA

In [25]:
rf_sc_pca_search = HyperoptCV(estimator=estim_pipeline['rf_sc_pca_pipe'], 
                                hyper_space=dict(rf_feat_ext_space, **rus_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [None]:
%%time
best_trial = rf_sc_pca_search.model_selection(X, y)

In [31]:
best_trial['result']['params']

{'clf__max_depth': 30,
 'clf__min_samples_leaf': 0.03,
 'clf__min_samples_split': 0.1,
 'clf__n_estimators': 700,
 'dim_reducer__n_components': 150,
 're_sample__sampling_strategy': 0.3333333333333333}

In [32]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.681404,0.646792,0.265345,0.182397,0.19693,0.563088,0.693779,0.221167
std,0.161991,0.072039,0.062905,0.080831,0.090288,0.194815,0.196347,0.038388
min,0.146199,0.423387,0.071429,-0.009654,0.083333,0.0625,0.064516,0.098705
25%,0.565789,0.601499,0.226829,0.135929,0.141123,0.4375,0.553226,0.200106
50%,0.719298,0.654637,0.256372,0.176081,0.171241,0.5625,0.735484,0.222818
75%,0.795322,0.701714,0.300489,0.232742,0.218688,0.734375,0.832258,0.24223
max,0.900585,0.770565,0.439024,0.378575,0.5,0.9375,0.96129,0.299016


In [33]:
file_path = 'results/rus_rf_pca_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

###### KPCA

In [34]:
rf_sc_kpca_search = HyperoptCV(estimator=estim_pipeline['rf_sc_kpca_pipe'], 
                                hyper_space=dict(rf_feat_ext_space, **rus_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [35]:
%%time
best_trial = rf_sc_kpca_search.model_selection(X, y)

100%|██████████| 100/100 [12:20<00:00,  8.14s/it, best loss: 0.3508594654870746]
CPU times: user 56.7 s, sys: 2.43 s, total: 59.1 s
Wall time: 12min 20s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('dim_reducer',
                                      KernelPCA(alpha=1.0, coef0=1, copy_X=True,
                                                degree=3, eigen_solver='auto',
                                                fit_inverse_transform=False,
                                                gamma=None, kernel='rbf',
                                                kernel_params=None,
                                                max_iter=Non...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),

In [36]:
best_trial['result']['params']

{'clf__max_depth': 10,
 'clf__min_samples_leaf': 0.1,
 'clf__min_samples_split': 0.15,
 'clf__n_estimators': 500,
 'dim_reducer__n_components': 150,
 're_sample__sampling_strategy': 0.25}

In [37]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.680351,0.649141,0.262376,0.177802,0.18027,0.576544,0.691233,0.167337
std,0.114697,0.062462,0.046896,0.061862,0.053077,0.149235,0.138904,0.013942
min,0.356725,0.511841,0.170213,0.028292,0.102564,0.1875,0.303226,0.133536
25%,0.615497,0.617036,0.231579,0.146847,0.145138,0.477941,0.61129,0.159147
50%,0.690058,0.646407,0.25641,0.175421,0.175735,0.606618,0.705509,0.167739
75%,0.758772,0.691459,0.28169,0.206263,0.195933,0.67739,0.791935,0.17514
max,0.883041,0.801757,0.444444,0.382941,0.4,0.875,0.954839,0.196634


In [38]:
file_path = 'results/rus_rf_kpca_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

##### LR

###### ANOVA

In [29]:
log_reg_anova_sc_search = HyperoptCV(estimator=estim_pipeline['log_reg_anova_sc_pipe'], 
                                hyper_space=dict(log_reg_feat_sel_space, **rus_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [30]:
%%time
best_trial = log_reg_anova_sc_search.model_selection(X, y)

100%|██████████| 100/100 [06:27<00:00,  3.78s/it, best loss: 0.2998064762561916]
CPU times: user 32 s, sys: 2.8 s, total: 34.8 s
Wall time: 6min 27s


In [31]:
best_trial['result']['params']

{'clf__C': 0.0013207106162047982,
 'clf__max_iter': 10000.0,
 'clf__penalty': 'l2',
 'dim_reducer__k': 350,
 're_sample__sampling_strategy': 0.25}

In [32]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.711696,0.700194,0.300485,0.230781,0.21607,0.614191,0.721763,0.216991
std,0.112564,0.060311,0.062791,0.077979,0.082899,0.158498,0.13684,0.066727
min,0.45614,0.558065,0.196078,0.08586,0.118812,0.25,0.422078,0.121258
25%,0.649123,0.663911,0.257792,0.176071,0.165293,0.5,0.647832,0.173209
50%,0.71345,0.70625,0.297769,0.230388,0.193277,0.625,0.722581,0.210211
75%,0.783626,0.742344,0.339631,0.279667,0.236264,0.738971,0.806452,0.239766
max,0.900585,0.839919,0.45283,0.416365,0.5,0.882353,0.96129,0.444247


In [48]:
file_path = 'results/rus_lr_anova_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

###### PCA

In [49]:
log_reg_sc_pca_search = HyperoptCV(estimator=estim_pipeline['log_reg_sc_pca_pipe'], 
                                hyper_space=dict(log_reg_feat_ext_space, **rus_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [50]:
%%time
best_trial = log_reg_sc_pca_search.model_selection(X, y)

100%|██████████| 100/100 [35:13<00:00, 21.04s/it, best loss: 0.30410006407254997]
CPU times: user 1min 1s, sys: 2.3 s, total: 1min 4s
Wall time: 35min 13s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('dim_reducer',
                                      PCA(copy=True, iterated_power='auto',
                                          n_components=350, random_state=None,
                                          svd_solver='auto', tol=0.0,
                                          whiten=False)),
                                     ('re_sample',
                                      RandomUnderSampler(random_s...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),
                    'prec': make_scorer

In [51]:
best_trial['result']['params']

{'clf__C': 0.0001046612729527639,
 'clf__max_iter': 100000.0,
 'clf__penalty': 'l2',
 'dim_reducer__n_components': 150,
 're_sample__sampling_strategy': 0.25}

In [52]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.698596,0.6959,0.290983,0.219282,0.19918,0.623382,0.706444,0.238072
std,0.116062,0.055998,0.051132,0.063231,0.053166,0.157551,0.141379,0.036238
min,0.280702,0.560081,0.196078,0.107077,0.109489,0.3125,0.212903,0.147955
25%,0.638889,0.667816,0.250718,0.17814,0.155523,0.529412,0.625597,0.212248
50%,0.707602,0.703429,0.291404,0.216609,0.188578,0.625,0.722581,0.239153
75%,0.789474,0.730444,0.32,0.264531,0.224026,0.738971,0.8239,0.263083
max,0.871345,0.814516,0.423077,0.375895,0.333333,0.9375,0.929032,0.355704


In [53]:
file_path = 'results/rus_lr_pca_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

###### KPCA

In [54]:
log_reg_sc_kpca_search = HyperoptCV(estimator=estim_pipeline['log_reg_sc_kpca_pipe'], 
                                hyper_space=dict(log_reg_feat_ext_space, **rus_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [55]:
%%time
best_trial = log_reg_sc_kpca_search.model_selection(X, y)

100%|██████████| 100/100 [11:01<00:00,  5.94s/it, best loss: 0.31659106311146157]
CPU times: user 59 s, sys: 2.66 s, total: 1min 1s
Wall time: 11min 1s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('dim_reducer',
                                      KernelPCA(alpha=1.0, coef0=1, copy_X=True,
                                                degree=3, eigen_solver='auto',
                                                fit_inverse_transform=False,
                                                gamma=None, kernel='rbf',
                                                kernel_params=None,
                                                max_iter=Non...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),

In [56]:
best_trial['result']['params']

{'clf__C': 1.6622342338780551,
 'clf__max_iter': 100000.0,
 'clf__penalty': 'l2',
 'dim_reducer__n_components': 250,
 're_sample__sampling_strategy': 0.25}

In [57]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.687836,0.683409,0.283096,0.207468,0.193705,0.617279,0.695207,0.209891
std,0.11174,0.062393,0.058542,0.074935,0.061827,0.153248,0.1351,0.025148
min,0.444444,0.520161,0.174757,0.03453,0.103448,0.3125,0.396104,0.169291
25%,0.596491,0.653327,0.245077,0.170765,0.153846,0.507353,0.578603,0.188077
50%,0.687135,0.685256,0.27101,0.196833,0.176327,0.588235,0.69579,0.205961
75%,0.788012,0.727781,0.321071,0.257018,0.238068,0.75,0.816129,0.225822
max,0.883041,0.820161,0.473684,0.416313,0.409091,0.882353,0.941935,0.257685


In [58]:
file_path = 'results/rus_lr_kpca_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

##### SVM

###### ANOVA

In [35]:
svm_anova_sc_search = HyperoptCV(estimator=estim_pipeline['svm_anova_sc_pipe'], 
                                hyper_space=dict(svm_feat_sel_space, **rus_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=1)

In [36]:
%%time
best_trial = svm_anova_sc_search.model_selection(X, y)

100%|██████████| 100/100 [12:51<00:00,  7.07s/it, best loss: 0.29726027625126306]
CPU times: user 1h 15min 21s, sys: 1min 2s, total: 1h 16min 23s
Wall time: 12min 51s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('dim_reducer',
                                      SelectKBest(k=350,
                                                  score_func=<function f_classif at 0x7f440998f400>)),
                                     ('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('re_sample',
                                      RandomUnderSampler(random_state=69,
                                                         replacement=False,
                                                         sampling_strategy=1...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),
 

In [37]:
best_trial['result']['params']

{'clf__C': 0.0025543413087670383,
 'clf__gamma': 0.0002681678316978474,
 'clf__max_iter': 100000.0,
 'dim_reducer__k': 250,
 're_sample__sampling_strategy': 0.3333333333333333}

In [38]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.694737,0.70274,0.299555,0.235347,0.219123,0.641912,0.700246,0.305218
std,0.127505,0.061456,0.064747,0.075949,0.098112,0.178659,0.156429,0.135905
min,0.327485,0.567339,0.206897,0.107077,0.116279,0.235294,0.264516,0.10823
25%,0.608187,0.670665,0.255319,0.189558,0.157728,0.529412,0.590909,0.230638
50%,0.690058,0.702379,0.282642,0.221728,0.179505,0.6875,0.693548,0.254266
75%,0.788012,0.744456,0.327152,0.283981,0.239531,0.761029,0.814516,0.335761
max,0.918129,0.857258,0.478261,0.43339,0.6,0.9375,0.974194,0.736252


In [39]:
file_path = 'results/rus_svm_anova_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

###### PCA

In [22]:
svm_sc_pca_search = HyperoptCV(estimator=estim_pipeline['svm_sc_pca_pipe'], 
                                hyper_space=dict(svm_feat_ext_space, **rus_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [23]:
%%time
best_trial = svm_sc_pca_search.model_selection(X, y)

100%|██████████| 100/100 [25:52<00:00, 17.88s/it, best loss: 0.31929430247172175]
CPU times: user 1min 6s, sys: 2.81 s, total: 1min 9s
Wall time: 25min 52s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('dim_reducer',
                                      PCA(copy=True, iterated_power='auto',
                                          n_components=250, random_state=None,
                                          svd_solver='auto', tol=0.0,
                                          whiten=False)),
                                     ('re_sample',
                                      RandomUnderSampler(random_s...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),
                    'prec': make_scorer

In [24]:
best_trial['result']['params']

{'clf__C': 1.3249043228863913,
 'clf__gamma': 0.00010200925988061845,
 'clf__max_iter': 10000.0,
 'dim_reducer__n_components': 250,
 're_sample__sampling_strategy': 0.5}

In [25]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.70386,0.680706,0.283962,0.211193,0.209679,0.589118,0.715911,0.362495
std,0.127438,0.062484,0.053664,0.0672,0.085832,0.177201,0.156205,0.096856
min,0.245614,0.507258,0.178344,0.042607,0.099291,0.176471,0.180645,0.131149
25%,0.621345,0.64032,0.252632,0.170278,0.160714,0.5,0.627419,0.295567
50%,0.716374,0.683266,0.278674,0.202816,0.186773,0.575368,0.719355,0.354678
75%,0.788012,0.730042,0.31014,0.249751,0.22601,0.75,0.817742,0.4
max,0.906433,0.797984,0.512821,0.461883,0.6,0.9375,0.987013,0.629479


In [26]:
file_path = 'results/rus_svm_pca_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

###### KPCA

In [22]:
svm_sc_kpca_search = HyperoptCV(estimator=estim_pipeline['svm_sc_kpca_pipe'], 
                                hyper_space=dict(svm_feat_ext_space, **rus_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [23]:
%%time
best_trial = svm_sc_kpca_search.model_selection(X, y)

100%|██████████| 100/100 [10:35<00:00,  6.08s/it, best loss: 0.31152989846965184]
CPU times: user 58.5 s, sys: 2.8 s, total: 1min 1s
Wall time: 10min 35s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('dim_reducer',
                                      KernelPCA(alpha=1.0, coef0=1, copy_X=True,
                                                degree=3, eigen_solver='auto',
                                                fit_inverse_transform=False,
                                                gamma=None, kernel='rbf',
                                                kernel_params=None,
                                                max_iter=Non...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),

In [24]:
best_trial['result']['params']

{'clf__C': 0.009057060972495206,
 'clf__gamma': 0.62408134956178,
 'clf__max_iter': 10000.0,
 'dim_reducer__n_components': 250,
 're_sample__sampling_strategy': 0.3333333333333333}

In [25]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.711696,0.68847,0.288697,0.217802,0.205059,0.598088,0.723681,0.249849
std,0.110536,0.067041,0.064055,0.074227,0.063099,0.180308,0.136772,0.055994
min,0.467836,0.50121,0.090909,0.047865,0.121212,0.0625,0.43871,0.165042
25%,0.644737,0.637601,0.256824,0.183072,0.168931,0.477941,0.63474,0.218272
50%,0.704678,0.691129,0.287187,0.220125,0.18809,0.625,0.709677,0.240577
75%,0.812865,0.731754,0.31462,0.255015,0.225333,0.6875,0.85,0.270247
max,0.883041,0.83871,0.52381,0.479116,0.423077,0.875,0.967742,0.520237


In [26]:
file_path = 'results/rus_svm_kpca_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

##### ANN

###### ANOVA

In [27]:
ann_anova_sc_search = HyperoptCV(estimator=estim_pipeline['ann_anova_sc_pipe'], 
                                hyper_space=dict(ann_feat_sel_space, **rus_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [28]:
%%time
best_trial = ann_anova_sc_search.model_selection(X, y)

100%|██████████| 100/100 [03:47<00:00,  2.17s/it, best loss: 0.31728751324576765]
CPU times: user 1min 1s, sys: 2.88 s, total: 1min 4s
Wall time: 3min 47s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('dim_reducer',
                                      SelectKBest(k=150,
                                                  score_func=<function f_classif at 0x7f47b28e0400>)),
                                     ('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('re_sample',
                                      RandomUnderSampler(random_state=69,
                                                         replacement=False,
                                                         sampling_strategy=0...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),
 

In [29]:
best_trial['result']['params']

{'clf__alpha': 0.00015117148382438425,
 'clf__batch_size': 140,
 'clf__hidden_layer_sizes': (25,),
 'clf__learning_rate_init': 0.012822228449225715,
 'dim_reducer__k': 250,
 're_sample__sampling_strategy': 0.3333333333333333}

In [30]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.684561,0.682712,0.285145,0.212896,0.199683,0.622721,0.691137,0.313345
std,0.139495,0.065031,0.059942,0.071412,0.067856,0.183428,0.170055,0.202978
min,0.321637,0.539919,0.2,0.08414,0.115385,0.235294,0.258065,0.001407
25%,0.583333,0.641641,0.247432,0.16606,0.157895,0.5,0.578603,0.141287
50%,0.687135,0.689315,0.273828,0.209346,0.184524,0.625,0.693548,0.253265
75%,0.78655,0.726714,0.311985,0.241631,0.228077,0.75,0.824194,0.454065
max,0.888889,0.825806,0.434783,0.37976,0.411765,0.9375,0.935484,0.71042


In [31]:
file_path = 'results/rus_ann_anova_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

###### PCA

In [32]:
ann_sc_pca_search = HyperoptCV(estimator=estim_pipeline['ann_sc_pca_pipe'], 
                                hyper_space=dict(ann_feat_ext_space, **rus_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [33]:
%%time
best_trial = ann_sc_pca_search.model_selection(X, y)

100%|██████████| 100/100 [31:16<00:00, 16.58s/it, best loss: 0.34365018852115625]
CPU times: user 1min 13s, sys: 3.12 s, total: 1min 16s
Wall time: 31min 16s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('dim_reducer',
                                      PCA(copy=True, iterated_power='auto',
                                          n_components=150, random_state=None,
                                          svd_solver='auto', tol=0.0,
                                          whiten=False)),
                                     ('re_sample',
                                      RandomUnderSampler(random_s...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),
                    'prec': make_scorer

In [34]:
best_trial['result']['params']

{'clf__alpha': 0.020998685122069965,
 'clf__batch_size': 80,
 'clf__hidden_layer_sizes': (50,),
 'clf__learning_rate_init': 0.010261410386712157,
 'dim_reducer__n_components': 250,
 're_sample__sampling_strategy': 0.25}

In [35]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.693333,0.65635,0.268916,0.18584,0.193224,0.5575,0.707426,0.2885086
std,0.149819,0.06667,0.055234,0.067506,0.066059,0.185139,0.182494,0.3246532
min,0.263158,0.522984,0.166667,0.070523,0.107143,0.1875,0.193548,3.911994e-07
25%,0.640351,0.610484,0.229823,0.134417,0.147794,0.418199,0.636364,0.01893532
50%,0.719298,0.649156,0.258342,0.181125,0.173163,0.5625,0.732258,0.1531315
75%,0.812865,0.712702,0.30303,0.230688,0.228632,0.6875,0.859677,0.4973549
max,0.888889,0.785081,0.391304,0.326964,0.4,0.9375,0.941935,0.9828282


In [36]:
file_path = 'results/rus_ann_pca_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

###### KPCA

In [37]:
ann_sc_kpca_search = HyperoptCV(estimator=estim_pipeline['ann_sc_kpca_pipe'], 
                                hyper_space=dict(ann_feat_ext_space, **rus_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [38]:
%%time
best_trial = ann_sc_kpca_search.model_selection(X, y)

100%|██████████| 100/100 [09:04<00:00,  5.44s/it, best loss: 0.3297474186155399]
CPU times: user 1min 1s, sys: 2.66 s, total: 1min 3s
Wall time: 9min 4s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('dim_reducer',
                                      KernelPCA(alpha=1.0, coef0=1, copy_X=True,
                                                degree=3, eigen_solver='auto',
                                                fit_inverse_transform=False,
                                                gamma=None, kernel='rbf',
                                                kernel_params=None,
                                                max_iter=Non...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),

In [39]:
best_trial['result']['params']

{'clf__alpha': 0.00017780552652702216,
 'clf__batch_size': 20,
 'clf__hidden_layer_sizes': (75,),
 'clf__learning_rate_init': 0.022808100391533744,
 'dim_reducer__n_components': 150,
 're_sample__sampling_strategy': 0.25}

In [40]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.72269,0.670253,0.288381,0.212992,0.215446,0.559853,0.739605,0.164094
std,0.124285,0.070582,0.069722,0.081705,0.085251,0.176331,0.151725,0.098807
min,0.315789,0.537051,0.193103,0.079694,0.108527,0.1875,0.258065,0.014521
25%,0.666667,0.632661,0.245077,0.16657,0.160995,0.4375,0.670968,0.126802
50%,0.725146,0.676149,0.270552,0.193833,0.185737,0.575368,0.745161,0.142502
75%,0.815789,0.714415,0.311683,0.244171,0.265476,0.6875,0.845161,0.167078
max,0.912281,0.856048,0.571429,0.525365,0.526316,0.875,0.974194,0.726844


In [41]:
file_path = 'results/rus_ann_kpca_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

#### Over-sampling

In [21]:
from imblearn.over_sampling import RandomOverSampler

# Define the re-sampling method along with the hyper-parameter space
ros = RandomOverSampler(random_state=69)
ros_space = {'re_sample__sampling_strategy': hp.choice('sampling_strategy', [1, 1/2, 1/3, 1/4])}

# Update Pipelines
create_pipelines(dict_pipe=estim_pipeline, re_sampler=ros)

##### RF

###### ANOVA

In [45]:
rf_anova_sc_search = HyperoptCV(estimator=estim_pipeline['rf_anova_sc_pipe'], 
                                hyper_space=dict(rf_feat_sel_space, **ros_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [46]:
%%time
best_trial = rf_anova_sc_search.model_selection(X, y)

100%|██████████| 100/100 [12:38<00:00,  6.98s/it, best loss: 0.3077235207866137]
CPU times: user 51.9 s, sys: 2.45 s, total: 54.3 s
Wall time: 12min 38s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('dim_reducer',
                                      SelectKBest(k=250,
                                                  score_func=<function f_classif at 0x7f47b28e0400>)),
                                     ('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('re_sample',
                                      RandomOverSampler(random_state=69,
                                                        sampling_strategy=1)),
                                     ('clf',
                                      RandomFor...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, ne

In [47]:
best_trial['result']['params']

{'clf__max_depth': 70,
 'clf__min_samples_leaf': 0.03,
 'clf__min_samples_split': 0.1,
 'clf__n_estimators': 300,
 'dim_reducer__k': 300,
 're_sample__sampling_strategy': 0.25}

In [48]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.678713,0.692276,0.278092,0.205847,0.192439,0.626176,0.683982,0.206607
std,0.124875,0.065255,0.049412,0.064848,0.058235,0.185588,0.154012,0.060886
min,0.421053,0.506048,0.164706,0.02226,0.101449,0.1875,0.383117,0.113508
25%,0.589181,0.651411,0.248067,0.176762,0.152386,0.453125,0.581588,0.165416
50%,0.666667,0.6991,0.276596,0.209314,0.180461,0.6875,0.664516,0.187053
75%,0.766082,0.73256,0.304748,0.245788,0.225,0.75,0.785484,0.258245
max,0.877193,0.816532,0.4,0.333678,0.368421,0.882353,0.941935,0.348811


In [49]:
file_path = 'results/ros_rf_anova_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

###### PCA

In [50]:
rf_sc_pca_search = HyperoptCV(estimator=estim_pipeline['rf_sc_pca_pipe'], 
                                hyper_space=dict(rf_feat_ext_space, **ros_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [51]:
%%time
best_trial = rf_sc_pca_search.model_selection(X, y)

100%|██████████| 100/100 [38:29<00:00, 14.34s/it, best loss: 0.3517438638211883]
CPU times: user 1min, sys: 2.9 s, total: 1min 3s
Wall time: 38min 29s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('dim_reducer',
                                      PCA(copy=True, iterated_power='auto',
                                          n_components=150, random_state=None,
                                          svd_solver='auto', tol=0.0,
                                          whiten=False)),
                                     ('re_sample',
                                      RandomOverSampler(random_st...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),
                    'prec': make_scorer

In [52]:
best_trial['result']['params']

{'clf__max_depth': 50,
 'clf__min_samples_leaf': 0.03,
 'clf__min_samples_split': 0.1,
 'clf__n_estimators': 700,
 'dim_reducer__n_components': 200,
 're_sample__sampling_strategy': 0.3333333333333333}

In [53]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.704912,0.648256,0.264626,0.185152,0.199569,0.541838,0.722174,0.185202
std,0.146541,0.079916,0.054291,0.071247,0.084613,0.201085,0.179328,0.036978
min,0.140351,0.441935,0.076923,-0.005505,0.093168,0.0625,0.058065,0.066306
25%,0.656433,0.602502,0.242424,0.14687,0.160858,0.418199,0.653226,0.170097
50%,0.725146,0.646662,0.270209,0.187034,0.184413,0.545956,0.735484,0.182354
75%,0.78655,0.703931,0.292753,0.234298,0.210978,0.6875,0.827922,0.20117
max,0.906433,0.803226,0.428571,0.39086,0.545455,0.9375,0.987013,0.2966


In [54]:
file_path = 'results/ros_rf_pca_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

###### KPCA

In [55]:
rf_sc_kpca_search = HyperoptCV(estimator=estim_pipeline['rf_sc_kpca_pipe'], 
                                hyper_space=dict(rf_feat_ext_space, **ros_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [57]:
%%time
best_trial = rf_sc_kpca_search.model_selection(X, y)

100%|██████████| 100/100 [21:18<00:00, 10.06s/it, best loss: 0.3402348505384558]
CPU times: user 54.3 s, sys: 2.6 s, total: 56.9 s
Wall time: 21min 18s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('dim_reducer',
                                      KernelPCA(alpha=1.0, coef0=1, copy_X=True,
                                                degree=3, eigen_solver='auto',
                                                fit_inverse_transform=False,
                                                gamma=None, kernel='rbf',
                                                kernel_params=None,
                                                max_iter=Non...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),

In [58]:
best_trial['result']['params']

{'clf__max_depth': 90,
 'clf__min_samples_leaf': 0.03,
 'clf__min_samples_split': 0.2,
 'clf__n_estimators': 700,
 'dim_reducer__n_components': 150,
 're_sample__sampling_strategy': 1}

In [59]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.692047,0.659765,0.267242,0.187933,0.197292,0.559706,0.706202,0.406887
std,0.148247,0.062003,0.054997,0.071126,0.080277,0.207221,0.182519,0.045592
min,0.315789,0.517953,0.181818,0.045428,0.114504,0.1875,0.251613,0.31611
25%,0.602339,0.624798,0.225893,0.136705,0.142857,0.384191,0.606452,0.375975
50%,0.710526,0.660685,0.271977,0.190742,0.181831,0.5625,0.729032,0.407045
75%,0.830409,0.699798,0.297917,0.225488,0.209978,0.6875,0.866883,0.435894
max,0.912281,0.782258,0.410256,0.373908,0.555556,0.9375,0.974194,0.494709


In [60]:
file_path = 'results/ros_rf_kpca_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

##### LR

###### ANOVA

In [61]:
log_reg_anova_sc_search = HyperoptCV(estimator=estim_pipeline['log_reg_anova_sc_pipe'], 
                                hyper_space=dict(log_reg_feat_sel_space, **ros_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [62]:
%%time
best_trial = log_reg_anova_sc_search.model_selection(X, y)

100%|██████████| 100/100 [18:52<00:00,  7.07s/it, best loss: 0.3029773836220706]
CPU times: user 52.4 s, sys: 2.83 s, total: 55.2 s
Wall time: 18min 52s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('dim_reducer',
                                      SelectKBest(k=200,
                                                  score_func=<function f_classif at 0x7f47b28e0400>)),
                                     ('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('re_sample',
                                      RandomOverSampler(random_state=69,
                                                        sampling_strategy=1)),
                                     ('clf',
                                      LogisticR...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, ne

In [63]:
best_trial['result']['params']

{'clf__C': 0.0008899666034336689,
 'clf__max_iter': 10000000.0,
 'clf__penalty': 'l2',
 'dim_reducer__k': 350,
 're_sample__sampling_strategy': 0.25}

In [64]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.666901,0.697023,0.282791,0.212394,0.190323,0.65875,0.667879,0.186245
std,0.117203,0.06082,0.052405,0.064666,0.057811,0.152517,0.142197,0.062628
min,0.409357,0.556855,0.20339,0.091485,0.117117,0.235294,0.367742,0.099132
25%,0.592105,0.663407,0.24734,0.163577,0.151057,0.5625,0.575806,0.140055
50%,0.652047,0.698185,0.276545,0.214265,0.173342,0.6875,0.640783,0.166131
75%,0.763158,0.736593,0.314967,0.249434,0.215675,0.75,0.778666,0.211176
max,0.877193,0.839113,0.409091,0.369189,0.333333,0.875,0.948052,0.400811


In [65]:
file_path = 'results/ros_lr_anova_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

###### PCA

In [66]:
log_reg_sc_pca_search = HyperoptCV(estimator=estim_pipeline['log_reg_sc_pca_pipe'], 
                                hyper_space=dict(log_reg_feat_ext_space, **ros_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [68]:
%%time
best_trial = log_reg_sc_pca_search.model_selection(X, y)

100%|██████████| 100/100 [1:12:17<00:00, 43.25s/it, best loss: 0.3080619409053945]
CPU times: user 56.9 s, sys: 2.77 s, total: 59.6 s
Wall time: 1h 12min 17s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('dim_reducer',
                                      PCA(copy=True, iterated_power='auto',
                                          n_components=350, random_state=None,
                                          svd_solver='auto', tol=0.0,
                                          whiten=False)),
                                     ('re_sample',
                                      RandomOverSampler(random_st...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),
                    'prec': make_scorer

In [69]:
best_trial['result']['params']

{'clf__C': 0.006178318151879314,
 'clf__max_iter': 100000.0,
 'clf__penalty': 'l1',
 'dim_reducer__n_components': 300,
 're_sample__sampling_strategy': 0.25}

In [70]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.659532,0.691938,0.279486,0.20742,0.194953,0.648529,0.660761,0.206646
std,0.145801,0.059406,0.051682,0.065584,0.078777,0.169103,0.176217,0.059609
min,0.274854,0.556452,0.183486,0.052341,0.107527,0.25,0.206452,0.09436
25%,0.583333,0.661895,0.244654,0.154249,0.14537,0.5625,0.570968,0.173993
50%,0.692982,0.703629,0.280065,0.210978,0.179744,0.6875,0.693548,0.1991
75%,0.752924,0.732661,0.307692,0.257448,0.215127,0.75,0.767742,0.229403
max,0.906433,0.791129,0.387097,0.359313,0.555556,0.941176,0.974026,0.392206


In [71]:
file_path = 'results/ros_lr_pca_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

###### KPCA

In [72]:
log_reg_sc_kpca_search = HyperoptCV(estimator=estim_pipeline['log_reg_sc_kpca_pipe'], 
                                hyper_space=dict(log_reg_feat_ext_space, **ros_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [73]:
%%time
best_trial = log_reg_sc_kpca_search.model_selection(X, y)

100%|██████████| 100/100 [16:35<00:00,  8.50s/it, best loss: 0.31834321323837456]
CPU times: user 1min 1s, sys: 2.82 s, total: 1min 3s
Wall time: 16min 35s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('dim_reducer',
                                      KernelPCA(alpha=1.0, coef0=1, copy_X=True,
                                                degree=3, eigen_solver='auto',
                                                fit_inverse_transform=False,
                                                gamma=None, kernel='rbf',
                                                kernel_params=None,
                                                max_iter=Non...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),

In [74]:
best_trial['result']['params']

{'clf__C': 1.1370513943378124,
 'clf__max_iter': 10000000.0,
 'clf__penalty': 'l2',
 'dim_reducer__n_components': 250,
 're_sample__sampling_strategy': 0.25}

In [75]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.634386,0.681657,0.264821,0.188067,0.175218,0.659779,0.631584,0.177715
std,0.138243,0.062244,0.046519,0.063077,0.051926,0.167882,0.167112,0.034987
min,0.269006,0.5375,0.180328,0.044754,0.10219,0.25,0.206452,0.102662
25%,0.540936,0.645238,0.233835,0.15626,0.141484,0.5625,0.51461,0.15472
50%,0.654971,0.686895,0.267662,0.188183,0.16784,0.6875,0.660222,0.173186
75%,0.719298,0.72127,0.292017,0.225445,0.19185,0.761029,0.733871,0.195163
max,0.871345,0.803629,0.421053,0.356339,0.363636,0.941176,0.922581,0.27962


In [76]:
file_path = 'results/ros_lr_kpca_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

##### SVM

###### ANOVA

In [77]:
svm_anova_sc_search = HyperoptCV(estimator=estim_pipeline['svm_anova_sc_pipe'], 
                                hyper_space=dict(svm_feat_sel_space, **ros_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=1)

In [78]:
%%time
best_trial = svm_anova_sc_search.model_selection(X, y)

100%|██████████| 100/100 [1:28:07<00:00, 51.06s/it, best loss: 0.2993629032258065]
CPU times: user 2h 54min 33s, sys: 1min 24s, total: 2h 55min 57s
Wall time: 1h 28min 7s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('dim_reducer',
                                      SelectKBest(k=350,
                                                  score_func=<function f_classif at 0x7f47b28e0400>)),
                                     ('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('re_sample',
                                      RandomOverSampler(random_state=69,
                                                        sampling_strategy=0.3333333333333333))...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),
                    'prec': make_scorer(opt_precision_score,

In [79]:
best_trial['result']['params']

{'clf__C': 0.4435748143532496,
 'clf__gamma': 0.00010571037988085505,
 'clf__max_iter': 100000.0,
 'dim_reducer__k': 350,
 're_sample__sampling_strategy': 0.3333333333333333}

In [80]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.715789,0.700637,0.30357,0.234896,0.222296,0.611912,0.726801,0.256317
std,0.108785,0.06176,0.06545,0.079905,0.091334,0.155198,0.13223,0.148631
min,0.48538,0.568145,0.197531,0.079348,0.123077,0.235294,0.458065,0.106409
25%,0.643275,0.666734,0.262085,0.195215,0.161949,0.5,0.640323,0.156302
50%,0.69883,0.702623,0.298833,0.230788,0.195054,0.625,0.709677,0.197198
75%,0.811404,0.738609,0.325979,0.265348,0.252914,0.738971,0.835484,0.324671
max,0.900585,0.84879,0.473684,0.416313,0.5,0.875,0.96129,0.720578


In [81]:
file_path = 'results/ros_svm_anova_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

###### PCA

In [82]:
svm_sc_pca_search = HyperoptCV(estimator=estim_pipeline['svm_sc_pca_pipe'], 
                                hyper_space=dict(svm_feat_ext_space, **ros_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [83]:
%%time
best_trial = svm_sc_pca_search.model_selection(X, y)

100%|██████████| 100/100 [47:28<00:00, 35.16s/it, best loss: 0.3161337884127259] 
CPU times: user 59.9 s, sys: 3.13 s, total: 1min 3s
Wall time: 47min 28s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('dim_reducer',
                                      PCA(copy=True, iterated_power='auto',
                                          n_components=300, random_state=None,
                                          svd_solver='auto', tol=0.0,
                                          whiten=False)),
                                     ('re_sample',
                                      RandomOverSampler(random_st...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),
                    'prec': make_scorer

In [84]:
best_trial['result']['params']

{'clf__C': 0.0394993205122062,
 'clf__gamma': 0.00010007342141313602,
 'clf__max_iter': 1000000.0,
 'dim_reducer__n_components': 300,
 're_sample__sampling_strategy': 0.5}

In [85]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.676374,0.683866,0.278871,0.202346,0.186766,0.628824,0.681411,0.182694
std,0.113891,0.065804,0.056964,0.075991,0.05355,0.149137,0.136457,0.118921
min,0.350877,0.539113,0.177778,0.03778,0.10084,0.3125,0.309677,0.03601
25%,0.608187,0.644355,0.24,0.157296,0.146315,0.537684,0.584835,0.112658
50%,0.684211,0.686081,0.26748,0.202366,0.165612,0.625,0.683871,0.142428
75%,0.770468,0.721371,0.314967,0.258212,0.223245,0.75,0.787097,0.234987
max,0.853801,0.812903,0.392857,0.348495,0.32,0.882353,0.890323,0.560509


In [86]:
file_path = 'results/ros_svm_pca_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

###### KPCA

In [87]:
svm_sc_kpca_search = HyperoptCV(estimator=estim_pipeline['svm_sc_kpca_pipe'], 
                                hyper_space=dict(svm_feat_ext_space, **ros_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [88]:
%%time
best_trial = svm_sc_kpca_search.model_selection(X, y)

100%|██████████| 100/100 [24:27<00:00, 13.06s/it, best loss: 0.31165946671923916]
CPU times: user 55 s, sys: 2.86 s, total: 57.9 s
Wall time: 24min 27s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('dim_reducer',
                                      KernelPCA(alpha=1.0, coef0=1, copy_X=True,
                                                degree=3, eigen_solver='auto',
                                                fit_inverse_transform=False,
                                                gamma=None, kernel='rbf',
                                                kernel_params=None,
                                                max_iter=Non...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),

In [89]:
best_trial['result']['params']

{'clf__C': 2.6843225952993026,
 'clf__gamma': 0.05211313725687281,
 'clf__max_iter': 10000000.0,
 'dim_reducer__n_components': 250,
 're_sample__sampling_strategy': 0.5}

In [90]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.664327,0.688341,0.273912,0.20072,0.181809,0.645294,0.666227,0.229302
std,0.115046,0.067301,0.051489,0.066784,0.048442,0.168983,0.141064,0.090872
min,0.403509,0.525806,0.181818,0.058345,0.115942,0.25,0.354839,0.108248
25%,0.590643,0.64712,0.238934,0.15765,0.151772,0.5625,0.574194,0.168176
50%,0.681287,0.69496,0.266194,0.196039,0.16784,0.6875,0.687097,0.213813
75%,0.730994,0.726512,0.303138,0.250766,0.203784,0.75,0.748387,0.257168
max,0.853801,0.83629,0.418605,0.356499,0.333333,0.882353,0.915584,0.585677


In [91]:
file_path = 'results/ros_svm_kpca_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

##### ANN

###### ANOVA

In [22]:
ann_anova_sc_search = HyperoptCV(estimator=estim_pipeline['ann_anova_sc_pipe'], 
                                hyper_space=dict(ann_feat_sel_space, **ros_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [25]:
%%time
best_trial = ann_anova_sc_search.model_selection(X, y)

100%|██████████| 100/100 [08:57<00:00,  4.62s/it, best loss: 0.33191271347248563]
CPU times: user 1min 2s, sys: 3.07 s, total: 1min 5s
Wall time: 8min 57s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('dim_reducer',
                                      SelectKBest(k=300,
                                                  score_func=<function f_classif at 0x7f3c33d89400>)),
                                     ('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('re_sample',
                                      RandomOverSampler(random_state=69,
                                                        sampling_strategy=0.3333333333333333))...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),
                    'prec': make_scorer(opt_precision_score,

In [26]:
best_trial['result']['params']

{'clf__alpha': 0.0003170875375418915,
 'clf__batch_size': 20,
 'clf__hidden_layer_sizes': (50,),
 'clf__learning_rate_init': 8.855940259821627e-05,
 'dim_reducer__k': 300,
 're_sample__sampling_strategy': 0.3333333333333333}

In [27]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.694854,0.668087,0.278055,0.201227,0.209103,0.573309,0.707667,0.238906
std,0.152389,0.071172,0.058817,0.073657,0.085227,0.194114,0.185498,0.153885
min,0.315789,0.546371,0.188406,0.070383,0.106557,0.1875,0.258065,0.013625
25%,0.581871,0.614748,0.229885,0.146441,0.141484,0.445772,0.570475,0.109145
50%,0.72807,0.661867,0.262626,0.196429,0.186147,0.575368,0.754839,0.212967
75%,0.823099,0.713306,0.321935,0.255009,0.248106,0.701287,0.857834,0.343992
max,0.906433,0.844355,0.439024,0.378575,0.5,0.875,0.967742,0.640787


In [28]:
file_path = 'results/ros_ann_anova_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

###### PCA

In [29]:
ann_sc_pca_search = HyperoptCV(estimator=estim_pipeline['ann_sc_pca_pipe'], 
                                hyper_space=dict(ann_feat_ext_space, **ros_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [30]:
%%time
best_trial = ann_sc_pca_search.model_selection(X, y)

100%|██████████| 100/100 [30:09<00:00, 19.10s/it, best loss: 0.36149737548978533]
CPU times: user 1min 9s, sys: 2.98 s, total: 1min 12s
Wall time: 30min 9s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('dim_reducer',
                                      PCA(copy=True, iterated_power='auto',
                                          n_components=200, random_state=None,
                                          svd_solver='auto', tol=0.0,
                                          whiten=False)),
                                     ('re_sample',
                                      RandomOverSampler(random_st...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),
                    'prec': make_scorer

In [31]:
best_trial['result']['params']

{'clf__alpha': 0.04837385908790528,
 'clf__batch_size': 20,
 'clf__hidden_layer_sizes': (50,),
 'clf__learning_rate_init': 0.001726574572895734,
 'dim_reducer__n_components': 200,
 're_sample__sampling_strategy': 1}

In [32]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.634152,0.638503,0.255073,0.16843,0.173336,0.613309,0.636297,0.058079
std,0.166901,0.070931,0.058409,0.075432,0.060296,0.186234,0.200845,0.110091
min,0.251462,0.499194,0.179104,0.053904,0.105634,0.1875,0.180645,2.4e-05
25%,0.50731,0.590625,0.208448,0.10935,0.128205,0.5,0.48628,0.002635
50%,0.669591,0.629032,0.241582,0.153259,0.159365,0.625,0.680645,0.012406
75%,0.782164,0.678411,0.277083,0.206888,0.210714,0.761029,0.81129,0.071231
max,0.883041,0.832315,0.412698,0.371403,0.318182,0.941176,0.954839,0.55151


In [33]:
file_path = 'results/ros_ann_pca_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

###### KPCA

In [34]:
ann_sc_kpca_search = HyperoptCV(estimator=estim_pipeline['ann_sc_kpca_pipe'], 
                                hyper_space=dict(ann_feat_ext_space, **ros_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [35]:
%%time
best_trial = ann_sc_kpca_search.model_selection(X, y)

100%|██████████| 100/100 [13:38<00:00, 10.68s/it, best loss: 0.3265333916557823]
CPU times: user 1min 5s, sys: 2.91 s, total: 1min 8s
Wall time: 13min 38s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('dim_reducer',
                                      KernelPCA(alpha=1.0, coef0=1, copy_X=True,
                                                degree=3, eigen_solver='auto',
                                                fit_inverse_transform=False,
                                                gamma=None, kernel='rbf',
                                                kernel_params=None,
                                                max_iter=Non...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),

In [36]:
best_trial['result']['params']

{'clf__alpha': 6.842149165139765e-05,
 'clf__batch_size': 110,
 'clf__hidden_layer_sizes': (100,),
 'clf__learning_rate_init': 0.00018101273539469022,
 'dim_reducer__n_components': 250,
 're_sample__sampling_strategy': 1}

In [37]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.71345,0.673467,0.277433,0.201924,0.201109,0.567574,0.728622,0.46085
std,0.113832,0.07078,0.056465,0.067613,0.06611,0.1831,0.140504,0.040286
min,0.339181,0.504435,0.111111,0.086497,0.112,0.0625,0.283871,0.314326
25%,0.654971,0.623555,0.244558,0.157284,0.164311,0.4375,0.642993,0.443136
50%,0.730994,0.673178,0.273619,0.193052,0.187953,0.5625,0.754043,0.470019
75%,0.799708,0.721304,0.311765,0.24677,0.21841,0.6875,0.836824,0.485911
max,0.906433,0.827016,0.416667,0.360702,0.5,0.875,0.993548,0.562949


In [38]:
file_path = 'results/ros_ann_kpca_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

#### SMOTE

In [39]:
from imblearn.over_sampling import SMOTE

# Define the re-sampling method along with the hyper-parameter space
smo = SMOTE(random_state=69)
smo_space = {'re_sample__sampling_strategy': hp.choice('sampling_strategy', [1, 1/2, 1/3, 1/4]),
             're_sample__k_neighbors': hp.choice('k_neighbors', [3, 5, 7, 9])}

# Update Pipelines
create_pipelines(dict_pipe=estim_pipeline, re_sampler=smo)

##### RF

###### ANOVA

In [40]:
rf_anova_sc_search = HyperoptCV(estimator=estim_pipeline['rf_anova_sc_pipe'], 
                                hyper_space=dict(rf_feat_sel_space, **smo_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [41]:
%%time
best_trial = rf_anova_sc_search.model_selection(X, y)

100%|██████████| 100/100 [18:17<00:00, 14.21s/it, best loss: 0.310500301880283] 
CPU times: user 55.7 s, sys: 2.49 s, total: 58.2 s
Wall time: 18min 17s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('dim_reducer',
                                      SelectKBest(k=150,
                                                  score_func=<function f_classif at 0x7f3c33d89400>)),
                                     ('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('re_sample',
                                      SMOTE(k_neighbors=3, n_jobs=None,
                                            random_state=69,
                                            sampling_strategy=0.25))...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),
                    'prec'

In [42]:
best_trial['result']['params']

{'clf__max_depth': 90,
 'clf__min_samples_leaf': 0.03,
 'clf__min_samples_split': 0.05,
 'clf__n_estimators': 700,
 'dim_reducer__k': 350,
 're_sample__k_neighbors': 3,
 're_sample__sampling_strategy': 0.25}

In [43]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.671579,0.6895,0.275097,0.201249,0.188489,0.630074,0.676019,0.197537
std,0.11975,0.062129,0.049534,0.0627,0.05901,0.167768,0.146919,0.053716
min,0.432749,0.543548,0.184615,0.055906,0.106796,0.235294,0.406452,0.124027
25%,0.562865,0.658669,0.241474,0.165868,0.145138,0.5625,0.548387,0.154961
50%,0.684211,0.697177,0.276393,0.202587,0.179563,0.6875,0.683871,0.185999
75%,0.75731,0.728831,0.293034,0.231421,0.203061,0.75,0.767742,0.223617
max,0.877193,0.821371,0.409091,0.338513,0.368421,0.882353,0.948052,0.32912


In [44]:
file_path = 'results/smo_rf_anova_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

###### PCA

In [45]:
rf_sc_pca_search = HyperoptCV(estimator=estim_pipeline['rf_sc_pca_pipe'], 
                                hyper_space=dict(rf_feat_ext_space, **smo_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [46]:
%%time
best_trial = rf_sc_pca_search.model_selection(X, y)

100%|██████████| 100/100 [35:24<00:00, 28.37s/it, best loss: 0.3501951009142661]
CPU times: user 1min 9s, sys: 3.07 s, total: 1min 12s
Wall time: 35min 24s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('dim_reducer',
                                      PCA(copy=True, iterated_power='auto',
                                          n_components=250, random_state=None,
                                          svd_solver='auto', tol=0.0,
                                          whiten=False)),
                                     ('re_sample',
                                      SMOTE(k_neighbors=9, n_jobs=...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),
                    'prec': make_score

In [47]:
best_trial['result']['params']

{'clf__max_depth': 70,
 'clf__min_samples_leaf': 0.03,
 'clf__min_samples_split': 0.05,
 'clf__n_estimators': 700,
 'dim_reducer__n_components': 150,
 're_sample__k_neighbors': 9,
 're_sample__sampling_strategy': 0.25}

In [48]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.692047,0.649805,0.265353,0.180719,0.193532,0.551618,0.706652,0.192657
std,0.130029,0.056493,0.057354,0.075037,0.082245,0.166671,0.157921,0.02766
min,0.374269,0.541129,0.166667,0.040562,0.110092,0.1875,0.318182,0.127043
25%,0.615497,0.614869,0.225893,0.128284,0.142857,0.4375,0.606452,0.176189
50%,0.701754,0.645993,0.253466,0.167259,0.168239,0.5625,0.718454,0.190891
75%,0.782164,0.683972,0.294949,0.227942,0.21022,0.6875,0.804839,0.204809
max,0.906433,0.805242,0.421053,0.356339,0.5,0.882353,0.974194,0.280693


In [49]:
file_path = 'results/smo_rf_pca_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

###### KPCA

In [50]:
rf_sc_kpca_search = HyperoptCV(estimator=estim_pipeline['rf_sc_kpca_pipe'], 
                                hyper_space=dict(rf_feat_ext_space, **smo_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [51]:
%%time
best_trial = rf_sc_kpca_search.model_selection(X, y)

100%|██████████| 100/100 [22:56<00:00, 10.86s/it, best loss: 0.3350161783188349]
CPU times: user 56.3 s, sys: 2.59 s, total: 58.9 s
Wall time: 22min 56s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('dim_reducer',
                                      KernelPCA(alpha=1.0, coef0=1, copy_X=True,
                                                degree=3, eigen_solver='auto',
                                                fit_inverse_transform=False,
                                                gamma=None, kernel='rbf',
                                                kernel_params=None,
                                                max_iter=Non...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),

In [52]:
best_trial['result']['params']

{'clf__max_depth': 70,
 'clf__min_samples_leaf': 0.06,
 'clf__min_samples_split': 0.05,
 'clf__n_estimators': 700,
 'dim_reducer__n_components': 150,
 're_sample__k_neighbors': 9,
 're_sample__sampling_strategy': 0.3333333333333333}

In [53]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.67345,0.664984,0.272772,0.195398,0.196024,0.604706,0.680834,0.244622
std,0.144058,0.061112,0.050089,0.065457,0.087175,0.167204,0.174234,0.031487
min,0.292398,0.525,0.198675,0.074083,0.111111,0.1875,0.225806,0.175443
25%,0.599415,0.618944,0.238812,0.151921,0.147528,0.5,0.58871,0.227512
50%,0.692982,0.673992,0.262821,0.191908,0.176242,0.625,0.695769,0.245021
75%,0.776316,0.711794,0.312297,0.239371,0.233865,0.6875,0.793548,0.262056
max,0.912281,0.779985,0.4,0.36151,0.666667,0.9375,0.987013,0.322776


In [54]:
file_path = 'results/smo_rf_kpca_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

##### LR

###### ANOVA

In [55]:
log_reg_anova_sc_search = HyperoptCV(estimator=estim_pipeline['log_reg_anova_sc_pipe'], 
                                hyper_space=dict(log_reg_feat_sel_space, **smo_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [56]:
%%time
best_trial = log_reg_anova_sc_search.model_selection(X, y)

100%|██████████| 100/100 [16:44<00:00, 17.14s/it, best loss: 0.30422328051455183]
CPU times: user 52.2 s, sys: 2.64 s, total: 54.8 s
Wall time: 16min 44s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('dim_reducer',
                                      SelectKBest(k=300,
                                                  score_func=<function f_classif at 0x7f3c33d89400>)),
                                     ('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('re_sample',
                                      SMOTE(k_neighbors=3, n_jobs=None,
                                            random_state=69,
                                            sampling_strategy=0.5)),...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),
                    'prec'

In [57]:
best_trial['result']['params']

{'clf__C': 0.0005736647676464367,
 'clf__max_iter': 10000000.0,
 'clf__penalty': 'l2',
 'dim_reducer__k': 300,
 're_sample__k_neighbors': 7,
 're_sample__sampling_strategy': 0.25}

In [58]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.689123,0.695777,0.285947,0.214618,0.196055,0.631103,0.695238,0.201226
std,0.103359,0.060969,0.050094,0.063678,0.059634,0.152144,0.126582,0.055301
min,0.415205,0.537903,0.193548,0.078875,0.111111,0.235294,0.380645,0.119364
25%,0.602339,0.657157,0.258065,0.175988,0.160494,0.537684,0.580645,0.162387
50%,0.704678,0.705691,0.276759,0.215871,0.178632,0.625,0.706452,0.191742
75%,0.758772,0.734778,0.31184,0.251084,0.219322,0.75,0.779032,0.220714
max,0.883041,0.815726,0.421053,0.356339,0.368421,0.882353,0.954545,0.397979


In [59]:
file_path = 'results/smo_lr_anova_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

###### PCA

In [60]:
log_reg_sc_pca_search = HyperoptCV(estimator=estim_pipeline['log_reg_sc_pca_pipe'], 
                                hyper_space=dict(log_reg_feat_ext_space, **smo_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [61]:
%%time
best_trial = log_reg_sc_pca_search.model_selection(X, y)

100%|██████████| 100/100 [1:02:10<00:00, 24.40s/it, best loss: 0.31918762783705856]
CPU times: user 56.3 s, sys: 2.39 s, total: 58.7 s
Wall time: 1h 2min 10s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('dim_reducer',
                                      PCA(copy=True, iterated_power='auto',
                                          n_components=150, random_state=None,
                                          svd_solver='auto', tol=0.0,
                                          whiten=False)),
                                     ('re_sample',
                                      SMOTE(k_neighbors=7, n_jobs=...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),
                    'prec': make_score

In [62]:
best_trial['result']['params']

{'clf__C': 0.007568457206222195,
 'clf__max_iter': 10000000.0,
 'clf__penalty': 'l1',
 'dim_reducer__n_components': 150,
 're_sample__k_neighbors': 7,
 're_sample__sampling_strategy': 0.3333333333333333}

In [63]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.677193,0.680812,0.281449,0.204892,0.19473,0.619632,0.683169,0.259168
std,0.144974,0.059027,0.056382,0.06943,0.065726,0.165829,0.174664,0.074565
min,0.245614,0.537097,0.188679,0.087904,0.104895,0.3125,0.174194,0.089232
25%,0.596491,0.637555,0.236689,0.15056,0.142534,0.515625,0.594512,0.215077
50%,0.710526,0.686895,0.269967,0.196367,0.18809,0.625,0.709677,0.25377
75%,0.783626,0.721774,0.309148,0.249468,0.216615,0.701287,0.816862,0.30254
max,0.894737,0.794118,0.4375,0.379435,0.4375,0.9375,0.941935,0.425802


In [64]:
file_path = 'results/smo_lr_pca_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

###### KPCA

In [65]:
log_reg_sc_kpca_search = HyperoptCV(estimator=estim_pipeline['log_reg_sc_kpca_pipe'], 
                                hyper_space=dict(log_reg_feat_ext_space, **smo_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [66]:
%%time
best_trial = log_reg_sc_kpca_search.model_selection(X, y)

100%|██████████| 100/100 [12:54<00:00,  7.99s/it, best loss: 0.321220883954755]
CPU times: user 58.8 s, sys: 2.62 s, total: 1min 1s
Wall time: 12min 54s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('dim_reducer',
                                      KernelPCA(alpha=1.0, coef0=1, copy_X=True,
                                                degree=3, eigen_solver='auto',
                                                fit_inverse_transform=False,
                                                gamma=None, kernel='rbf',
                                                kernel_params=None,
                                                max_iter=Non...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),

In [67]:
best_trial['result']['params']

{'clf__C': 0.5951167885875537,
 'clf__max_iter': 100000.0,
 'clf__penalty': 'l2',
 'dim_reducer__n_components': 150,
 're_sample__k_neighbors': 3,
 're_sample__sampling_strategy': 0.25}

In [68]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.625965,0.678779,0.255769,0.180033,0.166364,0.665809,0.621883,0.190615
std,0.111984,0.058144,0.038083,0.051047,0.04074,0.162393,0.13773,0.025509
min,0.421053,0.560081,0.178571,0.059641,0.114583,0.125,0.374194,0.146547
25%,0.533626,0.642475,0.233091,0.155523,0.138907,0.5625,0.51129,0.171922
50%,0.622807,0.686304,0.262461,0.18751,0.161395,0.6875,0.619355,0.185441
75%,0.716374,0.714513,0.276841,0.206806,0.179444,0.764706,0.732258,0.202731
max,0.894737,0.789919,0.357143,0.285195,0.333333,0.875,0.974194,0.288389


In [69]:
file_path = 'results/smo_lr_kpca_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

##### SVM

###### ANOVA

In [70]:
svm_anova_sc_search = HyperoptCV(estimator=estim_pipeline['svm_anova_sc_pipe'], 
                                hyper_space=dict(svm_feat_sel_space, **smo_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=1)

In [71]:
%%time
best_trial = svm_anova_sc_search.model_selection(X, y)

100%|██████████| 100/100 [1:23:18<00:00, 66.77s/it, best loss: 0.2965335949629119]
CPU times: user 2h 49min 59s, sys: 1min 24s, total: 2h 51min 24s
Wall time: 1h 23min 18s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('dim_reducer',
                                      SelectKBest(k=300,
                                                  score_func=<function f_classif at 0x7f3c33d89400>)),
                                     ('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('re_sample',
                                      SMOTE(k_neighbors=7, n_jobs=None,
                                            random_state=69,
                                            sampling_strategy=1)),
                                     ('...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_p

In [72]:
best_trial['result']['params']

{'clf__C': 1.3027563859589923,
 'clf__gamma': 0.00010053145282036389,
 'clf__max_iter': 10000.0,
 'dim_reducer__k': 350,
 're_sample__k_neighbors': 7,
 're_sample__sampling_strategy': 0.3333333333333333}

In [73]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.691579,0.703466,0.293708,0.226564,0.206938,0.642574,0.696812,0.227405
std,0.108159,0.060444,0.054278,0.066669,0.077977,0.151602,0.132341,0.161551
min,0.467836,0.57621,0.202532,0.087638,0.121212,0.235294,0.43871,0.088515
25%,0.618421,0.671371,0.253976,0.17696,0.159048,0.5625,0.609677,0.140095
50%,0.687135,0.701626,0.284161,0.220142,0.181667,0.636029,0.687097,0.169199
75%,0.76462,0.74096,0.311352,0.261254,0.229297,0.75,0.780645,0.243124
max,0.900585,0.869355,0.440678,0.415473,0.5,0.941176,0.961039,0.851694


In [74]:
file_path = 'results/smo_svm_anova_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

###### PCA

In [75]:
svm_sc_pca_search = HyperoptCV(estimator=estim_pipeline['svm_sc_pca_pipe'], 
                                hyper_space=dict(svm_feat_ext_space, **smo_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [76]:
%%time
best_trial = svm_sc_pca_search.model_selection(X, y)

100%|██████████| 100/100 [37:50<00:00, 29.68s/it, best loss: 0.3263955309396487] 
CPU times: user 1min 1s, sys: 2.93 s, total: 1min 3s
Wall time: 37min 50s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('dim_reducer',
                                      PCA(copy=True, iterated_power='auto',
                                          n_components=250, random_state=None,
                                          svd_solver='auto', tol=0.0,
                                          whiten=False)),
                                     ('re_sample',
                                      SMOTE(k_neighbors=3, n_jobs=...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),
                    'prec': make_score

In [77]:
best_trial['result']['params']

{'clf__C': 0.00048211741877555764,
 'clf__gamma': 0.00011058567403620571,
 'clf__max_iter': 100000.0,
 'dim_reducer__n_components': 150,
 're_sample__k_neighbors': 5,
 're_sample__sampling_strategy': 0.25}

In [78]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.683158,0.673604,0.282696,0.20638,0.212054,0.596838,0.692294,0.251902
std,0.143356,0.05884,0.071893,0.088447,0.118649,0.171224,0.173906,0.12703
min,0.350877,0.529435,0.192771,0.071205,0.115789,0.294118,0.290323,0.093001
25%,0.570175,0.637601,0.238812,0.149006,0.147608,0.5,0.559677,0.158131
50%,0.652047,0.670069,0.263483,0.192442,0.16447,0.625,0.658065,0.21942
75%,0.802632,0.715524,0.30704,0.231337,0.222222,0.6875,0.833871,0.297953
max,0.929825,0.826613,0.5,0.499358,0.75,0.9375,0.987097,0.605055


In [79]:
file_path = 'results/smo_svm_pca_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

###### KPCA

In [80]:
svm_sc_kpca_search = HyperoptCV(estimator=estim_pipeline['svm_sc_kpca_pipe'], 
                                hyper_space=dict(svm_feat_ext_space, **smo_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [81]:
%%time
best_trial = svm_sc_kpca_search.model_selection(X, y)

100%|██████████| 100/100 [20:52<00:00, 21.62s/it, best loss: 0.2992618534217205]
CPU times: user 55.9 s, sys: 2.54 s, total: 58.5 s
Wall time: 20min 52s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('dim_reducer',
                                      KernelPCA(alpha=1.0, coef0=1, copy_X=True,
                                                degree=3, eigen_solver='auto',
                                                fit_inverse_transform=False,
                                                gamma=None, kernel='rbf',
                                                kernel_params=None,
                                                max_iter=Non...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),

In [82]:
best_trial['result']['params']

{'clf__C': 0.03121330748848976,
 'clf__gamma': 0.008341103607021559,
 'clf__max_iter': 10000000.0,
 'dim_reducer__n_components': 250,
 're_sample__k_neighbors': 7,
 're_sample__sampling_strategy': 0.3333333333333333}

In [83]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.674152,0.700738,0.289543,0.221789,0.199501,0.658897,0.675652,0.300594
std,0.128812,0.068652,0.06489,0.078078,0.074939,0.168071,0.155856,0.100925
min,0.380117,0.529032,0.148148,0.079456,0.121212,0.125,0.322581,0.139051
25%,0.609649,0.653594,0.241641,0.162194,0.152868,0.5625,0.598052,0.238595
50%,0.684211,0.688903,0.268468,0.196377,0.174457,0.6875,0.680645,0.272133
75%,0.752924,0.742641,0.331395,0.279006,0.21721,0.761029,0.764118,0.326036
max,0.900585,0.864113,0.484848,0.43013,0.470588,0.941176,0.954545,0.597029


In [84]:
file_path = 'results/smo_svm_kpca_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

##### ANN

###### ANOVA

In [85]:
ann_anova_sc_search = HyperoptCV(estimator=estim_pipeline['ann_anova_sc_pipe'], 
                                hyper_space=dict(ann_feat_sel_space, **smo_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [86]:
%%time
best_trial = ann_anova_sc_search.model_selection(X, y)

100%|██████████| 100/100 [08:23<00:00,  4.17s/it, best loss: 0.3269799033983094]
CPU times: user 1min 4s, sys: 2.91 s, total: 1min 7s
Wall time: 8min 23s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('dim_reducer',
                                      SelectKBest(k=250,
                                                  score_func=<function f_classif at 0x7f3c33d89400>)),
                                     ('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('re_sample',
                                      SMOTE(k_neighbors=9, n_jobs=None,
                                            random_state=69,
                                            sampling_strategy=0.3333...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),
                    'prec'

In [87]:
best_trial['result']['params']

{'clf__alpha': 2.6196908117454092e-06,
 'clf__batch_size': 110,
 'clf__hidden_layer_sizes': (50,),
 'clf__learning_rate_init': 6.37391696496336e-05,
 'dim_reducer__k': 300,
 're_sample__k_neighbors': 9,
 're_sample__sampling_strategy': 1}

In [88]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.676257,0.67302,0.27317,0.198958,0.204128,0.602868,0.684024,0.389912
std,0.139352,0.067506,0.055578,0.076294,0.109712,0.189692,0.170054,0.197696
min,0.339181,0.543952,0.191781,0.059668,0.113208,0.1875,0.277419,0.102853
25%,0.557018,0.622681,0.224359,0.145142,0.136755,0.445772,0.540323,0.242653
50%,0.690058,0.662903,0.26087,0.170403,0.178211,0.647059,0.683871,0.33191
75%,0.782164,0.719254,0.311688,0.261159,0.21511,0.75,0.810411,0.523435
max,0.918129,0.824597,0.413793,0.362433,0.75,0.9375,0.993548,0.915404


In [89]:
file_path = 'results/smo_ann_anova_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

###### PCA

In [90]:
ann_sc_pca_search = HyperoptCV(estimator=estim_pipeline['ann_sc_pca_pipe'], 
                                hyper_space=dict(ann_feat_ext_space, **smo_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [91]:
%%time
best_trial = ann_sc_pca_search.model_selection(X, y)

100%|██████████| 100/100 [27:36<00:00, 14.14s/it, best loss: 0.35255223144976466]
CPU times: user 1min 10s, sys: 3.07 s, total: 1min 13s
Wall time: 27min 36s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('dim_reducer',
                                      PCA(copy=True, iterated_power='auto',
                                          n_components=150, random_state=None,
                                          svd_solver='auto', tol=0.0,
                                          whiten=False)),
                                     ('re_sample',
                                      SMOTE(k_neighbors=3, n_jobs=...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),
                    'prec': make_score

In [92]:
best_trial['result']['params']

{'clf__alpha': 0.00029753448642246114,
 'clf__batch_size': 20,
 'clf__hidden_layer_sizes': (25,),
 'clf__learning_rate_init': 0.028406988095464734,
 'dim_reducer__n_components': 150,
 're_sample__k_neighbors': 3,
 're_sample__sampling_strategy': 1}

In [93]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.701871,0.647448,0.268124,0.183086,0.192546,0.544118,0.718538,0.03704428
std,0.12436,0.053692,0.050916,0.064599,0.067314,0.156868,0.151422,0.1350759
min,0.444444,0.535905,0.193548,0.067743,0.119048,0.235294,0.4,7.732033e-18
25%,0.618421,0.608669,0.227875,0.13711,0.141997,0.4375,0.62721,9.030469e-08
50%,0.716374,0.644179,0.255556,0.168717,0.173163,0.5,0.744344,0.000109214
75%,0.812865,0.682689,0.306856,0.230796,0.228632,0.641544,0.849277,0.00536738
max,0.894737,0.75,0.4,0.333678,0.444444,0.875,0.967532,0.8331336


In [94]:
file_path = 'results/smo_ann_pca_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)

###### KPCA

In [95]:
ann_sc_kpca_search = HyperoptCV(estimator=estim_pipeline['ann_sc_kpca_pipe'], 
                                hyper_space=dict(ann_feat_ext_space, **smo_space), cv=outer_cv_split, 
                                scoring=eval_metric, opt_metric=model_sel_metric, n_iter=n, 
                                random_seed=random_state, parallel=8)

In [96]:
%%time
best_trial = ann_sc_kpca_search.model_selection(X, y)

100%|██████████| 100/100 [12:01<00:00,  6.21s/it, best loss: 0.3404023756130017]
CPU times: user 1min 3s, sys: 2.59 s, total: 1min 5s
Wall time: 12min 1s


HyperoptCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=23),
           estimator=Pipeline(memory=None,
                              steps=[('scaler',
                                      StandardScaler(copy=True, with_mean=True,
                                                     with_std=True)),
                                     ('dim_reducer',
                                      KernelPCA(alpha=1.0, coef0=1, copy_X=True,
                                                degree=3, eigen_solver='auto',
                                                fit_inverse_transform=False,
                                                gamma=None, kernel='rbf',
                                                kernel_params=None,
                                                max_iter=Non...
                    'auc': 'roc_auc',
                    'f1': make_scorer(opt_f1_score, needs_proba=True),
                    'mcc': make_scorer(opt_mcc_score, needs_proba=True),

In [97]:
best_trial['result']['params']

{'clf__alpha': 0.0799461480511143,
 'clf__batch_size': 20,
 'clf__hidden_layer_sizes': (50,),
 'clf__learning_rate_init': 0.01641955427558416,
 'dim_reducer__n_components': 200,
 're_sample__k_neighbors': 3,
 're_sample__sampling_strategy': 0.3333333333333333}

In [98]:
scores = best_trial['result']['score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']
                   })
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.68807,0.659598,0.267289,0.188462,0.196344,0.570956,0.700451,0.223808
std,0.13073,0.055588,0.051327,0.060443,0.073401,0.18125,0.161404,0.084783
min,0.415205,0.559677,0.181818,0.087638,0.119266,0.125,0.376623,0.10802
25%,0.577485,0.615649,0.230972,0.146307,0.145388,0.477941,0.562903,0.159351
50%,0.669591,0.660265,0.26015,0.182447,0.16784,0.575368,0.683871,0.204627
75%,0.811404,0.693244,0.276658,0.217235,0.222321,0.75,0.856211,0.270012
max,0.894737,0.772984,0.439024,0.372146,0.384615,0.8125,0.974194,0.453962


In [99]:
file_path = 'results/smo_ann_kpca_disc_pfi_lung_ml_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)