#       **Pre-processing**

**In this notebook, I will be performing pre-processing for the project. I will first perform scaling of the features and then attempt to use feature selection using the Boruta method to select the most important protein/peptides for predicting the updrs scores in the training dataset.  There are 234 patients in the training dataset, with protein/peptide expression data at different months. There are protein abundance values for 227 proteins and peptide abundance measurements for 671 peptides. They have been merged in the previous notebook. Given that the test dataset provided has only two patients and the predictions are made through an API system in kaggle, we will be  splitting the original training dataset into training and testing in a 80/20 ratio, and using this test dataset for predictions.


In [20]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
import pandas as pd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import plotly.express as px
from boruta import BorutaPy
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

Let us import are datasets which we will be working on. Our cleaned and wrangled protein and peptide dataset and the clinical dataset

In [63]:
Training_protein_peptide_final=pd.read_csv("Protein_peptide_training_final.csv")
Training_clinical=pd.read_csv('Training_clinical_cleaned.csv')

Visualising the loaded datasets

In [64]:
Training_protein_peptide_final.head()

Unnamed: 0,visit_id,O00391,O00533,O00584,O14498,O14773,O14791,O15240,O15394,O43505,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
0,10053_0,13.152328,18.617988,0.0,0.0,12.803843,11.286465,16.340874,13.88356,17.352311,...,17.625951,0.0,22.069672,16.241585,19.153322,16.227046,16.669826,19.01624,0.0,12.815243
1,10053_12,13.353174,18.732598,0.0,0.0,0.0,0.0,17.588693,13.882175,17.325692,...,17.616901,0.0,22.254002,15.165272,18.44007,16.49057,16.911275,18.791961,15.58877,14.628719
2,10053_18,13.692147,18.952724,12.799071,14.582007,0.0,11.21232,16.948846,13.991664,17.35902,...,17.75191,0.0,22.371027,15.251778,18.920042,15.947719,16.969566,18.771544,15.676979,14.374204
3,10138_12,13.621159,18.915847,13.161929,14.730974,14.458028,12.554565,17.254078,15.735196,17.638302,...,17.523148,13.20361,21.895146,15.557054,18.325455,16.454783,16.987753,19.074915,16.002679,13.269854
4,10138_24,13.551131,18.994072,12.135232,14.069265,14.829346,11.380001,17.205803,15.675574,17.878027,...,17.653594,12.635979,21.747882,16.09475,18.922123,16.304196,16.770548,0.0,15.798107,12.259476


In [65]:
Training_clinical.head()

Unnamed: 0,visit_id,patient_id,visit_month,updrs_1,updrs_2,updrs_3,updrs_4,upd23b_clinical_state_on_medication,visit_month_diff,visit_month_diff_min,median_updrs_1,median_updrs_2,median_updrs_3,median_updrs_4
0,55_0,55,0,10.0,6.0,15.0,0.0,,,3.0,7.0,7.0,24.0,0.0
1,55_3,55,3,10.0,7.0,25.0,0.0,,3.0,3.0,7.0,7.0,24.0,0.0
2,55_6,55,6,8.0,10.0,34.0,0.0,,3.0,3.0,7.0,7.0,24.0,0.0
3,55_9,55,9,8.0,9.0,30.0,0.0,On,3.0,3.0,7.0,7.0,24.0,0.0
4,55_12,55,12,10.0,10.0,41.0,0.0,On,3.0,3.0,7.0,7.0,24.0,0.0


Drop the median UPDRS score columns from the training clinical dataset

In [66]:
Training_clinical=Training_clinical.drop(columns=["median_updrs_1","median_updrs_2","median_updrs_3","median_updrs_4"])

Combine the clinical and the protein/peptide abundance datasets

In [67]:
training_protein_peptide_clinical = Training_clinical.merge(Training_protein_peptide_final, on="visit_id",how="inner")
training_protein_peptide_clinical.head()

Unnamed: 0,visit_id,patient_id,visit_month,updrs_1,updrs_2,updrs_3,updrs_4,upd23b_clinical_state_on_medication,visit_month_diff,visit_month_diff_min,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
0,55_0,55,0,10.0,6.0,15.0,0.0,,,3.0,...,17.61797,14.009505,21.861462,16.705821,19.147352,17.000913,17.339528,18.73828,15.498388,13.86287
1,55_6,55,6,8.0,10.0,34.0,0.0,,3.0,3.0,...,17.384303,13.688119,21.974045,16.79087,18.973823,16.659439,17.141778,18.804645,15.289432,14.337615
2,55_12,55,12,10.0,10.0,41.0,0.0,On,3.0,3.0,...,17.822347,14.125559,22.384201,16.827318,19.441143,17.063216,17.471699,18.786771,15.739915,14.414758
3,55_36,55,36,17.0,18.0,51.0,0.0,On,6.0,3.0,...,17.499425,14.181502,21.34281,16.472578,19.373398,16.972453,17.635945,18.927584,15.688051,13.770426
4,942_6,942,6,8.0,2.0,21.0,0.0,,3.0,3.0,...,17.787966,12.643811,0.0,15.813065,18.87553,16.287734,16.281602,19.128931,15.550921,13.936095


We have to edit the upd23b_clinical_state_on_medication, visit_month_diff, visit_month_diff_min as they have many omitted values

In [68]:
training_protein_peptide_clinical["upd23b_clinical_state_on_medication"] = training_protein_peptide_clinical.upd23b_clinical_state_on_medication.fillna("Unknown")

In [69]:
training_protein_peptide_clinical["visit_month_diff_min"] = training_protein_peptide_clinical["visit_month_diff_min"].fillna(0)

In [70]:
training_protein_peptide_clinical["visit_month_diff"] = training_protein_peptide_clinical["visit_month_diff"].fillna(0)

In [71]:
training_protein_peptide_clinical=training_protein_peptide_clinical.set_index("visit_id")

In [72]:
training_protein_peptide_clinical.head()

Unnamed: 0_level_0,patient_id,visit_month,updrs_1,updrs_2,updrs_3,updrs_4,upd23b_clinical_state_on_medication,visit_month_diff,visit_month_diff_min,O00391,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
55_0,55,0,10.0,6.0,15.0,0.0,Unknown,0.0,3.0,13.458189,...,17.61797,14.009505,21.861462,16.705821,19.147352,17.000913,17.339528,18.73828,15.498388,13.86287
55_6,55,6,8.0,10.0,34.0,0.0,Unknown,3.0,3.0,13.684266,...,17.384303,13.688119,21.974045,16.79087,18.973823,16.659439,17.141778,18.804645,15.289432,14.337615
55_12,55,12,10.0,10.0,41.0,0.0,On,3.0,3.0,13.89724,...,17.822347,14.125559,22.384201,16.827318,19.441143,17.063216,17.471699,18.786771,15.739915,14.414758
55_36,55,36,17.0,18.0,51.0,0.0,On,6.0,3.0,13.72396,...,17.499425,14.181502,21.34281,16.472578,19.373398,16.972453,17.635945,18.927584,15.688051,13.770426
942_6,942,6,8.0,2.0,21.0,0.0,Unknown,3.0,3.0,13.453618,...,17.787966,12.643811,0.0,15.813065,18.87553,16.287734,16.281602,19.128931,15.550921,13.936095


In [73]:
X_train, X_test, y_train, y_test = train_test_split(training_protein_peptide_clinical.drop(columns=['updrs_1',"updrs_2","updrs_3","updrs_4","patient_id"]), 
                                                    training_protein_peptide_clinical[['updrs_1',"updrs_2","updrs_3","updrs_4"]], test_size=0.2, 
                                                    random_state=47)

In [74]:
X_train.head()

Unnamed: 0_level_0,visit_month,upd23b_clinical_state_on_medication,visit_month_diff,visit_month_diff_min,O00391,O00533,O00584,O14498,O14773,O14791,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
29313_0,0,Unknown,0.0,3.0,13.220174,18.5884,14.838032,15.022116,13.636115,11.00195,...,17.876049,13.071402,21.503168,16.54588,20.763162,16.232138,16.920271,18.832402,15.659307,13.748727
31693_0,0,Unknown,0.0,6.0,13.653203,19.658056,15.326166,15.252122,14.965234,10.915446,...,17.828868,13.437687,22.547289,14.876354,17.764217,16.858479,17.399036,19.13048,15.642531,0.0
60803_60,60,Off,6.0,3.0,12.653765,17.914087,14.103706,14.043933,13.353202,11.83639,...,17.697348,12.824863,21.412552,17.043177,20.172799,15.834881,16.949622,17.750831,15.4444,15.187797
26809_12,12,Off,3.0,3.0,0.0,18.593989,14.589113,14.545139,13.63178,12.638535,...,17.633598,13.61471,21.799722,15.933923,19.60209,16.160192,16.839438,18.608183,0.0,0.0
47881_12,12,Unknown,3.0,3.0,12.644102,17.922747,14.432385,13.873579,13.946312,12.015903,...,17.905593,12.33758,22.302692,17.705909,20.248884,16.073623,16.969173,19.125328,15.324806,14.741815


In [75]:
X_test.head()

Unnamed: 0_level_0,visit_month,upd23b_clinical_state_on_medication,visit_month_diff,visit_month_diff_min,O00391,O00533,O00584,O14498,O14773,O14791,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
33558_0,0,Unknown,0.0,12.0,13.730268,19.353597,14.942556,14.939992,14.021882,11.829303,...,17.877692,13.642514,21.482976,16.65486,20.072452,16.794949,17.157475,18.661869,15.830377,14.188032
25750_12,12,Unknown,6.0,6.0,13.606798,19.550739,14.385087,15.055956,14.070297,0.0,...,18.021338,12.383594,21.362309,15.333515,18.603576,16.244882,17.193997,18.365618,14.825172,14.334196
10718_0,0,Unknown,0.0,6.0,13.223305,19.063121,14.183232,14.278573,12.748321,10.372016,...,18.002645,12.482286,0.0,15.606955,19.721917,16.194607,17.08457,19.322976,15.71709,15.165221
64674_6,6,Unknown,3.0,3.0,0.0,18.066704,14.399218,13.702898,13.889352,11.12974,...,17.325235,11.961254,22.123804,16.081791,18.638125,15.485722,16.652999,19.487777,15.908536,14.460616
48780_60,60,Unknown,12.0,12.0,13.336283,19.34718,15.213233,14.98237,14.20477,11.142203,...,17.948846,13.163373,21.989282,15.318077,17.674234,16.453091,17.376515,19.168525,15.368213,14.43872


Given upd23b_clinical_state_on_medication is a categorical variable, we have to encode it, and we should scale our continuous features

In [76]:
numeric_columns = X_train.select_dtypes(include=['int64', 'float64']).columns
categorical_columns =X_train.select_dtypes(include=['object', 'bool']).columns

###Creating series for categorical test and train
X_train_cat = X_train[categorical_columns]
X_test_cat = X_test[categorical_columns]
ohe = OneHotEncoder()
###Fitting encoder to training categorical features and transforming ###test and train
X_train_encode = ohe.fit_transform(X_train_cat)

In [77]:
X_test_encode =  ohe.transform(X_test_cat)

In [79]:
###Converting series to dataframes
columns =ohe.get_feature_names(input_features=X_train_cat.columns)
X_train_processed = pd.DataFrame(X_train_encode.todense(), 
columns=columns)
X_test_processed = pd.DataFrame(X_test_encode.todense(), columns=columns)


In [115]:
###Instantiating Standard Scaler
ss = StandardScaler()
###Converting continuous feature values to floats
X_train_cont = X_train[numeric_columns].astype(float)
X_test_cont = X_test[numeric_columns].astype(float)
###Fitting scaler to training continuous features and transforming ###train and test
X_train_scaled = ss.fit_transform(X_train_cont)
X_test_scaled = ss.transform(X_test_cont)
###Concatenating scaled and encoded dataframes
X_train_a2 = pd.concat([pd.DataFrame(X_train_scaled), X_train_processed], axis=1)
##set colnames 
#columns_X_train=X_train_cont.columns.tolist()+ X_train_processed.columns.tolist()
#columns_X_test=X_test_cont.columns.tolist()+ X_test_processed.columns.tolist()

X_test_a2 = pd.concat([pd.DataFrame(X_test_scaled), X_test_processed], axis=1)
#X_test_a2.colnames=columns_X_test


  X_test_a2.colnames=columns_X_test


In [157]:
X_train_a2.shape

(854, 1201)

In [159]:
X_test_a2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1191,1192,1193,1194,1195,1196,1197,upd23b_clinical_state_on_medication_Off,upd23b_clinical_state_on_medication_On,upd23b_clinical_state_on_medication_Unknown
0,-1.166054,-1.311332,1.865668,0.714109,0.601383,0.340042,0.357781,0.275834,0.50705,0.812566,...,0.580694,0.563234,0.430082,0.273973,0.008785,0.378718,0.529754,0.0,0.0,1.0
1,-0.646794,-0.017424,0.120563,0.694335,0.800473,0.019936,0.386786,0.2897,-2.361684,1.377609,...,0.333493,0.071108,-0.153935,0.287953,-0.173125,0.136415,0.554662,0.0,0.0,1.0
2,-1.166054,-1.311332,0.120563,0.632919,0.308037,-0.095972,0.192346,-0.088897,0.153642,0.114111,...,0.384649,0.445792,-0.207314,0.246065,0.414733,0.35141,0.696273,0.0,0.0,1.0
3,-0.906424,-0.664378,-0.751989,-1.484781,-0.698226,0.02805,0.048358,0.237879,0.337399,-0.579597,...,0.473482,0.082683,-0.959952,0.080865,0.515928,0.397558,0.576204,0.0,0.0,1.0
4,1.430244,1.276484,1.865668,0.651013,0.594903,0.495469,0.368381,0.328211,0.340421,0.590105,...,0.330605,-0.240255,0.067124,0.357819,0.319894,0.267314,0.572473,0.0,0.0,1.0


We have to apply the Boruta feature selection method individually for each of the four updrs scores.

In [97]:

model = RandomForestRegressor(n_estimators=1000, max_depth=5, random_state=42)

np.int = np.int32
np.float = np.float64
np.bool = np.bool_

# let's initialize Boruta
feat_selector = BorutaPy(
    verbose=2,
    estimator=model,
    n_estimators='auto'  
)

# train Boruta
# N.B.: X and y must be numpy arrays
feat_selector.fit(np.array(X_train_a2), np.array(y_train.updrs_1))



Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	0
Tentative: 	6
Rejected: 	1195
Iteration: 	9 / 100
Confirmed: 	1
Tentative: 	5
Rejected: 	1195
Iteration: 	10 / 100
Confirmed: 	1
Tentative: 	5
Rejected: 	1195
Iteration: 	11 / 100
Confirmed: 	1
Tentative: 	5
Rejected: 	1195
Iteration: 	12 / 100
Confirmed: 	1
Tentative: 	5
Rejected: 	1195
Iteration: 	13 / 100
Confirmed: 	1
Tentative: 	5
Rejected: 	1195
Iteration: 	14 / 100
Confirmed: 	1
Tentative: 	5
Rejected: 	1195
Iteration: 	15 / 100
Confirmed: 	1
Tentative: 	5
Rejected: 	1195
Iteration: 	16 / 100
Confirmed: 	2

BorutaPy(estimator=RandomForestRegressor(max_depth=5, n_estimators=69,
                                         random_state=RandomState(MT19937) at 0x7FE500CC3840),
         n_estimators='auto',
         random_state=RandomState(MT19937) at 0x7FE500CC3840, verbose=2)

In [110]:
type(columns_X_train)

list

In [169]:
features_UPDRS1=X_train_a2.columns[feat_selector.support_]


In [170]:
features_UPDRS1.tolist()

[2, 25, 89, 229, 839, 'upd23b_clinical_state_on_medication_Unknown']

In [171]:
pd.Series(features_UPDRS1).to_csv('features_UPDRS1', header=False, index=False)

In [131]:
features_UPDRS1=X_train_cont.columns[[2, 25, 89, 229, 839]].tolist()+['upd23b_clinical_state_on_medication_Unknown']

In [132]:
features_UPDRS1

['visit_month_diff_min',
 'P01008',
 'P05155',
 'Q9Y6R7',
 'NLREGTC(UniMod_4)PEAPTDEC(UniMod_4)KPVK',
 'upd23b_clinical_state_on_medication_Unknown']

In [134]:

np.int = np.int32
np.float = np.float64
np.bool = np.bool_
feat_selector_1 = BorutaPy(
    verbose=2,
    estimator=model,
    n_estimators='auto' # number of iterations to perform
)

# train Boruta
# N.B.: X and y must be numpy arrays
feat_selector_1.fit(np.array(X_train_a2), y_train.updrs_2)

Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	0
Tentative: 	8
Rejected: 	1193
Iteration: 	9 / 100
Confirmed: 	4
Tentative: 	4
Rejected: 	1193
Iteration: 	10 / 100
Confirmed: 	4
Tentative: 	4
Rejected: 	1193
Iteration: 	11 / 100
Confirmed: 	4
Tentative: 	4
Rejected: 	1193
Iteration: 	12 / 100
Confirmed: 	5
Tentative: 	3
Rejected: 	1193
Iteration: 	13 / 100
Confirmed: 	5
Tentative: 	3
Rejected: 	1193
Iteration: 	14 / 100
Confirmed: 	5
Tentative: 	3
Rejected: 	1193
Iteration: 	15 / 100
Confirmed: 	5
Tentative: 	3
Rejected: 	1193
Iteration: 	16 / 100
Confirmed: 	5

BorutaPy(estimator=RandomForestRegressor(max_depth=5, n_estimators=80,
                                         random_state=RandomState(MT19937) at 0x7FE500CC3840),
         n_estimators='auto',
         random_state=RandomState(MT19937) at 0x7FE500CC3840, verbose=2)

In [166]:
features_UPDRS2=X_train_a2.columns[feat_selector_1.support_]

In [167]:
features_UPDRS2

Index([2, 37, 68, 163, 296, 395, 1057,
       'upd23b_clinical_state_on_medication_Unknown'],
      dtype='object')

In [168]:
pd.Series(features_UPDRS2).to_csv('features_UPDRS2', header=False, index=False)

In [139]:
features_UPDRS2=X_train_cont.columns[[2,37, 68, 163, 296, 395, 1057]].tolist()+['upd23b_clinical_state_on_medication_Unknown']

In [140]:
features_UPDRS2

['visit_month_diff_min',
 'P01594',
 'P02766',
 'P43121',
 'ASNLESGVPSR',
 'DSGRDYVSQFEGSALGK',
 'TPLGDTTHTC(UniMod_4)PR',
 'upd23b_clinical_state_on_medication_Unknown']

In [143]:

np.int = np.int32
np.float = np.float64
np.bool = np.bool_
feat_selector_2 = BorutaPy(
    verbose=2,
    estimator=model,
    n_estimators='auto' # number of iterations to perform
)

# train Boruta
# N.B.: X and y must be numpy arrays
feat_selector_2.fit(np.array(X_train_a2), y_train.updrs_3)


Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	0
Tentative: 	15
Rejected: 	1186
Iteration: 	9 / 100
Confirmed: 	5
Tentative: 	10
Rejected: 	1186
Iteration: 	10 / 100
Confirmed: 	5
Tentative: 	10
Rejected: 	1186
Iteration: 	11 / 100
Confirmed: 	5
Tentative: 	10
Rejected: 	1186
Iteration: 	12 / 100
Confirmed: 	5
Tentative: 	10
Rejected: 	1186
Iteration: 	13 / 100
Confirmed: 	5
Tentative: 	10
Rejected: 	1186
Iteration: 	14 / 100
Confirmed: 	5
Tentative: 	10
Rejected: 	1186
Iteration: 	15 / 100
Confirmed: 	5
Tentative: 	10
Rejected: 	1186
Iteration: 	16 / 100
Confi

BorutaPy(estimator=RandomForestRegressor(max_depth=5, n_estimators=109,
                                         random_state=RandomState(MT19937) at 0x7FE500CC3840),
         n_estimators='auto',
         random_state=RandomState(MT19937) at 0x7FE500CC3840, verbose=2)

In [163]:
features_UPDRS3=X_train_a2.columns[feat_selector_2.support_]
print(features_UPDRS3)

Index([                                            2,
                                                  55,
                                                 530,
                                                 552,
                                                 728,
                                                 825,
                                                1040,
                                                1157,
                                                1184,
                                                1186,
           'upd23b_clinical_state_on_medication_Off',
            'upd23b_clinical_state_on_medication_On',
       'upd23b_clinical_state_on_medication_Unknown'],
      dtype='object')


In [165]:
pd.Series(features_UPDRS3).to_csv('features_UPDRS3', header=False, index=False)

In [145]:
features_UPDRS3=X_train_cont.columns[[2,37, 68, 163, 296, 395, 1057]].tolist()+['upd23b_clinical_state_on_medication_Off','upd23b_clinical_state_on_medication_On','upd23b_clinical_state_on_medication_Unknown']
print(features_UPDRS3)

['visit_month_diff_min', 'P01594', 'P02766', 'P43121', 'ASNLESGVPSR', 'DSGRDYVSQFEGSALGK', 'TPLGDTTHTC(UniMod_4)PR', 'upd23b_clinical_state_on_medication_Off', 'upd23b_clinical_state_on_medication_On', 'upd23b_clinical_state_on_medication_Unknown']


In [152]:
np.int = np.int32
np.float = np.float64
np.bool = np.bool_

feat_selector_3 = BorutaPy(
    verbose=2,
    estimator=model,
    n_estimators='auto' # number of iterations to perform
)

# train Boruta
# N.B.: X and y must be numpy arrays
feat_selector_3.fit(np.array(X_train_a2),y_train.updrs_4)



Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	1201
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	0
Tentative: 	4
Rejected: 	1197
Iteration: 	9 / 100
Confirmed: 	1
Tentative: 	3
Rejected: 	1197
Iteration: 	10 / 100
Confirmed: 	1
Tentative: 	3
Rejected: 	1197
Iteration: 	11 / 100
Confirmed: 	1
Tentative: 	3
Rejected: 	1197
Iteration: 	12 / 100
Confirmed: 	1
Tentative: 	3
Rejected: 	1197
Iteration: 	13 / 100
Confirmed: 	1
Tentative: 	3
Rejected: 	1197
Iteration: 	14 / 100
Confirmed: 	1
Tentative: 	3
Rejected: 	1197
Iteration: 	15 / 100
Confirmed: 	1
Tentative: 	3
Rejected: 	1197
Iteration: 	16 / 100
Confirmed: 	1

BorutaPy(estimator=RandomForestRegressor(max_depth=5, n_estimators=56,
                                         random_state=RandomState(MT19937) at 0x7FE500CC3840),
         n_estimators='auto',
         random_state=RandomState(MT19937) at 0x7FE500CC3840, verbose=2)

In [161]:
features_UPDRS4=X_train_a2.columns[feat_selector_3.support_]
print(features_UPDRS4)

Index([881, 'upd23b_clinical_state_on_medication_Unknown'], dtype='object')


In [162]:
pd.Series(features_UPDRS4).to_csv('features_UPDRS4', header=False, index=False)

In [155]:
features_UPDRS4=X_train_cont.columns[[881]].tolist()+['upd23b_clinical_state_on_medication_Unknown']
print(features_UPDRS4)

['QKVEPLRAELQEGAR', 'upd23b_clinical_state_on_medication_Unknown']


### **The important features that are selected by BorutaPy will now be used in predicting the UPDRS scores

Save the scaled training and test dataset files

In [160]:
X_train_a2.to_csv("X_train.csv",index=True)
X_test_a2.to_csv("X_test.csv",index=True)
y_train.to_csv("y_train.csv",index=True)
y_test.to_csv("y_test.csv",index=True)