# Run Machine Learning Pipeline
----

### Import Packages
The machine learning code has been packaged into a single module called <code>ML_Pipeline</code> saved in **ml_functions**. 

<div class="alert alert-block alert-info">
<b>Note:</b> Be sure to specify path to <code>ml_functions</code> in line 3.
</div>

In [1]:
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
import sys
sys.path.insert(1,'C:/ArrhythmiaPredictionProject/ml_functions/') # include path to ml_functions 
import ML_Pipeline as ml
import warnings 
warnings.simplefilter('ignore')

%reload_ext autoreload
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Specify Input Requirements
<code>ML_Pipeline</code> creates an object that requires the following inputs as defined below:
- data filename
- label
- feature set 
- seed (optional)

The seed is optional, but the default is 134556 which was used for this project.

In [2]:
# Input Data 
filename = 'C:/ArrhythmiaPredictionProject/Manuscript_Data/Population_Feature_Outputs.csv' # file with data 
features = ['Vrest', 'Upstroke', 'Vpeak', 'APD20', 'APD40', 'APD50', 'APD90',
       'TriAP','DCai', 'Capeak', 'CaD50', 'CaD90', 'TriCa', 'dCa']
label = 'IKrBlock_Label'
seed = 134556

### Specify Data Output Requirements
<code>ML_Pipeline</code> has the option to save the results of each classifier to an Excel file if <code>flag_save</code> is set to <code>True</code>. 

In [3]:
# Output Data 
flag_save = False 
output_name = 'Test' # filename 

### Create ML Object based on set input requirements

In [4]:
# Create a "ML task" object 
ML_Task = ml.ML_Pipeline(filename,label,features,seed)

### Convert data from the .csv to a pandas dataframe. 

In [8]:
df = ML_Task.convert_df()
df.head()

Unnamed: 0,Vrest,Upstroke,Vpeak,APD20,APD40,APD50,APD90,TriAP,DCai,Capeak,...,CaD90_HighCao,TriCa_HighCao,dCa_HighCao,Threshold,MaxSlope,CurveFit,NormMinMax,IKrBlock_Label,ICaLIncrease_Label,Inject_Label
0,-87.908,62.981,38.188,114.59,162.79,173.78,209.39,35.602,0.000224,0.000317,...,501.716312,302.314174,9.6e-05,33.4425,2.1,-2.99e-07,0.326,0,1,0
1,-88.1,115.51,38.15,111.77,163.86,182.39,246.29,63.901,0.000506,0.000547,...,298.714962,200.657809,6.2e-05,28.6405,1.26,-2.52e-07,0.277,1,1,1
2,-88.084,144.09,40.885,146.75,219.42,239.23,303.08,63.854,0.000321,0.000398,...,428.460444,272.328933,8.8e-05,31.513125,2.05,-3.94e-07,0.272,1,1,1
3,-88.045,94.84,37.837,127.34,201.69,219.06,276.38,57.324,0.00024,0.000318,...,445.118633,272.452332,8.2e-05,30.6985,1.76,-3.32e-07,0.275,1,1,0
4,-87.963,104.73,31.902,163.96,258.12,287.74,344.58,56.842,0.000408,0.000477,...,434.446389,297.796318,8e-05,29.841,2.25,-4.38e-07,0.236,1,1,1


### Split the data into a training and testing set. (We used a 90-10 split)

In [9]:
ML_Task.train_test_data = train_test_split(df[features], df[label], random_state = seed, test_size = 0.10) 
print(ML_Task.train_test_data[0].head()) # training data
# print(ML_Task.train_test_data[1].head()) # testing data
# print(ML_Task.train_test_data[2].head()) # training y labels
# print(ML_Task.train_test_data[3].head()) # testing y labels

      Vrest  Upstroke   Vpeak    APD20   APD40   APD50   APD90   TriAP  \
21  -88.054   167.250  39.068  128.580  184.03  196.58  237.04  40.466   
319 -87.921   190.620  31.453  125.020  150.26  160.30  199.85  39.548   
724 -88.028   271.590  41.874   90.029  157.01  175.74  242.93  67.186   
939 -87.881   203.390  37.301  187.580  263.61  283.38  341.11  57.730   
880 -87.694    71.313  38.365  180.790  258.39  282.18  368.85  86.669   

         DCai    Capeak   CaD50   CaD90   TriCa       dCa  
21   0.000299  0.000403  203.35  495.39  292.04  0.000106  
319  0.000452  0.000534  176.30  456.08  279.78  0.000084  
724  0.000310  0.000385  196.62  495.76  299.14  0.000077  
939  0.000408  0.000502  251.36  586.47  335.10  0.000099  
880  0.000399  0.000512  258.39  617.56  359.17  0.000118  


### Tune and evaluate 8 machine learning classifiers  
We first use the 90% training data to tune the parameters of each algorithm using the <code>GridSearchCV</code> function in <code>sklearn</code>. Then we used the tuned model to evaluate the performance on the 10% unseen testing data. Data is normalized using <code>MinMaxScaler</code> for the artifical neural network (ANN) and <code>StandardScaler</code> for the remaining.

In [10]:
# run the 8 machine learning classifiers  
LR = ML_Task.run_LR()
SVM = ML_Task.run_SVM()
KNN = ML_Task.run_KNN()
RF = ML_Task.run_RF()
NB = ML_Task.run_NB()
GB = ML_Task.run_GB()
XGB = ML_Task.run_XGB()
ANN = ML_Task.run_ANN()

------------------------------------------
LR Best Params: 
{'C': 100}
0.7944444444444445
------------------------------------------
SVM Best Params: 
{'C': 50, 'gamma': 0.01, 'kernel': 'rbf'}
0.7300000000000001
------------------------------------------
KNN Best Params: 
{'n_neighbors': 19}
0.5460565476190476
------------------------------------------
Random forest Best Params:
{'bootstrap': True, 'criterion': 'entropy', 'max_depth': 100, 'max_features': 3, 'min_samples_leaf': 3, 'min_samples_split': 8, 'n_estimators': 200}
0.668204365079365
------------------------------------------
Gradient Boosting Best Params:
{'learning_rate': 0.5, 'max_depth': 200, 'max_features': 3, 'min_samples_leaf': 3, 'min_samples_split': 8, 'n_estimators': 200}
0.6288888888888888
------------------------------------------
XGBoost Best Params: 
{'gamma': 0.1, 'learning_rate': 0.1, 'max_depth': 100, 'min_child_weight': 8, 'n_estimators': 100, 'subsample': 0.8}
0.6622222222222222
-----------------------------

### Combine data from each classifier into a single table

In [11]:
# extract the results of each calssifier and output results as a table
classifiers = [LR[1], SVM[1], NB, XGB[1], GB[1], RF[1], KNN[1], ANN[1]]
keys = ['LR','SVM','NB','XGB','GB','RF','KNN','ANN']
combined_ML_data = ML_Task.create_table(classifiers,keys)

result_table = combined_ML_data[0]
metrics = combined_ML_data[1]
rocs = combined_ML_data[2]
prob = combined_ML_data[3]
pred = combined_ML_data[4]
conf_matrix = combined_ML_data[5]
result_table

Unnamed: 0,Accuracy,Threshold,Sensitivity,Specificity,NPV,PPV,AUC
LR,0.71,0.515326,0.649123,0.790698,0.62963,0.804348,0.759282
SVM,0.71,0.511086,0.631579,0.813953,0.625,0.818182,0.764586
NB,0.56,0.681326,0.368421,0.813953,0.492958,0.724138,0.586291
XGB,0.68,0.545012,0.649123,0.72093,0.607843,0.755102,0.70461
GB,0.7,0.100666,0.807018,0.55814,0.685714,0.707692,0.698898
RF,0.67,0.517771,0.578947,0.790698,0.586207,0.785714,0.677275
KNN,0.63,0.473684,0.684211,0.55814,0.571429,0.672414,0.636883
ANN,0.69,0.897536,0.736842,0.627907,0.642857,0.724138,0.668707


### Plot results and highlight the best classifier for this particular task

In [None]:
ML_Task.plotting(rocs,result_table)

### Save results to an Excel File

In [None]:
# save the results 
ML_Task.flag_save = flag_save # the default is set to True
filename = output_name + '_' + label
ML_Task.save_data(filename,combined_ML_data,keys)