# Baseline Machine Learning models

This is the first script to use a set of ML method with default parameters in order to obtain baseline results. We will use:
1. Nearest Neighbors
2. Linear SVM
3. RBF SVM
4. Gaussian Process
5. AdaBoost
6. Naive Bayes
7. Quadratic Discriminant Analysis
8. Neural Net
9. Decision Tree
10. Random Forest
11. XGBoost

*Note: More advanced hyperparameter search will be done in future scripts!*

In [1]:
import numpy as np
import pandas as pd

import math
import time
import os

from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,accuracy_score, roc_auc_score, f1_score, recall_score, precision_score

from xgboost import XGBClassifier

Let's define the folder with datasets and the list with dataset files:

In [4]:
# dataset folder
WorkingFolder  = './datasets/'
BasicMLResults = 'ML_basic.csv' # a file with all the statistis for ML models

# define the list of datasets (manually or script)
listFiles_tr = ['test_tr.csv','test2_tr.csv'] # manually set of training files
#listFiles_tr = [col for col in os.listdir(WorkingFolder) if ('_MA_tr.' in col)]

listFiles_ts = ['test_ts.csv','test2_ts.csv'] # manually set of test files
#listFiles_ts = [col for col in os.listdir(WorkingFolder) if ('_MA_ts.' in col)]

# Split details
seed = 0          # for reproductibility

# output variable
outVar = 'Lij'    

# parameter for ballanced (equal number of examples in all classes) and non-ballanced dataset 
class_weight = "balanced" # use None for ballanced datasets!

# remember ballanced datasets!!!!!!

In the first step, we are running a base line set of models with default parameters. In the next step, a grid search will be used for better models. We already have the training and test datasets as files from the previous steps.

In [5]:
print('-> Basic Machine Learning ...')
basicResults = [] #list with results

MLnames = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Gaussian Process",
          "AdaBoost", "Naive Bayes", "QDA", "Neural Net",
           "Decision Tree", "Random Forest", "XGBoost"]

# class_weight={0: w0, 1: w1}
#for each dataset
for f in range(len(listFiles_tr)):
    newFile_tr = listFiles_tr[f]
    newFile_ts = listFiles_ts[f]
    
    # read training set as dataframe
    # print('---> Reading data:', newFile_tr, '...')
    df_tr = pd.read_csv(os.path.join(WorkingFolder, newFile_tr))
    X_tr = df_tr.drop(outVar, axis = 1) # remove output variable from input features
    y_tr = df_tr[outVar]                # get only the output variable
        
    # read test set as dataframe
    # print('---> Reading data:', newFile_ts, '...')
    df_ts = pd.read_csv(os.path.join(WorkingFolder, newFile_ts))
    X_ts = df_ts.drop(outVar, axis = 1) # remove output variable from input features
    y_ts = df_ts[outVar]                # get only the output variable
            
    # get only array data for train
    X_tr_data = X_tr.values # get values of features
    y_tr_data = y_tr.values # get output values
    # get only array data for test
    X_ts_data = X_ts.values # get values of features
    y_ts_data = y_ts.values # get output values
        
    # default weights are egual for positive and negative classes = 1 (1:1)
    # we suppose we have a ballanced dataset
    # NOTE: sample_weights from fit() is about the importance of each point! not the class ballance!
    w0 = w1 = 1

    # check ratio for the classes (no of examples for each class)
    # Class counts
    target_count = y_tr.value_counts()
    nClass0 = target_count[0]
    nClass1 = target_count[1]
    print('class 0 =', nClass0, 'class 1 =', nClass1)

    # calculate the weight ratio
    if nClass0 > nClass1: # if class 0 has more examples
        w0 = round(nClass0/nClass1, 4) #round with 4 decimals only !
        w1 = 1
    else:
        if nClass0 < nClass1: # if class 1 has more examples
            w0 = 1
            w1 = round(nClass1/nClass0, 4)

    print('Ratio 1/0', w1, ':', w0)
    
    print('**************************************')
    print('Dataset MLmethod ACC AUROC f1-score recall prec')
    
    # we are using scale_pos_weight for unballanced dataset!
    MLclassifiers = [
        KNeighborsClassifier(3),
        SVC(kernel="linear", C=0.025, class_weight={0: w0, 1: w1}, random_state=seed),
        SVC(gamma=2, C=1, class_weight={0: w0, 1: w1}, random_state=seed),
        GaussianProcessClassifier(1.0 * RBF(1.0), random_state = seed),
        AdaBoostClassifier(random_state = seed),
        GaussianNB(),
        QuadraticDiscriminantAnalysis(),
        MLPClassifier(random_state = seed),
        DecisionTreeClassifier(random_state = seed, class_weight={0: w0, 1: w1}),
        RandomForestClassifier(random_state = seed, class_weight={0: w0, 1: w1}, n_jobs=-1),
        XGBClassifier(nthread=-1, 
                      scale_pos_weight= int(w0/w1), # ratio weights negative / positive class
                      seed=seed)]

    
    # for each ML method
    for i in range(len(MLnames)):
        iMLnames       = MLnames[i]
        iMLclassifier  = MLclassifiers[i]
        
        #### ADD checking if files exists, do not calculate again ???
        #### ADD time for each transformation and print it ???
        
        # TRY - EXCEPT for possible errors
        
        # training model
        iMLclassifier.fit(X_tr_data, y_tr_data)
        
        # scores for test set
        # score = iMLclassifier.score(X_ts_data, y_ts_data)
        
        # make predictions using test set
        y_pred = iMLclassifier.predict(X_ts_data)

        # calculate the statistics for the model (test ACC, AUROC, f1, recall, prec, confusion matrix)
        ac    = accuracy_score(y_ts_data, y_pred)
        auroc = roc_auc_score(y_ts_data, y_pred, average="weighted", sample_weight=None)
        f1    = f1_score(y_ts_data, y_pred, pos_label=1, average="binary",
                         sample_weight=None, labels=None)
        recall= recall_score(y_ts_data, y_pred, average=None)
        prec  = precision_score(y_ts_data, y_pred, average= None)
        # cm = confusion_matrix(y_ts_data, y_pred)
        
        print("%s %s %4.2f %4.2f %4.2f %s %s" % (newFile_tr, iMLnames, ac, auroc, f1, recall, prec))
        basicResults.append((newFile_tr, iMLnames, ac, auroc, f1, recall, prec))

# save the results
# create a dataframe
df_BasicMLResults = pd.DataFrame(basicResults, columns=['Dataset','MLmethod','ACC',
                                                        'AUROC','f1-score','recall','prec'])
print('---> Saving results ...')
df_BasicMLResults.to_csv(BasicMLResults, index=False)
print('! Please find your ML results in:', BasicMLResults)
print('Done!')

-> Basic Machine Learning ...
class 0 = 147 class 1 = 52
Ratio 1/0 1 : 2.8269
**************************************
Dataset MLmethod ACC AUROC f1-score recall prec
test_tr.csv Nearest Neighbors 0.83 0.67 0.50 [0.94736842 0.4       ] [0.85714286 0.66666667]
test_tr.csv Linear SVM 0.79 0.50 0.00 [1. 0.] [0.79166667 0.        ]
test_tr.csv RBF SVM 0.96 0.90 0.89 [1.  0.8] [0.95 1.  ]
test_tr.csv Gaussian Process 0.79 0.50 0.00 [1. 0.] [0.79166667 0.        ]
test_tr.csv AdaBoost 0.96 0.90 0.89 [1.  0.8] [0.95 1.  ]
test_tr.csv Naive Bayes 0.21 0.50 0.34 [0. 1.] [0.         0.20833333]
test_tr.csv QDA 0.38 0.61 0.40 [0.21052632 1.        ] [1.   0.25]


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


test_tr.csv Neural Net 0.88 0.70 0.57 [1.  0.4] [0.86363636 1.        ]
test_tr.csv Decision Tree 0.96 0.90 0.89 [1.  0.8] [0.95 1.  ]
test_tr.csv Random Forest 0.96 0.90 0.89 [1.  0.8] [0.95 1.  ]
test_tr.csv XGBoost 1.00 1.00 1.00 [1. 1.] [1. 1.]
class 0 = 147 class 1 = 52
Ratio 1/0 1 : 2.8269
**************************************
Dataset MLmethod ACC AUROC f1-score recall prec
test2_tr.csv Nearest Neighbors 0.79 0.73 0.60 [0.85714286 0.6       ] [0.85714286 0.6       ]
test2_tr.csv Linear SVM 0.74 0.50 0.00 [1. 0.] [0.73684211 0.        ]
test2_tr.csv RBF SVM 0.95 0.90 0.89 [1.  0.8] [0.93333333 1.        ]


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


test2_tr.csv Gaussian Process 0.74 0.50 0.00 [1. 0.] [0.73684211 0.        ]
test2_tr.csv AdaBoost 0.84 0.76 0.67 [0.92857143 0.6       ] [0.86666667 0.75      ]
test2_tr.csv Naive Bayes 0.68 0.66 0.50 [0.71428571 0.6       ] [0.83333333 0.42857143]
test2_tr.csv QDA 0.26 0.50 0.42 [0. 1.] [0.         0.26315789]




test2_tr.csv Neural Net 0.84 0.70 0.57 [1.  0.4] [0.82352941 1.        ]
test2_tr.csv Decision Tree 0.95 0.90 0.89 [1.  0.8] [0.93333333 1.        ]
test2_tr.csv Random Forest 1.00 1.00 1.00 [1. 1.] [1. 1.]
test2_tr.csv XGBoost 0.89 0.86 0.80 [0.92857143 0.8       ] [0.92857143 0.8       ]
---> Saving results ...
Done!


With these results, you could have an idea of the future ML model. Use your advanced scripts to find better parameters for the best models you obtained.