# Machine Learning Baseline Generator

This is the first script to use a set of ML method with default parameters in order to obtain baseline results. We will use the following *sklearn* classifiers:

1. KNeighborsClassifier - Nearest Neighbors
2. LinearSVC - Linear Support vector machine (SVM)
3. LogisticRegression - Logistic regression
4. SVC - Support vector machine (SVM) with Radial Basis Functions (RBF)
5. AdaBoostClassifier - AdaBoost
6. GaussianNB - Gaussian Naive Bayes
7. MLPClassifier - Multi-Layer Perceptron (MLP) / Neural Networks
8. DecisionTreeClassifier - Decision Trees
9. RandomForestClassifier - Random Forest
10. GradientBoostingClassifier - Gradient Boosting
11. BaggingClassifier - ensemble meta-estimator Bagging
12. XGBClassifier - XGBoost

*Note: More advanced hyperparameter search will be done in future scripts!*

In [1]:
# import scripts
from ds_utils import *
# import warnings
# warnings.filterwarnings("ignore")

In this notebook we will show the chosen procedure for the Machine Learning baseline generator.
The first step is to create a list of train and test datasets which will be used to generate and estimae a set of performances of more common used algorithms. In order to have a wide approximation several metrics will be used for every model.

### Step 1 - List of datasets and classifiers
So as a first step lets define a list o datasets.

In [2]:
# dataset folder
WorkingFolder  = './datasets/'
# BasicMLResults = 'ML_basic.csv' # a file with all the statistis for ML models

# Split details
seed = 0          # for reproductibility

# output variable
outVar = 'Lij'    

# parameter for ballanced (equal number of examples in all classes) and non-ballanced dataset 
class_weight = "balanced" # use None for ballanced datasets!


# set list of train and test files

listFiles_tr = [col for col in os.listdir(WorkingFolder) if ('tr' in col)
                and (col[:5] != 'o.ds_') ]

listFiles_ts = [col for col in os.listdir(WorkingFolder) if ('ts' in col)
                and (col[:5] != 'o.ds_') ]

# Check the list of files to process
print('* Files to use for ML baseline generator:')
print('Training sets:\n', listFiles_tr)
print('Test sets:\n',listFiles_ts)

* Files to use for ML baseline generator:
Training sets:
 ['fs.rf.m.ds_MA_tr.csv', 'fs.rf.s.ds_MA_tr.csv', 'm.ds_MA_tr.csv', 'pca0.99.m.ds_MA_tr.csv', 'pca0.99.o.ds_MA_tr.csv', 'pca0.99.s.ds_MA_tr.csv', 's.ds_MA_tr.csv']
Test sets:
 ['fs.rf.m.ds_MA_ts.csv', 'fs.rf.s.ds_MA_ts.csv', 'm.ds_MA_ts.csv', 'pca0.99.m.ds_MA_ts.csv', 'pca0.99.o.ds_MA_ts.csv', 'pca0.99.s.ds_MA_ts.csv', 's.ds_MA_ts.csv']


Once defined our list of datasets, let's call baseline_generator function which will generate a dataframe with all performances for every combination of dataset and algorithm. Remenber we are using a set of algorithms as a baseline where are included basic, complex and ensemble methods. For more information you can call baseline_classifiers method fror the ds_utils.py script to see which algorithms and parameters are being used for the baseline. Another aspect to point out is that weights for algorithms that used this parameter are calculated using set_weights method based on the class_weight.compute_class_weight sklearn method. More information can be found at: https://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html.

Let's verify the ballance of the classes and create the classifier definitions:

In [3]:
# calling baseline_classifiers to see which algorithms and parameters are being used. Remember that
# baseline_classifiers() need a y_tr_data argument to be executed, it can be any y_tr_data if you just want
# to see the output

y_tr_data = datasets_parser(listFiles_tr[0], listFiles_ts[0],outVar=outVar, WorkingFolder=WorkingFolder)[1]
ML_methods_used = baseline_classifiers(y_tr_data)
ML_methods_used

class weights {0: 0.6792961543919398, 1: 1.894341115947764}
**************************************


[KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
            metric_params=None, n_jobs=None, n_neighbors=5, p=2,
            weights='uniform'),
 LinearSVC(C=1.0, class_weight={0: 0.6792961543919398, 1: 1.894341115947764},
      dual=True, fit_intercept=True, intercept_scaling=1,
      loss='squared_hinge', max_iter=5000, multi_class='ovr', penalty='l2',
      random_state=0, tol=0.0001, verbose=0),
 LogisticRegression(C=1.0,
           class_weight={0: 0.6792961543919398, 1: 1.894341115947764},
           dual=False, fit_intercept=True, intercept_scaling=1,
           max_iter=500, multi_class='warn', n_jobs=None, penalty='l2',
           random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
           warm_start=False),
 SVC(C=1.0, cache_size=200,
   class_weight={0: 0.6792961543919398, 1: 1.894341115947764}, coef0=0.0,
   decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
   max_iter=-1, probability=True, random_state=0, shrinking=True, 

### Step 2 - Baseline generator for all datasets and ML methods
Once settle this. The next step is to call the baseline_generator method which takes as arguments a list of train and test sets we define previously. This function will calculate some metrics for each combination of train-test sets and will create a dataframe with all the performances. The final dataframe is sorted by AUC value, so the first row will correspond to the algorithm and dataset which achieve better perforamce. 

For each dataset and method we will print here only the test **AUC** and **Accuracy**. The rest of statistics will be save on a local CSV file:

In [None]:
baseline = baseline_generator(listFiles_tr, listFiles_ts, outVar, WorkingFolder)
baseline

-> Generating Basic Machine Learning baseline...
-> Dataset: ./datasets/ fs.rf.m.ds_MA_tr.csv ...
class weights {0: 0.6792961543919398, 1: 1.894341115947764}
**************************************
* Classifier: KNeighborsClassifier ...
[0.7100568273922878, 0.7904]
* Classifier: LinearSVC ...
[0.7565345441158424, 0.7575]
* Classifier: LogisticRegression ...
[0.7292373979084006, 0.7901]
* Classifier: SVC ...
[0.7206875341820325, 0.7954]
* Classifier: AdaBoostClassifier ...
[0.7103583110250595, 0.7998]
* Classifier: GaussianNB ...
[0.6859775265919168, 0.7415]
* Classifier: MLPClassifier ...
[0.6754528693212376, 0.7904]
* Classifier: DecisionTreeClassifier ...
[0.6930495145204516, 0.7447]
* Classifier: RandomForestClassifier ...
[0.729241942887789, 0.7744]
* Classifier: GradientBoostingClassifier ...
[0.7032529932476757, 0.8067]
* Classifier: BaggingClassifier ...
[0.6832755363454427, 0.7751]
* Classifier: XGBClassifier ...
[0.6562912362192438, 0.7779]
-> Dataset: ./datasets/ fs.rf.s.ds_MA



[0.7659502264157232, 0.7563]
* Classifier: LogisticRegression ...
[0.7597584494954315, 0.76]
* Classifier: SVC ...
[0.7333437867859269, 0.7917]
* Classifier: AdaBoostClassifier ...
[0.7081532385250634, 0.7982]
* Classifier: GaussianNB ...
[0.6751006334186274, 0.7462]
* Classifier: MLPClassifier ...
[0.6882454713067877, 0.7885]
* Classifier: DecisionTreeClassifier ...
[0.7425678224549251, 0.7694]
* Classifier: RandomForestClassifier ...
[0.7481164847917414, 0.7804]
* Classifier: GradientBoostingClassifier ...
[0.7000601452272415, 0.802]
* Classifier: BaggingClassifier ...
[0.707380592029015, 0.7876]
* Classifier: XGBClassifier ...
[0.6478943867989558, 0.7801]
-> Dataset: ./datasets/ m.ds_MA_tr.csv ...
class weights {0: 0.6792961543919398, 1: 1.894341115947764}
**************************************
* Classifier: KNeighborsClassifier ...
[0.6923829175434689, 0.7879]
* Classifier: LinearSVC ...




[0.7996459461056344, 0.7935]
* Classifier: LogisticRegression ...




[0.784102874093466, 0.7863]
* Classifier: SVC ...
[0.7360798643778151, 0.7979]
* Classifier: AdaBoostClassifier ...
[0.7133716323596472, 0.7942]
* Classifier: GaussianNB ...
[0.6555572220479979, 0.6341]
* Classifier: MLPClassifier ...
[0.7292972401370159, 0.7935]
* Classifier: DecisionTreeClassifier ...
[0.6866426085757702, 0.756]
* Classifier: RandomForestClassifier ...
[0.7078373624575612, 0.786]
* Classifier: GradientBoostingClassifier ...
[0.7222661570229779, 0.8067]
* Classifier: BaggingClassifier ...
[0.7063928165085771, 0.7845]
* Classifier: XGBClassifier ...
[0.6877000737801655, 0.781]
-> Dataset: ./datasets/ pca0.99.m.ds_MA_tr.csv ...
class weights {0: 0.6792961543919398, 1: 1.894341115947764}
**************************************
* Classifier: KNeighborsClassifier ...
[0.6947553967842757, 0.7892]
* Classifier: LinearSVC ...




[0.7522562035181171, 0.7719]
* Classifier: LogisticRegression ...
[0.7522562035181171, 0.7719]
* Classifier: SVC ...


According to the previous result it seems that the minitrain.csv dataset tend to get better performances that s.ds_MA.tr.csv. On the other hand Gradient Boosting Classifier is the method that achieves better performance, so is probably a good candidate for the minitrain.csv dataset. We could try some combination of parameters on that dataset and algorithm in the gridsearch strategy. But before we go any further we can plot the ROC curves for this baseline so that we can have a graphic comparaison across the methods used for the baseline.

### Step 3 - AUROCs for the best dataset
Plot the roc curves for the baseline from above. We can do this by calling the ROC_baseline_plot which will use an unique pair of train a test datasets. Since we can conclude that the dataset with better performances is the minitrain.csv we can use it for this purpose.

In [None]:
# train and test sets used for ROC_baseline_plot
newFile_tr = 'minitrain.csv' # new training data 
newFile_ts = 'minitest.csv' # new testing data
roc_curve_baseline = ROC_baseline_plot(newFile_tr, newFile_ts)

With the previous plot we can confirm once again that GradientBoostingClassifier is a good candidate to optimize for this dataset.

In another notebook we will analyze how to look for a good combination of parameters for a set of chosen algorithms.