In [1]:
import numpy as np
import pandas as pd
import neuroqwerty as nq #neuroqwerty is a .py file created by me
import random

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif

from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC, NuSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_curve, auc
import xgboost as xgb

import warnings
warnings.filterwarnings('ignore')

# Introduction to Parkinson's Disease and Starter Code

Parkinson's disease(PD) is a neurodegenerative disorder related to movement. This means that as time continues, symptoms will worsen. Symptoms of people with PD include tremors, stiffness, difficulty speaking, loss of balance and difficulty initiating movement. 

Typically, PD is diagnosed by a clinician who looks at symptoms, medical history and tests to reach a diagnosis. However, utilizing a classifier that relies on keyboard typing data allows for the early detection of PD without a clinician. This could revolutionize the way that diagnoses are made. The best course of action is to use the classifier as a first step for early detection and then seeking out a doctor to properly diagnose it. A clinician with expertise in PD will not only diagnose but also find the best treatments for the individual. 

In [2]:
myClass = nq.neuroqwerty()
cs1Pd = pd.read_csv('MIT-CSXPD_v2/MIT-CS1PD/GT_DataPD_MIT-CS1PD.csv')
cs2Pd = pd.read_csv('MIT-CSXPD_v2/MIT-CS2PD/GT_DataPD_MIT-CS2PD.csv')

In [3]:
cs1Pd = myClass.multiplecsv_to_df_two_files(cs1Pd, 'MIT-CSXPD_v2/MIT-CS1PD/data_MIT-CS1PD/', 'file_1', 'file_2') 
cs1Pd = cs1Pd.drop(columns=['file_2'])

cs2Pd = myClass.multiplecsv_to_df_left_right(cs2Pd, 'MIT-CSXPD_v2/MIT-CS2PD/data_MIT-CS2PD/')

combine = [cs1Pd, cs2Pd]
total_df =  pd.concat(combine, ignore_index=True)

Most of the libraries can be used by you, the user, with the exception of *neuroqwerty* which is a .py file, created by me, that contains functions to parse through each of the 85 people's individual dataset, extract key information, and place that information into a singular dataset. 
Since there were two experiments done, two different datasets were created called *cs1Pd* and *cs2Pd*. These two datasets were then combined into one dataset called *total_df* 

If you're curious about the dataset, it can be found here at: https://physionet.org/content/nqmitcsxpd/1.0.0/MIT-CS1PD/#files-panel 

In [4]:
X = total_df.drop(columns=['pID', 'gt', 'updrs108', 'nqScore', 'file_1', 'afTap' , 'sTap'])
y = total_df['gt']

The data is then divided into X(features) and y(labels). X has 67 features. X contains information from the keyboard typing dataset only. Focusing on typing data allows the classifier to be made available to the public as long as there is a keyboard. This is why features like 'updrs108', 'afTap' and 'sTap' that evaluate PD in a clinical setting were removed. 

# Exploration of different classifiers with different techniques 

Here we explore 11 different classifiers along with different techniques to see the wall time(how long a cell takes to run) and the performance metrics(measured using accuracy and AUC). This exploration will lead to finding the best classifier based on these metrics. 

Here we explore 11 different classifiers:
1. GaussianNB: stands for Gaussian Naive Bayes
2. SVC: stands for Support Vector Classification
3. NuSVC: stands for Nu-support Vector Classification
4. KNeigborsClassifier: stands for K-Nearest Neighbors
5. MLPClassifier: stands for Multi-Layer Perceptron
6. DecisionTreeClassifier 
7. RandomForestClassifier 
8. LogisticRegression 
9. AdaBoostClassifier 
10. GradientBoostingClassifier
11. XGBClassifier: stands for Extreme Gradient Boosting 

Different techniques explored were: 
- Standardization: scaling so that features are have a mean of 0 and a standard deviation of 1
- Feature selection: selecting the best 20 features using SelectKBest
- Bagging (or booststrap aggregation): ensemble learning technique that trains multiple models on bootstrapped data independently, generates predictions for each individual model, and then aggregates the predictions by using a majority voting approach to get the final prediction

### Baseline models: exploring models without any additional techniques 

For each individual classifier(e.g. DecisionTree), the dataset was split into training and testing sets 150 different times. Each time, the accuracy and AUC was found. After 150 times, the average of the accuracy and AUC was calculated, as well as the median of the accuracy. 

For the code below, I took inspiration from a Kaggle notebook on prediction of Alzheimer's. The link to the notebook is: https://www.kaggle.com/code/gallo33henrique/ml-alzheimer-predict/notebook  

Side note: the code here is very similar to the code used in the next sections

In [5]:
%%time
models = [GaussianNB(),
          SVC(probability=True),
          NuSVC(probability = True, nu = .30),
          KNeighborsClassifier(),
          MLPClassifier(), 
          DecisionTreeClassifier(),
          RandomForestClassifier(n_estimators=100), 
          LogisticRegression(),
          AdaBoostClassifier(),
          GradientBoostingClassifier(),
          xgb.XGBClassifier()
         ]

# List to store metrics for each model
metricas = []
num = 150
# Evaluate each model
for model in models:

    acc_list = []
    auc_list = []
    for i in range(num): 
        rand_ = random.randint(0, 10000) 
        
        # Split the data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=rand_)
        
        # Create the classifier
        classifier = model

        # Train the classifier
        classifier.fit(X_train, y_train)

        # Make predictions on the test set
        y_pred = classifier.predict(X_test)
        y_proba = classifier.predict_proba(X_test)[:,1]

        # Calculate accuracy
        acc_val = accuracy_score(y_test, y_pred)

        #Calculate AUC 
        fpr, tpr, thresholds = roc_curve(y_test, y_proba)
        auc_val = auc(fpr, tpr)
        
        acc_list.append(acc_val)
        auc_list.append(auc_val)

    acc_avg = np.mean(acc_list)
    acc_median = np.median(acc_list)
    auc_avg = np.mean(auc_list)
    
    # Extract metrics of interest from the report
    metrics = {"Model": type(model).__name__,
               "Average Accuracy": acc_avg,
               "Median Accuracy": acc_median,
               "AUC Average": auc_avg
              }
    metricas.append(metrics)

# Convert the list of dictionaries into a DataFrame
df_metricas = pd.DataFrame(metricas)

# Function to highlight the maximum value in each column
def highlight_max(s):
    is_max = s == s.max()
    return ['background-color: yellow' if v else '' for v in is_max]

# Apply the highlighting function
df_metricas_styled = df_metricas.style.apply(highlight_max, subset=['Average Accuracy', 'Median Accuracy', 
                                                                   'AUC Average'])

CPU times: user 2min 38s, sys: 54.3 s, total: 3min 33s
Wall time: 1min 3s


In [6]:
df_metricas_styled

Unnamed: 0,Model,Average Accuracy,Median Accuracy,AUC Average
0,GaussianNB,0.651818,0.636364,0.687849
1,SVC,0.412727,0.409091,0.547627
2,NuSVC,0.458485,0.454545,0.473096
3,KNeighborsClassifier,0.425455,0.409091,0.420915
4,MLPClassifier,0.530909,0.545455,0.550304
5,DecisionTreeClassifier,0.606667,0.590909,0.61162
6,RandomForestClassifier,0.666061,0.681818,0.738963
7,LogisticRegression,0.552121,0.545455,0.58609
8,AdaBoostClassifier,0.675455,0.681818,0.741397
9,GradientBoostingClassifier,0.64697,0.636364,0.733407


It takes 1 minute with 3 seconds to run the baseline models. 

The best classifier is AdaBoost with an average accuracy of .675 and an average AUC of .741

### Standardization: exploring models with standardization only 

Standardization is applied to each feature where each feature is scaled to have a mean of 0 and a standard deviation of 1.

In [7]:
%%time

models = [GaussianNB(),
          SVC(probability=True),
          NuSVC(probability = True, nu = .30),
          KNeighborsClassifier(),
          MLPClassifier(), 
          DecisionTreeClassifier(),
          RandomForestClassifier(n_estimators=100), 
          LogisticRegression(),
          AdaBoostClassifier(), 
          GradientBoostingClassifier(),
          xgb.XGBClassifier()
         ]

# List to store metrics for each model
metricas_standardized = []
num = 150
# Evaluate each model
for model in models:
    acc_list = []
    auc_list = []
    for i in range(num): 
        rand_ = random.randint(0, 10000) 
        
        # Split the data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=rand_)
        
        #Standardize
        scaler = StandardScaler()
        X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
        X_test = pd.DataFrame(scaler.fit_transform(X_test), columns=X_test.columns)
        
        # Create the classifier
        classifier = model

        # Train the classifier
        classifier.fit(X_train, y_train)

        # Make predictions on the test set
        y_pred = classifier.predict(X_test)
        y_proba = classifier.predict_proba(X_test)[:,1]

        # Calculate accuracy
        acc_val = accuracy_score(y_test, y_pred)

        #Calculate AUC 
        fpr, tpr, thresholds = roc_curve(y_test, y_proba)
        auc_val = auc(fpr, tpr)
                
        acc_list.append(acc_val)
        auc_list.append(auc_val)
        
    acc_avg = np.mean(acc_list)
    acc_median = np.median(acc_list)
    auc_avg = np.mean(auc_list)
    # Extract metrics of interest from the report
    metrics_standardized = {"Model": type(model).__name__,
               "Average Accuracy": acc_avg,
               "Median Accuracy": acc_median,
               "AUC Average": auc_avg
              }
    metricas_standardized.append(metrics_standardized)

# Convert the list of dictionaries into a DataFrame
df_metricas_standard = pd.DataFrame(metricas_standardized)

# Function to highlight the maximum value in each column
def highlight_max(s):
    is_max = s == s.max()
    return ['background-color: yellow' if v else '' for v in is_max]

# Apply the highlighting function
df_metricas_styled_standardized = df_metricas_standard.style.apply(highlight_max, subset=['Average Accuracy', 'Median Accuracy', 
                                                                   'AUC Average'])

CPU times: user 2min 57s, sys: 57.5 s, total: 3min 55s
Wall time: 1min 6s


In [8]:
df_metricas_styled_standardized

Unnamed: 0,Model,Average Accuracy,Median Accuracy,AUC Average
0,GaussianNB,0.645758,0.636364,0.649816
1,SVC,0.625455,0.636364,0.648241
2,NuSVC,0.622424,0.636364,0.637466
3,KNeighborsClassifier,0.58,0.590909,0.628848
4,MLPClassifier,0.63,0.636364,0.690782
5,DecisionTreeClassifier,0.601212,0.590909,0.607433
6,RandomForestClassifier,0.676667,0.681818,0.74229
7,LogisticRegression,0.654545,0.659091,0.72796
8,AdaBoostClassifier,0.633333,0.636364,0.691412
9,GradientBoostingClassifier,0.650303,0.636364,0.72221


It takes 1 minute with 6 seconds to run this code. 

The best classifier based on accuracy is random forest with an average accuracy of .677 and an average AUC of .742

When comparing the standardized models to the baseline models, the models that improved significantly were SVC, NuSVC, KNeighbors, MLP and LogisticRegression. Standardization is seen as a good preprocessing step before training a model, which is definitely true for k-nearest neighbors, SVM's, and neural networks. Although it's good to standardize data, there was a classifier that performed significantly worse. That classifier was AdaBoost. 

### Feature Selection: exploring models with the 20 best features

Each classifier uses SelectKBest to find the 20 best features. 

In [9]:
%%time

models = [GaussianNB(),
          SVC(probability=True),
          NuSVC(probability = True, nu = .30),
          KNeighborsClassifier(),
          MLPClassifier(), 
          DecisionTreeClassifier(),
          RandomForestClassifier(n_estimators=100), 
          LogisticRegression(),
          AdaBoostClassifier(),
          GradientBoostingClassifier(),
          xgb.XGBClassifier()
         ]

# List to store metrics for each model
metricas_feat = []
num = 150
# Evaluate each model
for model in models:
    acc_list = []
    auc_list = []
    for i in range(num): 
        rand_ = random.randint(0, 10000) 
        
        # Split the data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=rand_)
        
        # Select the 20 best features for the training and testing set
        selector = SelectKBest(f_classif, k=20)
        X_train = selector.fit_transform(X_train, y_train)
        X_test = selector.transform(X_test)
        
        # Create the classifier
        classifier = model

        # Train the classifier
        classifier.fit(X_train, y_train)

        # Make predictions on the test set
        y_pred = classifier.predict(X_test)
        y_proba = classifier.predict_proba(X_test)[:,1]

        # Calculate accuracy
        acc_val = accuracy_score(y_test, y_pred)

        #Calculate AUC 
        fpr, tpr, thresholds = roc_curve(y_test, y_proba)
        auc_val = auc(fpr, tpr)
                
        acc_list.append(acc_val)
        auc_list.append(auc_val)

    acc_avg = np.mean(acc_list)
    acc_median = np.median(acc_list)
    auc_avg = np.mean(auc_list)
    # Extract metrics of interest from the report
    metrics_feat = {"Model": type(model).__name__,
               "Average Accuracy": acc_avg,
               "Median Accuracy": acc_median,
               "AUC Average": auc_avg
              }
    metricas_feat.append(metrics_feat)

# Convert the list of dictionaries into a DataFrame
df_metricas_feat = pd.DataFrame(metricas_feat)

# Function to highlight the maximum value in each column
def highlight_max(s):
    is_max = s == s.max()
    return ['background-color: yellow' if v else '' for v in is_max]

# Apply the highlighting function
df_metricas_styled_feat = df_metricas_feat.style.apply(highlight_max, subset=['Average Accuracy', 'Median Accuracy', 
                                                                   'AUC Average'])

CPU times: user 57.8 s, sys: 2 s, total: 59.8 s
Wall time: 39.3 s


In [10]:
df_metricas_styled_feat

Unnamed: 0,Model,Average Accuracy,Median Accuracy,AUC Average
0,GaussianNB,0.664545,0.636364,0.690468
1,SVC,0.444848,0.431818,0.480101
2,NuSVC,0.571818,0.590909,0.60012
3,KNeighborsClassifier,0.478182,0.5,0.496843
4,MLPClassifier,0.597576,0.590909,0.620693
5,DecisionTreeClassifier,0.630606,0.636364,0.634653
6,RandomForestClassifier,0.696667,0.681818,0.772682
7,LogisticRegression,0.599697,0.590909,0.653705
8,AdaBoostClassifier,0.697879,0.727273,0.76122
9,GradientBoostingClassifier,0.678182,0.681818,0.753826


It takes 39.3 seconds to run this code. This is slightly faster than the previous two explorations because the previous explorations relied on all 67 features, whereas, this one relies on 20 features.

In terms of accuracy, the best classifier is AdaBoost with an average accuracy of .698. In terms of AUC, the best classifier is random forest with an an AUC of .773.

When comparing the 20-best features models to the baseline models, most models improved. The only decrease was for the support vector classifier in which the average AUC decreased from 0.548 to .480, but the average accuracy increased from .413 to .445. 

### Exploring classifiers with standardization,  feature selection and bagging

The next two explorations will look at bagging which is an excessively timely procedure. Because of this, instead of looking at each classifier 150 times like we previously did, we will look at each classifier 75 times.

For this section, each classifier uses all three techniques: standardization, feature selection and bagging. 

In [11]:
%%time

models = [GaussianNB(),
          SVC(probability=True),
          NuSVC(probability = True, nu = .30),
          KNeighborsClassifier(),
          MLPClassifier(), 
          DecisionTreeClassifier(),
          RandomForestClassifier(n_estimators=100), 
          LogisticRegression(),
          AdaBoostClassifier(), 
          GradientBoostingClassifier(),
          xgb.XGBClassifier()
         ]

# List to store metrics for each model
metricas_stand_feat_bag = []
num = 75
# Evaluate each model
for model in models:
    acc_list = []
    auc_list = []
    for i in range(num): 
        rand_ = random.randint(0, 10000) 
        
        # Split the data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=rand_)
        
        #Standardize
        scaler = StandardScaler()
        X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
        X_test = pd.DataFrame(scaler.fit_transform(X_test), columns=X_test.columns)
        
        # Select the 20 best features for the training and testing set
        selector = SelectKBest(f_classif, k=20)
        X_train = selector.fit_transform(X_train, y_train)
        X_test = selector.transform(X_test)
        
        # Create the base classifier
        base_classifier = model

        # Number of base models (iterations)
        n_estimators = 75
        
        # Create the Bagging classifier
        bagging_classifier = BaggingClassifier(base_estimator=base_classifier, n_estimators=n_estimators)

        # Train the Bagging classifier
        bagging_classifier.fit(X_train, y_train)

        # Make predictions on the test set
        y_pred = bagging_classifier.predict(X_test)
        y_proba = bagging_classifier.predict_proba(X_test)[:,1]

        # Calculate accuracy
        acc_val = accuracy_score(y_test, y_pred)

        #Calculate AUC 
        fpr, tpr, thresholds = roc_curve(y_test, y_proba)
        auc_val = auc(fpr, tpr)
                
        acc_list.append(acc_val)
        auc_list.append(auc_val)

    acc_avg = np.mean(acc_list)
    acc_median = np.median(acc_list)
    auc_avg = np.mean(auc_list)
    
    # Extract metrics of interest from the report
    metrics_stand_feat_bag = {"Model": type(model).__name__,
               "Average Accuracy": acc_avg,
               "Median Accuracy": acc_median,
               "AUC Average": auc_avg
              }
    metricas_stand_feat_bag.append(metrics_stand_feat_bag)

# Convert the list of dictionaries into a DataFrame
df_metricas_stand_feat_bag = pd.DataFrame(metricas_stand_feat_bag)

# Function to highlight the maximum value in each column
def highlight_max(s):
    is_max = s == s.max()
    return ['background-color: yellow' if v else '' for v in is_max]

# Apply the highlighting function
df_metricas_styled_stand_feat_bag = df_metricas_stand_feat_bag.style.apply(highlight_max, subset=['Average Accuracy', 'Median Accuracy', 
                                                                   'AUC Average'])

CPU times: user 29min 44s, sys: 1min, total: 30min 45s
Wall time: 21min 19s


In [12]:
df_metricas_styled_stand_feat_bag

Unnamed: 0,Model,Average Accuracy,Median Accuracy,AUC Average
0,GaussianNB,0.646667,0.681818,0.689175
1,SVC,0.613333,0.636364,0.698191
2,NuSVC,0.644242,0.636364,0.746021
3,KNeighborsClassifier,0.616364,0.590909,0.70456
4,MLPClassifier,0.677576,0.681818,0.758131
5,DecisionTreeClassifier,0.689091,0.681818,0.741349
6,RandomForestClassifier,0.679394,0.681818,0.763058
7,LogisticRegression,0.672727,0.681818,0.747257
8,AdaBoostClassifier,0.700606,0.681818,0.776442
9,GradientBoostingClassifier,0.658788,0.681818,0.744806


It takes 21 minutes with 19 seconds to run this code. Bagging is a computationaly costly procedure.  

The best classifier based on accuracy is AdaBoost with an average accuracy of .701 and an average AUC of .776

When comparing the average AUC of the baseline classifier with those of the classifiers with standardization, feature selection and bagging, most of the models improved. The AUC for gaussian naive bayes stayed roughly the same. When comparing the average accuracy, most of the classifiers increased. The naive bayes decreased slightly from .652 to .647.

### Exploring classifiers with feature selection and bagging

Each classifier uses two techniques: feature selection and bagging

In [13]:
%%time
models = [GaussianNB(),
          SVC(probability=True),
          NuSVC(probability = True, nu = .30),
          KNeighborsClassifier(),
          MLPClassifier(), 
          DecisionTreeClassifier(),
          RandomForestClassifier(n_estimators=100), 
          LogisticRegression(),
          AdaBoostClassifier(),
          GradientBoostingClassifier(),
          xgb.XGBClassifier()
         ]

# List to store metrics for each model
metricas_feat_bag = []
num = 75
# Evaluate each model
for model in models:
    acc_list = []
    auc_list = []
    for i in range(num): 
        rand_ = random.randint(0, 10000) 
        
        # Split the data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=rand_)
        
        # Select the 20 best features for the training and testing set
        selector = SelectKBest(f_classif, k=20)
        X_train = selector.fit_transform(X_train, y_train)
        X_test = selector.transform(X_test)
        
        # Create the base classifier
        base_classifier = model

        # Number of base models (iterations)
        n_estimators = 75
        
        # Create the Bagging classifier
        bagging_classifier = BaggingClassifier(base_estimator=base_classifier, n_estimators=n_estimators)

        # Train the Bagging classifier
        bagging_classifier.fit(X_train, y_train)

        # Make predictions on the test set
        y_pred = bagging_classifier.predict(X_test)
        y_proba = bagging_classifier.predict_proba(X_test)[:,1]

        # Calculate accuracy
        acc_val = accuracy_score(y_test, y_pred)

        #Calculate AUC 
        fpr, tpr, thresholds = roc_curve(y_test, y_proba)
        auc_val = auc(fpr, tpr)
                
        acc_list.append(acc_val)
        auc_list.append(auc_val)

    acc_avg = np.mean(acc_list)
    acc_median = np.median(acc_list)
    auc_avg = np.mean(auc_list)
    
    # Extract metrics of interest from the report
    metrics_feat_bag = {"Model": type(model).__name__,
               "Average Accuracy": acc_avg,
               "Median Accuracy": acc_median,
               "AUC Average": auc_avg
              }
    metricas_feat_bag.append(metrics_feat_bag)

# Convert the list of dictionaries into a DataFrame
df_metricas_feat_bag = pd.DataFrame(metricas_feat_bag)

# Function to highlight the maximum value in each column
def highlight_max(s):
    is_max = s == s.max()
    return ['background-color: yellow' if v else '' for v in is_max]

# Apply the highlighting function
df_metricas_styled_feat_bag = df_metricas_feat_bag.style.apply(highlight_max, subset=['Average Accuracy', 'Median Accuracy', 
                                                                   'AUC Average'])

CPU times: user 33min 26s, sys: 4min 39s, total: 38min 5s
Wall time: 22min 17s


In [14]:
df_metricas_styled_feat_bag

Unnamed: 0,Model,Average Accuracy,Median Accuracy,AUC Average
0,GaussianNB,0.667879,0.681818,0.680312
1,SVC,0.433333,0.409091,0.463362
2,NuSVC,0.539394,0.545455,0.609905
3,KNeighborsClassifier,0.466061,0.454545,0.472609
4,MLPClassifier,0.599394,0.590909,0.656381
5,DecisionTreeClassifier,0.690909,0.681818,0.748693
6,RandomForestClassifier,0.689091,0.681818,0.748436
7,LogisticRegression,0.589091,0.590909,0.644987
8,AdaBoostClassifier,0.72303,0.727273,0.798983
9,GradientBoostingClassifier,0.688485,0.681818,0.764994


It took 22 minutes with 17 seconds to run this code.

The best classifier is AdaBoost with an average accuracy of .723 and an average AUC of .799

When comparing the average AUC of the baseline classifier with those of the classifiers with feature selection and bagging, most models improved. The support vector's average AUC decreased from .548 to .463, and the naive bayes classifier stayed roughly the same. When comparing the average accuracy, all of the classifiers increased. 

### Key Takeaways

- Standardization: 
 - allows features to be scaled so that they have similar ranges in values 
 - improves accuracy when using logistic regression, support vector machines, neural networks and k-nearest neighbors classifiers
- Feature selection:
 - cuts down on time to run because there is less features
 - improves accuracy by choosing the most important features for the model while discarding the least important ones
- Bagging(or bootstrap aggregation): 
 - computationally expensive procedure 
 - improves accuracy by aggregating multiple individual models where each model has bootstrapped data
 - although costly, the best classifier uses bagging 
- Best classifier for this dataset:
 - was AdaBoost with both feature selection and bagging 
 
