#    Project:                            Confused Person EGG Brainwave

The purpose of this project is the trial of developing some automation able to analyze a brainwave and to classify it, in order to understand if the person on which the signal has been extracted was in a mental state of confusion at the moment. 

We want to understand if there are some patterns in the brainwaves signals able to correlate different frequencies to our specific mental state research. What are the frequencies that affect the confusion state the most?

Furthermore, we want to build a classifier able to define if a brainwave signal sample owns to a confused person or not. Our question is “Is this signal a confusion mental state?” 

In order to do this, we are going to use unsupervised and supervised Machine Learning techniques. 

## 1) Data Loading

At first, it is necessary to load the dataset: EGG brainwaves samples and people demographic info.  
Data is stored in the files "data/EEG_data.csv", "data/datasets_106_24522_demographic_info.csv".  The format of these two files is CSV. Every file has an header line with feature names. Every value is divided by the comma (‘,’) separator. 

In [1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

In [2]:
df_samples = pd.read_csv("data/EEG_data.csv")
df_people = pd.read_csv("data/datasets_106_24522_demographic_info.csv")

FileNotFoundError: [Errno 2] File data/EEG_data.csv does not exist: 'data/EEG_data.csv'

##  2) Data Exploration
Let's explore our datasets

### 2.1 Data Structure
We need to understand how data is structured before doing our analysis.

Which are the features? How data is structured? How large is the dataset? What types of features are there? 

In [3]:
print(f"EGG Brainwave samples shape: {df_samples.shape}")
print(f"People info: {df_people.shape}")

NameError: name 'df_samples' is not defined

In [4]:
df_samples.head(5)

NameError: name 'df_samples' is not defined

In [5]:
df_people.head(5)

NameError: name 'df_people' is not defined

In [6]:
df_samples.info()

NameError: name 'df_samples' is not defined

In [7]:
df_people.info()

NameError: name 'df_people' is not defined

#### Data structure
Every sample is associated to the person from who it has been detected. People data presents age, ethnicity and Sex. These two datasets are correlated by a feature 'SubjectID', a unique identifier for the 10 students. Its values are the range between 0 and 9 (Student 1 – 10). As we can see, there are 12811 EGG samples and 10 people. Every feature is a numerical value in the samples dataset. In the demographic one, instead, there are two categorical features: Sex and Ethnicity. Sex is a binary classification feature: ‘M’ stands for “Male” and ‘F’ for “Female”. Ethnicity, instead, has more values. These values are strings referring to the student origin.

### 2.2 Joining datasets
We don't want to study the two datasets separately, thus, we have to join them. In order to do this, we have to consider their relational correlation: the samples dataset contain an "External Key" related to the people dataset "Primary Key".  The key feature (external/internal) is "SubjectID" for both. 

In [8]:
#Dataframes Inner Join
EGG_dataset = pd.merge(left=df_samples, right=df_people, left_on='SubjectID', right_on='subject ID')
#duplicated column drop
EGG_dataset.drop(columns='SubjectID', inplace=True)

NameError: name 'df_samples' is not defined

In [9]:
print(f"Complete dataset shape: {EGG_dataset.shape}")

NameError: name 'EGG_dataset' is not defined

In [10]:
EGG_dataset.head(5)

NameError: name 'EGG_dataset' is not defined

### 2.3 Searching for missing values
The algorithms does not allows the data to be missing, thus we have to be sure of it. If data has some NaNs values, we have to imupte them. If there are columns or rows with a high values of NaNs, they will be dropped.

In [11]:
EGG_dataset.isnull().sum()

NameError: name 'EGG_dataset' is not defined

There aren't missing values: great!
In case of a new signal with missing values, we shoul be sure to impute it. For this reason, we build an imputation function able to fill NaNs with the mean value of the k-nearest neighbours.

In [12]:
from sklearn.impute import KNNImputer

def impute_dataset(df):
    imputer = KNNImputer(missing_values=np.nan)
    ds_idxs = df.index
    ds_cols = df.columns 
    df = pd.DataFrame(imputer.fit_transform(df), index=ds_idxs, columns=ds_cols)
    return df

##  3) Feature transformation

### 3.1 Encoding categorical features
The algorithms want the data to be numerical. Thus, we have to search for categorical features and to encode them.

In [13]:
import numbers

def encode_categorical_features(df):
    '''
    This function encodes features with non numerical values.
    Features with two values are incoded into 0 an 1 (binaries).
    Features with more than two non numerical values are one-hot encoded with dummies
    '''
    to_binaries = []
    to_encode = []
    
    for feature in df.columns:
        values = df[feature].unique()
        values = [x for x in values if not pd.isnull(x)]
        if not all(isinstance(value, numbers.Number) for value in values):
            if len(values) == 2:
                to_binaries.append(feature)
            else:
                to_encode.append(feature)

    for binary in to_binaries:
        values = df[binary].unique()
        values = [x for x in values if not pd.isnull(x)]
        df[binary] = df[binary].map(lambda x: 0 if x == values[0] else 1 if x == values[1] else np.nan)

    df = pd.get_dummies(df, columns=to_encode)
    
    return df

In [14]:
encoded_df = encode_categorical_features(EGG_dataset)

NameError: name 'EGG_dataset' is not defined

In [15]:
encoded_df.head(5)

NameError: name 'encoded_df' is not defined

### 3.2 Feature selection
The dataset may have some features usless for our problem. If we apply our algorithms on the dataset, training our models with these features, we may obtain different results. Furthermore, the more features we have, the more computationally expensive our processes will be.

There are two features that we don't need for. The first is the VideoID. This information is correlated to the context of the experiment and it is not a valuable data of the EGG brainwave. For this reason, we drop it. Another usless feature, is the 'predefinedlabel'. This feature indicates which confusion state was supposed to be detected by the experiment conductor before doing the test. We need for the 'user-definedlabel' because that's the label indicating if a signal is correlated to a confusion state.

In [16]:
selected_df = encoded_df.drop(columns=['VideoID', 'predefinedlabel'])

NameError: name 'encoded_df' is not defined

In [17]:
selected_df.head(5)

NameError: name 'selected_df' is not defined

What about Sex, Age and Ethnicity? Are these features useful for our purpose? A brain may work in similar but different ways on Male and Females. Age should impact the brain activity too.  Ethnicity? We don't know this without an important background. However, we can study feature variance by applying PCA, after that we can do a better evaluation.

Moreover, what about SubjectID? This feature indicates the Identifier of the people who generated the detected brainwave. This is clearly unuseful for our purpose, but for now let's hold it. In order to find outliers, we want to look at the data and to see if some students samples are distant from the other ones. For this reason, we will drop the feature in a second moment.

### 3.3 Feature scaling
Before we apply dimensionality reduction and unsupervised techniques to the data, we need to perform feature scaling. By this way, the principal component vectors are not influenced by the natural differences in scale for features.

In [18]:
from sklearn.preprocessing import StandardScaler

def scale_dataset(df, scaler=None):
    ds_idxs = df.index
    ds_cols = df.columns
    
    if scaler == None:
        scaler = StandardScaler()
        scaler = scaler.fit(df.values)
        
    df = pd.DataFrame(scaler.transform(df.values), index=ds_idxs, columns=ds_cols)
    return df, scaler

In [19]:
scaled_df, scaler = scale_dataset(selected_df)

NameError: name 'selected_df' is not defined

In [20]:
scaled_df.head(5)

NameError: name 'scaled_df' is not defined

### 3.3 Principal Component Analysis (PCA)
Data is ready for PCA. At first, we will apply PCA without indicating how many final features we want. By this way, we will be able to see the variance of each component. After taht, we will take a good decision on features.

In [21]:
from sklearn.decomposition import PCA

def do_pca(df, n_components = None, pca=None):
    if pca == None:
        if n_components == None: 
            pca = PCA()
        else:
            pca = PCA(n_components=n_components)
            
    df_reduced = pca.fit_transform(df)
    return pca, df_reduced

In [22]:
def pca_variance_plot(variance): #function inspired by the one used in an excercise of the lessons.
    n_components = len(variance)
    idxs = np.arange(n_components)
 
    plt.figure(figsize=(20, 10))
    ax = plt.subplot(111)
    cumvals = np.cumsum(variance)
    ax.bar(idxs, variance)
    ax.plot(idxs, cumvals)
 
    ax.xaxis.set_tick_params(width=2)
    ax.yaxis.set_tick_params(width=5, length=20)
 
    ax.set_xlabel("Principal Component")
    ax.set_ylabel("Variance Explained")
    plt.title('Explained Variance Per Principal Component')

In [23]:
pca, dataset_reduct = do_pca(scaled_df)

NameError: name 'scaled_df' is not defined

In [24]:
pca_variance_plot(pca.explained_variance_ratio_)

NameError: name 'pca' is not defined

In [25]:
def pca_results(full_dataset, pca): #This function has taken from an excercise of the PCA lessons (in helper_functions.py)
    '''
    Create a DataFrame of the PCA results
    Includes dimension feature weights and explained variance
    Visualizes the PCA results
    '''
    # Dimension indexing
    dimensions = dimensions = ['{}'.format(i) for i in range(1,len(pca.components_)+1)]
    # PCA components
    components = pd.DataFrame(np.round(pca.components_, 4), columns = full_dataset.keys())
    components.index = dimensions
    # PCA explained variance
    ratios = pca.explained_variance_ratio_.reshape(len(pca.components_), 1)
    variance_ratios = pd.DataFrame(np.round(ratios, 4), columns = ['Explained Variance'])
    variance_ratios.index = dimensions
    # Create a bar plot visualization
    fig, ax = plt.subplots(figsize = (14,8))
    # Plot the feature weights as a function of the components
    components.plot(ax = ax, kind = 'bar');
    ax.set_ylabel("Feature Weights")
    ax.set_xticklabels(dimensions, rotation=0)
    # Display the explained variance ratios
    for i, ev in enumerate(pca.explained_variance_ratio_):
        ax.text(i-0.40, ax.get_ylim()[1] + 0.05, "Expl Var\n          %.4f"%(ev))
    # Return a concatenated DataFrame
    return pd.concat([variance_ratios, components], axis = 1)

In [26]:
pca_res = pca_results(scaled_df, pca)

NameError: name 'scaled_df' is not defined

In [27]:
pca_res = pd.DataFrame(pca_res)
display(pca_res)

NameError: name 'pca_res' is not defined

As we can see above, features correlated to the people are low impactive on the dataset. For this reason, we can drop them, without using PCA. Thus, we don't apply PCA to the dataset in order to get a feature extraction. We hust drop the usless columns.

In [28]:
reduced_df = scaled_df.drop(columns=[' age', ' gender', ' ethnicity_Bengali',' ethnicity_English', ' ethnicity_Han Chinese'])

NameError: name 'scaled_df' is not defined

##  4) Searching for outliers
The dataset may have some outlier students. Every student has generated many samples. It is possible that some students labeled itself as confused or not confused, was wrong. Now, we look at the data in order to find students with feature showing very distant median values from the other ones.

In [29]:
data_subset = {}
for v in reduced_df['user-definedlabeln'].unique():
    data_subset[v] = reduced_df[reduced_df['user-definedlabeln'] == v]

NameError: name 'reduced_df' is not defined

In [30]:
print(data_subset.keys())

dict_keys([])


In [31]:
data_subset[0.975097266665175].groupby(['subject ID']).agg('mean')

KeyError: 0.975097266665175

In [32]:
data_subset[-1.0255387171989436].groupby(['subject ID']).agg('mean')

KeyError: -1.0255387171989436

We can note that students 3 (-0.868121) and 7 (0.527912) has very distant mean values for a lot of features respect to the other students, both for confused and not confused. We have to reject these samples. After that, we can drop the Suvject ID column.

In [33]:
reduced_df['subject ID'].unique()

NameError: name 'reduced_df' is not defined

In [34]:
reduced_df = reduced_df[reduced_df['subject ID'] != -0.8681212082604366]

NameError: name 'reduced_df' is not defined

In [35]:
reduced_df = reduced_df[reduced_df['subject ID'] != 0.5279122818574891]

NameError: name 'reduced_df' is not defined

In [36]:
reduced_df['subject ID'].unique()

NameError: name 'reduced_df' is not defined

In [37]:
reduced_df.drop(columns=['subject ID'], inplace=True)

NameError: name 'reduced_df' is not defined

In [38]:
reduced_df.head(5)

NameError: name 'reduced_df' is not defined

##  5) Preprocessing function
Building a function with pre-processing steps, accordig to the previous data exploration and analysis

In [39]:
from sklearn.impute import KNNImputer
import numbers

def impute_dataset(df):
    imputer = KNNImputer(missing_values=np.nan)
    ds_idxs = df.index
    ds_cols = df.columns 
    df = pd.DataFrame(imputer.fit_transform(df), index=ds_idxs, columns=ds_cols)
    return df

def remove_outliers(EGG_data_df):
    EGG_data_df = EGG_data_df[EGG_data_df['SubjectID'] != 2] #remove outlier student 3
    EGG_data_df = EGG_data_df[EGG_data_df['SubjectID'] != 6] #remove outlier student 7
    return EGG_data_df

def preprocess_data(EGG_data_df):
    
    EGG_data_df = impute_dataset(EGG_data_df)
    EGG_data_df = remove_outliers(EGG_data_df)
    EGG_data_df.drop(columns=['VideoID', 'predefinedlabel', 'SubjectID'], inplace=True)
    return EGG_data_df
    

In [40]:
dataset = pd.read_csv("data/EEG_data.csv")
dataset = preprocess_data(dataset)

FileNotFoundError: [Errno 2] File data/EEG_data.csv does not exist: 'data/EEG_data.csv'

In [41]:
dataset.head(5)

NameError: name 'dataset' is not defined

##  5) Shuffle and Split Data
Data is ready for supervised classification. Let's shuffle samples and split them in training and testing data

In [42]:
# Import train_test_split
from sklearn.model_selection import train_test_split

y = dataset['user-definedlabeln']
X = dataset.drop(columns=['user-definedlabeln'])

# Split the 'features' and 'income' data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size = 0.2, 
                                                    random_state = 0)

# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))

NameError: name 'dataset' is not defined

##  6) Model training, tuning and selection
A supervised model has to be selected as the final model. In order to do this, we will train four different models and evaluate them. After that we will choose the best one. Every model will be tuned using different hyperparameters.

### 6.1 Algorithms selection

- **Decision Trees** 

**Decision Trees** model is used for classification problems. This algorithm is able to analyse data features and to understand how much information each of these fetaures can give to us, in order to make our prediction. The model can make really accurate predictions, because of each feature can be combined with every other feature according to its specific information gain. This type of model has also a bad behavior. It tends to overfit a lot, thus, we have to use it with high attention or use it with any ensemble method. 

- **Ensemble Methods** 

Ensemble methods are able to solve classification problems combining different classification models. It can be used for real world applications, especially when data is not linearly separable. We can make a prediction using different models combined with an ensemble method, such as Ada Boost or Bagging or Random Forest. Furthermore, they are able to create models fitting data well because of their ability of finding a good compromise between bias and variance. We will use **AdaBoost** and **XGBoost** (Extreme Gradient Boosting)

- **Support Vector Machine**

**Support Vector Machine** is able to make good classification predictions, defining models with a well-fitted boundary. This boundary is so good because of the use of the margin. This model has a high flexibility because of its parameters. This alorithm is able to work on different types of data, in fact, it has three kernels: linear, polynomial and Rbf. Moreover, an SVM model can give us the ability of tuning it in order to decide how much the model has to be precise: with the C parameter we can define how much weight we want to assign to the Classification error respect to the margin error. 

### 6.2 Creating a Training and Tuning Pipeline
#### 6.2.1 Creating a tuning function and an evaluation function
In order to evaluate models, we have to create a training and tuning pipeline: every model is trained and tuned.
Thus, we have to define a function able to:
 - Fit the learner to the sampled training data and record the training time.
 - getting best estimator from grid search
 - Perform predictions on the test data X_test, and also on the first 300 training points.
 - Record the total prediction time.
 - Calculate the accuracy score for both the training subset and testing set.
 - Calculate the F-score for both the training subset and testing set.


In [43]:
from sklearn.metrics import fbeta_score, accuracy_score, make_scorer
from sklearn.model_selection import GridSearchCV


def get_best_estimator(learner, hyperparameters_combinations, X_train, y_train):
    '''
    This function takes a classifier and a combination of parameters.
    It returns a model tuned by the hyperparameters combination.
    '''
    #Get a scorer for Grid Search
    scorer = make_scorer(fbeta_score, beta=0.5)
    #Perform grid search on the classifier using 'scorer' as the scoring method 
    grid_obj = GridSearchCV(learner, hyperparameters_combinations, scoring=scorer)
    #Fit the grid search object to the training data and find the optimal parameters
    grid_fit = grid_obj.fit(X_train, y_train)
    # Get the nest estimator
    learner = grid_fit.best_estimator_
    print(f"Best params: {grid_fit.best_params_}")
    #return
    return learner

In [44]:
from time import time

def train_predict(learner, sample_size, X_train, y_train, X_test, y_test): 
    '''
    inputs:
       - learner: the learning algorithm to be trained and predicted on
       - hyperparameters_combinations: the dictionary containing hyperparameters possible values, for GridSearch Tuning
       - sample_size: the size of samples (number) to be drawn from training set
       - X_train: features training set
       - y_train: income training set
       - X_test: features testing set
       - y_test: income testing set
    '''
    results = {}
    
    #Fit the learner to the training data again in order to record the training time
    start = time() # Get start time
    learner = learner.fit(X_train[:sample_size], y_train[:sample_size])
    end = time() # Get end time
    
    #Calculate the training time
    results['train_time'] = end - start
        
    # Get the predictions on the test set(X_test),
    #then get predictions on the first 300 training samples(X_train)
    start = time() # Get start time
    predictions_test = learner.predict(X_test)
    predictions_train = learner.predict(X_train[:300])
    if learner.__class__.__name__ == 'XGBClassifier':
        predictions_test = [round(value) for value in predictions_test]
        predictions_train = [round(value) for value in predictions_train]
    end = time() # Get end time
    
    #Calculate the total prediction time
    results['pred_time'] = end - start
   
    # Compute accuracy on the first 300 training samples which is y_train[:300]
    results['acc_train'] = accuracy_score(y_train[:300], predictions_train)
   
    #Compute accuracy on test set using accuracy_score()
    results['acc_test'] = accuracy_score(y_test, predictions_test)
    
    #Compute F-score on the the first 300 training samples using fbeta_score()
    results['f_train'] = fbeta_score(y_train[:300], predictions_train, beta=0.5)
    
    #Compute F-score on the test set which is y_test
    results['f_test'] = fbeta_score(y_test, predictions_test, beta=0.5)
        
    # Success
    print("{} trained on {} samples.".format(learner.__class__.__name__, sample_size))
    print('Classifier {} Accuracy {} f_score {}'.format(learner.__class__.__name__, results['acc_test'], results['f_test']))
          
    # Return the results
    return results

#### 6.2.2 Classifiers creation and tuning
Now the function in created. We have to run the pipeline on the three selected classifiers. At first, we obtain the best estimator for every algorithm. After that, we train and test every of them in order to evaluate their performance.

In [45]:
#Import the models from sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
import xgboost as xgb

#Initialize classifiers
clf_A = DecisionTreeClassifier(random_state = 0)
parameters_A = {'max_depth': list(range(4, 20, 2)),
                'min_samples_split': list(range(2, 15, 2)),
                'min_samples_leaf': list(range(1,  20, 2))
               }
clf_B = AdaBoostClassifier(random_state = 0)
parameters_B = {'algorithm':['SAMME','SAMME.R'],
                'n_estimators':[10, 40, 60, 100, 120, 130, 140]
               }
clf_C = SVC(random_state = 0)
parameters_C = {'kernel': ['rbf'],
                'C': list(np.arange(0.5, 1.5, 0.1)),
                'gamma': ['scale', 'auto']
               }
clf_D = xgb.XGBClassifier(seed=0)
parameters_D = {'base_score': list(np.arange(0.2, 0.5, 0.1)),
                'n_estimators': [10, 40, 60, 100, 120, 130, 140],
                'objective': ['binary:logistic']}


#Collect results on the learners
best_estimators = {}
for clf, params in [(clf_A, parameters_A), (clf_B, parameters_B), (clf_C, parameters_C), (clf_D, parameters_D)]:
    print(f'searching for best estimator: classifier {clf.__class__.__name__}')
    best_estimators[clf.__class__.__name__] = get_best_estimator(clf, params, X_train, y_train)

searching for best estimator: classifier DecisionTreeClassifier


NameError: name 'X_train' is not defined

#### 6.2.3 Classifiers evaluation metrics collection
We have the best estimator for every algorithm we decided to use. Now, It is time to evaluate them. Let's train the models on different percentage of data. We'll evaluate the accuracy score, fscore and execution-time for every model using training and testing datasets. Looking at these results, we will able to detect the best model according to its performance, accuracy and bias-variance balance.

In [46]:
#Calculate the number of samples for 1%, 25%, 50%, and 100% of the training data
samples_100 = len(y_train)
samples_50 = int((len(y_train)/100)*50)
samples_25 = int((len(y_train)/100)*25)
samples_1 = int((len(y_train)/100)*1)

#Collect results on the learners
results = {}
for clf, params in [(best_estimators['DecisionTreeClassifier'], parameters_A),
                    (best_estimators['AdaBoostClassifier'], parameters_B),
                    (best_estimators['SVC'], parameters_C),
                    (best_estimators['XGBClassifier'], parameters_D)
                   ]:
    clf_name = clf.__class__.__name__
    results[clf_name] = {}
    for i, samples in enumerate([samples_1, samples_25, samples_50, samples_100]):
        results[clf_name][i] = train_predict(clf, samples, X_train, y_train, X_test, y_test)

NameError: name 'y_train' is not defined

#### 6.2.4 Classifiers evaluation
We want to evaluate the results obtained above. Thus, we want to buil a function able to display models metrics in order to have an easy way to compare them with a graph visualization

In [47]:
import matplotlib.pyplot as pl
import matplotlib.patches as mpatches
import numpy as np
import pandas as pd
from time import time
from sklearn.metrics import f1_score, accuracy_score

#taken from a notebook used in the lessons
def evaluate(results):
    """
    Visualization code to display results of various learners.
    
    inputs:
      - learners: a list of supervised learners
      - stats: a list of dictionaries of the statistic results from 'train_predict()'
    """
  
    # Create figure
    fig, ax = pl.subplots(2, 3, figsize = (19,10))

    # Constants
    bar_width = 0.22
    colors = ['#A00000','#00A0A0','#00A000', '#F5B041']
    
    # Super loop to plot four panels of data
    for k, learner in enumerate(results.keys()):
        for j, metric in enumerate(['train_time', 'acc_train', 'f_train', 'pred_time', 'acc_test', 'f_test']):
            for i in np.arange(4):
                
                # Creative plot code
                ax[j//3, j%3].bar(i+k*bar_width, results[learner][i][metric], width = bar_width, color = colors[k])
                ax[j//3, j%3].set_xticks([0.30, 1.30, 2.30, 3.30])
                ax[j//3, j%3].set_xticklabels(["1%", "25%", "50%", "100%"])
                ax[j//3, j%3].set_xlabel("Training Set Size")
                ax[j//3, j%3].set_xlim((-0.1, 4))
    
    # Add unique y-labels
    ax[0, 0].set_ylabel("Time (in seconds)")
    ax[0, 1].set_ylabel("Accuracy Score")
    ax[0, 2].set_ylabel("F-score")
    ax[1, 0].set_ylabel("Time (in seconds)")
    ax[1, 1].set_ylabel("Accuracy Score")
    ax[1, 2].set_ylabel("F-score")
    
    # Add titles
    ax[0, 0].set_title("Model Training")
    ax[0, 1].set_title("Accuracy Score on Training Subset")
    ax[0, 2].set_title("F-score on Training Subset")
    ax[1, 0].set_title("Model Predicting")
    ax[1, 1].set_title("Accuracy Score on Testing Set")
    ax[1, 2].set_title("F-score on Testing Set")
    
    
    # Set y-limits for score panels
    ax[0, 1].set_ylim((0, 1))
    ax[0, 2].set_ylim((0, 1))
    ax[1, 1].set_ylim((0, 1))
    ax[1, 2].set_ylim((0, 1))

    # Create patches for the legend
    patches = []
    for i, learner in enumerate(results.keys()):
        patches.append(mpatches.Patch(color = colors[i], label = learner))
    pl.legend(handles = patches, bbox_to_anchor = (-.80, 2.53), \
               loc = 'upper center', borderaxespad = 0., ncol = 4, fontsize = 'x-large')
    
    # Aesthetics
    pl.suptitle("Performance Metrics for Three Supervised Learning Models", fontsize = 16, x = 0.63, y = 1.05)
    # Tune the subplot layout
    # Refer - https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.subplots_adjust.html for more details on the arguments
    pl.subplots_adjust(left = 0.125, right = 1.2, bottom = 0.1, top = 0.9, wspace = 0.2, hspace = 0.3) 
    pl.tight_layout()
    pl.show()

In [48]:
evaluate(results)

NameError: name 'results' is not defined

### 6.3 Model selection
Looking at these graphs, we can clearly identify the models with the highest values in the scores: 

- The execution-time metric highlights a model that seems to be much more computationally expensive than the others: SVM. In the other hand, there is an algorithm resulting really fast: DecisionTrees. AdaBoost and XGBoost have anyway a good execution-time, thus, in case of their better performance in other metrics, we should not be much worry about the time comparison. 

- If we want to look for models with a bad bias-variance balance, we have to look at the other metrics trends. An underfitting model shows the training and testing scores converging to a low value. There aren’t underfitting models here. An overfitting model, instead, shows training and testing error diverging at the data size increasing. There aren’t overfitting models here. 

- The Accuracy and FScore metrics highlit that the XGBoost model is the one with the higher values. AdaBoost and DecisionTrees are really similar one to the other. SVM seems to be the worse model. 

According to our metrics, we select the XGBoost model as the one we want to use, with an accuracy of about 70%. XGBoostClassifier: {‘Accuracy’: 0.674305216967333, ‘f_score’: 0.6878541076487251} 

In [49]:
selected_model = best_estimators['XGBClassifier']

KeyError: 'XGBClassifier'

## 7) Visualize Feature importance
One of the purposes of our project is understanding what features are most correlated to the brainwave confusion detection. When applying PCA to the data we could see what features had the higher variance. Now that we trained a supervised model, we cann see how much any feature is relevant for the final classification and as a consequence which features are more relevant in the confusion state brainwave. Let's build a function able to display this ranking.

In [50]:
def feature_plot(importances, X_train, y_train):
    
    # Display the five most important features
    indices = np.argsort(importances)[::-1]
    columns = X_train.columns.values[indices[:5]]
    values = importances[indices][:5]

    # Creat the plot
    fig = pl.figure(figsize = (9,5))
    pl.title("Normalized Weights for First Five Most Predictive Features", fontsize = 16)
    pl.bar(np.arange(5), values, width = 0.6, align="center", color = '#00A000', \
          label = "Feature Weight")
    pl.bar(np.arange(5) - 0.3, np.cumsum(values), width = 0.2, align = "center", color = '#00A0A0', \
          label = "Cumulative Feature Weight")
    pl.xticks(np.arange(5), columns)
    pl.xlim((-0.5, 4.5))
    pl.ylabel("Weight", fontsize = 12)
    pl.xlabel("Feature", fontsize = 12)
    
    pl.legend(loc = 'upper center')
    pl.tight_layout()
    pl.show()  

In [51]:
feature_plot(best_estimators['AdaBoostClassifier'].feature_importances_, X_train, y_train)

KeyError: 'AdaBoostClassifier'