# Determining the best classifier for the POI in the Enron case

## Data Cleaning

** Loading the data **

In [1]:
import pickle
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

#### Features and number of features

In [2]:
all_features = []
c = 0
for key in data_dict:
    if c < 1:
        for feature in data_dict[key]:
            all_features.append(feature)
        c += 1
print "Features: \n{}".format(all_features)
print "Number of features: {}".format(len(all_features))

Features: 
['salary', 'to_messages', 'deferral_payments', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'restricted_stock_deferred', 'total_stock_value', 'expenses', 'loan_advances', 'from_messages', 'other', 'from_this_person_to_poi', 'poi', 'director_fees', 'deferred_income', 'long_term_incentive', 'email_address', 'from_poi_to_this_person']
Number of features: 21


** Removal of "TOTAL" and "THE TRAVEL AGENCY IN THE PARK"**

"TOTAL"'s removal was discussed in class.

In [3]:
data_dict.pop("TOTAL")

{'bonus': 97343619,
 'deferral_payments': 32083396,
 'deferred_income': -27992891,
 'director_fees': 1398517,
 'email_address': 'NaN',
 'exercised_stock_options': 311764000,
 'expenses': 5235198,
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 83925000,
 'long_term_incentive': 48521928,
 'other': 42667589,
 'poi': False,
 'restricted_stock': 130322299,
 'restricted_stock_deferred': -7576788,
 'salary': 26704229,
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 309886585,
 'total_stock_value': 434509511}

I noticed from looking at the dictionary that "THE TRAVEL AGENCY IN THE PARK" isn't a person so I remove it: 

In [4]:
data_dict.pop('THE TRAVEL AGENCY IN THE PARK')

{'bonus': 'NaN',
 'deferral_payments': 'NaN',
 'deferred_income': 'NaN',
 'director_fees': 'NaN',
 'email_address': 'NaN',
 'exercised_stock_options': 'NaN',
 'expenses': 'NaN',
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 'NaN',
 'long_term_incentive': 'NaN',
 'other': 362096,
 'poi': False,
 'restricted_stock': 'NaN',
 'restricted_stock_deferred': 'NaN',
 'salary': 'NaN',
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 362096,
 'total_stock_value': 'NaN'}

After seeing what other people has done to treat data on the Enron case, I found out that LOCKHART EUGENE E does not have data in it, so it is removed here, too.

In [5]:
data_dict.pop('LOCKHART EUGENE E') # no data

{'bonus': 'NaN',
 'deferral_payments': 'NaN',
 'deferred_income': 'NaN',
 'director_fees': 'NaN',
 'email_address': 'NaN',
 'exercised_stock_options': 'NaN',
 'expenses': 'NaN',
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 'NaN',
 'long_term_incentive': 'NaN',
 'other': 'NaN',
 'poi': False,
 'restricted_stock': 'NaN',
 'restricted_stock_deferred': 'NaN',
 'salary': 'NaN',
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 'NaN',
 'total_stock_value': 'NaN'}

**Fixing some data points**

Also from other people's work, I found out that two datapoints have errors. Here, I manually fix the data for two people whose total_payments and total_stock_value didn't add up. These values were obtained from the pdf file provided (enron61702insiderpay.pdf).

In [6]:
data_dict['BELFER ROBERT']['director_fees'] = 102500
data_dict['BELFER ROBERT']['exercised_stock_options'] = 0
data_dict['BELFER ROBERT']['expenses'] = 3285
data_dict['BELFER ROBERT']['restricted_stock_deferred'] = -44093
data_dict['BELFER ROBERT']['total_payments'] = 3285
data_dict['BELFER ROBERT']['deferred_income'] = -102500
data_dict['BELFER ROBERT']['deferral_payments'] = 0
data_dict['BELFER ROBERT']['restricted_stock'] = 44093
data_dict['BELFER ROBERT']['total_stock_value'] = 0

data_dict['BHATNAGAR SANJAY']['director_fees'] = 0
data_dict['BHATNAGAR SANJAY']['exercised_stock_options'] = 15456290
data_dict['BHATNAGAR SANJAY']['expenses'] = 137864
data_dict['BHATNAGAR SANJAY']['other'] = 0
data_dict['BHATNAGAR SANJAY']['restricted_stock'] = 2604490
data_dict['BHATNAGAR SANJAY']['restricted_stock_deferred'] = -2604490
data_dict['BHATNAGAR SANJAY']['total_payments'] = 137864
data_dict['BHATNAGAR SANJAY']['total_stock_value'] = 15456290

**Derivation of new features, "fraction_to_poi", "fraction_from_poi"**

These were discussed in the class. 

In [7]:
def computeFraction(poi_messages, all_messages):
    """ given a number messages to/from POI (numerator) 
        and number of all messages to/from a person (denominator),
        return the fraction of messages to/from that person
        that are from/to a POI
    """
    if poi_messages != 'NaN' or all_messages != 'NaN':
        fraction = float(poi_messages) / float(all_messages)
    else:
        fraction = 0
    return fraction

In [8]:
for name in data_dict:
    data_point = data_dict[name]
    
    from_poi_to_this_person = data_point["from_poi_to_this_person"]
    to_messages = data_point["to_messages"]
    fraction_from_poi = computeFraction(from_poi_to_this_person, to_messages)
    
    data_point["fraction_from_poi"] = fraction_from_poi
    
    from_this_person_to_poi = data_point["from_this_person_to_poi"]
    from_messages = data_point["from_messages"]
    fraction_to_poi = computeFraction( from_this_person_to_poi, from_messages )
    
    data_point["fraction_to_poi"] = fraction_to_poi

Checking whether the new features were added:

In [9]:
all_features = []
c = 0
for key in data_dict:
    if c < 1:
        for feature in data_dict[key]:
            all_features.append(feature)
        c += 1

In [10]:
all_features

['to_messages',
 'deferral_payments',
 'expenses',
 'poi',
 'deferred_income',
 'email_address',
 'long_term_incentive',
 'fraction_from_poi',
 'restricted_stock_deferred',
 'shared_receipt_with_poi',
 'loan_advances',
 'from_messages',
 'other',
 'director_fees',
 'bonus',
 'total_stock_value',
 'from_poi_to_this_person',
 'from_this_person_to_poi',
 'restricted_stock',
 'salary',
 'total_payments',
 'fraction_to_poi',
 'exercised_stock_options']

In [11]:
len(all_features)

23

**Removal of Some Features**

Because of the new feautures above, it only makes sense to remove the features used to come up with the new features. "email_address" is removed because it will not give a numeric data type that can be used to create the numpy arrays for the analysis. "poi" is removed for the moment because it needs to be the first element in the list as indicated in the function used to create the numpy arrays (featureFormat).

In [12]:
features_remove = ["poi", "email_address", "from_poi_to_this_person", "from_this_person_to_poi", "from_messages", "to_messages"]

In [13]:
features_list = ["poi"]
for feature in all_features:
    if feature not in features_remove:
        features_list.append(feature)
print "features_list = {}".format(features_list)

features_list = ['poi', 'deferral_payments', 'expenses', 'deferred_income', 'long_term_incentive', 'fraction_from_poi', 'restricted_stock_deferred', 'shared_receipt_with_poi', 'loan_advances', 'other', 'director_fees', 'bonus', 'total_stock_value', 'restricted_stock', 'salary', 'total_payments', 'fraction_to_poi', 'exercised_stock_options']


In [14]:
len(features_list)

18

**Extracting features and labels from dataset for local testing**

In [15]:
from feature_format import featureFormat, targetFeatureSplit

In [16]:
my_dataset = data_dict
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

## Exploring a variety of classifiers

Here, I explore various classifiers by looping over them and collecting scores for each classfier in a dictionary.

But before this can be done, the data is split into training and testing sets. As hinted in the provided tester.py, splitting the data is done using StratifiedShuffleSplit to account for the fact that the number of one class (i.e. POI) is a lot lower than the other (non-POI). 

**Splitting the features to test and train, converting to numpy arrays**

In [17]:
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
features = np.array(features)
labels = np.array(labels)
cv = StratifiedShuffleSplit(n_splits=1000, random_state=42)
for train_idx, test_idx in cv.split(features, labels):
    features_train, features_test = features[train_idx], features[test_idx]
    labels_train, labels_test = labels[train_idx], labels[test_idx]

In [18]:
len(features_train)

128

In [19]:
len(labels_train)

128

In [20]:
len(features_test)

15

In [21]:
len(labels_test)

15

** Importing modules **

In [22]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, recall_score, precision_score

**Feature scaling**

In [23]:
min_max_scaler = MinMaxScaler()
features_train_minmax = min_max_scaler.fit_transform(features_train)
features_test_minmax = min_max_scaler.transform(features_test)

** Feature Selection Using SelectKBest **

In [24]:
select = SelectKBest()
select.fit(features_train_minmax, labels_train)
features_train_minmax_skb = select.transform(features_train_minmax)
features_test_minmax_skb = select.transform(features_test_minmax)

In [163]:
select.scores_

array([  2.63967964e-01,   5.89873442e+00,   5.27911024e+00,
         2.61127433e+00,   2.34933761e+00,   7.93834480e-01,
         5.49532138e+00,   1.97148818e-01,   1.41634132e-02,
         1.96160845e+00,   1.11294792e+01,   1.46898644e+01,
         6.57270161e+00,   1.11962683e+01,   2.77163445e+00,
         8.24301644e+00,   1.37141611e+01])

In [146]:
select.pvalues_

array([  6.08306946e-01,   1.65631782e-02,   2.32316086e-02,
         1.08607636e-01,   1.27843297e-01,   3.74641359e-01,
         2.06317493e-02,   6.57793758e-01,   9.05456857e-01,
         1.63799239e-01,   1.11679789e-03,   1.99026034e-04,
         1.15300149e-02,   1.08049888e-03,   9.84329051e-02,
         4.80039891e-03,   3.17064113e-04])

In [148]:
features_list

['poi',
 'deferral_payments',
 'expenses',
 'deferred_income',
 'long_term_incentive',
 'fraction_from_poi',
 'restricted_stock_deferred',
 'shared_receipt_with_poi',
 'loan_advances',
 'other',
 'director_fees',
 'bonus',
 'total_stock_value',
 'restricted_stock',
 'salary',
 'total_payments',
 'fraction_to_poi',
 'exercised_stock_options']

In [151]:
feature_importance = zip(features_list[1:], select.scores_)

In [154]:
feature_importance

[('deferral_payments', 0.26396796392070854),
 ('expenses', 5.8987344206928967),
 ('deferred_income', 5.2791102364434765),
 ('long_term_incentive', 2.611274332028505),
 ('fraction_from_poi', 2.3493376092169673),
 ('restricted_stock_deferred', 0.79383448019177838),
 ('shared_receipt_with_poi', 5.495321379933408),
 ('loan_advances', 0.19714881780250351),
 ('other', 0.014163413206360534),
 ('director_fees', 1.9616084482924003),
 ('bonus', 11.129479151071294),
 ('total_stock_value', 14.689864354826563),
 ('restricted_stock', 6.5727016131222769),
 ('salary', 11.196268305382173),
 ('total_payments', 2.7716344531988995),
 ('fraction_to_poi', 8.2430164382598381),
 ('exercised_stock_options', 13.714161147390753)]

In [159]:
from operator import itemgetter
sorted(feature_importance, key=itemgetter(1), reverse=True)

[('total_stock_value', 14.689864354826563),
 ('exercised_stock_options', 13.714161147390753),
 ('salary', 11.196268305382173),
 ('bonus', 11.129479151071294),
 ('fraction_to_poi', 8.2430164382598381),
 ('restricted_stock', 6.5727016131222769),
 ('expenses', 5.8987344206928967),
 ('shared_receipt_with_poi', 5.495321379933408),
 ('deferred_income', 5.2791102364434765),
 ('total_payments', 2.7716344531988995),
 ('long_term_incentive', 2.611274332028505),
 ('fraction_from_poi', 2.3493376092169673),
 ('director_fees', 1.9616084482924003),
 ('restricted_stock_deferred', 0.79383448019177838),
 ('deferral_payments', 0.26396796392070854),
 ('loan_advances', 0.19714881780250351),
 ('other', 0.014163413206360534)]

**Classifiers to be explored**

http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html


In [99]:
names = ["K Nearest Neighbors", "RBF SVM", "Linear SVM", "Decision Tree", "Naive Bayes", 
         "AdaBoost", "Random Forest", "Logistic Regression"]

In [100]:
classifiers = [KNeighborsClassifier(), SVC(kernel='rbf', random_state=42), \
                SVC(kernel="linear", random_state=42), DecisionTreeClassifier(random_state=42), \
                GaussianNB(), AdaBoostClassifier(random_state=42), RandomForestClassifier(random_state=42), \
                LogisticRegression(random_state=42)]

**Iteration over various classifiers**

In [27]:
test_scores = []
confusion_matrices = []
precisionscores = []
recallscores = []
for name, clf in zip(names, classifiers):
    clf.fit(features_train_minmax_skb, labels_train)
    score =  round(clf.score(features_test_minmax_skb, labels_test), 3)
    test_scores.append(score)
    pred = clf.predict(features_test_minmax_skb)
    conf_mat = confusion_matrix(labels_test, pred)
    confusion_matrices.append(conf_mat)
    precisionscore = round(precision_score(labels_test, pred), 3)
    precisionscores.append(precisionscore)
    recallscore = round(recall_score(labels_test, pred), 3)
    recallscores.append(recallscore) 

  'precision', 'predicted', average, warn_for)


In [28]:
import pprint
print "Feature Scaling: SelectKBest"
print "Test Scores, Precision Scores, Recall Scores:"
pprint.pprint(zip(names, test_scores, precisionscores, recallscores))

Feature Scaling: SelectKBest
Test Scores, Precision Scores, Recall Scores:
[('Nearest Neighbors', 0.867, 0.0, 0.0),
 ('RBF SVM', 0.867, 0.0, 0.0),
 ('Linear SVM', 0.867, 0.0, 0.0),
 ('Decision Tree', 0.733, 0.25, 0.5),
 ('Naive Bayes', 0.867, 0.5, 1.0),
 ('AdaBoost', 0.933, 0.667, 1.0),
 ('Random Forest', 0.933, 1.0, 0.5),
 ('Logistic Regression', 0.933, 1.0, 0.5)]


In [29]:
print "Confusion matrices: "
pprint.pprint(zip(names, confusion_matrices))

Confusion matrices: 
[('Nearest Neighbors', array([[13,  0],
       [ 2,  0]])),
 ('RBF SVM', array([[13,  0],
       [ 2,  0]])),
 ('Linear SVM', array([[13,  0],
       [ 2,  0]])),
 ('Decision Tree', array([[10,  3],
       [ 1,  1]])),
 ('Naive Bayes', array([[11,  2],
       [ 0,  2]])),
 ('AdaBoost', array([[12,  1],
       [ 0,  2]])),
 ('Random Forest', array([[13,  0],
       [ 1,  1]])),
 ('Logistic Regression', array([[13,  0],
       [ 1,  1]]))]


**Feature Scaling Using PCA**

Here I needed to use pipeline.

In [30]:
from sklearn.pipeline import Pipeline

In [31]:
test_scores_pca = []
confusion_matrices_pca = []
precisionscores_pca = []
recallscores_pca = []
for name, clf in zip(names, classifiers):
    pipe = Pipeline([("scaler", MinMaxScaler()), ("pca", PCA(random_state=42)), ("clf", clf)])
    pipe.fit(features_train, labels_train)

    score =  round(pipe.score(features_test, labels_test), 3)
    test_scores_pca.append(score)

    pred = pipe.predict(features_test)
    
    conf_mat = confusion_matrix(labels_test, pred)
    confusion_matrices_pca.append(conf_mat)
    
    precisionscore = round(precision_score(labels_test, pred), 3)
    precisionscores_pca.append(precisionscore)
    
    recallscore = round(recall_score(labels_test, pred), 3)
    recallscores_pca.append(recallscore) 

In [32]:
import pprint
print "Feature Scaling: PCA"
print "Test Scores, Precision Scores, Recall Scores:"
pprint.pprint(zip(names, test_scores_pca, precisionscores_pca, recallscores_pca))

Feature Scaling: PCA
Test Scores, Precision Scores, Recall Scores:
[('Nearest Neighbors', 0.933, 1.0, 0.5),
 ('RBF SVM', 0.867, 0.0, 0.0),
 ('Linear SVM', 0.867, 0.0, 0.0),
 ('Decision Tree', 0.733, 0.25, 0.5),
 ('Naive Bayes', 0.867, 0.5, 1.0),
 ('AdaBoost', 0.867, 0.5, 0.5),
 ('Random Forest', 0.8, 0.0, 0.0),
 ('Logistic Regression', 0.867, 0.0, 0.0)]


** Testing whether my code above for SelectKBest is consistent with using Pipeline:**

I tried repeating the selectkbest analysis above but using pipeline and see if this give me the same results as above. It's a good test whether I am doing things correctly.

In [33]:
test_scores_skb = []
confusion_matrices_skb = []
precisionscores_skb = []
recallscores_skb = []
for name, clf in zip(names, classifiers):
    pipe_skb = Pipeline([("scaler", MinMaxScaler()), ("skb", SelectKBest()), ("clf", clf)])
    pipe_skb.fit(features_train, labels_train)

    score =  round(pipe_skb.score(features_test, labels_test), 3)
    test_scores_skb.append(score)
    
    pred = pipe_skb.predict(features_test)

    conf_mat = confusion_matrix(labels_test, pred)
    confusion_matrices_skb.append(conf_mat)
    
    precisionscore = round(precision_score(labels_test, pred), 3)
    precisionscores_skb.append(precisionscore)
    
    recallscore = round(recall_score(labels_test, pred), 3)
    recallscores_skb.append(recallscore) 

In [35]:
import pprint
print "Feature Scaling: SelectKBest"
print "Test Scores, Precision Scores, Recall Scores:"
pprint.pprint(zip(names, test_scores_skb, precisionscores_skb, recallscores_skb))

Feature Scaling: SelectKBest
Test Scores, Precision Scores, Recall Scores:
[('Nearest Neighbors', 0.867, 0.0, 0.0),
 ('RBF SVM', 0.867, 0.0, 0.0),
 ('Linear SVM', 0.867, 0.0, 0.0),
 ('Decision Tree', 0.733, 0.25, 0.5),
 ('Naive Bayes', 0.867, 0.5, 1.0),
 ('AdaBoost', 0.933, 0.667, 1.0),
 ('Random Forest', 0.933, 1.0, 0.5),
 ('Logistic Regression', 0.933, 1.0, 0.5)]


Results are the same. 

Below, I attempt to create dictionaries of results instead so I can convert them into pandas dataframes which are easier to look at than the lists above. 

In [101]:
pca = {}
for name, clf in zip(names, classifiers):
    pipe = Pipeline([("scaler", MinMaxScaler()), ("pca", PCA(random_state=42)), ("clf", clf)])
    pipe.fit(features_train, labels_train)
    
    pca_scores = {}

    score =  round(pipe.score(features_test, labels_test), 3)
    pca_scores["Accuracy score"] = score

    pred = pipe.predict(features_test)
    
    conf_mat = confusion_matrix(labels_test, pred)
    pca_scores["Confusion matrix"] = conf_mat
    
    precisionscore = round(precision_score(labels_test, pred), 3)
    pca_scores["Precision score"] = precisionscore
    
    recallscore = round(recall_score(labels_test, pred), 3)
    pca_scores["Recall score"] = recallscore
    
    pca[name] = pca_scores

In [102]:
pprint.pprint(pca)

{'AdaBoost': {'Accuracy score': 0.867,
              'Confusion matrix': array([[12,  1],
       [ 1,  1]]),
              'Precision score': 0.5,
              'Recall score': 0.5},
 'Decision Tree': {'Accuracy score': 0.733,
                   'Confusion matrix': array([[10,  3],
       [ 1,  1]]),
                   'Precision score': 0.25,
                   'Recall score': 0.5},
 'K Nearest Neighbors': {'Accuracy score': 0.933,
                         'Confusion matrix': array([[13,  0],
       [ 1,  1]]),
                         'Precision score': 1.0,
                         'Recall score': 0.5},
 'Linear SVM': {'Accuracy score': 0.867,
                'Confusion matrix': array([[13,  0],
       [ 2,  0]]),
                'Precision score': 0.0,
                'Recall score': 0.0},
 'Logistic Regression': {'Accuracy score': 0.867,
                         'Confusion matrix': array([[13,  0],
       [ 2,  0]]),
                         'Precision score': 0.0,
               

In [103]:
selectkbest = {}

for name, clf in zip(names, classifiers):
    pipe_skb = Pipeline([("scaler", MinMaxScaler()), ("skb", SelectKBest()), ("clf", clf)])
    pipe_skb.fit(features_train, labels_train)
    
    skb_scores = {}

    score =  round(pipe_skb.score(features_test, labels_test), 3)
    skb_scores["Accuracy score"] = score
    
    pred = pipe_skb.predict(features_test)

    conf_mat = confusion_matrix(labels_test, pred)
    skb_scores["Confusion matrix"] = conf_mat
    
    precisionscore = round(precision_score(labels_test, pred), 3)
    skb_scores["Precision score"] = precisionscore
    
    recallscore = round(recall_score(labels_test, pred), 3)
    skb_scores["Recall score"] = recallscore
    
    selectkbest[name] = skb_scores

In [104]:
pprint.pprint(selectkbest)

{'AdaBoost': {'Accuracy score': 0.933,
              'Confusion matrix': array([[12,  1],
       [ 0,  2]]),
              'Precision score': 0.667,
              'Recall score': 1.0},
 'Decision Tree': {'Accuracy score': 0.733,
                   'Confusion matrix': array([[10,  3],
       [ 1,  1]]),
                   'Precision score': 0.25,
                   'Recall score': 0.5},
 'K Nearest Neighbors': {'Accuracy score': 0.867,
                         'Confusion matrix': array([[13,  0],
       [ 2,  0]]),
                         'Precision score': 0.0,
                         'Recall score': 0.0},
 'Linear SVM': {'Accuracy score': 0.867,
                'Confusion matrix': array([[13,  0],
       [ 2,  0]]),
                'Precision score': 0.0,
                'Recall score': 0.0},
 'Logistic Regression': {'Accuracy score': 0.933,
                         'Confusion matrix': array([[13,  0],
       [ 1,  1]]),
                         'Precision score': 1.0,
             

**Converting the dictionaries to pandas dataframe to easily see the scores**

In [105]:
import pandas as pd

In [106]:
pd.DataFrame.from_dict(selectkbest)

Unnamed: 0,AdaBoost,Decision Tree,K Nearest Neighbors,Linear SVM,Logistic Regression,Naive Bayes,RBF SVM,Random Forest
Accuracy score,0.933,0.733,0.867,0.867,0.933,0.867,0.867,0.933
Confusion matrix,"[[12, 1], [0, 2]]","[[10, 3], [1, 1]]","[[13, 0], [2, 0]]","[[13, 0], [2, 0]]","[[13, 0], [1, 1]]","[[11, 2], [0, 2]]","[[13, 0], [2, 0]]","[[13, 0], [1, 1]]"
Precision score,0.667,0.25,0,0,1,0.5,0,1
Recall score,1,0.5,0,0,0.5,1,0,0.5


In [107]:
pd.DataFrame.from_dict(pca)

Unnamed: 0,AdaBoost,Decision Tree,K Nearest Neighbors,Linear SVM,Logistic Regression,Naive Bayes,RBF SVM,Random Forest
Accuracy score,0.867,0.733,0.933,0.867,0.867,0.867,0.867,0.8
Confusion matrix,"[[12, 1], [1, 1]]","[[10, 3], [1, 1]]","[[13, 0], [1, 1]]","[[13, 0], [2, 0]]","[[13, 0], [2, 0]]","[[11, 2], [0, 2]]","[[13, 0], [2, 0]]","[[12, 1], [2, 0]]"
Precision score,0.5,0.25,1,0,0,0.5,0,0
Recall score,0.5,0.5,0.5,0,0,1,0,0


From the results above, the best classifiers are:

SelectKBest feature selection:
- AdaBoost
- Naive Bayes
- Logistic Regression
- Random Forest

PCA as feature selection:
- KNN
- Naive Bayes
- AdaBoost

Decision Tree can be used, probably need to optimize parameters. 

Next step is to optimize parameters. I chose to use the console for this as it is faster to run the poi_id.py and tester.py files and obtain results this way.