## Identify Fraud from Enron Email

<li>Author: Jubin Soni</li>
<li>Data Analyst Nanodegree Machine Learning Project</li>
<li><a href='http://htmlpreview.github.io/?https://github.com/jubins/DAND-Nanodegree/blob/master/dandp7-ML-Identify-Fraud-From-Enron-Emails/MachineLearningProject_JubinSoni.html'>GitHub Link</a></li>

### Overview

In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives.

The purpose of this project is to use machine learning skills to identify Enron Employees who may have committed fraud, based on the public Enron financial and email dataset. I have performed an end-to-end process of investigating the data through a machine learning lens.

### Dataset

The data is combined with a hand-generated list of persons of interest in the fraud case, which means individuals who were indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity.

### Task 1: Select what features you'll use.

In [1]:
# %load poi_id.py
#!/usr/bin/python

import sys, os
import pickle

ospath = os.getcwd()+'\\tools\\'
sys.path.append(ospath)

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data

### features_list is a list of strings, each of which is a feature name.
### The first feature must be "poi".
features_list = ['poi', 'salary', 'to_messages', 'deferral_payments', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'other',
 'shared_receipt_with_poi', 'restricted_stock_deferred', 'total_stock_value', 'expenses', 'loan_advances', 'from_messages',
 'from_this_person_to_poi', 'director_fees', 'deferred_income', 'long_term_incentive', 'from_poi_to_this_person']

### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)



In [2]:
import pandas as pd
import numpy as np

enron_data = pd.DataFrame.from_dict(data_dict, orient='index')

In [3]:
enron_data.head()

Unnamed: 0,salary,to_messages,deferral_payments,total_payments,exercised_stock_options,bonus,restricted_stock,shared_receipt_with_poi,restricted_stock_deferred,total_stock_value,...,loan_advances,from_messages,other,from_this_person_to_poi,poi,director_fees,deferred_income,long_term_incentive,email_address,from_poi_to_this_person
ALLEN PHILLIP K,201955.0,2902.0,2869717.0,4484442,1729541.0,4175000.0,126027.0,1407.0,-126027.0,1729541,...,,2195.0,152.0,65.0,False,,-3081055.0,304805.0,phillip.allen@enron.com,47.0
BADUM JAMES P,,,178980.0,182466,257817.0,,,,,257817,...,,,,,False,,,,,
BANNANTINE JAMES M,477.0,566.0,,916197,4046157.0,,1757552.0,465.0,-560222.0,5243487,...,,29.0,864523.0,0.0,False,,-5104.0,,james.bannantine@enron.com,39.0
BAXTER JOHN C,267102.0,,1295738.0,5634343,6680544.0,1200000.0,3942714.0,,,10623258,...,,,2660303.0,,False,,-1386055.0,1586055.0,,
BAY FRANKLIN R,239671.0,,260455.0,827696,,400000.0,145796.0,,-82782.0,63014,...,,,69.0,,False,,-201641.0,,frank.bay@enron.com,


In [4]:
print ("There are total {} people in the dataset.".format(enron_data.shape[0]))
print("Out of which {} are POI and {} are Non-POI.".format(enron_data.poi.value_counts()[True],
                                                          enron_data.poi.value_counts()[False]))
print("Total number of email plus financial features are {}.".format(enron_data.columns.shape[0]-1))
print("Label is 'poi' column.")

There are total 146 people in the dataset.
Out of which 18 are POI and 128 are Non-POI.
Total number of email plus financial features are 20.
Label is 'poi' column.


Enron dataset is really messy and has a lot of missing values (NaN). Almost all of the features have missing values and some features have more than 50% of their values missing, as we can see from the frequency of NaN from the table below.
I have converted NaN to 0, to make all the values numeric and train the machine learning algorithm later.

In [5]:
enron_data.describe().transpose()

Unnamed: 0,count,unique,top,freq
salary,146,95,,51
to_messages,146,87,,60
deferral_payments,146,40,,107
total_payments,146,126,,21
exercised_stock_options,146,102,,44
bonus,146,42,,64
restricted_stock,146,98,,36
shared_receipt_with_poi,146,84,,60
restricted_stock_deferred,146,19,,128
total_stock_value,146,125,,20


In [6]:
enron_data.replace(to_replace='NaN', value=0, inplace=True)
enron_data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
salary,146.0,365811.4,2203575.0,0.0,0.0,210596.0,270850.5,26704229.0
to_messages,146.0,1221.589,2226.771,0.0,0.0,289.0,1585.75,15149.0
deferral_payments,146.0,438796.5,2741325.0,-102500.0,0.0,0.0,9684.5,32083396.0
total_payments,146.0,4350622.0,26934480.0,0.0,93944.75,941359.5,1968286.75,309886585.0
exercised_stock_options,146.0,4182736.0,26070400.0,0.0,0.0,608293.5,1714220.75,311764000.0
bonus,146.0,1333474.0,8094029.0,0.0,0.0,300000.0,800000.0,97343619.0
restricted_stock,146.0,1749257.0,10899950.0,-2604490.0,8115.0,360528.0,814528.0,130322299.0
shared_receipt_with_poi,146.0,692.9863,1072.969,0.0,0.0,102.5,893.5,5521.0
restricted_stock_deferred,146.0,20516.37,1439661.0,-7576788.0,0.0,0.0,0.0,15456290.0
total_stock_value,146.0,5846018.0,36246810.0,-44093.0,228869.5,965955.0,2319991.25,434509511.0


### Task 2: Remove outliers
Visualization is one of the most powerful tools for finding outliers. Upon plotting salary against bonus, there is an outlier that pops out immediately - 'Total'. The spreadsheet added up all the data points for us and we need to take this point out. Upon closer examination, I found one more entry which is not the name of a real person 'The Travel Agency In The Park'. The entry is dropped from the dataset. The entries which have all the features as 'NaN' are also dropped from the dataset.

In [7]:
#Importing plotly
from plotly import tools
from plotly import plotly
from plotly import graph_objs

#Setting plotly API credentials
tools.set_credentials_file(username='jubinsoni', api_key='yKCkLUthlyqn7oXWf4U2')

#Making scatterplot before the 'TOTAL' outlier removal
with_outlier_total = graph_objs.Scatter(x = enron_data['salary'],
                           y = enron_data['bonus'],
                           text = enron_data.index,
                           mode = 'markers')

#Removing the outlier
enron_data.drop(labels=['TOTAL'], inplace=True)

#Making scatterplot after the 'TOTAL' outlier removal
without_outlier_total = graph_objs.Scatter(x = enron_data['salary'],
                                    y = enron_data['bonus'],
                                    text = enron_data.index,
                                    mode = 'markers')


#Layout the plots together side by side
fig = tools.make_subplots(rows=1, cols=2, subplot_titles=('Before outlier TOTAL removal', 'After outlier TOTAL removal'))
fig.append_trace(with_outlier_total, 1, 1)
fig.append_trace(without_outlier_total, 1, 2)
fig['layout']['xaxis1'].update(title='salary')
fig['layout']['xaxis2'].update(title='salary')
fig['layout']['yaxis1'].update(title='bonus')
fig['layout']['yaxis2'].update(title='bonus')
plotly.iplot(fig)

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [8]:
#Making scatterplot before the 'TRAVEL AGENCY' outlier removal
with_outlier_travel = graph_objs.Scatter(x = enron_data['salary'],
                                  y = enron_data['bonus'],
                                  text = enron_data.index,
                                  mode = 'markers')
#drop
enron_data.drop(labels=['THE TRAVEL AGENCY IN THE PARK'], axis=0, inplace=True)

#Making scatterplot after the 'TRAVEL AGENCY' outlier removal
without_outlier_travel = graph_objs.Scatter(x = enron_data['salary'],
                                  y = enron_data['bonus'],
                                  text = enron_data.index,
                                  mode = 'markers')


#Layout the plots together side by side
fig = tools.make_subplots(rows=1, cols=2, subplot_titles=('Before outlier TRAVEL AGENCY removal', 'After outlier TRAVEL AGENCY removal'))
fig.append_trace(with_outlier_travel, 1, 1)
fig.append_trace(without_outlier_travel, 1, 2)
fig['layout']['xaxis1'].update(title='salary')
fig['layout']['xaxis2'].update(title='salary')
fig['layout']['yaxis1'].update(title='bonus')
fig['layout']['yaxis2'].update(title='bonus')
plotly.iplot(fig)

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



#### Splitting the dataset into training and test
A classic mistake is to evaluate the performance of an algorithm on the same dataset it was trained on. It will make our algorithm perform better than it actually does. However, we will have no idea how our algorithm performs on unseen data.

It is essential practice in data mining procedures to keep a subset of data as holdout data- test data. We train our model on training data and examine the generalization performance of the model on the test data. We hide the label for target variable of the test data from the model and let the model predict the values for target variable. Then we compare the values predicted by the model with the hidden true values. We can also use a more sophisticated holdout training and testing procedure called cross-validation. In the later sections of this report, I have used a variation of cross-validation called StratifiedShuffleSplit which makes randomly chosen training and test sets multiple times and averages results of overall tests

In [9]:
### Store to my_dataset for easy export below.
my_dataset = enron_data.to_dict('index')

#Initial list of features

data = featureFormat(my_dataset, features_list, sort_keys=True)
labels, features = targetFeatureSplit(data)

#Splitting dataset into training and test
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.3, random_state=42)

#### Implementing few classifiers before Feature Selection and Feature Engineering

In [10]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb = nb.fit(features_train, labels_train)
nb_labels_predicted = nb.predict(features_test)
nb_accuracy = accuracy_score(labels_test, nb_labels_predicted)
print("NaiveBayes accuracy score: {}.".format(nb_accuracy))

from sklearn.svm import SVC
svm = SVC()
svm = svm.fit(features_train, labels_train)
svm_labels_predicted = svm.predict(features_test)
svm_accuracy = accuracy_score(labels_test, svm_labels_predicted)
print("SVM accuracy score: {}.".format(svm_accuracy))

from sklearn.ensemble import AdaBoostClassifier
adaboost = AdaBoostClassifier()
adaboost = adaboost.fit(features_train, labels_train)
adaboost_labels_predicted = adaboost.predict(features_test)
adaboost_accuracy = accuracy_score(labels_test, adaboost_labels_predicted)
print("AdaBoost accuracy score: {}.".format(adaboost_accuracy))

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn = knn.fit(features_train, labels_train)
knn_labels_predicted = knn.predict(features_test)
knn_accuracy = accuracy_score(labels_test, knn_labels_predicted)
print("KNearestNeighbors accuracy score: {}.".format(knn_accuracy))

NaiveBayes accuracy score: 0.883720930233.
SVM accuracy score: 0.883720930233.
AdaBoost accuracy score: 0.790697674419.
KNearestNeighbors accuracy score: 0.883720930233.


### Task 3: Create new feature(s)
#### Feature Engineering
In the Task 1, I have selected some features of interest. In Task 2, I removed the outliers and implemented few classifiers based on these initially selected features. In this task first I will focus on the Feature Engineering and then Feature Selection.
Feature engineering involves using human intuition to hypothesize what feature might contain pattern that can be exploited using machine learning, coding up the new feature, visualizing and repeating the process again. Our hypothesis here - "POI's sent email to each other at a rate higher than for Non-POI's". I coded up two new features: fraction of emails this person sends to poi (fraction_from_poi) and fraction of emails this person sends to poi (fraction_to_poi).

In [11]:
#Creating new feature(s)
enron_data['fraction_from_poi'] = enron_data['from_poi_to_this_person'].divide(enron_data['to_messages'], fill_value=0)
enron_data['fraction_to_poi'] = enron_data['from_this_person_to_poi'].divide(enron_data['to_messages'], fill_value=0)

#Replacing NaN in new features with 0
enron_data['fraction_from_poi'].fillna(value=0, inplace=True)
enron_data['fraction_to_poi'].fillna(value=0, inplace=True)

In [12]:
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.layouts import row
from bokeh.models import ColumnDataSource

output_notebook()

colormap = {False: 'blue', True: 'red'}
colors = [colormap[x] for x in enron_data['poi']]

labelmap = {False: 'Non-POI', True: 'POI'}
labels = [labelmap[x] for x in enron_data['poi']]

source = ColumnDataSource(dict(x1 = enron_data['from_poi_to_this_person'],
                                y1 = enron_data['from_this_person_to_poi'],
                                x2 = enron_data['fraction_from_poi'],
                                y2 = enron_data['fraction_to_poi'],
                                color = colors,
                                label = labels))

#Before feature engineering
s1 = figure(plot_width=450, plot_height=400)
s1.xaxis.axis_label = "no. of emails from POI to this person"
s1.yaxis.axis_label = "no. of emails from this person to POI"
s1.circle('x1', 'y1', size=10, alpha=0.5, color='color', legend='label', source=source)

#After feature engineering
s2 = figure(plot_width=450, plot_height=400)
s2.xaxis.axis_label = 'fraction of emails this person gets from POI'
s2.yaxis.axis_label = 'fraction of emails this person sends to POI'
s2.circle('x2', 'y2', size=10, alpha=0.5, color='color', legend='label', source=source)

show(row(s1, s2))

#### Feature Selection
To have minimum number if features that can capture trends and patters in the data. I have not used features that do not give any information. Machine Learning algorithm is just going to be as good as the features we put into it. It is critical that the methodology deployed for feature selection must be scientific and exhaustive without room for intuition.

First I manually removed features which has more than 50% of the NaN values, then I performed SelectKBest on remaining features.

In [13]:
### Store to my_dataset for easy export below.
my_dataset = enron_data.to_dict('index')

from sklearn.feature_selection import SelectKBest, f_classif

features_list = ['poi', 'salary', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'other',
 'shared_receipt_with_poi', 'total_stock_value', 'expenses', 'loan_advances', 'from_messages',
 'from_this_person_to_poi', 'long_term_incentive', 'from_poi_to_this_person']

data = featureFormat(my_dataset, features_list, sort_keys=True)
labels, features = targetFeatureSplit(data)

#Perform feature selection
selector = SelectKBest(f_classif, k=5)
selector.fit(features, labels)

#Get the raw p-values for each feature, and transform from p-values into scores
scores = selector.scores_

#Bokeh Barplots
from bokeh.charts import Bar, show

data = {'scores': scores, 'features': features_list[1:]}

bar = Bar(data, label='features', values='scores', title='Select K Best',
         legend = None, plot_width=850, plot_height=450)

show(bar)

In [14]:
enron_data.shape

(144, 23)

Univariate feature selection works by selecting the best features based on univariate statistical tests. I have done it as a preprocessing step for inputting best features into all the classifiers.

SelectKBest removes all but the k highest scoring features, I have chosen the k value = 5 because there are too many features like loan_advances, other, restricted_stock, shared_receipt_with_poi, from_this_person_to_poi, from_poi_to_this_person, from_messages, adding too many features will not add much value into our estimator but having all good few features will. So I finally selected 5 best features: 'bonus', 'exercised_stock_options', 'salary', 'total_stock_value', 'total_payments' and two engineered features: 'fraction_from_poi', 'fraction_to_poi' as final features for classification. In the 5 best features via SelectKBest, I chose 'total_payments' over 'long_term_incentive' because in the dataset there are 79 NULL/0's for long_term_incentive feature out of total 144 this does not look right. While there are 121 complete values for total_payments feature.

### Task 4: Try a variety of classifiers

In [15]:
### Extract features and labels from dataset for local testing
features_list = ['poi', 'bonus', 'exercised_stock_options', 'salary', 'total_stock_value',
                 'total_payments', 'fraction_from_poi', 'fraction_to_poi']

data = featureFormat(my_dataset, features_list, sort_keys=True)
labels, features = targetFeatureSplit(data)

#Separating training and test dataset
from sklearn.model_selection import train_test_split
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.3, random_state=42)
from sklearn.metrics import accuracy_score, classification_report

#### Naive Bayes

In [16]:
from sklearn.naive_bayes import GaussianNB
#I have specified prior proability as 0.3 and 0.7 because we have imbalanced distribution of POI/Non-POI values.
nb = GaussianNB(priors=[0.3, 0.7])
nb = nb.fit(features_train, labels_train)
nb_labels_predicted = nb.predict(features_test)

nb_accuracy = accuracy_score(labels_test, nb_labels_predicted)
nb_classification_report = classification_report(labels_test, nb_labels_predicted)

print("NaiveBayes accuracy score: {}.".format(nb_accuracy))
print("NaiveBayes classification report:\n{}.".format(nb_classification_report))

NaiveBayes accuracy score: 0.883720930233.
NaiveBayes classification report:
             precision    recall  f1-score   support

        0.0       0.97      0.89      0.93        38
        1.0       0.50      0.80      0.62         5

avg / total       0.92      0.88      0.89        43
.


#### SVM

In [17]:
from sklearn.svm import SVC
svm = SVC(kernel='rbf', C=0.1, degree=3)
svm = svm.fit(features_train, labels_train)
svm_labels_predicted = svm.predict(features_test)

svm_accuracy = accuracy_score(labels_test, svm_labels_predicted)
svm_classification_report = classification_report(labels_test, svm_labels_predicted)

print("SVM accuracy score: {}.".format(svm_accuracy))
print("SVM classification report:\n{}.".format(svm_classification_report))

SVM accuracy score: 0.883720930233.
SVM classification report:
             precision    recall  f1-score   support

        0.0       0.88      1.00      0.94        38
        1.0       0.00      0.00      0.00         5

avg / total       0.78      0.88      0.83        43
.



Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.



#### AdaBoost

In [18]:
from sklearn.ensemble import AdaBoostClassifier
adab = AdaBoostClassifier(learning_rate=0.9, algorithm='SAMME.R', n_estimators=100, random_state=42)
adab = adab.fit(features_train, labels_train)
adab_labels_predicted = adab.predict(features_test)

adab_accuracy = accuracy_score(labels_test, adab_labels_predicted)
adab_classification_report = classification_report(labels_test, adab_labels_predicted)

print("AdaBoost accuracy score: {}.".format(adab_accuracy))
print("AdaBoost classification report:\n{}.".format(adab_classification_report))

AdaBoost accuracy score: 0.860465116279.
AdaBoost classification report:
             precision    recall  f1-score   support

        0.0       0.90      0.95      0.92        38
        1.0       0.33      0.20      0.25         5

avg / total       0.83      0.86      0.84        43
.


#### K Nearest Neighbors

In [19]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn = knn.fit(features_train, labels_train)
knn_labels_predicted = knn.predict(features_test)

knn_accuracy = accuracy_score(labels_test, knn_labels_predicted)
knn_classification_report = classification_report(labels_test, knn_labels_predicted)

print("KNearestNeighbors accuracy score: {}.".format(knn_accuracy))
print("KNearestNeighbors classification report:\n{}.".format(knn_classification_report))

KNearestNeighbors accuracy score: 0.906976744186.
KNearestNeighbors classification report:
             precision    recall  f1-score   support

        0.0       0.90      1.00      0.95        38
        1.0       1.00      0.20      0.33         5

avg / total       0.92      0.91      0.88        43
.


In [20]:
pd.options.display.max_colwidth = 0
data = {'Before Feature Engineering and Selection': {'NaiveBayes Classifier Accuracy': 0.8837,
                                                     'SVM Classifier Accuracy': 0.8837,
                                                     'AdaBoost Classifier Accuracy': 0.7906,
                                                     'KNearestNeighbors Classifier Accuracy': 0.8837},
        'After Feature Engineering and Selection': {'NaiveBayes Classifier Accuracy': 0.8837,
                                                   'SVM Classifier Accuracy': 0.8837,
                                                   'AdaBoost Classifier Accuracy': 0.9069,
                                                   'KNearestNeighbors Classifier Accuracy': 0.9069}}

accuracy_comparison = pd.DataFrame(data, columns=['Before Feature Engineering and Selection', 'After Feature Engineering and Selection'])

accuracy_comparison

Unnamed: 0,Before Feature Engineering and Selection,After Feature Engineering and Selection
AdaBoost Classifier Accuracy,0.7906,0.9069
KNearestNeighbors Classifier Accuracy,0.8837,0.9069
NaiveBayes Classifier Accuracy,0.8837,0.8837
SVM Classifier Accuracy,0.8837,0.8837


### Task 5: Pick and Tune your classifier

#### Picking an algorithm
Different algorithms were attempted using the 5 best features: 'bonus', 'exercised_stock_options', 'salary', 'total_stock_value', 'total_payments' and the engineered features: 'fraction_to_poi', 'fraction_from_poi'.

Below is the comparison of algorithms implemented using above features:

In [21]:
pd.options.display.max_colwidth = 0

data = {'Algorithms':['GaussianNaiveBayes Classifier',
                      'SupportVectorMachines Classifier',
                      'AdaBoost Classifier',
                      'KNearestNeighbors Classifier'],
       'Parameters': ["priors=[0.3, 07]",
                     "kernel='rbf', C=0.1, degree=3",
                      "learning_rate=0.9, n_estimators=100, random_state=42",
                      "n_neighbors=3"
                     ],
       'Accuracy': [0.8837, 0.8837, 0.9069, 0.9069],
       'Precision': [0.92, 0.78, 0.83, 0.89],
        'Recall': [0.88, 0.88, 0.86, 0.891],
        'F1':[0.89, 0.83, 0.84, 0.90]
       }

algorithms = pd.DataFrame(data, columns=['Algorithms', 'Parameters', 'Accuracy', 'Precision', 'Recall', 'F1'])
algorithms

Unnamed: 0,Algorithms,Parameters,Accuracy,Precision,Recall,F1
0,GaussianNaiveBayes Classifier,"priors=[0.3, 07]",0.8837,0.92,0.88,0.89
1,SupportVectorMachines Classifier,"kernel='rbf', C=0.1, degree=3",0.8837,0.78,0.88,0.83
2,AdaBoost Classifier,"learning_rate=0.9, n_estimators=100, random_state=42",0.9069,0.83,0.86,0.84
3,KNearestNeighbors Classifier,n_neighbors=3,0.9069,0.89,0.891,0.9


Since there is not much to tweak in the GaussianNaiveBayes Classifier it may not improve any further so I did not choose it. For SVM, I have already chosen 'rbf' kernel, C=0.1 and degree=3, choosing a different kernel and C value did not make much difference so I did not choose it as well. For AdaBoost as well I tried with 'SAMME' algorithm and I could not get a higher accuracy than 0.90. So I finally selected KNearestNeighbors Classifier as it works best with numerical features and there are multiple parameters I can tweak to improve the performance.

#### Hyperparameter optimization\Tuning
Parameter tuning for an algorithm means selecting the good and robust parameter or set of parameters for an algorithm to optimize its performance. Default parameters may not be customized very well for the particular dataset features and might result in poor performance. Scikit-learn provies two methods for algorithm tuning/Hyperparameter optimization: GridSearchCV and RandomizedSearchCV.

I used GridSearchCV to do exhaustive search over different parameters and find the best parameters.
1. I used the 'f1' as my scoring parameter to guide the parameter search process to minimize False Positives and False Neatives. Also because we can see above that accuracy is not a great parameter to evaulate here as many classifier seem to have almost similar accuracy values.
2. In the 'cv' parameter, I passed a cross validation object (StatifiedShuffleSplit) to validate my search results that best adapt to my dataset characteristics.
3. For my final KNearestNeighbors Classifiers I tweaked the parameteres as shown below:

In [22]:
#Specifying parameters of the algorithm
clf_params = {'n_neighbors': [3, 5, 7, 9, 12],
             'weights': ['uniform', 'distance'],
             'algorithm': ['ball_tree', 'kd_tree', 'brute', 'auto'],
             'p': [1, 2]}

#Specify algorithm
knn = KNeighborsClassifier()

#GridSearchCV
from sklearn.model_selection import StratifiedShuffleSplit, GridSearchCV
cv = StratifiedShuffleSplit(n_splits= 100, random_state= 42)
clf = GridSearchCV(knn, param_grid = clf_params, cv = cv, scoring = 'f1')
clf.fit(features, labels)

#pick a winner
best_clf = clf.best_estimator_
print(best_clf)


F-score is ill-defined and being set to 0.0 due to no predicted samples.



KNeighborsClassifier(algorithm='ball_tree', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=1,
           weights='uniform')


After 100 re-shuffling and splitting iterations through StratifiedShuffleSplit and doing GridSearchCV on the KNearestNeighbors Classifier I entered these criterias to check if there is any effect on performance.

Analyzing the criterias for the classifier selected by GridSearchCV:
- n_neighbors = 3, is the number of neighbors it considers for KNN algorithm. The initial option was default = 3.
- Algorithm = 'BallTree', is used for fast generalized N-Point problems. The initial option was default 'auto'.
- leaf_size = 30, is optimal value for BallTree algorithm and decides the speed of query construction. The initial option was default 30.
- metric = 'minkowski', is the euclidean_distance calculating algorithm. The initial option was default 'minkowski'.
- p = 1, Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance.
- Rest all other values are default.

In [23]:
#Removing 'total_stock_value' from the features list
features_list = ['poi', 'bonus', 'exercised_stock_options', 'salary', 'total_stock_value',
                 'total_payments', 'fraction_from_poi', 'fraction_to_poi']

#Converting the features into vectors
data = featureFormat(my_dataset, features_list, sort_keys=True)

#Splitting the features and labels
labels, features = targetFeatureSplit(data)

#Creating separate training and test sets
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.3, random_state=42)

#Initializing the KNN Classifier on the tuned parameters
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(algorithm='ball_tree', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=1,
           weights='uniform')

#Training the classifier
knn = knn.fit(features_train, labels_train)

#Predicting the labels
knn_labels_predicted = knn.predict(features_test)

#Calculating the accuracy, precision, recall and f1 scores
knn_accuracy = accuracy_score(labels_test, knn_labels_predicted)
knn_classification_report = classification_report(labels_test, knn_labels_predicted)

print("After Tuning:")
print("K Nearest Neighbors accuracy score: {}.".format(knn_accuracy))
print("K Nearest Neighbors classification report:\n{}.".format(knn_classification_report))

After Tuning:
K Nearest Neighbors accuracy score: 0.906976744186.
K Nearest Neighbors classification report:
             precision    recall  f1-score   support

        0.0       0.90      1.00      0.95        38
        1.0       1.00      0.20      0.33         5

avg / total       0.92      0.91      0.88        43
.


### Validate and Evaluate

Validation is the strategy to evaluate the performance of the model on unseen data. In my case, I used a variation of cross-validation called StratifiedShuffleSplit which makes randomly chosen training and test sets multiple times and averages results of overall tests. Data is first shuffled and then split into pair of training and test sets. Stratification ensures training and test splits have class distribution (POI:NON-POI) that represents the overall data. Stratification is well suited in our case because of the class imbalance (18 POI vs 128 Non-POI).

Along with Accuracy I used Precision, Recall and F1-Score for model evaluation. The Accuracy of my KNearestNeighbors Classifier after tuning was 90.69%. For imbalaced classes like we have, Precision and Recall are common easures.
- A good precision means the number of values that are correctly are marked as True Positive out of True Positive and False Positive. Low precision indicates a large number of False Positives.
- A good recall means number of values marked correctly as it shows up in the test cases. A low recall indicates many False Negaives. The Precision of my KNearestNeighbors Classifier is 92% and Recall is 91%.
- F1-score conveys a balance between precision and recall, it is also the harmonic mean of precision and recall, my F1-score is 88%.

In [24]:
pd.options.display.max_colwidth = 0

knn_data = {'KNearestNeighbors Classifier':['Before Hyperparameter Optimization',
                                        'After Hyperparameter Optimization'],
       'Parameters': ["priors=[0.3, 07], algorithm='ball_tree', n_neighbors=5, p=2",
                     "priors=None, algorithm='ball_tree', n_neighbors=3, p=1"],
       'Accuracy': [0.9069, 0.9069],
       'Precision': [0.89, 0.92],
        'Recall': [0.891, 0.91],
        'F1':[0.90, 0.88]
       }

knn_comparison = pd.DataFrame(knn_data, columns=['KNearestNeighbors Classifier', 'Parameters', 'Accuracy', 'Precision', 'Recall', 'F1'])
knn_comparison

Unnamed: 0,KNearestNeighbors Classifier,Parameters,Accuracy,Precision,Recall,F1
0,Before Hyperparameter Optimization,"priors=[0.3, 07], algorithm='ball_tree', n_neighbors=5, p=2",0.9069,0.89,0.891,0.9
1,After Hyperparameter Optimization,"priors=None, algorithm='ball_tree', n_neighbors=3, p=1",0.9069,0.92,0.91,0.88


The accuracy of our KNN classifier has not changed at all, but we can see that the Precision and Recall values have increased by 2%. This means there are less False Positives and False Negatives than before and GridSearchCV was useful in tuning the performance of our classfier.

### Task 6: Dump your classifier, dataset, and features_list so anyone can check your results

#### Algorithm performance

In [25]:
### You do not need to change anything below, but make sure
### that the version of poi_id.py that you submit can be run on its own and
### generates the necessary .pkl files for validating your results.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(algorithm='ball_tree', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=1,
           weights='uniform')

dump_classifier_and_data(clf=knn, dataset=my_dataset, feature_list=features_list)

#### Testing on tester.py

In [26]:
from tester import test_classifier
test_classifier(clf=knn, dataset=my_dataset, feature_list=features_list)

KNeighborsClassifier(algorithm='ball_tree', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=1,
           weights='uniform')
	Accuracy: 0.88227	Precision: 0.60227	Recall: 0.34450	F1: 0.43830	F2: 0.37675
	Total predictions: 15000	True positives:  689	False positives:  455	False negatives: 1311	True negatives: 12545



### References

1. <a href = 'https://www.udacity.com/course/intro-to-machine-learning--ud120'> Udacity Machine Learning </a>
2. <a href = 'https://en.wikipedia.org/wiki/Enron'> Enron Wiki </a>
3. <a href = 'http://scikit-learn.org/stable'> Scikit-Learn </a>
4. <a href = 'http://bokeh.pydata.org/en/latest/docs/user_guide/annotations.html#legends'> Bokeh JS </a>

#### Thank you