# Machine Learning to Identify Fraud in Enron Emails

## Jean Phillips
## 8/8/2021

## Introduction  / Project Overview

Enron was a major financial corporation in the early 2000's, but went bankrupt due to widespread, major financial fraud. Following the investigation into Enron, thousands of emails and otherwise confidential information was made public. That information forms the foundation of this dataset, which includes financial information and emails for many Enron employees. 

The provided dataset is in the form of a dictionary, where each key-value pair in the dictionary corresponds to one person. The dictionary key is the person's name, and the value is another dictionary, which contains the names of all the features and their values for that person. The features in the data fall into three major types, namely financial features, email features and POI labels.

The goal of this project is to use machine learning to create an algorithm to identify potential persons of interest (POI) in the dataset who may have contributed to the fraud at Enron. This algorithm should have a precision and recall of at least 0.3 per the project rubric specifications.

Machine learning is well-suited to this type of task as it will be able to detect patterns in a large dataset such as this much more efficiently than a human being would be able to do. 

## Data Exploration

The dataset for this project is provided to students as "final_project_dataset.pkl." Printing some basic information about the dataset shows the following: 
- There are 146 rows in the data
- There are 21 features in the data, ranging from financial information such as salary and bonus, to personal information such as email address
- Of the 146 rows in the data, 18 represent persons of interest and 128 represent non-persons of interest

There are three different groupings of data in the dataset, namely payment, stock and email information. These groupings can be broken downs as follows:
- Payment: salary, bonus, long_term_incentive, deferred_income, deferral_payments, loan_advances, other, expenses, director_fees, and total_payments
- Stock: exercised_stock_options, restricted_stock, restricted_stock_deferred, total_stock_value
- Email information: from_poi_to_this_person, shared_receipt_with_poi, from_this_person_to_poi, to_messages, and from_messages

In [43]:
import pickle
from pprint import pprint as pp
# load pickle file to Python dict to show length and features of dataset
with open("final_project_dataset.pkl", "rb") as data_file:
    data_dict = pickle.load(data_file)
print("Number of rows:", len(data_dict))
features = []
poi_count = 0
for k in data_dict.keys():
    for d in data_dict[k]:
        if d not in features:
            features.append(d)
    if data_dict[k]['poi']:
        poi_count += 1
print("Data features:")
pp(features)
print("Number of features:", len(features))
print("Number of POI: ", poi_count)
print("Non-PoIs: ", len(data_dict) - poi_count)

Number of rows: 146
Data features:
['salary',
 'to_messages',
 'deferral_payments',
 'total_payments',
 'loan_advances',
 'bonus',
 'email_address',
 'restricted_stock_deferred',
 'deferred_income',
 'total_stock_value',
 'expenses',
 'from_poi_to_this_person',
 'exercised_stock_options',
 'from_messages',
 'other',
 'from_this_person_to_poi',
 'poi',
 'long_term_incentive',
 'shared_receipt_with_poi',
 'restricted_stock',
 'director_fees']
Number of features: 21
Number of POI:  18
Non-PoIs:  128


I realized very quickly that there would not be a lot of information to glean from the "email_address" column, as this was typically just the individuals name reformatted as an email address. For that reason, this column was removed from the data. 

Following that cleaning, I decided to investigate missing values in the dataset as denoted by "NaN" or not a number. 

In [44]:
#remove email addresses
for k in data_dict.keys():
  if 'email_address' in data_dict[k].keys():
    data_dict[k].pop('email_address', 0)

print("\nCurrent features in the dataset:", len(features)-1)
missing_values = {'salary' : 0,
                  'to_messages' : 0,
                  'deferral_payments' : 0,
                  'total_payments' : 0,
                  'loan_advances' : 0,
                  'bonus' : 0,
                  'restricted_stock_deferred' : 0,
                  'deferred_income' : 0,
                  'total_stock_value' : 0,
                  'expenses' : 0,
                  'from_poi_to_this_person' : 0,
                  'exercised_stock_options' : 0,
                  'from_messages' : 0,
                  'other' : 0,
                  'from_this_person_to_poi' : 0,
                  'poi' : 0,
                  'long_term_incentive' : 0,
                  'shared_receipt_with_poi' : 0,
                  'restricted_stock' : 0,
                  'director_fees' : 0}
for k in data_dict.keys():
    for d in data_dict[k]:
        if data_dict[k][d] == 'NaN':
            missing_values[d] += 1
print("\n'NaN' (missing) values in the dataset, by feature:")
pp(missing_values)


Current features in the dataset: 20

'NaN' (missing) values in the dataset, by feature:
{'bonus': 64,
 'deferral_payments': 107,
 'deferred_income': 97,
 'director_fees': 129,
 'exercised_stock_options': 44,
 'expenses': 51,
 'from_messages': 60,
 'from_poi_to_this_person': 60,
 'from_this_person_to_poi': 60,
 'loan_advances': 142,
 'long_term_incentive': 80,
 'other': 53,
 'poi': 0,
 'restricted_stock': 36,
 'restricted_stock_deferred': 128,
 'salary': 51,
 'shared_receipt_with_poi': 60,
 'to_messages': 60,
 'total_payments': 21,
 'total_stock_value': 20}


The only row which is not missing values is "poi", which makes sense as this is the row created and added to the dataset by Udacity to classify the data. 

There are many varying values of NaN. It is interesting to note that for email-related columns (from_messages, from_poi_to_this_person, from_this_person_to_poi, shared_receipt_with_poi, to_messages) there is a consistent number of missing values at 60. Upon looking at the data set, it becomes apparaent that a row either has all email-associated data, or no email-associated data, thus this pattern makes sense. 

I decided to take a look at rows that were missing more than 15 features. These rows are not likely to offer much in the way of predictive value given how much data they are missing.

In [45]:
# populating a dictionary with rows with less than 5 nonmissing values
empty_features = {}
for k in data_dict.keys():
  empty_features[k] = []
  for d in data_dict[k]:
    if data_dict[k][d] == 'NaN':
      empty_features[k].append(1)
  empty_features[k] = sum(empty_features[k])
count = 0
print("Records with less than 5 nonmissing values:")
for k in sorted(empty_features):
  if empty_features[k] > 15:
    print(k)
    for d in data_dict[k]:
      if data_dict[k][d] != 'NaN':
        print("  %s : %s" % (d, data_dict[k][d]))
    count += 1
print("Number of records with less than 5 nonmissing values:")
print(" ",count)

Records with less than 5 nonmissing values:
CHRISTODOULOU DIOMEDES
  total_stock_value : 6077885
  exercised_stock_options : 5127155
  poi : False
  restricted_stock : 950730
CLINE KENNETH W
  restricted_stock_deferred : -472568
  total_stock_value : 189518
  poi : False
  restricted_stock : 662086
GILLIS JOHN
  total_stock_value : 85641
  exercised_stock_options : 9803
  poi : False
  restricted_stock : 75838
GRAMM WENDY L
  total_payments : 119292
  poi : False
  director_fees : 119292
LOCKHART EUGENE E
  poi : False
SAVAGE FRANK
  total_payments : 3750
  deferred_income : -121284
  poi : False
  director_fees : 125034
SCRIMSHAW MATTHEW
  total_stock_value : 759557
  exercised_stock_options : 759557
  poi : False
THE TRAVEL AGENCY IN THE PARK
  total_payments : 362096
  other : 362096
  poi : False
WAKEHAM JOHN
  total_payments : 213071
  expenses : 103773
  poi : False
  director_fees : 109298
WHALEY DAVID A
  total_stock_value : 98718
  exercised_stock_options : 98718
  poi : False

What is striking about the above results is the row for "LOCKHART EUGENE E" is missing values for everything except "Poi", so it makes sense to remove this row. Additionally, "THE TRAVEL AGENCY IN THE PARK" appears to be a business rather than a person, so this will be excluded as well. The other values will be left even though they may represent outliers, as they may still have some valuable predictive information for the algorithm we will build. 

## Data Cleaning

As discussed in the previous section, there are issues with the data. The rows for "LOCKHART EUGENE E" as well as for "THE TRAVEL AGENCY IN THE PARK" will be removed, as "LOCKHART EUGENE E" has no data and "THE TRAVEL AGENCY IN THE PARK" represents a non-human entity. Another row that sticks out is the "TOTAL" row. This is an aggregate row comprised of financial data. There is little predictive value here and it is a major outlier, so it will also be removed. 

In [46]:
print("'TOTAL':")
pp(data_dict['TOTAL'])

'TOTAL':
{'bonus': 97343619,
 'deferral_payments': 32083396,
 'deferred_income': -27992891,
 'director_fees': 1398517,
 'exercised_stock_options': 311764000,
 'expenses': 5235198,
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 83925000,
 'long_term_incentive': 48521928,
 'other': 42667589,
 'poi': False,
 'restricted_stock': 130322299,
 'restricted_stock_deferred': -7576788,
 'salary': 26704229,
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 309886585,
 'total_stock_value': 434509511}


In [47]:
#removing eugene lockhart row
data_dict.pop('LOCKHART EUGENE E', 0)
# remove the  aggregate row TOTAL
data_dict.pop('TOTAL', 0)
# remove the non-person row THE TRAVEL AGENCY IN THE PARK
data_dict.pop('THE TRAVEL AGENCY IN THE PARK', 0)
print("Rows remaining in the dataset:",len(data_dict))

Rows remaining in the dataset: 143


Another way to investigate the dataset and assess it for issues is to look at the financial data. One check is to add together the various financial entries and compare that information to total_payments. It would make sense for the aggregated total to equal total_payments and would bear further investigating if it did not.

In [48]:
print("Problems with 'total_payments' :", end='')
payment_financial_features = ['salary',
                              'bonus',
                              'long_term_incentive',
                              'expenses',
                              'director_fees',
                              'other',
                              'loan_advances',
                              'deferred_income',
                              'deferral_payments']
problem_entries = {}
# Iterate over each row, check sum of financial features against total_payments,add rows with issues to problem_entries
for k in data_dict.keys():
  total_payments_check = 0
  for d in data_dict[k]:
    if d in payment_financial_features and data_dict[k][d] != 'NaN':
      total_payments_check += data_dict[k][d]
  if data_dict[k]['total_payments'] != 'NaN' and \
                        total_payments_check != data_dict[k]['total_payments']:
    problem_entries[k] = data_dict[k]
from pprint import pprint as pp
if len(problem_entries):
  print("found.")
  print("Records with problems related to 'total_payments' found: ")
  pp(problem_entries)
else:
  print("None.")

Problems with 'total_payments' :found.
Records with problems related to 'total_payments' found: 
{'BELFER ROBERT': {'bonus': 'NaN',
                   'deferral_payments': -102500,
                   'deferred_income': 'NaN',
                   'director_fees': 3285,
                   'exercised_stock_options': 3285,
                   'expenses': 'NaN',
                   'from_messages': 'NaN',
                   'from_poi_to_this_person': 'NaN',
                   'from_this_person_to_poi': 'NaN',
                   'loan_advances': 'NaN',
                   'long_term_incentive': 'NaN',
                   'other': 'NaN',
                   'poi': False,
                   'restricted_stock': 'NaN',
                   'restricted_stock_deferred': 44093,
                   'salary': 'NaN',
                   'shared_receipt_with_poi': 'NaN',
                   'to_messages': 'NaN',
                   'total_payments': 102500,
                   'total_stock_value': -44093},
 'BHATNA

The values for total_payments for individuals Belfer and Bhatnagar do not appear to be correct. In comparing the values seen here with the information provided in the enron61702insiderpay.pdf document, it appears that these values have been shifted, Belfer to the right and Bhatnagar to the left. This was corrected by recreating entries for these two individuals with the correct data as found within the enron61702insiderpay.pdf document.

In [49]:
# records for Belfer corrected based off data found in pdf fie
belfer_corrected = {'bonus': 'NaN',
                    'deferral_payments': 0,                  
                    'deferred_income': -102500,              
                    'director_fees': 102500,                 
                    'exercised_stock_options': 0,            
                    'expenses': 3285,                        
                    'from_messages': 'NaN',
                    'from_poi_to_this_person': 'NaN',
                    'from_this_person_to_poi': 'NaN',
                    'loan_advances': 'NaN',
                    'long_term_incentive': 'NaN',
                    'other': 'NaN',
                    'poi': False,
                    'restricted_stock': 44093,                
                    'restricted_stock_deferred': -44093,      
                    'salary': 'NaN',
                    'shared_receipt_with_poi': 'NaN',
                    'to_messages': 'NaN',
                    'total_payments': 3285,                   
                    'total_stock_value': 0}                   

# similar correction for Bhatnagar from data found in pdf file
bhatnagar_corrected = {'bonus': 'NaN',
                       'deferral_payments': 'NaN',
                       'deferred_income': 'NaN',
                       'director_fees': 0,                    
                       'exercised_stock_options': 15456290,   
                       'expenses': 137864,                    
                       'from_messages': 29,
                       'from_poi_to_this_person': 0,
                       'from_this_person_to_poi': 1,
                       'loan_advances': 'NaN',
                       'long_term_incentive': 'NaN',
                       'other': 0,                            
                       'poi': False,
                       'restricted_stock': 2604490,           
                       'restricted_stock_deferred': -2604490, 
                       'salary': 'NaN',
                       'shared_receipt_with_poi': 463,
                       'to_messages': 523,
                       'total_payments': 137864,              
                       'total_stock_value': 15456290}         

data_dict['BELFER ROBERT'] = belfer_corrected
data_dict['BHATNAGAR SANJAY'] = bhatnagar_corrected

With those entries corrected, we can re-run the total_payment check to ensure no further issues have arisen, which appears to be the case here. 

No further cleaning or removal of outliers was undertaken. 

In [50]:
# Repeating check to verify changes
print("Second check for problems with 'total_payments' :", end='')
problem_entries = {}
for k in data_dict.keys():
  total_payments_check = 0
  for d in data_dict[k]:
    if d in payment_financial_features and data_dict[k][d] != 'NaN':
      total_payments_check += data_dict[k][d]
  if data_dict[k]['total_payments'] != 'NaN' and \
    total_payments_check != data_dict[k]['total_payments']:
    problem_entries[k] = data_dict[k]

if len(problem_entries):
  print("found.")
  print("Records with problems related to 'total_payments' found:")
  pp(problem_entries)
else:
  print("None.")

Second check for problems with 'total_payments' :None.


## Feature Creation

Having investigated the quantity of NaN values in the dataset and having seen the wide variability of these values across rows, it seemed prudent to focus on the email data since that was most consistent across the dataset. For that data, rows are either completely NaN or completely full of integer data. For that reason, ratios based on the email data were chosen for new feature creation:
- ratio of emails sent to PoIs versus emails sent in general:
    - to_poi_from_messages_ratio = from_this_person_to_poi / from_message
- ratio of emails received from PoIs versus emails received in general:
    - from_poi_to_messages_ratio = from_poi_to_this_person / to_messages
- ratio of emails having shared receipt with PoI versus emails received in general:
    - shared_receipt_to_messages_ratio = shared_receipt_with_poi / to_messages

In [59]:
#code to create new features
for k in data_dict.keys():
  from_messages = True if \
    (data_dict[k]['from_messages'] != 'NaN') else False
  to_messages = True if \
    (data_dict[k]['to_messages'] != 'NaN') else False
  to_poi = True if \
    (data_dict[k]['from_this_person_to_poi'] != 'NaN') else  False
  from_poi = True if \
    (data_dict[k]['from_poi_to_this_person'] != 'NaN') else False
  shared_receipt = True if \
    (data_dict[k]['shared_receipt_with_poi'] != 'NaN') else False

  # ratio of emails sent to PoIs to emails sent in general:
  # to_poi_from_messages_ratio = from_this_person_to_poi / from_messages
  if to_poi and from_messages:
    data_dict[k]['to_poi_from_messages_ratio'] = \
       data_dict[k]['from_this_person_to_poi'] / data_dict[k]['from_messages']
  else:
    data_dict[k]['to_poi_from_messages_ratio'] = 'NaN'

  # ratio of emails received from PoIs to emails received in general:
  # from_poi_to_messages_ratio = from_poi_to_this_person / to_messages
  if from_poi and to_messages:
    data_dict[k]['from_poi_to_messages_ratio'] = \
          data_dict[k]['from_poi_to_this_person'] / data_dict[k]['to_messages']
  else:
    data_dict[k]['from_poi_to_messages_ratio'] = 'NaN'
  
  # ratio of emails having shared receipt with PoIs to emails received in general:
  # shared_receipt_to_messages_ratio = shared_receipt_with_poi / to_messages
  if shared_receipt and to_messages:
    data_dict[k]['shared_receipt_to_messages_ratio'] = \
       data_dict[k]['shared_receipt_with_poi'] / data_dict[k]['to_messages']
  else:
    data_dict[k]['shared_receipt_to_messages_ratio'] = 'NaN'


In [60]:
for k in data_dict.keys():
  print(k)
  print(" to", data_dict[k]['to_messages'])
  print(" from", data_dict[k]['from_messages'])
  print(" to_poi", data_dict[k]['from_this_person_to_poi'])
  print(" to poi/from",data_dict[k]['to_poi_from_messages_ratio'])
  print(" from_poi", data_dict[k]['from_poi_to_this_person'])
  print(" from poi/to",data_dict[k]['from_poi_to_messages_ratio'])
  print(" shared",data_dict[k]['shared_receipt_with_poi'])
  print(" shared/to ",data_dict[k]['shared_receipt_to_messages_ratio'])

METTS MARK
 to 807
 from 29
 to_poi 1
 to poi/from 0.034482758620689655
 from_poi 38
 from poi/to 0.04708798017348203
 shared 702
 shared/to  0.8698884758364313
BAXTER JOHN C
 to NaN
 from NaN
 to_poi NaN
 to poi/from NaN
 from_poi NaN
 from poi/to NaN
 shared NaN
 shared/to  NaN
ELLIOTT STEVEN
 to NaN
 from NaN
 to_poi NaN
 to poi/from NaN
 from_poi NaN
 from poi/to NaN
 shared NaN
 shared/to  NaN
CORDES WILLIAM R
 to 764
 from 12
 to_poi 0
 to poi/from 0.0
 from_poi 10
 from poi/to 0.013089005235602094
 shared 58
 shared/to  0.07591623036649214
HANNON KEVIN P
 to 1045
 from 32
 to_poi 21
 to poi/from 0.65625
 from_poi 32
 from poi/to 0.03062200956937799
 shared 1035
 shared/to  0.9904306220095693
MORDAUNT KRISTINA M
 to NaN
 from NaN
 to_poi NaN
 to poi/from NaN
 from_poi NaN
 from poi/to NaN
 shared NaN
 shared/to  NaN
MEYER ROCKFORD G
 to 232
 from 28
 to_poi 0
 to poi/from 0.0
 from_poi 0
 from poi/to 0.0
 shared 22
 shared/to  0.09482758620689655
MCMAHON JEFFREY
 to 2355
 from 48

Looking through the values for these new ratios shown above, they do generally make sense, falling between 0 and 1. With the addition of these new ratio features, the number of features in our dataset is now 23.

In [61]:
features_list = ['poi',
                 'salary',
                 'bonus',
                 'long_term_incentive',
                 'expenses',
                 'director_fees',
                 'other',
                 'loan_advances',
                 'deferred_income',
                 'deferral_payments',
                 'total_payments',
                 'restricted_stock_deferred',
                 'exercised_stock_options',
                 'restricted_stock',
                 'total_stock_value',
                 'from_messages',
                 'to_messages',
                 'from_poi_to_this_person',
                 'from_this_person_to_poi',
                 'shared_receipt_with_poi',
                 'from_poi_to_messages_ratio',
                 'to_poi_from_messages_ratio',
                 'shared_receipt_to_messages_ratio']

print (len(features_list))


23


## Initial Algorithm Development 

Initially, I decided to evaluate the differences between the  DecisionTreeClassifier, KNeighborsClassifier (K Nearest Neighbors), and GaussianNB (Gaussian Naive Bayes) algorithms. In reviewing tester.py and noting that its test_classification() method applied K-fold cross-validation prior to iterative fitting and prediction, it made the most sense to me to attempt to follow that function's approach so that my algorithm of choice would follow the same grading metrics: accuracy, precision, recall, F1, and F2, each as ratios calculated from sums for comparisons of predictions and labels across 1000 testing-training splits of the dataset. This would essentially produce a confusion matrix.

In [62]:
from feature_format import featureFormat, targetFeatureSplit

In [63]:
from sklearn.neighbors         import KNeighborsClassifier
from sklearn.tree              import DecisionTreeClassifier
from sklearn.naive_bayes       import GaussianNB
from sklearn.ensemble          import AdaBoostClassifier
from sklearn.model_selection   import StratifiedShuffleSplit

# Function definition for classifier testing, validation, evaluation
def classifier_test(clf, dataset, feature_list, folds = 1000):
  data = featureFormat(dataset, feature_list, sort_keys = True)
  labels, features = targetFeatureSplit(data)
  cv = StratifiedShuffleSplit(n_splits=folds, random_state = 42)
  true_neg  = 0
  false_neg = 0
  true_pos  = 0
  false_pos = 0
  for train_idx, test_idx in cv.split(features, labels):
    features_train = []
    labels_train   = []
    features_test  = []
    labels_test    = []
    for ii in train_idx:
      features_train.append(features[ii])
      labels_train.append(labels[ii])
    for jj in test_idx:
      features_test.append(features[jj])
      labels_test.append(labels[jj])

    # fit the classifier using training set, and test on test set
    clf.fit(features_train, labels_train)
    predictions = clf.predict(features_test)
    for prediction, truth in zip(predictions, labels_test):
      if prediction == 0 and truth == 0:
        true_neg += 1
      elif prediction == 0 and truth == 1:
        false_neg += 1
      elif prediction == 1 and truth == 0:
        false_pos += 1
      elif prediction == 1 and truth == 1:
        true_pos += 1
      else:
        print("Warning: Found a predicted label not == 0 or 1.")
        print("All predictions should take value 0 or 1.")
        print("Evaluating performance for processed predictions:")
        break
  try:
    total_pred = true_neg + false_neg + false_pos + true_pos
    accuracy = 1.0 * (true_pos + true_neg) / total_pred
    precision = 1.0 * true_pos / (true_pos + false_pos)
    recall = 1.0 * true_pos / (true_pos + false_neg)
    f1 = 2.0 * true_pos / (2 * true_pos + false_pos + false_neg)
    f2 = (1 + 2.0 * 2.0) * precision * recall / (4 * precision + recall)
    print(clf)
    print("  Predictions: %d" % total_pred)
    print("  Accuracy: %.5f\n  Precision: %.5f  Recall: %.5f" % \
          (accuracy, precision, recall))
    print("  F1: %.5f  F2: %.5f" % (f1, f2), "\n")
  except:
    print("Performance calculations failed.")
    print("Precision or recall may be undefined (no true positives).")

# Iteration over a list of classifiers
classifiers = [KNeighborsClassifier(),
               DecisionTreeClassifier(),
               GaussianNB()]

print("Trying several classifiers with default settings for comparison...\n")
for classifier in classifiers:
  classifier_test(classifier, data_dict, features_list)

Trying several classifiers with default settings for comparison...

KNeighborsClassifier()
  Predictions: 15000
  Accuracy: 0.87513
  Precision: 0.61609  Recall: 0.16850
  F1: 0.26463  F2: 0.19715 

DecisionTreeClassifier()
  Predictions: 15000
  Accuracy: 0.80320
  Precision: 0.26224  Recall: 0.26250
  F1: 0.26237  F2: 0.26245 

GaussianNB()
  Predictions: 15000
  Accuracy: 0.76353
  Precision: 0.24564  Recall: 0.37350
  F1: 0.29637  F2: 0.33828 



With the initial, default settings used, the above results were produced. KNeighBorsClassifier ranked highest in accuracy at 0.87513 and precision at 0.61909, but lowest in recall at 0.16850. DecisionTreeClassifier was slightly better at performing than GaussianNB in accuracy at 0.80367 and precision at 0.25732, but was worse in recall at 0.25050. GaussianNB had the best recall at 0.37350, but the least accuracy (at 0.76353) or precision (at 0.24564).

With those results in mind, I decided to pursue more extensive testing and parameter tuning for the DecisionTreeClassifier and KNeighborsClassifier. After addition tuning for feature selection and parameter settings could not produce sufficient recall, the decision was made to discard the KNeighborsClassifier. Despite its relatively high recall score, due to the relative inflexibility of GaussianNB to tune parameters, this algorithm was discarded fairly early in the process.

## Investigating Feature Importance

After testing a few different classifiers, I was interested in taking a deeper look at feature importance. Several different methods were used to examine feature importance before settling on 'mutual information' values derived from sklearn.feature_selection.mutual_info_classif(). I found that using this produced meaningful increases in algorithm performance. 

In [64]:
# for feature and label extraction
features_list = ['poi',
                 'salary',
                 'bonus',
                 'long_term_incentive',
                 'expenses',
                 'director_fees',
                 'other',
                 'loan_advances',
                 'deferred_income',
                 'deferral_payments',
                 'total_payments',
                 'restricted_stock_deferred',
                 'exercised_stock_options',
                 'restricted_stock',
                 'total_stock_value',
                 'from_messages',
                 'to_messages',
                 'from_poi_to_this_person',
                 'from_this_person_to_poi',
                 'shared_receipt_with_poi',
                 'from_poi_to_messages_ratio',
                 'to_poi_from_messages_ratio',
                 'shared_receipt_to_messages_ratio']

# Extracting features and labels from dataset for local testing
data = featureFormat(data_dict, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

from sklearn.feature_selection import mutual_info_classif
print("\nFeature importance by mutual_info_classif:")
print(" ('mutual info' with regard to 'poi' target)")
# sorting feature names by magnitude of mutual information with 'poi'
mutual_info = sorted(zip(list(mutual_info_classif(features, labels)),
                         features_list[1:]), reverse = True)
for i in range(len(mutual_info)):
  print(" ", i+1, "- '%s'" % mutual_info[i][1],
        "\n        %.5f"   % mutual_info[i][0])


Feature importance by mutual_info_classif:
 ('mutual info' with regard to 'poi' target)
  1 - 'bonus' 
        0.08911
  2 - 'expenses' 
        0.08035
  3 - 'other' 
        0.06814
  4 - 'to_poi_from_messages_ratio' 
        0.05753
  5 - 'shared_receipt_with_poi' 
        0.05629
  6 - 'restricted_stock' 
        0.03974
  7 - 'total_stock_value' 
        0.03778
  8 - 'from_poi_to_messages_ratio' 
        0.03150
  9 - 'exercised_stock_options' 
        0.02915
  10 - 'shared_receipt_to_messages_ratio' 
        0.02590
  11 - 'salary' 
        0.02473
  12 - 'from_this_person_to_poi' 
        0.02328
  13 - 'loan_advances' 
        0.01632
  14 - 'total_payments' 
        0.01453
  15 - 'to_messages' 
        0.00780
  16 - 'deferred_income' 
        0.00739
  17 - 'restricted_stock_deferred' 
        0.00510
  18 - 'from_messages' 
        0.00106
  19 - 'long_term_incentive' 
        0.00064
  20 - 'from_poi_to_this_person' 
        0.00000
  21 - 'director_fees' 
        0.000

The top five features by importance shown by the mutual_info_classif were selected as the features for the final algorithm. Those features are, in order of importance:
- expenses
- bonus
- other
- shared_receipt_with_poi
- to_poi_from_messages_ratio

I found it interesting to note that one of the newly created features, to_poi_from_messages_ratio made it into the final feature selection choice. 

It is also important to note that no feature scaling was performed here, as this is not required for successful performance of the DecisionTreeClassifier.

## Algorithm Tuning Discussion

Tuning parameters for the purposes of a classification algorithm is the iterative process of changing settings for the behaviour of the classifiers with the intent to increase performance. This increased performance can be seen in either minimizing runtime or maximizing evaluation metrics such as accuracy, precision, recall, and/or composites like f1, f2. Poor parameter tuning could result in inefficiently long runtime, and/or poorly optimized results. This is why parameter tuning is of such importance.

Algorithm tuning is generally an iterative process, and that was the process I followed here. I first sought to maximize precision and accuracy by use of GridSearchCV with SelectKBest and DecisionTreeClassifier, optimizing parameters for maximized F1.

After several attempts, I found the best performance by using mutual information as the feature selection metric along with information gain as the splitting criterion. I sought to minimize runtime by keeping the chosen parameters fixed. Then, I sought to optimize for 'k' in SelectKBes and 'min_samples_split' in DecisionTreeClassifier.

In [65]:
from sklearn.feature_selection import mutual_info_classif, SelectKBest

In [66]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline        import Pipeline

# Using mutual information as feature selection metric
selector = SelectKBest(mutual_info_classif)
# Using information (entropy) gain as splitting criterion
classifier = DecisionTreeClassifier(criterion = 'entropy')

tune_pipe = Pipeline(steps=[('skb', selector),
                            ('clf', classifier)])

# Optimizing number of features and minimum number of samples for splitting
grid_params = {'skb__k' : (3, 4, 5, 6, 7, 8, 9),
                'clf__min_samples_split' : (3, 4, 5, 6, 7, 8, 9)}

print("Trying GridSearchCV with")
pp(tune_pipe)
print("over parameters:")
pp(grid_params)

# Optimizing for maximized F1 in order to maximize precison and recall
grid = GridSearchCV(tune_pipe, grid_params, scoring = 'f1', cv = 10,
                    n_jobs = -1)
grid.fit(features, labels)

print("\nResulting 'best' parameters for maximizing 'f1':")
pp(grid.best_params_)

# sorting features by paired information gain scores
grid_ftrs = sorted(zip(list(grid.best_estimator_.named_steps['skb'].scores_),
                             features_list[1:]), reverse = True)
# creating list to pass to k-fold testing method
best_features = ['poi']
print("\nFeatures used:")
for i in range(grid.best_params_['skb__k']):
  best_features.append(grid_ftrs[i][1])
  print(" ", i+1, "- '%s'" % grid_ftrs[i][1],
        "\n        %.5f"   % grid_ftrs[i][0])
print('')

# Testing tuned parameters with 1000-fold cross validation
classifier_test(grid.best_estimator_.named_steps['clf'],data_dict,
                best_features)

Trying GridSearchCV with
Pipeline(steps=[('skb',
                 SelectKBest(score_func=<function mutual_info_classif at 0x0000017BC8386CA0>)),
                ('clf', DecisionTreeClassifier(criterion='entropy'))])
over parameters:
{'clf__min_samples_split': (3, 4, 5, 6, 7, 8, 9),
 'skb__k': (3, 4, 5, 6, 7, 8, 9)}

Resulting 'best' parameters for maximizing 'f1':
{'clf__min_samples_split': 4, 'skb__k': 5}

Features used:
  1 - 'expenses' 
        0.08268
  2 - 'bonus' 
        0.07398
  3 - 'other' 
        0.06814
  4 - 'to_poi_from_messages_ratio' 
        0.05794
  5 - 'shared_receipt_with_poi' 
        0.04604

DecisionTreeClassifier(criterion='entropy', min_samples_split=4)
  Predictions: 13000
  Accuracy: 0.84262
  Precision: 0.48793  Recall: 0.46500
  F1: 0.47619  F2: 0.46941 



As shown in the above code, in order to observe performance results, the best-parametered classifier and best-performing features from GridSearchCV were passed to the previous 1000-fold cross validation testing function.

Multiple executions of this code resulted in slightly varying 'k', 'min_samples_split', and the exact features selected. This is likely not a cause for concern due to the inherent variation possible for mutual information scores and the variation possible fin the  DecisionTreeClassifier due to its node-splitting method.

Performance seems to be maximized with 'min_samples_split' set to 5, 'k' set to 5, with the features previously discussed. Given that information, I elected to manually test those values for their performance results from the testing function.

In [67]:
# features (apart from 'poi') that were most frequently top-ranked in mutual information
manual_features = ['poi',
                   'expenses',
                   'bonus',
                   'other',
                   'to_poi_from_messages_ratio',
                   'shared_receipt_with_poi']

# parameter settings most frequently resultingin highest precision and recall
clf = DecisionTreeClassifier(criterion = 'entropy',
                             min_samples_split = 5)

print("Trying DecisionTreeClassifier with parameter settings and feature")
print("  selection based on 'best' of varying results from optimization...")
print("  (features *reliably* top-ranked by 'mutual information' with 'poi')")
print("Features used:")
pp(manual_features[1:])
classifier_test(clf, data_dict, manual_features)

# Dumping classifier, dataset, features list, and running tester.py for final test
import tester
print("Testing final classifier via tester.py...")
tester.dump_classifier_and_data(clf, data_dict, manual_features)
tester.main()

Trying DecisionTreeClassifier with parameter settings and feature
  selection based on 'best' of varying results from optimization...
  (features *reliably* top-ranked by 'mutual information' with 'poi')
Features used:
['expenses',
 'bonus',
 'other',
 'to_poi_from_messages_ratio',
 'shared_receipt_with_poi']
DecisionTreeClassifier(criterion='entropy', min_samples_split=5)
  Predictions: 13000
  Accuracy: 0.84369
  Precision: 0.49153  Recall: 0.46400
  F1: 0.47737  F2: 0.46926 

Testing final classifier via tester.py...
DecisionTreeClassifier(criterion='entropy', min_samples_split=5)
	Accuracy: 0.84515	Precision: 0.49651	Recall: 0.46200	F1: 0.47863	F2: 0.46851
	Total predictions: 13000	True positives:  924	False positives:  937	False negatives: 1076	True negatives: 10063



## Conclusion

As shown in the above code samples, the algorithm I ultimately chose was a DecisionTreeClassifier. This algorithm was set to use entropy (as information gain) for the splitting criterion as well as a minimum of 5 samples for splitting of internal nodes. Based on investigation of feature importance via mutual information, the algorithm was applied to a subset of features, namely 'expenses', 'bonus', 'other', 'to_poi_from_messages_ratio', and 'shared_receipt_with_poi'.

In machine learning, validation is a method of testing a model’s performance by splitting the data into training and testing data. The model is trained with the training dataset and it is tested with the testing dataset, thereby maintaining independence between the two datasets. The performance of the algorithm is therefore able to be validated  when the model is applied on the testing dataset. Overfitting may occur without proper validation, meaning accuracy and other metrics may be higher than those possible for any independent data secondary to recognition of feature values from those used in training the classifier.

This project employed k-fold cross validation in a manner consistent with that applied by tester.py. The dataset was split by StratifiedShuffleSplit with a default of 1000 for its n_splits parameter. Subsequently, performance metrics were calculated according to totals for all predictions, true positives, false positives, true negatives, and false negatives from the classifier.

Accuracy of the final classifier was generally around0.84, with precision of about 0.49, and recall of about 0.46. In human-speak, this means that the algorithm devised here is able to correctly predict a record's status as a POI around 84% of the time. From the recall score of 0.46, we can deduce that correct positive predictions were about 46% of all potentially positive predictions.