Enron Submission Free-Response Questions
1. Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those? [relevant rubric items: “data exploration”, “outlier investigation”]

The goal of this project is to successfully identify persons of interest within the Enron Corpus, which is the largest openly available dataset of email conversations available today.

Machine learning is keenly appropriate for this type of project because:

- The dataset is massive and has many possible dimensions, processing this data and gathering useful information or trends would be nearly impossible for a person to do, even with traditional programming methods
- The dataset has so many potential questions to answer. It can be beneficial to set up a machine learning algorithm to identify points of interest in the dataset prior to digging deeper with a more focused study
- Machine learning allows you to pivot relatively quickly with new information by tweaking the parameters of your algorithm instead of completely refactoring logic in existing code

This dataset has has 3289 data points. There are 146 executives and only 21 are initially identified as persons of interest. I decided to use 8 features total for my classifier in an attempt to minimize (but not entirely eliminate) missing values. Some of the features I selected do have missing values.

While doing initial data analysis I was able to identify some odd outliers, but due to feature selection I eliminated them entirely from consideration in my machine learning algorith.

- TOTAL seemed to potentially refer to something other than the individual (maybe the department the executive belonged to?)
- Several 'executives' data points had only NULL/NaN values, I excluded all of these from consideration by popping them from the dictionary.

2. What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values. [relevant rubric items: “create new features”, “properly scale features”, “intelligently select feature”]

I loaded up the dataset into a Pandas dataframe to get a better look at the features available to use. I did a bit of exploration and selected features based on the following criteria:

- Is there enough data available? (majority of the data is not NULL)
- Did the feature (in my mind) have any correlation with the analysis I was trying to perform?

Using this as a guide I selected several new features to include alongside the initial features in the starter code. The features I selected had specific reasons for being chosen, which I outline next to them. I wanted to focus very intently on conversations between known and potential persons of interest and any trend that could be found in their bonus and expenses in my attempts to identify more persons of interest.

- bonus: If they received a higher-than-average bonus it's possible they were involved in the fraud
- expenses: higher utilization of expenses could point to another person involved in fraud
- from_poi_to_this_person: frequent conversations directly with confirmed poi's could indicate that the person is a poi themselves
- from_this_person_to_poi: similar to above
- shared_receipt_with_poi: similar to above

After testing these features in step 3 I realized that I had cut the feature list down too far to get good precision out of the algorithms. I slowly began adding a single feature at a time and testing the algorithm. Some features drastically reduced the precision and recall (restricted_stock) while others incrementally improved it. Once I had all the features added back in I did a test on the tree features to see which features had the most impact on the decision tree:

Tree Feature Importances:

bonus : 0.2611
long_term_incentive : 0.2129
deferred_income : 0.1373
exercised_stock_options : 0.1175
expenses : 0.1051
deferral_payments : 0.0764
total_payments : 0.0594
shared_receipt_with_poi : 0.0304
salary : 0.0000
from_poi_to_this_person : 0.0000
from_this_person_to_poi : 0.0000
loan_advances : 0.0000
restricted_stock_deferred : 0.0000
loan_advances : 0.0000
director_fees : 0.0000
total_stock_value : 0.0000
custom_ratio_to_poi : 0.0000
custom_ratio_from_poi : 0.0000

Using this information I cut down the features list again and ran another test on the Decision Tree, which pushed the precision up past the necessary threshold.

- Accuracy: 0.82733	
- Precision: 0.35804	
- Recall: 0.37200

3. What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms? [relevant rubric item: “pick an algorithm”] What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well? How did you tune the parameters of your particular algorithm? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier). [relevant rubric item: “tune the algorithm”] What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis? [relevant rubric item: “validation strategy”]

I initially chose a Decision Tree as my first algorithm. I initially tried adjusting the min_samples_split variable, setting it to higher-than-default settings in an attempt to normalize and prevent any sort of overfitting. After a few test runs I incorporated changes to max_depth, then random_state to see if I could push the precision measures higher until I got up to ~.25. When I couldn't squeeze any better performance out of the algorithm I tried it with the default settings and found that those settings yielded a better result. I am glad that I chose to use the default settings after attempting to manually change the parameters, it showed me how poorly clicking through parameters could have negative effects on your final accuracy measures. I believe that my overzealous cutting of features and focusing so closely in on two parts of the data that I assumed would be good for classification may have eliminated helpful features from the data, so I decided to return to feature selection after testing a second algorithm.

Prior to going back to feature selection the best Decision Tree test I could get is the following:

- Accuracy: 0.77862	
- Precision: 0.28995	
- Recall: 0.30300

The second algorithm I tried was KMeans clustering, because I knew the number of categories I would be trying to classify each person into from the start (2). Since I knew the number of groups it made tuning the algorithm easy by setting the number of clusters to 2, but if this were improperly configured or if I tried to play with that parameter it would likely have severely detrimental effects on the precision of the algorithm, or it would misclassify things that should be classified differently. I did check how changing the cluster sized affected the algorithm, just to see what types of behavior occurred, but it did not give me any significant insights into the data.

Prior to going back to feature selection the best KMeans test that I could get is the following:

- Accuracy: 0.73054	
- Precision: 0.21480	
- Recall: 0.28300

After removing two features that were showed to be unimportant (to_messages and from_messages) I ran the Decision Tree algorithm again. It showed a very slight improvement in precision but accuracy and recall dropped.

- Accuracy: 0.76500	
- Precision: 0.29124	
- Recall: 0.28600

Going back to feature selection and going through the process of adding features back in slowly, I finally came to a feature list that yielded a precision and recall score above the necessary threshold, but just barely.

- Accuracy: 0.82733	
- Precision: 0.35804	
- Recall: 0.37200

Those same features seemed to have a negative effect on the KMeans algorithm, however.

- Accuracy: 0.81427	
- Precision: 0.20227	
- Recall: 0.13350

4. Give at least 2 evaluation metrics and your average performance for each of them. Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance.

For my final chosen algorithm (Decision Tree) I was able to get an average performance of 0.35804 Precision and 0.37200 Recall.

Precision is the correctly predicted positive classifications against the number of actual positive classifications.

Recall is the correctly predicted positive classifications against the number of positive classifications that **should** have been identified correctly.

In [80]:
#!/usr/bin/python

import sys
import pickle
sys.path.append("../tools/")
sys.path.append("C:\Users\Evernite.Evernite-NPC\Desktop\WGU C753/tools/")
import pandas as pd
import numpy as np

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data

import tester

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn.cluster import KMeans

from sklearn.metrics import classification_report

### Task 1: Select what features you'll use.
### features_list is a list of strings, each of which is a feature name.
### The first feature must be "poi".
features_list = ['poi','bonus','long_term_incentive','deferred_income','exercised_stock_options','expenses',
                 'deferral_payments','total_payments','shared_receipt_with_poi'] # You will need to use more features

### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

# Task 2: Remove outliers
# Found these empty rows during exploration of the data and removed them
data_dict.pop('LOCKHART EUGENE E', 0)
data_dict.pop('TOTAL', 0)
data_dict.pop('THE TRAVEL AGENCY IN THE PARK', 0)

# Task 3: Create new feature(s)
for val in data_dict.values():
    from_this_person_to_poi = val['from_this_person_to_poi']
    to_messages = val['to_messages']
    from_poi_to_this_person = val['from_poi_to_this_person']
    from_messages = val['from_messages']
    
    val["custom_ratio_to_poi"] = (float(from_this_person_to_poi) / float(to_messages) 
                           if to_messages not in [0, "NaN"] and 
                              from_this_person_to_poi not in [0, "NaN"] 
                           else 0.0)
    val["custom_ratio_from_poi"] = (float(from_poi_to_this_person) / float(from_messages) 
                           if from_messages not in [0, "NaN"] and 
                              from_poi_to_this_person not in [0, "NaN"] 
                           else 0.0)

# Append the new features to the feature list
features_list.append('custom_ratio_to_poi')
features_list.append('custom_ratio_from_poi')

### Store to my_dataset for easy export below.
my_dataset = data_dict

### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

# Split into training and testing sets for fitting and scoring
from sklearn.model_selection import train_test_split

features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.33, random_state=42)

In [42]:
poi = pd.DataFrame.from_dict(data_dict, orient='index')

In [43]:
poi.info()

<class 'pandas.core.frame.DataFrame'>
Index: 143 entries, ALLEN PHILLIP K to YEAP SOON
Data columns (total 23 columns):
to_messages                  143 non-null object
deferral_payments            143 non-null object
custom_ratio_to_poi          143 non-null float64
expenses                     143 non-null object
poi                          143 non-null bool
deferred_income              143 non-null object
email_address                143 non-null object
long_term_incentive          143 non-null object
restricted_stock_deferred    143 non-null object
shared_receipt_with_poi      143 non-null object
loan_advances                143 non-null object
from_messages                143 non-null object
other                        143 non-null object
custom_ratio_from_poi        143 non-null float64
director_fees                143 non-null object
bonus                        143 non-null object
total_stock_value            143 non-null object
from_poi_to_this_person      143 non-null objec

In [None]:
### Task 4: Try a varity of classifiers
### Please name your classifier clf for easy export below.
### Note that if you want to do PCA or other multi-stage operations,
### you'll need to use Pipelines. For more info:
### http://scikit-learn.org/stable/modules/pipeline.html

In [81]:
# Decision Tree
clf_tree = tree.DecisionTreeClassifier()
clf_tree.fit(features_train, labels_train)
tester.dump_classifier_and_data(clf_tree, my_dataset, features_list)
tester.main();

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
	Accuracy: 0.82733	Precision: 0.35804	Recall: 0.37200	F1: 0.36488	F2: 0.36912
	Total predictions: 15000	True positives:  744	False positives: 1334	False negatives: 1256	True negatives: 11666



In [82]:
# Get the feature importances of the DecisionTree Classifier
tree_feature_importances = (clf_tree.feature_importances_)
tree_features = zip(tree_feature_importances, features_list[1:])
tree_features = sorted(tree_features, key= lambda x:x[0], reverse=True)

# Display the feature names and importance values
print('Tree Feature Importances:\n')
for i in range(len(features_list) - 1):
    print('{} : {:.4f}'.format(tree_features[i][1], tree_features[i][0]))

Tree Feature Importances:

shared_receipt_with_poi : 0.3591
deferred_income : 0.2029
deferral_payments : 0.1426
total_payments : 0.1242
expenses : 0.1119
exercised_stock_options : 0.0594
bonus : 0.0000
long_term_incentive : 0.0000
custom_ratio_to_poi : 0.0000
custom_ratio_from_poi : 0.0000


In [83]:
# Kmeans Clustering
clf_km = KMeans(n_clusters=2)
clf_km.fit(features_train, labels_train)
tester.dump_classifier_and_data(clf_km, my_dataset, features_list)
tester.main();

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)
	Accuracy: 0.81427	Precision: 0.20227	Recall: 0.13350	F1: 0.16084	F2: 0.14324
	Total predictions: 15000	True positives:  267	False positives: 1053	False negatives: 1733	True negatives: 11947

