/final_project/
:
poi_id.py
: Main file. Runs final feature selection, feature scaling, various classifiers (optional) and their results. Finally, dumps classifier, dataset and feature list so anyone can check results.tester.py
: Functions for validation and evaluation of classifier, dumping and loading of pickle files.my_classifier.pkl
: Pickle file for final classifier frompoi_id.py
.my_dataset.pkl
: Pickle file for final dataset frompoi_id.py
.my_feature_list.pkl
: Pickle file for final feature list frompoi_id.py
.
/tools/
:
feature_format.py
: Functions to convert data from dictionary format into numpy arrays and separate target label from features to make it suitable for machine learning processes.feature_creation.py
: Functions for creating two new features -'poi_email_ratio'
and'exercised_stock_options'
.select_k_best.py
: Function for selecting k best features using Sci-kit Learn's SelectKBest, sorts them in descending order of score.visualize.py
: Function for drawing plots of any two features colored by POI and non-POI.
Enron's financial scandal in 2001 led to the creation of a very valuable dataset for machine learning, one where algorithms were trained and tested to be able to find fraudulent employees, or persons-of-interest (POIs). In this project, a merged dataset of financial and email data will be used to go through the entire machine learning process. First, the dataset will be manually explored to find outliers and trends and generally understand the data we're working with. Certain useful financial or email-based features will be chosen (manually and automatically using Sklearn functions) and ensemble features created from those available, and then put through appropriate feature scaling. Then, numerous algorithms with parameter tuning will be trained and tested on the data, with the results of a Decision Tree Classifer, an ensemble classifier named Adaboost, and a Nearest Neighbors classifier being presented. The detailed results of the final algorithm, a Decision Tree Classifer, is shown. The validation and evaluation metrics are shown and the reasoning behind their choice and its importance explained. Finally, other ideas involving feature selection, feature scaling, other algorithms and usage of email texts are discussed.
In late 2001, Enron, an American energy company, filed for bankruptcy after one of the largest financial scandals in corporate history. After the company's collapse, over 600,000 emails generated by 158 Enron employees - now known as the Enron Corpus - were acquired by the Federal Energy Regulatory Commission during its investigation. The data was then uploaded online, and since then, a number of people and organizations have graciously prepared, cleaned and organized the dataset that is available to the public today (a few years later, financial data of top Enron executives were released following their trial).
Today, the Enron Corpus is the largest and one of the only publicly available mass collections of real emails easily accessible for study. This excerpt from an article in MIT Technology Review summarizes the value of such a dataset:
This corpus is valuable to computer scientists and social-network theorists in ways that the e-mails’ authors and recipients never could have intended. Because it is a rich example of how real people in a real organization use e-mail—full of mundane lunch plans, boring meeting notes, embarrassing flirtations that revealed at least one extramarital affair, and the damning missives that spelled out corruption—it has become the foundation of hundreds of research studies in fields as diverse as machine learning and workplace gender studies.
The aim of this project is to apply machine learning techniques to build a predictive model that identifies Enron employees that may have committed fraud based on their financial and email data.
The boffins at Udacity have combined important aspects of the financial and email data to create one consolidated dataset that will be used for creating the predictive model. The dataset has 14 financial features (salary, bonus, etc.), 6 email features (to and from messages, etc.) and a boolean label that denotes whether a person is a person-of-interest (POI) or not (established from credible news sources). It is these features that will be explored, cleaned, and then put through various machine learning algorithms, before finally tuning them and checking its accuracy (precision and recall). The objective is to get a precision and recall score of at least 0.3.
Machine Learning is an incredibly effective field when it comes to making predictions from data, especially large amounts of it. The Enron Corpus, after cleaning, has around 500,000 emails, and attempting to identify fraudulent employees by manually foraging through half a million emails is a daunting task at best. Designing and implementing a machine learning algorithm can significantly reduce the legwork, although making sure the algorithm is producing accurate results is challenging but ultimately worth the effort.
explore_enron_data.py
has the code that explores the dataset and provides the following information.
The dataset is a dictionary of dictionaries, with the keys being 146 Enron employees, and their values being a dictionary of features. The features available for each person are:
['salary', 'to_messages', 'deferral_payments', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'restricted_stock_deferred', 'total_stock_value', 'expenses', 'loan_advances', 'from_messages', 'other', 'from_this_person_to_poi', 'poi', 'director_fees', 'deferred_income', 'long_term_incentive', 'email_address', 'from_poi_to_this_person']
In the features, the boolean poi
denotes whether the person is a person of interest or not. The Poi_count
function shows that there are 18 such people in the dataset, and the aim of this project is to find distinguishing features that set these people apart from the others.
Ofcourse, not everyone has data for each feature, and missing data is denoted by 'NaN'. The NaN_count
function prints a dictionary sorted in descending order. Using the results of the function, as well as going through the names, I came across the following outliers:
LOCKHART EUGENE E
: No data available on this person.THE TRAVEL AGENCY IN THE PARK
: Not a person/employee associated with Enron.TOTAL
: Summation of everyone's data - likely part of a spreadsheet. Found after visualizing financial features and finding an extreme outlier.
I will remove these outliers (and any others I find) after I complete feature selection.
The first thing to do is simply to create plots of two features and visualize them, look for trends and generally get an idea of the variation in the data. The DrawPlot
function in visualize.py
does exactly that. Here are a few plots that it produced:
There are lots of features available to play with, but as with all lists of features, not all of them are useful in predicting the target variable. My first step is to check the ability of each feature in clearly differentiating between POI and non-POI. To do this, I'm going to use Scikit-Learn's SelectKBest
algorithm, which will give me a score for each feature in it's ability to identify the target variable.
In select_k_best.py
, the Select_K_Best
function returns an array of k tuples in descending order of its score. Running this will show the most useful features and the not-so-useful ones. Running them over all features gives:
[('exercised_stock_options', 25.097541528735491),
('total_stock_value', 24.467654047526398),
('bonus', 21.060001707536571),
('salary', 18.575703268041785),
('deferred_income', 11.595547659730601),
('long_term_incentive', 10.072454529369441),
('restricted_stock', 9.3467007910514877),
('total_payments', 8.8667215371077752),
('shared_receipt_with_poi', 8.7464855321290802),
('loan_advances', 7.2427303965360181),
('expenses', 6.2342011405067401),
('from_poi_to_this_person', 5.3449415231473374),
('other', 4.204970858301416),
('from_this_person_to_poi', 2.4265081272428781),
('director_fees', 2.1076559432760908),
('to_messages', 1.6988243485808501),
('deferral_payments', 0.2170589303395084),
('from_messages', 0.16416449823428736),
('restricted_stock_deferred', 0.06498431172371151)]
From this I can see that stock based features lead the way, and the leading features are all financial based - the email features are mostly among the bottom half.
The first thing I notice is that 'other'
is not very useful and also ambiguous, so it's not a feature I'm going to add to the list.
Also in order to make email features more effective, I'm going to create a new feature called 'poi_email_ratio'
, which is the sum of the ratio of emails sent to poi over all emails sent and the ratio of emails received from poi over all emails received. Essentially, if every email a person X sent/received was to/from a POI, then their value would 2. And if they sent/received 0 emails to/from a POI, then their value would be 0.
Since stocks are seemingly very good indicators, another feature called 'exercised_stock_ratio'
- the ratio of exercised stock options to total stock value. The rationale behind this is that the executives of Enron (and the main POIs) all had huge stock options, and since they knew the company was going to go bankrupt, they exercised those options.
In feature_creation.py
, the CreatePoiEmailRatio
function creates the 'poi_email_ratio'
feature and the CreateExercisedStockRatio
function creates the 'exercised_stock_ratio'
feature. Testing these features on a few people (test_people
in the file), however, led to a few complications, namely with the CreatePoiEmailRatio
function.
As explained above, if all a person's emails sent/received were to/from POIs, then the value should be 2 (i.e. the maximum value). However, Mark Frevert has a value of 11.5, and Kenneth Lay has a value of 3.4, above the designed maximum. Checking the message totals for these two people (see explore_enron_data.py
), I noticed that the 'from_poi_to_this_person'
total exceeded the 'from_messages'
total (and hence it must be possible that the 'from_this_person_to_poi'
total can exceed the 'to_messages'
total). This led me to conclude that the from and to message totals must not include the emails that are sent to/received from POIs, and I'll have to recalibrate the function to add them.
After fixing the error, I ran the Select_K_Best
function again, and 'poi_email_ratio'
came 5th with a score of 16.2, but unfortunately 'exercised_stock_ratio'
didn't work at all, coming dead last with a score nearly twice as bad as the one before it.
The final 14 features to be used for the machine learning process are:
Feature | Score |
---|---|
exercised_stock_options | 25.10 |
total_stock_value | 24.47 |
bonus | 21.06 |
salary | 18.58 |
poi_email_ratio | 16.24 |
deferred_income | 11.60 |
long_term_incentive | 10.07 |
restricted_stock | 9.35 |
total_payments | 8.87 |
shared_receipt_with_poi | 8.75 |
loan_advances | 7.24 |
expenses | 6.23 |
from_poi_to_this_person | 5.34 |
from_this_person_to_poi | 2.43 |
A lot of the features have vastly different scales. Financial features such as salary range from below ten thousand up to just above a million dollars, whereas email features like to and from messages max out at around a few thousand. The newly created 'poi_email_ratio'
has a maximum of two. To make sure the scales are comparable, I'm going to run sklearn's MinMaxScaler()
over all the features.
The next step in the process is to try out a number of algorithms to see which ones do a good job to correctly predicting (or not making an error with) POIs. The algorithms I'm going to try out are:
DecisionTreeClassifier
: A standard decision tree classifierAdaBoost
: An ensemble decision tree classifierLinear Regression
: A standard linear regression classifierK-Nearest Neighbors
: A proximity based classifierSVM
: Heavy but robust multi-dimensional classifier
I will both manually tune the classifiers with different parameters (and feature lists) as well as use Pipelines to run PCA and GridSearch to come up with the best precision and recall results.
Here are the results of using the DecisionTreeClassifier
while changing the min_samples_split
parameter over the full feature list (15 features):
Min Samples Split | Precision | Recall |
---|---|---|
10 | 0.273 | 0.204 |
5 | 0.294 | 0.266 |
4 | 0.294 | 0.270 |
3 | 0.296 | 0.274 |
2 | 0.288 | 0.283 |
Next, let's check what happens when I choose just the top 8 features (almost half the full list):
Min Samples Split | Precision | Recall |
---|---|---|
10 | 0.317 | 0.245 |
5 | 0.342 | 0.299 |
4 | 0.335 | 0.297 |
3 | 0.349 | 0.312 |
2 | 0.345 | 0.341 |
All the precision scores are above 0.3 now, and all recall scores are in and around that region. Clearly, reducing the feature list to just the top half important ones has slightly improved the classifier's ability to predict POIs.
Now, let's make a similar table, but this time with PCA.
I'm going to pick a min_samples_split
value of 3, and change the value of the number of components in the Principle Component Analysis. This will enable to compare between manually reducing the number of features and running dimensionality reduction over the full feature list.
No. of Components | Precision | Recall |
---|---|---|
12 | 0.274 | 0.287 |
10 | 0.298 | 0.317 |
8 | 0.283 | 0.275 |
6 | 0.284 | 0.269 |
4 | 0.301 | 0.284 |
2 | 0.196 | 0.196 |
These results give me the impression that PCA isn't particularly useful in this case, and it seems that manually picking the best features is giving better scores.
Since I got better results when reducing the number of features, I thought I'd go a little further. I decided to try just 2 features: the best financial feature, 'exercised_stock_options'
, and the best email feature, 'poi_email_ratio'
. These are the results changing min_samples_split
:
Min Samples Split | Precision | Recall | F1 Score |
---|---|---|---|
2 | 0.505 | 0.465 | 0.485 |
3 | 0.586 | 0.460 | 0.515 |
4 | 0.596 | 0.462 | 0.521 |
5 | 0.601 | 0.461 | 0.520 |
6 | 0.623 | 0.456 | 0.526 |
7 | 0.654 | 0.455 | 0.536 |
8 | 0.654 | 0.452 | 0.534 |
9 | 0.659 | 0.451 | 0.535 |
10 | 0.673 | 0.442 | 0.534 |
15 | 0.742 | 0.414 | 0.531 |
20 | 0.714 | 0.312 | 0.434 |
The Adaboost Classifier is an ensemble decision tree classifier that repeatedly iterates estimators over the dataset.
Results with full feature list changing number of estimators - n_estimators
:
No. of Estimators | Precision | Recall |
---|---|---|
10 | 0.356 | 0.227 |
25 | 0.354 | 0.268 |
50 | 0.404 | 0.306 |
75 | 0.446 | 0.317 |
100 | 0.461 | 0.315 |
Results with top 8 features:
No. of Estimators | Precision | Recall |
---|---|---|
10 | 0.357 | 0.228 |
25 | 0.356 | 0.269 |
50 | 0.355 | 0.291 |
75 | 0.364 | 0.308 |
100 | 0.371 | 0.322 |
These results show that the Adaboost Classifier performs better than the No-PCA and PCA Decision Tree Classifiers (same feature lists), but it is much, much more computationally expensive, with the 100 estimators taking a few minutes to run.
Results while changing the number of nearest neighbors:
No. of Neighbors | Precision | Recall |
---|---|---|
1 | 0.157 | 0.131 |
2 | 0.276 | 0.066 |
3 | 0.503 | 0.245 |
4 | 0.590 | 0.107 |
5 | 0.639 | 0.168 |
6 | 1.000 | 0.004 |
7 | 0.897 | 0.018 |
K-Nearest neighbors gives incredibly precise predictions as the value increases past 5, but at the cost of recall. The reason for the near-perfect precision is that it makes very few POI predictions in the first place, but when it does, it's accurate. For example, for 6 neighbors, just 8 POI predictions were made, with all 8 being true positives. While this precision seems compelling, it comes at the cost of recall, which is astoundingly low because when one makes very few POI predictions (that is, a lot of the predictions are that each person is not a POI), one lets a lot of actual POIs through. For this reason, K-Nearest Neighbors is not an appropriate classifier.
The algorithm I've chosen is a Decision Tree Classifier with 'min_samples_split'
value of 8 and a feature list of ['poi', 'exercised_stock_options', 'poi_email_ratio']
.
The feature list looks a bit empty considering there are so many to play with, but through experimentation I've noticed this produces the best precision and recall scores. The visualization of the two features also produces a less overplotted graph, which is good for decision tree classifiers (see below).
These are the evaluation metric results from tester.py
:
Evaluation Metric | Score |
---|---|
Accuracy | 0.869 |
Precision | 0.654 |
Recall | 0.455 |
F1 | 0.536 |
F2 | 0.483 |
Total Predictions | 12000 |
True Positives | 907 |
False Positives | 483 |
True Negatives | 9517 |
False Negatives | 1093 |
And here is the visualization:
Certain validation and evaluation metrics are used to get the algorithm performance scores above, and now I'm going to go into a little more detail on what they are and why they're important or right for this data.
The first validation step is to separate the dataset into training and testing sets. Why is this important? If an algorithm learns from and tests on the same data, then the model would just repeat the target labels and likely get a perfect score. The algorithm will overfit to the particular data, and should it be tested any new, unseen data (say another company's data after similar circumstances), it will likely perform very poorly in its predictions.
To perform on independent datasets and combat overfitting, sklearn has several helper functions under Cross-validation. The one used in this project is StratifiedShuffleSplit
, which is particularly suitable for this dataset. Since there is a large imbalance in the number of POIs and non-POIs (very few POIs in entire dataset), it's possible that in a random split that there won't be a lot of POI labels to either train or validate on. What StratifiedShuffleSplit
does is that it ensures that the percentage of target labels is approximately the same in both training and validation sets as it is in the complete dataset.
The most common measure of an algorithm's performance is its accuracy score - the number of predicted labels it got correct over all predicted labels. We'll see, however, that this isn't a suitable metric for this kind of dataset. When there's a large imbalance in the data labels, then a high accuracy score can be achieved by simply predicting all labels to be the most common label in the dataset. In this context, it would be predicting all labels to be non-POI, which will still result in an accuracy score above 80% despite not actually predicting anything.
In these situations, a much better evaluation metric for the algorithm is the precision score and the recall score (or the F1 score, which is a weighted average of the two). The precision score is the ratio of the true positives over the sum of true and false positives, while the recall score is the ratio of the true positives over the sum of true positives and false negatives. In the context of Enron data, the precision score measures the algorithm's ability to, given that it is predicting a POI, correctly predicts a POI, while the recall score measures the algorithm's ability to, given that it is a POI, correctly predicts a POI. The difference will be clearer once the true positives, false positives, true negatives and false negatives are defined in this context:
- True positive: when a label is a POI and the algorithm predicts it is a POI.
- False positive: when a label is not a POI and the algorithm predicts it is a POI.
- True negative: when a label is not a POI and the algorithm predicts a non-POI.
- False negative: when a label is not a POI and the algorithm predicts a POI.
So, the the precision score is how often the algorithm is getting the prediction of POI right, whereas the recall score is, given that the label is a POI, how often the algorithm predicts it is.
For feature scaling, I applied a MinMaxScaler
to all the features, but towards the end of the project I wondered if there are other methods that I could have used. A MinMaxScaler
just shrinks or expands the data along a scale, and it's useful when the two axes scales differ by a huge amount, but I passed the opportunity to transform the data into different shapes. Particularly, if I could transform the data to make the distribution Gaussian and see what the data looked like when normally distributed. I suspect the algorithms would work differently, and would lead to a lot more options (at the very least, it would lead to interesting outcomes).
The other aspect that was not touched in this project, primarily because it's beyond the scope of the analysis, was actually analyzing the text in the emails. The text learning mini-project provided an entry point into the world of natural language processing, but I suspect that using that data to help predict POIs is a much more challenging task. Nonetheless, it is interesting to think about, and hopefully soon there will be a stage where I'm skilled enough in machine learning to include that realm into my investigation.