# Enron Submission Free-Response Questions

A critical part of machine learning is making sense of your analysis process and communicating it to others. The questions below will help us understand your decision-making process and allow us to give feedback on your project. Please answer each question; your answers should be about 1-2 paragraphs per question. If you find yourself writing much more than that, take a step back and see if you can simplify your response!

When your evaluator looks at your responses, he or she will use a specific list of rubric items to assess your answers. Here is the link to that rubric: [Link] Each question has one or more specific rubric items associated with it, so before you submit an answer, take a look at that part of the rubric. If your response does not meet expectations for all rubric points, you will be asked to revise and resubmit your project. Make sure that your responses are detailed enough that the evaluator will be able to understand the steps you took and your thought processes as you went through the data analysis.
Once you’ve submitted your responses, your coach will take a look and may ask a few more focused follow-up questions on one or more of your answers.  

We can’t wait to see what you’ve put together for this project!



** QUESTION 1 ** 

Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?  [relevant rubric items: “data exploration”, “outlier investigation”]


** ANSWER **

The goal of this project is to develop a predictive model to identify Persons of Interest (POI) in the Enron Fraud case. Machine learning is useful here, in a supervised environment, to use computerized algorithms to identify/extract patterns in the complex available  data/features and use this knowledge in order to make predictions in new cases. Machine Learning is also useful here by allowing for training and testing in separate sub-sets of the data, giving confidence that the model would hold with cases outside this dataset.

The dataset provided contains financial information (14 features) and e-mail communication information (6 features) from 146 Enron employees, along with the label classifying the known POIs (18 POIs in the data set). 

*OUTLIERS*

I spotted 2 keys that needed to tbe removed: *TOTAL* and *THE TRAVEL AGENCY IN THE PARK*. These keys are clearly not persons and will only disturb any analysis

I noticed also that *LOCKHART EUGENE E* has only 0s or *NaN* values, therefore he will also be removed. I opted to not remove it manually, but simply let the *features_format* function do that (by default option *remove_all_zeroes=True*)

I also noitced that Kenneth Lay has a gigantic outliers value for *total payments* so I consider removing him or his value for total payments. However he is a true outlier and should be identified by the model. Therefore I did not do anything, but decided to keep an eye on the *total payments* feature.



** QUESTION 2** 

What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values.  [relevant rubric items: “create new features”, “intelligently select features”, “properly scale features”]


** ANSWER ** 

I initially listed the number of *NaN* in each feature (broken down by within POI and non-POIs) and decided to drop any feature that had less than 50% valid points. Therefore I kept only the 14 features:

['salary', 'to_messages', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'total_stock_value', 'expenses', 'from_messages', 'other', 'from_this_person_to_poi', 'poi', 'from_poi_to_this_person'].

After that I decided to convert convert all the to/from messages features into share of the total to/from messages. The rationale was that I did not expect the features *to/from_messages* to be informative, and the to/from poi in absolute terms ignores the relative effect. There I re-scaled all to their shares. 

I decided to not do any re-scaling since I was not planning on using any classifier that was sensitive to that. In particular, my initial plan was to test and select among NaiveBayes, SVM and Adaboost, which should not be sensitive to the scale. 

I then visually plotted all variables grouping on POI vs non-POI

As mentioned above, I noitced that Kenneth Lay has a gigantic outliers value for *total payments* from the feature list.

FIX THIS PART
From an initial visual inspection, the features: *exercised_stock_options*, *salary*, *total_stock_value* seemed as potential good predictors.

On the other hand, *expenses* and the 3 share variables created just above (*share_shared_receipt_with_poi*, *share_from_this_person_to_poi*, *share_from_poi_to_this_person*) seemed as not good predictors.



** QUESTION 3** 

What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?  [relevant rubric item: “pick an algorithm”]


** ANSWER **

I initially compared 4 algorithms with their default parameters:
* Naive Bays
* SVM
* Decision Tree
* Adaboost

These where the initial performance:
GaussianNB(priors=None)
	Accuracy: 0.84707	Precision: 0.37105	Recall: 0.21150	F1: 0.26943	F2: 0.23140
	Total predictions: 15000	True positives:  423	False positives:  717	False negatives: 1577	True negatives: 12283

Got a divide by zero when trying out: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
Precision or recall may be undefined due to a lack of true positive predicitons.

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
	Accuracy: 0.81073	Precision: 0.30370	Recall: 0.32450	F1: 0.31375	F2: 0.32011
	Total predictions: 15000	True positives:  649	False positives: 1488	False negatives: 1351	True negatives: 11512

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
	Accuracy: 0.85473	Precision: 0.44526	Recall: 0.36400	F1: 0.40055	F2: 0.37779
	Total predictions: 15000	True positives:  728	False positives:  907	False negatives: 1272	True negatives: 12093


Adaboost clearly had the superior performance, therefore I opted to drop the other ones and pursue tuning of Adaboost

First I fine tune the features. The initial importance feature is:
salary: 0.12
total_payments: 0.1
exercised_stock_options: 0.12
bonus: 0.08
restricted_stock: 0.12
total_stock_value: 0.0
expenses: 0.08
other: 0.24
share_shared_receipt_with_poi: 0.06
share_from_this_person_to_poi: 0.06
share_from_poi_to_this_person: 0.02

** QUESTION 4** 

What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well?  How did you tune the parameters of your particular algorithm? What parameters did you tune? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier).  [relevant rubric items: “discuss parameter tuning”, “tune the algorithm”]


** ANSWER **



** QUESTION 5** 

What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?  [relevant rubric items: “discuss validation”, “validation strategy”]


** ANSWER **

** QUESTION 6** 

Give at least 2 evaluation metrics and your average performance for each of them.  Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance. [relevant rubric item: “usage of evaluation metrics”]


** ANSWER **