# Enron Submission Free-Response Questions

A critical part of machine learning is making sense of your analysis process and communicating it to others. The questions below will help us understand your decision-making process and allow us to give feedback on your project. Please answer each question; your answers should be about 1-2 paragraphs per question. If you find yourself writing much more than that, take a step back and see if you can simplify your response!

When your evaluator looks at your responses, he or she will use a specific list of rubric items to assess your answers. Here is the link to that rubric: [Link] Each question has one or more specific rubric items associated with it, so before you submit an answer, take a look at that part of the rubric. If your response does not meet expectations for all rubric points, you will be asked to revise and resubmit your project. Make sure that your responses are detailed enough that the evaluator will be able to understand the steps you took and your thought processes as you went through the data analysis.
Once you’ve submitted your responses, your coach will take a look and may ask a few more focused follow-up questions on one or more of your answers.  

We can’t wait to see what you’ve put together for this project!



** QUESTION 1 ** 

Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?  [relevant rubric items: “data exploration”, “outlier investigation”]


** ANSWER **

The goal of this project is to develop a predictive model to identify Persons of Interest (POI) in the Enron Fraud case. Machine learning is useful here, in a supervised environment, to use computerized algorithms to identify/extract patterns in the complex available  data/features and use this knowledge in order to make predictions in new cases. Machine Learning is also useful here by allowing for training and testing in separate sub-sets of the data, giving confidence that the model would hold with cases outside this dataset.

The dataset provided contains financial information (14 features) and e-mail communication information (6 features) from 146 Enron employees, along with the label classifying the known POIs (18 POIs in the data set). 

*OUTLIERS*

I spotted 3 keys that needed to tbe removed: *TOTAL* and *THE TRAVEL AGENCY IN THE PARK*. These keys are clearly not persons and will only disturb any analysis.

I noticed also that *LOCKHART EUGENE E* has only 0s or *NaN* values, therefore he will also be removed. I opted to remove it manually, but this would not be stricly necessary if using *features_format* function since by default (option *remove_all_zeroes=True*) this person would be removed. However for consisteny and initial analysis, it is cleaner to do so.

I also noitced that Kenneth Lay has a gigantic outliers value for *total payments* so I considered removing him or his value for total payments. However he is a real outlier and should be identified by the model. Therefore I did not do remove it.

** QUESTION 2** 

What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values.  [relevant rubric items: “create new features”, “intelligently select features”, “properly scale features”]


** ANSWER ** 

I tried and compared two processes. 

*First Process*

I initially listed the number of *NaN* in each feature (broken down by within POI and non-POIs) and decided to drop any feature that had less than 50% valid points. Therefore I kept only the 14 features:

['poi','salary', 'to_messages', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'total_stock_value', 'expenses', 'from_messages', 'other', 'from_this_person_to_poi', 'from_poi_to_this_person'].

After that I decided to convert convert all the to/from messages features related to pois into share of the total to/from messages. The rationale was that I did not expect the features *to/from_messages* to be informative, and the to/from poi in absolute terms ignores the relative effect. Therefore I re-scaled all to their shares and removed the absolute ones. 

The updated list with 12 features was:
['poi', 'salary', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'total_stock_value', 'expenses', 'other', 'share_shared_receipt_with_poi', 'share_from_this_person_to_poi', 'share_from_poi_to_this_person']

I decided to not do any re-scaling since I was not planning on using any classifier that was sensitive to that. In particular, my initial plan was to test and select among NaiveBayes, SVM, Decision Tree and Adaboost, which should not be sensitive to the scaling, except SVM. Therefore I decided to re-visit scaling only if I opted to further investigate SVM, which was not the case.

After tuning and training it, the performance in the *tester.py* was:
    
    Accuracy: 0.87187	Precision: 0.52664	Recall: 0.38550	F1: 0.44515	F2: 0.40733

I compared with the performance after manually removing the 3 features with near 0 importance, but performace was reduced:

    Accuracy: 0.86840	Precision: 0.50866	Recall: 0.38200	F1: 0.43632	F2: 0.40202


*Second Process*

In the second process I opted to use recursive feature elimination (RFE) to select features. I started from all features, including the 3 additional computed features I created and used RFE to select the 11 best (since there where 11 features + poi in the process above).

I tested these selection on the  *tester.py* and obtained clearly a superior performance. I noticed again that 3 features had near 0 importance, so I repeated the procedure with the 8 best features.

With the list:

['poi', 'exercised_stock_options', 'restricted_stock', 'shared_receipt_with_poi', 'expenses', 'other', 'from_poi_to_this_person', 'share_shared_receipt_with_poi', 'share_from_this_person_to_poi']

Performance improved (slightly) again and was:

    Accuracy: 0.87907	Precision: 0.56522	Recall: 0.40300	F1: 0.47052	F2: 0.42754

I noticed again that 1 feature had near 0 performance, and tried to remove via RFE another feature, but performance dropped.

Lastly I decided to manually remove the feature with near 0 performance (*from_poi_to_this_person*) and compare the performance.

With the list:

['poi', 'exercised_stock_options', 'restricted_stock', 'shared_receipt_with_poi', 'expenses', 'other', 'share_shared_receipt_with_poi', 'share_from_this_person_to_poi']

Performance improved and was:
    Accuracy: 0.88780	Precision: 0.61444	Recall: 0.42550	F1: 0.50281	F2: 0.45338

*Final selection*

I compared the best performing classifier from the two processes above and selected the best one, which was the 8 best features selected via RFE, removing the feature with near 0 importance.

**List:**

['poi', 'exercised_stock_options', 'restricted_stock', 'shared_receipt_with_poi', 'expenses', 'other', 'share_shared_receipt_with_poi', 'share_from_this_person_to_poi']

**Performance:**

    Accuracy: 0.88780	Precision: 0.61444	Recall: 0.42550	F1: 0.50281	F2: 0.45338
    
**Features importance:**
* exercised_stock_options: 0.13
* restricted_stock: 0.07
* shared_receipt_with_poi: 0.07
* expenses: 0.2
* other: 0.13
* share_shared_receipt_with_poi: 0.2
* share_from_this_person_to_poi: 0.2






** QUESTION 3** 

What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?  [relevant rubric item: “pick an algorithm”]


** ANSWER **

I initially compared 4 algorithms with their default parameters:
* Naive Bays
* SVM
* Decision Tree
* Adaboost

These where the initial performances:

GaussianNB(priors=None)

	Accuracy: 0.84707	Precision: 0.37105	Recall: 0.21150	F1: 0.26943	F2: 0.23140    
	Total predictions: 15000	True positives:  423	False positives:  717	False negatives: 1577	True negatives: 12283

Got a divide by zero when trying out: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
Precision or recall may be undefined due to a lack of true positive predicitons.

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
            
	Accuracy: 0.81073	Precision: 0.30370	Recall: 0.32450	F1: 0.31375	F2: 0.32011
	Total predictions: 15000	True positives:  649	False positives: 1488	False negatives: 1351	True negatives: 11512

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
          
	Accuracy: 0.85473	Precision: 0.44526	Recall: 0.36400	F1: 0.40055	F2: 0.37779
	Total predictions: 15000	True positives:  728	False positives:  907	False negatives: 1272	True negatives: 12093


Adaboost clearly had the superior performance, therefore I opted to drop the other ones and pursue tuning of Adaboost. 

In the final model, as indicate in *QUESTION 2* performance was:

    Accuracy: 0.88780	Precision: 0.61444	Recall: 0.42550	F1: 0.50281	F2: 0.45338


** QUESTION 4** 

What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well?  How did you tune the parameters of your particular algorithm? What parameters did you tune? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier).  [relevant rubric items: “discuss parameter tuning”, “tune the algorithm”]


** ANSWER **

Tuning the parameters of an algorithm is the *art* of choosing the right values in order to allow the model to get its best performance. If you do not do this well your model will underperform and you will have worse results than you could.

In this case I optimized AdaBoost using the built-in sklearn function GridSearchCV. In particular I did the grid search on

*learning_rate*

[0.1, 0.2, 0.3, 0.5, 0.7, **0.9**, 1, 2,3, 5, 10] 

and *n_estimators*

[1,5,8,10,12, 14, **15**, 16,50,100,1000, 2000]. 

Moreover I used  *F1* as the scoring metric and 10 folds for the stratified cross-validating in the GridSearch.

The results indicated **0.9** and **15** and the best parameters for *learning_rate* and *n_estimators* respectively.

I obtained this tuning using the initial selection of features. Later, with my final selection, I tried to re-tune, but the performance did not improve, therefore I kept the choices above.

** QUESTION 5** 

What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?  [relevant rubric items: “discuss validation”, “validation strategy”]


** ANSWER **

Validation is the process of veryfing that your model works well (i.e., has good performance) outside the training set. The classic mistake is to over fit. This means your model will have very good performance in a training set, but once tested in out-of-sample or new data it will have poor performance. This would mean that the model nows very well the training set, but is not capable of handling new situations.

In my case I used 2 approaches for cross validation. First and foremost, I used the *tester.py* code to infer performance. This code applied a Shuffled Stratified K Fold with multiple re-sampling. This implies repeating the process of randomly picking a sample sub-set from the data within strata (i.e., guaranteeing that POIs will be included) for training and testing performance on the left over data. 

The other related approach was to built-in the GridSearchCV. This function, in doing the gridseach applies the cross-validation with Stratified K-fold, for which I select 10. This also chooses the best parameters form dividing the sample in stratified folds and traiing in one set and computing performance in another set.

** QUESTION 6** 

Give at least 2 evaluation metrics and your average performance for each of them.  Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance. [relevant rubric item: “usage of evaluation metrics”]


** ANSWER **

The final model had the performance:

    Accuracy: 0.88780	Precision: 0.61444	Recall: 0.42550	F1: 0.50281	F2: 0.45338
	Total predictions: 15000	True positives:  851	False positives:  534	False negatives: 1149	True negatives: 12466
    
*Accuracy*:

Accuracy is the rate in which the model gives right predictions, i.e., it compares the predicted label with the real label. It means that the model is right 88% of the time. It is a general metric for performance which is quite informative if the labels are close to even split. In this case however, there are only 18 POIs in the 143 valid observations. That means, for example, that if you would if you classify all persons as non-POIs, your accuracy would be 87.4. Does this model this better than that!

*Precision*:

Precision is the rate to which the POIs identification are correct, i.e, once a person is classified as POI, this is correct 62% of the time. This gives an indication of how much "trust" you can give to a result from a model. It is the rate of true positive over positive classifications.

*Recall*:

Recall is the rate to which POIs are "caught" by the model, i.e., once a person is a POI, the model correctly label her as a POI in 42% of the time. This gives an indication of the power of the model. It is the rate of classification as POI over all classifications received by POIs.

*F1*:

F1 combines the Precision with Recall. It is  way to average Precision and Recall, trying to improve the best of both!
