### Items to include in submission:

#### Code/Classifier

When making your classifier, you will create three pickle files (my_dataset.pkl, my_classifier.pkl, my_feature_list.pkl). The project evaluator will test these using the tester.py script. You are encouraged to use this script before submitting to gauge if your performance is good enough. You should also include your modified poi_id.py file in case of any issues with running your code or to verify what is reported in your question responses (see next paragraph). Notably, we should be able to run poi_id.py to generate the three pickle files that reflect your final algorithm, without needing to modify the script in any way. If you have intermediate code that you would like to provide as supplemental materials, it is encouraged for you to save them in files separate from poi_id.py. If you do so, be sure to provide a readme file that explains what each file is for.

#### Documentation of Your Work

Document the work you've done by answering (in about a paragraph each) the questions found here. You can write your answers in a PDF, text/markdown file, HTML, or similar format. The responses in your documentation should allow a reviewer to understand and follow the steps you took in your project and to verify your understanding of the methods you have performed.

#### Text File Listing Your References

A list of Web sites, books, forums, blog posts, github repositories etc. that you referred to or used in this submission (add N/A if you did not use such resources). Please carefully read the following statement and include it in your document “I hereby confirm that this submission is my work. I have cited above the origins of any parts of the submission that were taken from Websites, books, forums, blog posts, github repositories, etc.

### Project Specifics
In this project, you will play detective, and put your new skills to use by building a person of interest identifier based on financial and email data made public as a result of the Enron scandal. To assist you in your detective work, we've combined this data with a hand-generated list of persons of interest in the fraud case, which means individuals who were indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity.

The starter code can be found in the final_project directory of the codebase that you downloaded for use with the mini-projects. Some relevant files: 

**poi_id.py** : Starter code for the POI identifier, you will write your analysis here. You will also submit a version of this file for your evaluator to verify your algorithm and results. 

**final_project_dataset.pkl** : The dataset for the project, more details below. 

**tester.py** : When you turn in your analysis for evaluation by Udacity, you will submit the algorithm, dataset and list of features that you use (these are created automatically in poi_id.py). The evaluator will then use this code to test your result, to make sure we see performance that’s similar to what you report. You don’t need to do anything with this code, but we provide it for transparency and for your reference. 

** emails_by_address ** : this directory contains many text files, each of which contains all the messages to or from a particular email address. It is for your reference, if you want to create more advanced features based on the details of the emails dataset. You do not need to process the e-mail corpus in order to complete the project.

### Steps to Success

We will provide you with starter code that reads in the data, takes your features of choice, then puts them into a numpy array, which is the input form that most sklearn functions assume. Your job is to engineer the features, pick and tune an algorithm, and to test and evaluate your identifier. Several of the mini-projects were designed with this final project in mind, so be on the lookout for ways to use the work you’ve already done.

As preprocessing to this project, we've combined the Enron email and financial data into a dictionary, where each key-value pair in the dictionary corresponds to one person. The dictionary key is the person's name, and the value is another dictionary, which contains the names of all the features and their values for that person. The features in the data fall into three major types, namely financial features, email features and POI labels.

**financial features:** ['salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees'] (all units are in US dollars)

**email features:** ['to_messages', 'email_address', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'poi', 'shared_receipt_with_poi'] (units are generally number of emails messages; notable exception is ‘email_address’, which is a text string)

**POI label:** [‘poi’] (boolean, represented as integer)

You are encouraged to make, transform or rescale new features from the starter features. If you do this, you should store the new feature to my_dataset, and if you use the new feature in the final algorithm, you should also add the feature name to my_feature_list, so your evaluator can access it during testing. For a concrete example of a new feature that you could add to the dataset, refer to the lesson on Feature Selection.

In addition, we advise that you keep notes as you work through the project. As part of your project submission, you will compose answers to a series of questions (given on the next page) to understand your approach towards different aspects of the analysis. Your thought process is, in many ways, more important than your final project and we will by trying to probe your thought process in these questions.

Kyle Shannon 1/18/16 - Project 5 - Enron POI Predictive Model

### Files:

**poi_id.py:** Main file where I construct my pipeline classifier: MinMax/Scaler/PCA/Classifiers etc.

**data_shape.py:** python script to remove outliers, print out useful info about data set and add features.

**tester.py:** file generated by Udacity to test out algorithm and print out useful information about performance.

**three pickle files:** files generated by poi_id.py which tester.py uses. The pickle files are essentially my classifier or pipeline along with the transformed data set and feature list.

**references.txt:** text file with list of refernces I used.

**ml_results.txt:** I sent output of tester.py to here so I could save all of my ML algorithm attempts.

### Understanding the Dataset and Question

##### Data Exploration & Outlier Investigation

I started out with printing a lot of different information about the data set and noticed several, what could be considered, outliers. Obviously there were financial outliers such as Lay and Skilling. However, it was important to keep these people because there were POIs and this data set was already heavily class imbalanced. There were three outliers I did remove:
	
    del data_dict['TOTAL'] # this is not a person and should be removed.
	del data_dict['THE TRAVEL AGENCY IN THE PARK'] # not identified as a POI and had little financial data.
	del data_dict['LOCKHART EUGENE E'] # All data ponts were NaN and they were not a POI.
    
Other interesting information I found included:

- Number of People under Investigation: 143
- Number of Data Points: 3718 (this was after outliers were removed and I added new features, derived by: num_of_people * num_of_features)
- Number of Features: 26 (I added new features)
- Num of POIs:  18
- Percentage of data points as NaNs: 35%

I also created a new dict that provided a key as feature and value as % of NaNs in data set: 

    {'to_messages': '39.86%', 'deferral_payments': '73.43%', 'expenses': '34.27%', 'poi_email_reciept_interaction': '0.00%', 'poi': '0.00%', 'deferred_income': '66.43%', 'email_address': '22.38%', 'from_poi_to_this_person': '39.86%', 'restricted_stock_deferred': '88.11%', 'shared_receipt_with_poi': '39.86%', 'loan_advances': '97.90%', 'from_messages': '39.86%', 'other': '36.36%', 'from_this_person_to_poi_fraction': '0.00%', 'director_fees': '88.81%', 'salary': '34.27%', 'bonus': '43.36%', 'total_stock_value': '12.59%', 'poi_email_interaction': '0.00%', 'from_this_person_to_poi': '39.86%', 'restricted_stock': '23.78%', 'adj_compensation': '0.00%', 'total_payments': '13.99%', 'long_term_incentive': '54.55%', 'from_poi_to_this_person_fraction': '0.00%', 'exercised_stock_options': '29.37%'}

A lot of the email features have little to no NaNs, most of the NaNs are in the financial data, especially 'restricted_stock_deferred', 'loan_advances', and 'director_fees'.

### Optimize Feature Selection/Engineering

At first I was thinking about ways to imputate data for NaNs, for example using regression to create values. But then as I was starring at the financial PDF it hit me. NaNs will be turned into 0, because there was no financial data, if 'salary' was NaN that was because they did not recieve a salary, and I should not imputate a salary for that person. Phew...

#### Create new features

I created 5 new features:
        
1. **'poi_email_interaction'** -  My rational here was that people who recieved emails from POIs probably also responded to POIs and vice versa. So I decided to combine emails from POIs and to POIs into one POI interaction.

2. **'poi_email_reciept_interaction'** - For this feature I took poi_email_interaction and multiplied that value by number of recipets shared with POIs. My reason was that people who shared the greatest number of reciepts with POIs new the POIs very well and may be involved in the fraud as well. Obviously this created huge number ranges, but I think once I normalize or scale the data using either StandardScaler, MinMaxScaler this feature will be helpful. 

3. **'adj_compensation'** - I created this feature by adding up financial features I thought might make up an employees total compensation from Enron. I added: 'salary', 'total_payments', 'exercised_stock_options', 'bonus', 'long_term_incentive', and 'total_stock_value'.

4. **'from_poi_to_this_person_fraction'** - This feature and the following one are the fraction of poi to and from emails compared to overall emails recieved and sent. There may be some interesting interactions. Perhaps someone sent a low amount of emails to POIs, but overall had sent low emails. Then this person would have a high ratio. Compared to someone spamming emails all day and they happened to send a lot to POIs as well.  

5. **'from_this_person_to_poi_fraction'** - See the previous feature description above.

example of code used to create features:
        
        if (key == 'salary' or key == 'total_payments' or key == 'exercised_stock_options' \
				or key == 'bonus' or key == 'long_term_incentive' or key == 'total_stock_value') \
				and value != 'NaN':
				v['adj_compensation'] += value

#### Intelligently select features  ... collinearity of features scale bak...

#### Properly scale features  minmax scala etc? ref article from machine elarning oythin book guy's blog

### Pick and Tune an Algorithm

#### Pick an algorithm 

#### Tune an algorithm

### Validate and Evaluate

#### Usage of Evaluation Metrics

#### Validation Strategy

#### Algorithm Performance

### References

1. https://www.oreilly.com/learning/handling-missing-data imputation
2. http://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues PCA
3. 
4. 
5. 
6. 
7. 
8. 
9. 