## Sidenotes (definitions, code snippets, resources, etc.)
- Original in `nd_machine_learning/nd_ml_course_code/projects/boston_housing/`
- Symlinked to `intro_to_ml/ud120-projects/evaluation/` for lesson 14 in Intro to ML course.

### Python
- Note on data structure: list
    - empty list has a truth value of false
    
### Further Reading
- Definition of [Coefficient of Determination](http://stattrek.com/statistics/dictionary.aspx?definition=coefficient_of_determination) in Stat Trek's Statistics and Probability Dictionary
- Investigate meaning of `# %%writefile new_enron_feature.py` inserted at top of edited studentMain.py module

### Latex
To use Python to display Latex equations etc.:
```python
from IPython.display import display, Math, Latex
display(Math(
    r'F(k) = \int_{-\infty}^{\infty} f(x) e^{2\pi i k} dx'))
```

# Evaluation Metrics
## Classification vs Regression Metrics
- In classification we want to see how often a model correctly or incorrectly identifies a new example, whereas in regression we might be more interested to see how far off the model's prediction is from the real true value.
- _Classification metrics_ include: accuracy, precision, recall, and F-score.
    - Using a set of data kept for testing, we can use these metrics on this testing set to measure which points were accurately classified, and which were not.
- _Regression metrics_ include: mean absolute error and mean squared error.

## Classification Metrics
### Accuracy Score
__definition:__ 

$\text{accuracy} = \frac{\text{no. of data points labeled corrected}}{ \text{all data points}} = \frac{\text{true positives + true negatives}}{ \text{all data points}}$
- Accuracy here is described as the proportion of items classified or labeled correctly.
- Most basic and common classification metric
- Default metric used in the `clf.score()` method in sklearn (
- Can also use `sklearn.metrics.`[__`accuracy_score()`__](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score)

#### Shortcoming of accuracy measurement:
- Not good for skewed classes (i.e. most of the data under one label)
    - because demoninator will be small, so measurement not trustworthy
- Not suited to particular labeling requirements i.e. might want the metric to err on one label over the other.
    - different performance metrics can focus on different types of errors (false positives, false negatives).
    
### Confusion Matrices: understanding precision and recall 
- Mathematical representation of classification errors by type
- Example of Confusion Matrix analysis with Decision Tree:
![confusion matrix example](evaluation_images/confusion_matrix_example.png)

### Precision Score
$\text{Precision(x)} = \frac{\text{data points correctly labeled as x}}{ \text{total data points predicted x}} = \frac{\text{true positives}}{ \text{true positives + false positives}}$
- in sklearn: `sklearn.metrics.`[__`precision_score()`__](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html)
- Measures confidence in model's positive predictions for a specific label.
    - i.e. a high precision relates to a low false positive rate
- Gives the probability that a postive prediction of label x for a test data point is accurate.
- Answers q: of all the predicted positives, how many are accurate?

### Recall Score
$\text{Recall(x)} = \frac{\text{data points correctly labeled as x}}{ \text{total data points actually x}} = \frac{\text{true positives}}{ \text{true positives + false negatives}}$
- in sklearn: `sklearn.metrics.`[__`recall_score()`__](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html)
- Measures confidence in model's negative predictions for a specific label.
    - i.e. a high recall relates to a low false negative rate
- Gives the probability that a negative prediction of label x for a test data point is accurate.

### F1 Score
$F_{1} = 2* \frac{\text{precision $\cdot$ recall}}{\text{precision + recall}}$
- in sklearn: `sklearn.metrics.`[__`f1_score()`__](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)
- Combines precision and recall relative to a specific positive class.
- Can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst at 0.
- The relative contribution of precision and recall to the F1 score are equal.

## Regression Metrics
- Regression scoring functions return values between 0 and 1, like classification metrics (better than error measurements in this way).
    - Covered briefly for use in Boston Housing project
- Regression error functions measure how close a regression model's prediction is to a true value. Lower value indicates better performance of a model in contrast to scoring functions.

### Mean Absolute Error
- in sklearn: `sklearn.metrics.`[__`mean_absolute_error()`__](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html)
- The average of the distances from predicted values to their true values (a.ka. _residual errors_).
- _Absolute_ error is used to avoid canceling out errors from being too high or below the true values (similar to problem seen when using different methods to measure variance).

### Mean Squared Error
- in sklearn: `sklearn.metrics.`[__`mean_squared_error()`__](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html)
- Most common metric to measure model performance. 
- In contrast with absolute error, the each residual error is squared.
- Advantages of squaring each error:
    - Error terms are positive
    - larger errors are emphasized over smaller errors
    - equation is differentiable, allowing calculus calculation of minimum and maximum values, which often leads to better computational efficiency.


### R2 Score
- in sklearn: `sklearn.metrics.`[__`r2_score()`__](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html)
- Default scoring method for regression learners in sklearn
- Computes the _coefficient of determination_ of predictions for true values. This is the default scoring method for regression learners in scikit-learn

**Definition of [Coefficient of Determination](http://stattrek.com/statistics/dictionary.aspx?definition=coefficient_of_determination):** (from Stat Trek)

- The coefficient of determination (denoted by R<sup>2</sup>) is a key output of regression analysis. 
- It is interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable.
    - The coefficient of determination is the square of the correlation (r) between predicted y scores and actual y scores; thus, it ranges from 0 to 1.
    - With linear regression, the coefficient of determination is also equal to the square of the correlation between x and y scores.

Meaning of return value:
- A model with an R<sup>2</sup> of 0 always fails to predict the target variable, whereas a model with an R<sup>2</sup> of 1 perfectly predicts the target variable.
- Any value between 0 and 1 indicates what percentage of the **target variable**, using this model, can be explained by the **features**.
- *A model can be given a negative R<sup>2</sup> as well, which indicates that the model is no better than one that naively predicts the mean of the target variable.*

Formula of the Coefficient of Determination from sklearn's [User Guide](http://scikit-learn.org/stable/modules/model_evaluation.html#r2-score)
![formula of the coefficient of determination](evaluation_images/formula_coefficient_of_determination.png)



### Explained Variance Score
- in sklearn: `sklearn.metrics.`[__`explained_variance_score()`__](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html)
- Covered in more detail later in nanodegree


## Mini-project! Applying Metrics to Your POI Identifier
[not labeled mini-project in course]

Go back to your code from the last lesson, where you built a simple first iteration of a POI identifier using a decision tree and one feature. Copy the POI identifier that you built into the skeleton code in evaluation/evaluate_poi_identifier.py. Recall that at the end of that project, your identifier had an accuracy (on the test set) of 0.724. Not too bad, right? Let’s dig into your predictions a little more carefully.

From Python 3.3 forward, a change to the order in which dictionary keys are processed was made such that the orders are randomized each time the code is run. This will cause some compatibility problems with the graders and project code, which were run under Python 2.7. To correct for this, add the following argument to the featureFormat call on line 25 of evaluate_poi_identifier.py:

sort_keys = '../tools/python2_lesson14_keys.pkl'

This will open up a file in the tools folder with the Python 2 key order.

In [1]:
# Final model from L13
from evaluate_poi_identifier import *

### it's all yours from here forward!
from sklearn.cross_validation import train_test_split
features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score    
clf = DecisionTreeClassifier()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print "Accuracy score: ", accuracy_score(labels_test, pred)
print "No. POIs predicted in test set: ", len([x for x in pred if x == 1])
print "No. of true positives: ", (
    len([i for i, j in zip(labels_test, pred) if i and j == 1]))


ImportError: No module named evaluate_poi_identifier

As you may now see, having imbalanced classes like we have in the Enron dataset (many more non-POIs than POIs) introduces some special challenges, namely that you can just guess the more common class label for every point, not a very insightful strategy, and still get pretty good accuracy!

Precision and recall can help illuminate your performance better. Use the precision_score and recall_score available in sklearn.metrics to compute those quantities.

What’s the precision?

In [2]:
from sklearn.metrics import precision_score, recall_score
print "Precision score: ", precision_score(labels_test, pred)
print "Recall score: ", recall_score(labels_test, pred)

Precision score: 

NameError: name 'labels_test' is not defined