# Exploring precision and recall

The goal of this second notebook is to understand precision-recall in the context of classifiers.

 * Use Amazon review data in its entirety.
 * Train a logistic regression model.
 * Explore various evaluation metrics: accuracy, confusion matrix, precision, recall.
 * Explore how various metrics can be combined to produce a cost of making an error.
 * Explore precision and recall curves.

In [None]:
from __future__ import division
import numpy as np

# Load amazon review dataset

In [None]:
import sframe
products = sframe.SFrame('amazon_baby.gl/')

# Perform text cleaning

We start by removing punctuation, so that words "cake." and "cake!" are counted as the same word.

- Write a function **remove_punctuation** that strips punctuation from a line of text
- Apply this function to every element in the **review** column of **products**, and save the result to a new column **review_clean**.

Refer to your tool's manual for string processing capabilities. Python lets us express the operation in a succinct way, as follows:

In [None]:
def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation) 

products['review_clean'] = products['review'].apply(remove_punctuation)

**Aside**. In this notebook, we remove all punctuation for the sake of simplicity. A smarter approach to punctuation would preserve phrases such as "I'd", "would've", "hadn't" and so forth. See [this page](https://eventing.coursera.org/api/redirectStrict/soEljgG6OcsgBWLliUWFMna7QX7UIZXzBs9aTCpBXapwN4_qOufcBys-2_OZnZ1OUlmIV7QZ3RkoGcA3n6HgSg.2tMyXMxSchsxcMIkZa0TGg.VQ59UujyUbQzDQdgmfuqeRu6067Cewe6f92q4DXuG2ytCfp1z6XyJjyFUweoodOvTyFjDKslLmEE7Ayvea5sgzduvhsWAujrghlYsPP_PL2wB-I19o-2fSlmTUEpkRkOMJTKdUHBfA8GqOJVT73fCBDOs-kf7RZLuev56CPUv3TNc0q5P4LsUabNp0l_Fwrh7fcPw-sagzUpOuCdSRKmmdhqbguItmEazLTItO98w9HA1s1Q5-zrwMgrguYaNOIsA9omsTNrnxhq8Mlgd-7j4dC2KNOM2cLM6i2rHs5LvkeCb_GEbagJXelQey9fdqx8sk9nwG4Z_MmUN192PtXsExvL6vNYvMcOXwsL1AaI0PI) for an example of smart handling of punctuation.

# Extract Sentiments

We will ignore all reviews with rating = 3, since they tend to have a neutral sentiment. In SFrame, for instance,

In [None]:
products = products[products['rating'] != 3]

Now, we will assign reviews with a rating of 4 or higher to be positive reviews, while the ones with rating of 2 or lower are negative. For the sentiment column, we use +1 for the positive class label and -1 for the negative class label. A good way is to create an anonymous function that converts a rating into a class label and then apply that function to every element in the **rating** column. In SFrame, you would use apply():

In [None]:
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)

Now, we can see that the dataset contains an extra column called **sentiment** which is either positive (+1) or negative (-1).

In [None]:
products

## Split data into training and test sets

We split the data into a 80-20 split where 80% is in the training set and 20% is in the test set.

In [None]:
train_data_s, test_data_s = products.random_split(.8, seed=1)

In [None]:
train_data = train_data_s.to_dataframe()
test_data = test_data_s.to_dataframe()

# Build the word count vector for each review

We will now compute the word count for each word that appears in the reviews. A vector consisting of word counts is often referred to as **bag-of-word features**. Since most words occur in only a few reviews, word count vectors are sparse. For this reason, scikit-learn and many other tools use sparse matrices to store a collection of word count vectors. Refer to appropriate manuals to produce sparse word count vectors. General steps for extracting word count vectors are as follows:

- Learn a vocabulary (set of all words) from the training data. Only the words that show up in the training data will be considered for feature extraction.
- Compute the occurrences of the words in each review and collect them into a row vector.
- Build a sparse matrix where each row is the word count vector for the corresponding review. Call this matrix **train_matrix**.
- Using the same mapping between words and columns, convert the test data into a sparse matrix **test_matrix**.

The following cell uses CountVectorizer in scikit-learn. Notice the **token_pattern** argument in the constructor.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
# Use this token pattern to keep single-letter words
# First, learn vocabulary from the training data and assign columns to words
# Then convert the training data into a sparse matrix
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
# Second, convert the test data into a sparse matrix, using the same word-column mapping
test_matrix = vectorizer.transform(test_data['review_clean'])

Keep in mind that the test data must be transformed in the same way as the training data.

In [None]:
vectorizer.vocabulary_

# Train a sentiment classifier with logistic regression

Learn a logistic regression classifier using the training data. If you are using scikit-learn, you should create an instance of the [LogisticRegression](https://eventing.coursera.org/api/redirectStrict/DTbdclYxIDLtkBZ-V_sJV1XdcmNnrxX7uSROyS-ipOBdORWmYI2Dzt0JkhWgRjVfNKz4M02esZebvphuhna4Og.9Isz5nlAl2z-5i_-h3c8xA.CBU5p6R8ND-xk2QBmT83PMDtXAeMbiXQbb_ON0zDYD09JdJwr95RIvWCaeVeyyJAwB-cmwTojYsAKP2iouD4LV82ISxN99GcGjufPQnnXqBORc5zrHTmK3aMIvRMOpeApDfHGkssJ8Mz1r80miakDJPJnjhwfGL2mIBGfN4jGCnC7EjThJQyxhseLJEuvfxYX3WL2zWd_hdySown4oG1xMT6xnHJSxxzcQq2G7KxbYm9KIqRW381-KPZMKX7aYhk3Kh7b2mf_Q7s5faMb02M2r4Tp0tLIEg6zEvuyVl2qdRRaIH5YSe28GqZl0MFBIkSepMxiWYZt655i6c5VabiFNrC-AKyLTqm7BQuhjLLO1pv1rpankJcr5dv5Twb4sY7YpgzsXR5E8FkYU2kqYApn0xozpIbKy_NWgRYFXqPp_dC-ZTqFLIO4bgzTaPdNTbGczGgZkKgfVcH9kcg98JCINOIb3Ns6qXYDsPrUo7FFGzsNINAFMB7TXnOzmA3RMb9OazF_Ie5ooBDFbdWdDkaICYO5uWMTqLhl0LGOUwgLgwq-4qi6PcNUWgViHEPtUfdp2U-I8jviFUBpA_h8aFwQq4hlQontmsAIsfgiM9RNSgOQyoNnLCbOfo5z-mokeEtkhOBz_IuAl_ahAx-CgkkyWk4DvD0TOqo-zCrjxQkE8P6IXLeOIQeJH8d40UEsR7fUJAHLtYeWn0ML6WXm_OD2Tyif7i4DO0iMrLoAngGH0tUfIpuBTdfha8m6ebAD7mkO5afic_GsiZH5SkhzTaSZvZA-bkAW2LVSNgGD_hUhlpcyO9VmInSkh4RAIQNK7OCiRtlgIP8Sc_pM-S6b28sV9MAPxXHlowpApJXcWGWfgJ7gDZv5Y3DkrlZuGE2M_rtLPCScIO1SBH4KzjFgShVlSJTP3tPhrFmCV-4eJaUK0CYbmcQUXRz_r42SeYf-uLZ1R1tjIgYpdH12aJ94xSi8X33whjvDkvHVeVHrTBYd5ysiq1GOJ07lUxSs2PQy-YIlSiGFN-DVgQm9QOoDc6nMavSsWjFk96qrLtkuOyfSc4t6oBYwx7xKySjhDECI7dUT2pg474f3sCBtAvmluOl2Skpsuin-tT_KT_BG4e7LSGLFZSmMEo-yhIz6eig-6Y0KIb6wSat6sxaDLn5A8T7Cav4cFq9YVlq95TMZfHYWm7o0nDTwLXzYLIS_LI0VVPGZ8NAHHMcfVzEcklgReSmiT-oD9tuZaT1AiW4MYdV5xp7OKtKnFlEJxq70Z-PQVYGwHIqALa51HsIitWHORw1fleblc36Mb3K4mGsG4um72OBn6YgneobUMtwPUKSwfPIWRhNUjTx48JUKySsRhQIXCCQoSfXAotJk56FndS0CnTC7FmrXt9JheJcoM8ISgUtzQ_f1DrbKQXY2bci4g6E9TfSgf0XDkPv9VZUnPCGKTjBbPXXo-AW0Ty8I7zf22v6i-lhNkOF6Bxe4IGfwt70PJ3AUK-zUdRiL6_VtnJc_3dlQHPJbIBrt_q71p66w9lyvO50swzV0yntTl__idD7D7jOPLO1Ndex4suwkVTy3DT5gR7LknaORWnFItw4pK4bbdfneWcIOQtp_d50RWQwcXFtig0xeh4lw4-8gArMAkRDhTRV_S24WUBFKU6uv5Z0KvG3RqHOC5yLd-gxz6XhpHbuDZvKWsnat0ttDIr_6MsJO4AqosxxraoZEDnksJ-hW8ujs4LsuxNt9ECOtKNhudPF841qKl9lCneMZH9YfkWzPCQdN35Tv3dDU2APyQyb_C2tDfLlYPqNRMVJmtwREAoYsns8dmuv-X7dFlVrahEVE4c1iA2yXw0HYmAGUSbw7kljEZxlh3rKxd6Mmsj4liLXaN_mVHKe9X7psAFIVe4) class and then call the method fit() to train the classifier. This model should use the sparse word count matrix (**train_matrix**) as features and the column **sentiment** of **train_data** as the target. Use the default values for other parameters. Call this model **model**.

In [None]:
from sklearn.linear_model import LogisticRegression 
model = LogisticRegression()
model.fit(train_matrix, train_data['sentiment'])

# Model Evaluation

We will explore the advanced model evaluation concepts that were discussed in the lectures.

## Accuracy

One performance metric we will use for our more advanced exploration is accuracy, which we have seen many times in past assignments. Recall that the accuracy is given by

$$
\text {accuracy}=\frac {\text {#correctly classified data points}} {\text {#total data points}}
$$

Compute the accuracy on the test set using your tool of choice. If you are using scikit-learn, you can use the pre-defined method accuracy_score:

In [None]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_true=test_data['sentiment'], y_pred=model.predict(test_matrix))
print "Test Accuracy: %s" % accuracy

## Baseline: Majority class prediction

Recall from an earlier assignment that we used the **majority class classifier** as a baseline (i.e reference) model for a point of comparison with a more sophisticated classifier. The majority classifier model predicts the majority class for all data points.

Typically, a good model should beat the majority class classifier. Since the majority class in this dataset is the positive class (i.e., there are more positive than negative reviews), the accuracy of the majority class classifier is simply the fraction of positive reviews in the test set:



In [None]:
baseline = len(test_data[test_data['sentiment'] == 1])/len(test_data)
print "Baseline accuracy (majority class classifier): %s" % baseline

**Quiz question**: Using accuracy as the evaluation metric, was our **logistic regression model** better than the baseline (majority class classifier)?

## Confusion Matrix

The accuracy, while convenient, does not tell the whole story. For a fuller picture, we turn to the **confusion matrix**. In the case of binary classification, the confusion matrix is a 2-by-2 matrix laying out correct and incorrect predictions made in each label as follows:
```
              +---------------------------------------------+
              |                Predicted label              |
              +----------------------+----------------------+
              |          (+1)        |         (-1)         |
+-------+-----+----------------------+----------------------+
| True  |(+1) | # of true positives  | # of false negatives |
| label +-----+----------------------+----------------------+
|       |(-1) | # of false positives | # of true negatives  |
+-------+-----+----------------------+----------------------+
```

Using your tool, print out the confusion matrix for a classifier. For instance, scikit-learn provides the method confusion_matrix for this purpose:

In [None]:
from sklearn.metrics import confusion_matrix
cmat = confusion_matrix(y_true=test_data['sentiment'],
                        y_pred=model.predict(test_matrix),
                        labels=model.classes_)    # use the same order of class as the LR model.
print ' target_label | predicted_label | count '
print '--------------+-----------------+-------'
# Print out the confusion matrix.
# NOTE: Your tool may arrange entries in a different order. Consult appropriate manuals.
for i, target_label in enumerate(model.classes_):
    for j, predicted_label in enumerate(model.classes_):
        print '{0:^13} | {1:^15} | {2:5d}'.format(target_label, predicted_label, cmat[i,j])


**IMPORTANT**. In one way or another, make sure to print out the predicted label and the true label for each and every entry of the confusion matrix. This way, we don't mistake one type of mistake for another. The cell above produces the following output:

```
target_label | predicted_label | count 
--------------+-----------------+-------
     -1       |       -1        |  3787
     -1       |        1        |  1454
      1       |       -1        |   805
      1       |        1        | 27290
```

**Quiz Question**: How many predicted values in the **test set** are **false positives**?

In [None]:
cmat[0,1]

## Computing the cost of mistakes


Put yourself in the shoes of a manufacturer that sells a baby product on Amazon.com and you want to monitor your product's reviews in order to respond to complaints.  Even a few negative reviews may generate a lot of bad publicity about the product. So you don't want to miss any reviews with negative sentiments --- you'd rather put up with false alarms about potentially negative reviews instead of missing negative reviews entirely. In other words, **false positives cost more than false negatives**. (It may be the other way around for other scenarios, but let's stick with the manufacturer's scenario for now.)

Suppose you know the costs involved in each kind of mistake: 
1. \$100 for each false positive.
2. \$1 for each false negative.
3. Correctly classified reviews incur no cost.

**Quiz Question**: Given the stipulation, what is the cost associated with the logistic regression classifier's performance on the **test set**?

In [None]:
cmat[0,1]*100 + cmat[1,0]*1

## Precision and Recall

You may not have exact dollar amounts for each kind of mistake. Instead, you may simply prefer to reduce the percentage of false positives to be less than, say, 3.5% of all positive predictions. This is where **precision** comes in:

$$
[\text{precision}] = \frac{[\text{# positive data points with positive predicitions}]}{\text{[# all data points with positive predictions]}} = \frac{[\text{# true positives}]}{[\text{# true positives}] + [\text{# false positives}]}
$$

So to keep the percentage of false positives below 3.5% of positive predictions, we must raise the precision to 96.5% or higher. 

**First**, let us compute the precision of the logistic regression classifier on the **test_data**. Scikit-learn provides a predefined method for computing precision. (Consult appropriate manuals if you are using other tools.)

In [None]:
from sklearn.metrics import precision_score
precision = precision_score(y_true=test_data['sentiment'], 
                            y_pred=model.predict(test_matrix))
print "Precision on test data: %s" % precision

**Quiz Question**: Out of all reviews in the **test set** that are predicted to be positive, what fraction of them are **false positives**? (Round to the second decimal place e.g. 0.25)

In [None]:
1 - precision

**Quiz Question:** Based on what we learned in lecture, if we wanted to reduce this fraction of false positives to be below 3.5%, we would: (see the quiz)

A complementary metric is **recall**, which measures the ratio between the number of true positives and that of (ground-truth) positive reviews:

$$
[\text{recall}] = \frac{[\text{# positive data points with positive predicitions}]}{\text{[# all positive data points]}} = \frac{[\text{# true positives}]}{[\text{# true positives}] + [\text{# false negatives}]}
$$

Let us compute the recall on the **test_data**. Scikit-learn provides a predefined method for computing recall as well. (Consult appropriate manuals if you are using other tools.)

In [None]:
from sklearn.metrics import recall_score
recall = recall_score(y_true=test_data['sentiment'],
                      y_pred=model.predict(test_matrix))
print "Recall on test data: %s" % recall

In [None]:
recall_basic_classifier = recall_score(y_true=test_data['sentiment'],
                      y_pred=[1]*len(test_data))
print recall_basic_classifier

**Quiz Question**: What fraction of the positive reviews in the **test_set** were correctly predicted as positive by the classifier?

**Quiz Question**: What is the recall value for a classifier that predicts **+1** for all data points in the **test_data**?

# Precision-recall tradeoff

In this part, we will explore the trade-off between precision and recall discussed in the lecture.  We first examine what happens when we use a different threshold value for making class predictions.  We then explore a range of threshold values and plot the associated precision-recall curve.  


## Varying the threshold

False positives are costly in our example, so we may want to be more conservative about making positive predictions. To achieve this, instead of thresholding class probabilities at 0.5, we can choose a higher threshold. 

Write a function called `apply_threshold` that accepts two things
* `probabilities` (an SArray of probability values)
* `threshold` (a float between 0 and 1).

The function should return an array, where each element is set to +1 or -1 depending whether the corresponding probability exceeds `threshold`.

In [None]:
def apply_threshold(probabilities, threshold):
    ### YOUR CODE GOES HERE
    # +1 if >= threshold and -1 otherwise.
    return np.array([+1 if p>=threshold else -1 for p in probabilities]) 

Using the **model** you trained, compute the class probability values P(y=+1|x,w) for the data points in the **test_data**. Then use thresholds set at 0.5 (default) and 0.9 to make predictions from these probability values.

Note. If you are using scikit-learn, make sure to use **predict_proba()** function, not decision_function(). Also, note that the predict_proba() function returns the probability values for both classes +1 and -1. So make sure to extract the second column, which correspond to the class +1.

In [None]:
probabilities = model.predict_proba(test_matrix)[:,1]

In [None]:
predictions_with_default_threshold = apply_threshold(probabilities, 0.5)
predictions_with_high_threshold = apply_threshold(probabilities, 0.9)

In [None]:
print "Number of positive predicted reviews (threshold = 0.5): %s" % (predictions_with_default_threshold == 1).sum()
print "Number of positive predicted reviews (threshold = 0.9): %s" % (predictions_with_high_threshold == 1).sum()

**Quiz Question**: What happens to the number of positive predicted reviews as the threshold increased from 0.5 to 0.9?

## Exploring the associated precision and recall as the threshold varies

By changing the probability threshold, it is possible to influence precision and recall. Compute precision and recall for threshold values 0.5 and 0.9.

In [None]:
# Threshold = 0.5
precision_with_default_threshold = precision_score(test_data['sentiment'],
                                        predictions_with_default_threshold)

recall_with_default_threshold = recall_score(test_data['sentiment'],
                                        predictions_with_default_threshold)

# Threshold = 0.9
precision_with_high_threshold = precision_score(test_data['sentiment'],
                                        predictions_with_high_threshold)
recall_with_high_threshold = recall_score(test_data['sentiment'],
                                        predictions_with_high_threshold)

In [None]:
print "Precision (threshold = 0.5): %s" % precision_with_default_threshold
print "Recall (threshold = 0.5)   : %s" % recall_with_default_threshold

In [None]:
print "Precision (threshold = 0.9): %s" % precision_with_high_threshold
print "Recall (threshold = 0.9)   : %s" % recall_with_high_threshold

**Quiz Question (variant 1)**: Does the **precision** increase with a higher threshold?

**Quiz Question (variant 2)**: Does the **recall** increase with a higher threshold?

## Precision-recall curve

Now, we will explore various different values of tresholds, compute the precision and recall scores, and then plot the precision-recall curve. Use 100 equally spaced values between 0.5 and 1. In Python, we run

In [None]:
threshold_values = np.linspace(0.5, 1, num=100)
print threshold_values

For each of the values of threshold, we compute the precision and recall scores.

In [None]:
precision_all = []
recall_all = []

print model.classes_
probabilities = model.predict_proba(test_matrix)[:,1]
# print probabilities
for threshold in threshold_values:
    predictions = apply_threshold(probabilities, threshold)
    
    precision = precision_score(test_data['sentiment'], predictions)
    recall = recall_score(test_data['sentiment'], predictions)
    
    precision_all.append(precision)
    recall_all.append(recall)

For each of the values of threshold, we first obtain class predictions using that threshold and then compute the precision and recall scores. Save the precision scores and recall scores to lists **precision_all** and **recall_all**, respectively.

Let's plot the precision-recall curve to visualize the precision-recall tradeoff as we vary the threshold. Implement the function **plot_pr_curve** that generates a connected scatter plot from the lists of precision and recall scores. The function would be implemented in matplotlib as follows; for other tools, consult appropriate manuals.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
def plot_pr_curve(precision, recall, title):
    plt.rcParams['figure.figsize'] = 7, 5
    plt.locator_params(axis = 'x', nbins = 5)
    plt.plot(precision, recall, 'b-', linewidth=4.0, color = '#B0017F')
    plt.title(title)
    plt.xlabel('Precision')
    plt.ylabel('Recall')
    plt.rcParams.update({'font.size': 16})

In [None]:
# print recall_all

In [None]:
# print precision_all

Once the function plot_pr_curve is complete, plot the precision-recall curve for the test set by running

In [None]:
plot_pr_curve(precision_all, recall_all, 'Precision recall curve (all)')

You should obtain a connected scatter plot that looks like this figure:

**Quiz Question**: Among all the threshold values tried, what is the **smallest** threshold value that achieves a precision of 96.5% or better? Round your answer to 3 decimal places.

In [None]:
zip(threshold_values, precision_all)

**Quiz Question**: Using `threshold` = 0.98, how many **false negatives** do we get on the **test_data**?

In [None]:
cmat_high = confusion_matrix(y_true=test_data['sentiment'],
                        y_pred=apply_threshold(probabilities, 0.98),
                        labels=model.classes_)

In [None]:
cmat_high.tolist()

In [None]:
print ' target_label | predicted_label | count '
print '--------------+-----------------+-------'
# Print out the confusion matrix.
# NOTE: Your tool may arrange entries in a different order. Consult appropriate manuals.
for i, target_label in enumerate(model.classes_):
    for j, predicted_label in enumerate(model.classes_):
        print '{0:^13} | {1:^15} | {2:5d}'.format(target_label, predicted_label, cmat_high[i,j])

This is the number of false negatives (i.e the number of reviews to look at when not needed) that we have to deal with using this classifier.

# Evaluating specific search terms

So far, we looked at the number of false positives for the **entire test set**. In this section, let's select reviews using a specific search term and optimize the precision on these reviews only. After all, a manufacturer would be interested in tuning the false positive rate just for their products (the reviews they want to read) rather than that of the entire set of products on Amazon.

## Precision-Recall on all baby related items

From the **test set**, select all the reviews for all products with the word 'baby' in them.

In [None]:
baby_reviews = test_data[test_data['name'].str.contains('baby', case=False)==True]

In [None]:
baby_reviews

In [None]:
baby_reviews_matrix = vectorizer.transform(baby_reviews['review_clean'])
baby_reviews_matrix.toarray()

Now, let's predict the probability of classifying these reviews as positive:

In [None]:
probabilities = model.predict_proba(baby_reviews_matrix)
probabilities

Let's plot the precision-recall curve for the **baby_reviews** dataset.

**First**, let's consider the following `threshold_values` ranging from 0.5 to 1:

In [None]:
threshold_values = np.linspace(0.5, 1, num=100)

**Second**, as we did above, let's compute precision and recall for each value in `threshold_values` on the **baby_reviews** dataset.  Complete the code block below.

In [None]:
precision_all = []
recall_all = []

for threshold in threshold_values:
    
    # Make predictions. Use the `apply_threshold` function 
    ## YOUR CODE HERE 
    predictions = apply_threshold(probabilities[:,1], threshold)

    # Calculate the precision.
    # YOUR CODE HERE
    precision = precision_score(baby_reviews['sentiment'], predictions)
    
    # YOUR CODE HERE
    recall = recall_score(baby_reviews['sentiment'], predictions)
    
    # Append the precision and recall scores.
    precision_all.append(precision)
    recall_all.append(recall)

**Quiz Question**: Among all the threshold values tried, what is the **smallest** threshold value that achieves a precision of 96.5% or better for the reviews of data in **baby_reviews**? Round your answer to 3 decimal places.

In [None]:
zip(threshold_values, precision_all)

**Quiz Question:** Is this threshold value smaller or larger than the threshold used for the entire dataset to achieve the same specified precision of 96.5%?

**Finally**, let's plot the precision recall curve.

In [None]:
plot_pr_curve(precision_all, recall_all, "Precision-Recall (Baby)")