To evaluation the performance of a classification model, sklearn provides a function called classification_report. This function returns a result showing scores including precision, recall, F1. This notebook will explain what the scores mean.

I will use an example to explain this:
    
We have a corpus containing 100 sentences, including 20 ironic sentences and 80 non-ironic sentences. Our purpose is to classify the ironic and non-ironic sentences. The model predicts that there are 10 ironic sentences and 90 non-ironic sentences. In the following report, class "1" means ironic texts, and class "-1" means non-ironic texts.

<img src ='report.png'>


First, we will introduce some concepts used to calculate the score. 

- TP (True positive): These are the correctly predicted positive values which means that the value of actual class is yes and the value of predicted class is also yes. E.g. if actual class value indicates that this sentence is ironic and predicted class tells you the same thing.

- TN (True Negtive): These are the correctly predicted negative values which means that the value of actual class is no and value of predicted class is also no. E.g. if actual class says this sentence is non-ironic and predicted class tells you the same thing.  
    
False positives and false negatives, these values occur when your actual class contradicts with the predicted class.
    
- FP (False positive): When actual class is no and predicted class is yes. E.g. if actual class says this sentence is non-ironic but predicted class tells you that this sentence is ironic.

- FN (False Negatives): When actual class is yes but predicted class in no. E.g. if actual class value indicates that this sentence is ironic but the predicted class tells you that this sentence is non-ironic.

##### Precision 
Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answer is of all sentences that labeled as ironic, how many actually ironic? High precision relates to the low false positive rate. We got the precision of class "1" 0.37, means in the 10 ironic sentences identified by our model, 37% are true ironic sentence. The precision of class "-1" is 0.75, means that in the 90 non-ironic sentences labeled by the model, 75% are trully non-ironic sentences.

Precision = TP/TP+FP

##### Recall
Recall is the ratio of correctly predicted positive observations to the all observations in actual class - yes. The question recall answers is: Of all the sentences that truly ironic, how many did the model label? We got the recall of class "1" is 0.23, means for the 20 true ironic sentences in our corpus, 23% of them were labeled correctly. The recall of class "-1" is 0.85, means for the 80 true non-ironic sentences in our corpus, 85% of them were labeled correctly.

Recall = TP/TP+FN

##### F1 Score
F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution. Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall.

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

Let's go back to the report, we can say that the model has higher precision on detecting non-ironic text (precision of "-1" is 0.75), and can detect most of the non-ironic text in our corpus (recall of "-1" is 0.85).

Reference：
http://blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance-measures/