# Accuracy Measures for Supervised Machine Learning

Previously we used a simple accuracy measure to test the accuracy of our machine learning algorithm. It's helpful also to calculate more precise accuracy scores for each category. We can do this using precision, recall, and the F1 score.

The following text is taken from [Wikipedia](https://en.wikipedia.org/wiki/Precision_and_recall).

In a classification task, the precision for a class is the number of true positives (i.e. the number of items correctly labeled as belonging to the positive class) divided by the total number of elements labeled as belonging to the positive class (i.e. the sum of true positives and false positives, which are items incorrectly labeled as belonging to the class). Recall in this context is defined as the number of true positives divided by the total number of elements that actually belong to the positive class (i.e. the sum of true positives and false negatives, which are items which were not labeled as belonging to the positive class but should have been).

In a classification task, a precision score of 1.0 for a class C means that every item labeled as belonging to class C does indeed belong to class C (but says nothing about the number of items from class C that were not labeled correctly) whereas a recall of 1.0 means that every item from class C was labeled as belonging to class C (but says nothing about how many other items were incorrectly also labeled as belonging to class C).

Usually, precision and recall scores are not discussed in isolation. Instead, either values for one measure are compared for a fixed level at the other measure (e.g. precision at a recall level of 0.75) or both are combined into a single measure. Examples for measures that are a combination of precision and recall are the F-measure (the weighted harmonic mean of precision and recall)


The equations for Precision, Recall, and the F1 Score:

![alt text](https://www.safaribooksonline.com/library/view/python-data-analysis/9781785282287/graphics/B04223_10_02.jpg)


Finally, you can print out the *confusion matrix* to see exactly which labels are most "confusing" for the algorithm, or how many documents are being misclassified for each label.

### Key Terms

* *precision*
    * the precision for a class is the number of true positives (i.e. the number of items correctly labeled as belonging to the positive class) divided by the total number of elements labeled as belonging to the positive class (i.e. the sum of true positives and false positives, which are items incorrectly labeled as belonging to the class).
* *recall*
    * Recall is defined as the number of true positives divided by the total number of elements that actually belong to the positive class (i.e. the sum of true positives and false negatives, which are items which were not labeled as belonging to the positive class but should have been).
* *F1 score*
    * The harmonic mean between precision and recall.
* [*Confusion Matrix*](https://en.wikipedia.org/wiki/Confusion_matrix)
    * In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class (or vice versa). The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabelling one as another).
    

### 0. Reading in and pre-processing data

First reproduce the classifier introduced on 3.22:

In [1]:
#first, import the necessary modules
import pandas
import numpy as np
#scikit-learn is a huge libaray. We import what we need.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC #Linear Suppot Vector Classifier
from sklearn.naive_bayes import MultinomialNB #Naive Bayes classifier
from sklearn.neighbors import KNeighborsClassifier #nearest neighbors classifier
from sklearn.metrics import accuracy_score #to asses the accuracy of the algorithm
from sklearn.model_selection import cross_val_score #to compute cross validation for assessment purposes
from sklearn.cross_validation import cross_val_score #to compute cross validation for assessment purposes



In [2]:
#read our texts and turn them into lists
import os
review_path = '../data/poems/reviewed/'
random_path = '../data/poems/random/'
review_files = os.listdir(review_path)
random_files = os.listdir(random_path)

review_texts = [open(review_path+file_name).read() for file_name in review_files]
random_texts = [open(random_path+file_name).read() for file_name in random_files]

review_texts[0] #notice the strange output here. These poems are saved in a bag of words format

"the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the

In [3]:
#transform and concat these lists into a Pandas dataframe
df1 = pandas.DataFrame(review_texts, columns = ['body'])
df1['label'] = "review"
df2 = pandas.DataFrame(random_texts, columns = ['body'])
df2['label'] = "random"
df = pandas.concat([df1,df2])
df

Unnamed: 0,body,label
0,the the the the the the the the the the the th...,review
1,the the the the the the the the the the the th...,review
2,the the the the the the the the the the the th...,review
3,the the the the the the the the the the the th...,review
4,the the the the the the the the the the the th...,review
5,the the the the the the the the the the the th...,review
6,the the the the the the the the the the the th...,review
7,the the the the the the the the the the the th...,review
8,the the the the the the the the the the the th...,review
9,the the the the the the the the the the the th...,review


In [4]:
##EX: Output some summary statistics for this dataframe. How many poems with the review label, and how many with the random label?
##What is the total number of words in each category? What is the average number of words per poem in each category?

print(df['label'].value_counts())

random    360
review    360
Name: label, dtype: int64


In [5]:
df['tokens'] = df['body'].str.split()
df['tokens'] = df['tokens'].str.len()
grouped = df.groupby('label')
print(grouped['tokens'].sum())
print(grouped['tokens'].mean())

label
random    7069809
review    8260352
Name: tokens, dtype: int64
label
random    19638.358333
review    22945.422222
Name: tokens, dtype: float64


### 1. Divide data into training and test sets

First we need to create a training set and a test set. We'll train on the first 500 poems, and test the accuracy on the rest.

In [7]:
#randomize our rows
df = df.sample(720, random_state=0)
df

Unnamed: 0,body,label,tokens
176,the the the the the the the the the the the th...,review,11264
143,the the the the the the the the the the the th...,review,7241
52,the the the the the the the the the the the th...,random,22119
55,the the the the the the the the the the the th...,random,14812
146,the the the the the the the the the the the th...,review,9181
322,the the the the the the the the the the the th...,random,8759
130,the the the the the the the the the the the th...,review,3446
318,the the the the the the the the the the the th...,random,1438
72,the the the the the the the the the the the th...,random,8509
68,the the the the the the the the the the the th...,review,32233


In [8]:
#create two new dataframes
df_train = df[:500]
df_test = df[500:]
print(df_test['label'].value_counts())
df_train['label'].value_counts()

review    112
random    108
Name: label, dtype: int64


random    252
review    248
Name: label, dtype: int64

### 2. Supervised Machine Learning Classification

Next we need to create a dtm for each review, and an array containing the classification label for each review (for us, this is called 'label')

In [9]:
#transform the 'body' column into a document term matrix
tfidfvec = TfidfVectorizer(stop_words = 'english', min_df = 1, binary=True)
countvec = CountVectorizer(stop_words = 'english', min_df = 1, binary=True)

training_dtm_tf = countvec.fit_transform(df_train.body)
test_dtm_tf = countvec.transform(df_test.body)

#create an array for labels
training_labels = df_train.label
test_labels = df_test.label
test_labels.value_counts()

review    112
random    108
Name: label, dtype: int64

In [10]:
#define a container for our chosen algorithm, in this case multinomial naive bayes
nb = MultinomialNB()

#fit a model on our training set
nb.fit(training_dtm_tf, training_labels)

#predict the labels on the test set using the trained model
predictions_nb = nb.predict(test_dtm_tf) 
predictions_nb

array(['random', 'review', 'review', 'review', 'review', 'review',
       'review', 'random', 'random', 'random', 'random', 'review',
       'review', 'review', 'review', 'random', 'random', 'review',
       'review', 'random', 'random', 'review', 'review', 'random',
       'review', 'random', 'random', 'random', 'review', 'review',
       'review', 'review', 'random', 'random', 'review', 'review',
       'random', 'review', 'review', 'review', 'review', 'random',
       'review', 'review', 'random', 'review', 'random', 'review',
       'review', 'review', 'random', 'random', 'random', 'random',
       'random', 'review', 'review', 'review', 'review', 'random',
       'random', 'review', 'review', 'random', 'random', 'random',
       'review', 'review', 'review', 'random', 'random', 'random',
       'review', 'review', 'review', 'review', 'random', 'random',
       'review', 'random', 'review', 'review', 'random', 'review',
       'random', 'random', 'review', 'review', 'review', 'rand

### 3. Accuracy, Precision, Recall, and F1

We can use the built-in function "accuracy-score" to calculate the accuracy of our classifier. For binary and multiclass classification, which is our case, this function calculates Jaccard similarity coefficient, defined as the size of the intersection divided by the size of the union of two label sets. It is used to compare the set of predicted labels (the labels the algorithms assigned to the test set) to the true labels for the test set.

In [11]:
accuracy_score(predictions_nb, test_labels)

0.74090909090909096

We can also import precision, recall, and F1 to calculate these measures.

The syntax here is (test label array, predicted array, options...). We will set the 'average' option to None, so we can see the precision and recall for each category, and we'll set the labels option to our label array, so we know which score applies to which label.

In [13]:
#import from sklearn.metrics
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix

In [14]:
#precision
precision_score(test_labels, predictions_nb, labels=['random', 'review'], average=None)

array([ 0.76842105,  0.72      ])

In [15]:
recall_score(test_labels, predictions_nb, labels=['random', 'review'], average=None)

array([ 0.67592593,  0.80357143])

In [16]:
f1_score(test_labels, predictions_nb, labels=['random', 'review'], average=None)

array([ 0.71921182,  0.75949367])

### 4. Confusion Matrix

Finally, we can pring out the confusion matrix to see if the algorithm is confusing two categories.

In [17]:
confusion_matrix(test_labels, predictions_nb, labels=['random', 'review'])

array([[73, 35],
       [22, 90]])