#### Accuracy and error types

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

In [18]:
# Grab and process the raw data.
data_path = ("https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/"
             "master/sms_spam_collection/SMSSpamCollection"
            )
sms_raw = pd.read_csv(data_path, delimiter= '\t', header=None)
sms_raw.columns = ['spam', 'message']

# Enumerate our spammy keywords.
keywords = ['click', 'offer', 'winner', 'buy', 'free', 'cash', 'urgent']

for key in keywords:
    sms_raw[str(key)] = sms_raw.message.str.contains(
        ' ' + str(key) + ' ',
        case=False
)

sms_raw['allcaps'] = sms_raw.message.str.isupper()
sms_raw['spam'] = (sms_raw['spam'] == 'spam')
data = sms_raw[keywords + ['allcaps']]
target = sms_raw['spam']

from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()
y_pred = bnb.fit(data, target).predict(data)

print("Number of mislabeled points out of a total {} points : {}".format(
    data.shape[0],
    (target != y_pred).sum()))

Number of mislabeled points out of a total 5572 points : 604


In [5]:
target.sum()


747

In [20]:
sms_raw['y_pred'] = y_pred

In [21]:
sms_raw['y_pred'].value_counts()

False    5319
True      253
Name: y_pred, dtype: int64

##### Success Rate
Now we have our model as well as our returned predictions.

The first thing to note is what data is directly comparable for model evaluation: our target and y_pred variables. **Target is the actual outcomes, whether something was spam or ham. The y_pred is the predicted outcomes from our classifier.** Both are **ordered arrays** with the results from each row of the dataframe. When the two agree that means our model was able to successfully predict whether a given message was spam or ham. When they disagree our model was incorrect.

The **most basic measure of success, then, is how often our model was correct. This is called the accuracy.** It's a metric you've seen before as it was our method of evaluation in the past lesson, but translated from a count to a rate or percentage.

Go ahead and calculate it in the cell below. If you're stuck look back at the previous lesson. If you haven't yet, make your own copy of this notebook to work with locally so you don't lose your work.


In [23]:
(target != y_pred).sum()/target.sum()

0.8085676037483266

In [26]:
#calculate the accuracy of your model = success rate
#(total variables - incorrect = correct) / total variables
(data.shape[0]- ((target != y_pred).sum()))/data.shape[0]*100

89.16008614501077

success rate is a popular way to evaluate a model, and what most people get excited about when discussing a model. However, for a data scientist, success rate is usually not sufficient. There are several reasons for this, but we'll mention two of them here.

Firstly, **not all errors are created equal.** Think of the situation we're currently working with: a spam filter. Are all types of errors equal here? Certainly not! If you were using this to remove messages from your inbox, letting in a spam message is not nearly as egregious as throwing out a real (and quite possibly very important) message. Knowing more about the kinds of errors you're generating can therefore be incredibly useful.

Secondly, **understanding how your model is failing can be key to improving it**. If a certain outcome is not being predicted accurately you may want to focus on engineering more features to identify that outcome.

##### Confusion matrix
 next level of analysis of your classifier is often something called a Confusion Matrix. This is a matrix that **shows the count of each possible permutation of target and prediction.** So in our case, it will show the counts for when a message was ham and we predicted ham, when a message was ham and we predicted spam, when a message was spam and we predicted ham, and when a message was spam and we predicted spam.

SKLearn has a built in confusion matrix function, so let's quickly import that and generate one here.

In [27]:
from sklearn.metrics import confusion_matrix
confusion_matrix(target, y_pred)

array([[4770,   55],
       [ 549,  198]], dtype=int64)

4770: ham, pred ham
55: ham, pred spam **failed to predict ham**
549: spam, pred ham **failed to predict spam**
198: spam, pred spam

 the columns are prediction and the rows are actual.

So what do we learn?

We learn the majority of our error is coming from times where we failed to identify a spam message. 549 of our 604 errors are from failing to identify spam. So we need to get a little bit better at identifying spam messages.

But before we move on or iterate on the model, let's talk about some key terms that you may run into when thinking about this kind of matrix.

Let's assume our goal is to identify spam (rather than identify ham).

Firstly, when we talk about **errors in a binary classifier** (where there are only two outcomes) we're generally referring to **two kinds of errors. A false positive is when we identify something as spam that is not.** In this case we had 55 of these. This is sometimes also called a **"Type I Error" or a "false alarm".**

A **false negative is therefore when we mistakenly identify something as not spam when it is.** We had 549 of these. This is also called a **"Type II Error" or a "miss".**

This also brings us to a conversation of sensitivity vs specificity.

**Sensitivity is the percentage of positives correctly identified**, in our case 198/747 or 27%. This shows how good we are at catching positives, or how sensitive our model is to identifying positives.

**Specificity is just the opposite, the percentage of negatives correctly identified,** 4770/4825 or 99%.

Again this confirms that we're not great at identifying spam, though we do label ham quite accurately. You should get familiar with these terms as in the practicing world they will often be used with little explanation and you will be expected to understand them.

4770: ham, pred ham
55: ham, pred spam **failed to predict ham**
549: spam, pred ham **failed to predict spam**
198: spam, pred spam

trying to create confusion matrix by hand

In [28]:
target.sum()

747

In [29]:
sms_raw['y_pred'].value_counts()

False    5319
True      253
Name: y_pred, dtype: int64

In [46]:
(target != y_pred).value_counts()

False    4968
True      604
Name: spam, dtype: int64

In [32]:
target.value_counts()

False    4825
True      747
Name: spam, dtype: int64

In [33]:
(target != y_pred).value_counts()

False    4968
True      604
Name: spam, dtype: int64

In [45]:
sms_raw[(sms_raw.spam == 'True') & (sms_raw.y_pred == 'True')].sum()
#((sms_raw['spam'] == 'True') & (sms_raw['y_pred'] == 'True')).counts()

spam       0.0
message    0.0
click      0.0
offer      0.0
winner     0.0
buy        0.0
free       0.0
cash       0.0
urgent     0.0
allcaps    0.0
y_pred     0.0
dtype: float64

#from stackoverflow
y_actu = pd.Series([2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2], name='Actual')
y_pred = pd.Series([0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2], name='Predicted')
df_confusion = pd.crosstab(y_actu, y_pred)

from data to fish tutorials: https://datatofish.com/confusion-matrix-python/

#sample code:
import pandas as pd

data = {'y_Predicted': [1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0],
        'y_Actual':    [1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0]
        }

df = pd.DataFrame(data, columns=['y_Actual','y_Predicted'])
print (df)

#apply pd.crosstab
confusion_matrix = pd.crosstab(df['y_Actual'], df['y_Predicted'], rownames=['Actual'], colnames=['Predicted'])
print (confusion_matrix)

In [62]:
#false = 0 = ham; true = 1 = spam
sms_raw['y_actual'] = sms_raw['spam'].map({False: 0, True: 1})

In [65]:
#false = 0 = ham; true = 1 = spam
sms_raw['y_predicted'] = sms_raw['y_pred'].map({False: 0, True: 1})

In [66]:
confusion_matrix = pd.crosstab(sms_raw['y_actual'], sms_raw['y_predicted'], rownames=['Actual'], colnames=['Predicted'])
print (confusion_matrix)

Predicted     0    1
Actual              
0          4770   55
1           549  198


In [54]:
sms_raw.head()

Unnamed: 0,spam,message,click,offer,winner,buy,free,cash,urgent,allcaps,y_pred,y_actual,y_predicted
0,False,"Go until jurong point, crazy.. Available only ...",False,False,False,False,False,False,False,False,False,,
1,False,Ok lar... Joking wif u oni...,False,False,False,False,False,False,False,False,False,,
2,True,Free entry in 2 a wkly comp to win FA Cup fina...,False,False,False,False,False,False,False,False,False,,
3,False,U dun say so early hor... U c already then say...,False,False,False,False,False,False,False,False,False,,
4,False,"Nah I don't think he goes to usf, he lives aro...",False,False,False,False,False,False,False,False,False,,
