# Evaluation of Naive Bayes

In this Notebook, we will predict the character of lines of dialogue from the *Simpsons*.

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import MultinomialNB

## Pre-processing ##


First, let's start with the code to generate a document-feature matrix

In [3]:
df = pd.read_csv('simpsons.csv')
df = df.loc[(df['raw_character_text'] == 'Lisa Simpson') | (df['raw_character_text'] == 'Bart Simpson')]
text = df['spoken_words'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
docu_feat = vect.transform(text) # make a matrix
df.head()

Unnamed: 0,raw_character_text,spoken_words
1,Lisa Simpson,Where's Mr. Bergstrom?
3,Lisa Simpson,That life is worth living.
7,Bart Simpson,Victory party under the slide!
9,Lisa Simpson,Mr. Bergstrom! Mr. Bergstrom!
11,Lisa Simpson,Do you know where I could find him?


## Building the model ##

Now, we will use the Naïve Bayes classifier from `sklearn`.

In [4]:
nb = MultinomialNB() #create the model
X = docu_feat #the document-feature matrix is the X matrix
y = df['raw_character_text'] #creating the y vector

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) #split the data and store it

nb = nb.fit(X_train, y_train) #fit the model X=features, y=character

## Evaluating the model ##

In [5]:

#Evaluate the model
y_test_p = nb.predict(X_test)
nb.score(X_test, y_test)

0.6323432343234323

The accuracy is 63.7%, which is not great considering there are only two categories. What if we guessed the same category all the time?

In [6]:
df['raw_character_text'].value_counts(normalize=True)

Bart Simpson    0.544954
Lisa Simpson    0.455046
Name: raw_character_text, dtype: float64

So we're doing about 9.3 percentage points better than when we guessed Bart all the time. Not great, but that's to be expected with such short lines of dialogue. Let's create a confusion matrix.

In [7]:
cm = confusion_matrix(y_test, y_test_p)
cm = pd.DataFrame(cm, index=['Bart', 'Lisa'], columns=['Bart pred', 'Lisa pred'])
cm

Unnamed: 0,Bart pred,Lisa pred
Bart,3251,904
Lisa,1881,1539


Bart has more lines than Lisa, which is how I figured out the ordering of the labels. However, you can also always get them using the `.classes_` attribute of the object.

In [8]:
nb.classes_

array(['Bart Simpson', 'Lisa Simpson'], dtype='<U12')

Let's calculate precision and recall. Remember: precision is the proportion of the "Bart" predictions that is actually "Bart". Recall is the proportion of real "Bart" that is predicted as "Bart".

In [9]:
print(f"The precision for Bart is: {cm.iloc[0,0]/(cm.iloc[0,0]+cm.iloc[1,0])}")
print(f"The recall for Bart is: {cm.iloc[0,0]/(cm.iloc[0,0]+cm.iloc[0,1])}")
print(f"The precision for Lisa is: {cm.iloc[1,1]/(cm.iloc[1,1]+cm.iloc[0,1])}")
print(f"The precision for Lisa is: {cm.iloc[1,1]/(cm.iloc[1,1]+cm.iloc[1,0])}")


The precision for Bart is: 0.6334762275915822
The recall for Bart is: 0.782430806257521
The precision for Lisa is: 0.6299631600491199
The precision for Lisa is: 0.45


We could do the same much shorter using a built-in function `classification_report`

In [10]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_p, nb.classes_)) #this function needs the class names, which are in nb.classes_

              precision    recall  f1-score   support

Bart Simpson       0.63      0.78      0.70      4155
Lisa Simpson       0.63      0.45      0.52      3420

    accuracy                           0.63      7575
   macro avg       0.63      0.62      0.61      7575
weighted avg       0.63      0.63      0.62      7575



## Predicting probabilities instead of classes ##

We can predict the probabilities of a line of dialogue using the method `.predict_proba`. Let's try it on a line of dialogue.

In [11]:
print(df.iloc[0,1])
print(nb.predict_proba(X[0]))

Where's Mr. Bergstrom?
[[0.02823148 0.97176852]]


This gives us an array/matrix of size 1x2 (you can see this from the double square brackets). Let's check out more lines. We need to get inside the array. Note that the array is two-dimensional (see the brackets).

In [12]:
for i in range(10):
    prob = nb.predict_proba(X[i])
    print(f"Line: {i}. {df.iloc[i,1]}")
    print(f"Bart: {prob[0,0]}, Lisa: {prob[0,1]}")
    


Line: 0. Where's Mr. Bergstrom?
Bart: 0.02823148173303606, Lisa: 0.9717685182669633
Line: 1. That life is worth living.
Bart: 0.5962751907882123, Lisa: 0.4037248092117888
Line: 2. Victory party under the slide!
Bart: 0.8162185640113552, Lisa: 0.18378143598864466
Line: 3. Mr. Bergstrom! Mr. Bergstrom!
Bart: 0.0007086004249150008, Lisa: 0.9992913995750851
Line: 4. Do you know where I could find him?
Bart: 0.5373212010811312, Lisa: 0.4626787989188686
Line: 5. The train, how like him... traditional, yet environmentally sound.
Bart: 0.08794218252630742, Lisa: 0.9120578174736946
Line: 6. I see he touched you, too.
Bart: 0.5219607655971505, Lisa: 0.47803923440285034
Line: 7. Hey, thanks for your vote, man.
Bart: 0.9547850467242326, Lisa: 0.0452149532757656
Line: 8. Well, you got that right. Thanks for your vote, girls.
Bart: 0.7793934992698764, Lisa: 0.22060650073012048
Line: 9. Well, don't sweat it. Just so long as a couple of people did... right, Milhouse?
Bart: 0.8914004903805847, Lisa: 0.

Some things to note here that the algorithm might have picked up (for fans of the Simpsons):
* Lines 1, 2, 4, 6 provide too little information
* Lines 0, 3: Bergstrom is Lisa's substitute teacher
* Line 5: might be because of the complex words (and Lisa cares about the environment, too)
* Line 7: 'Hey' and especially 'man' make this a very Bart-thing to say.
* Line 8: Don't know, maybe 'girls'?
* Line 9: Milhouse is Bart's friend (though also in love with Lisa, so it's close)

If you want to check out which features predict the classes strongly, see the attribute `feature_log_prob_` of the Naive Bayes model.

In [15]:
nb.feature_log_prob_

array([[ -9.76409138, -10.16955648, -10.86270367, ...,  -9.25326575,
        -10.86270367, -10.86270367],
       [-10.77645316, -10.77645316, -10.77645316, ..., -10.77645316,
        -10.77645316, -10.77645316]])

In [13]:
## Finding the most defining words ##

In [14]:
nb.feature_log_prob_

array([[ -9.76409138, -10.16955648, -10.86270367, ...,  -9.25326575,
        -10.86270367, -10.86270367],
       [-10.77645316, -10.77645316, -10.77645316, ..., -10.77645316,
        -10.77645316, -10.77645316]])