# Evaluation of Naive Bayes

In this Notebook, we will predict the character of lines of dialogue from the *Simpsons*.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import MultinomialNB

First, let's start with the code to generate a document-feature matrix

In [2]:
df = pd.read_csv('simpsons.csv')
df = df.loc[(df['raw_character_text'] == 'Lisa Simpson') | (df['raw_character_text'] == 'Bart Simpson')]
text = df['spoken_words'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
docu_feat = vect.transform(text) # make a matrix


Now, we will use the Naïve Bayes classifier from `sklearn`.

In [29]:
df

Unnamed: 0,raw_character_text,spoken_words
1,Lisa Simpson,Where's Mr. Bergstrom?
3,Lisa Simpson,That life is worth living.
7,Bart Simpson,Victory party under the slide!
9,Lisa Simpson,Mr. Bergstrom! Mr. Bergstrom!
11,Lisa Simpson,Do you know where I could find him?
13,Lisa Simpson,"The train, how like him... traditional, yet en..."
15,Lisa Simpson,"I see he touched you, too."
17,Bart Simpson,"Hey, thanks for your vote, man."
19,Bart Simpson,"Well, you got that right. Thanks for your vote..."
21,Bart Simpson,"Well, don't sweat it. Just so long as a couple..."


In [3]:
nb = MultinomialNB() #create the model
X = docu_feat #the document-feature matrix is the X matrix
y = df['raw_character_text'] #creating the y vector

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) #split the data and store it

nb = nb.fit(X_train, y_train) #fit the model X=features, y=character

#Evaluate the model
y_test_p = nb.predict(X_test)
nb.score(X_test, y_test)


0.63260726072607265

In [17]:
y_test_p

array(['Lisa Simpson', 'Bart Simpson', 'Lisa Simpson', ..., 'Lisa Simpson',
       'Lisa Simpson', 'Bart Simpson'], 
      dtype='<U12')

The accuracy is 63.7%, which is not great considering there are only two categories. What if we guessed the same category all the time?

In [4]:
df['raw_character_text'].value_counts(normalize=True)

Bart Simpson    0.544954
Lisa Simpson    0.455046
Name: raw_character_text, dtype: float64

So we're doing about 9.3 percentage points better than when we guessed Bart all the time. Not great, but that's to be expected with such short lines of dialogue. Let's create a confusion matrix.

In [26]:
cm = confusion_matrix(y_test, y_test_p)
cm = pd.DataFrame(cm, index=['Bart', 'Lisa'], columns=['Bart pred', 'Lisa pred'])
cm

Unnamed: 0,Bart pred,Lisa pred
Bart,3235,862
Lisa,1921,1557


Bart has more lines than Lisa, which is how I figured out the ordering of the labels. However, you can also always get them using the `.classes_` attribute of the object.

In [22]:
nb.classes_

array(['Bart Simpson', 'Lisa Simpson'], 
      dtype='<U12')

Let's calculate precision and recall. Remember: precision is the proportion of the "Bart" predictions that is actually "Bart". Recall is the proportion of real "Bart" that is predicted as "Bart".

In [12]:
print(f"The precision for Bart is: {cm.iloc[0,0]/(cm.iloc[0,0]+cm.iloc[1,0])}")
print(f"The recall for Bart is: {cm.iloc[0,0]/(cm.iloc[0,0]+cm.iloc[0,1])}")
print(f"The precision for Lisa is: {cm.iloc[1,1]/(cm.iloc[1,1]+cm.iloc[0,1])}")
print(f"The precision for Lisa is: {cm.iloc[1,1]/(cm.iloc[1,1]+cm.iloc[1,0])}")


The precision for Bart is: 0.6428288431061807
The recall for Bart is: 0.7800480769230769
The precision for Lisa is: 0.6379105658884052
The precision for Lisa is: 0.4720351390922401


We could do the same much shorter using a built-in function `classification_report`

In [16]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_p, nb.classes_)) #this function needs the class names, which are in nb.classes_

              precision    recall  f1-score   support

Bart Simpson       0.64      0.78      0.70      4160
Lisa Simpson       0.64      0.47      0.54      3415

 avg / total       0.64      0.64      0.63      7575



We can predict the probabilities of a line of dialogue using the method `.predict_proba`. Let's try it on a line of dialogue.

In [31]:
print(df.iloc[0,1])
print(nb.predict_proba(X[0]))

Where's Mr. Bergstrom?
[[ 0.02634007  0.97365993]]


This gives us an array/matrix of size 1x2 (you can see this from the double square brackets). Let's check out more lines. We need to get inside the array. Note that the array is two-dimensional (see the brackets).

In [34]:
for i in range(10):
    prob = nb.predict_proba(X[i])
    print(f"Line: {i}. {df.iloc[i,1]}")
    print(f"Bart: {prob[0,0]}, Lisa: {prob[0,1]}")

Line: 0. Where's Mr. Bergstrom?
Bart: 0.0263400706807323, Lisa: 0.9736599293192669
Line: 1. That life is worth living.
Bart: 0.8048921614350071, Lisa: 0.1951078385649927
Line: 2. Victory party under the slide!
Bart: 0.4878858829660796, Lisa: 0.5121141170339188
Line: 3. Mr. Bergstrom! Mr. Bergstrom!
Bart: 0.000606422812669835, Lisa: 0.9993935771873295
Line: 4. Do you know where I could find him?
Bart: 0.5531864614509548, Lisa: 0.4468135385490454
Line: 5. The train, how like him... traditional, yet environmentally sound.
Bart: 0.0866691154438012, Lisa: 0.913330884556198
Line: 6. I see he touched you, too.
Bart: 0.268789237936734, Lisa: 0.7312107620632654
Line: 7. Hey, thanks for your vote, man.
Bart: 0.9548914489348257, Lisa: 0.0451085510651731
Line: 8. Well, you got that right. Thanks for your vote, girls.
Bart: 0.8303941841995993, Lisa: 0.16960581580039805
Line: 9. Well, don't sweat it. Just so long as a couple of people did... right, Milhouse?
Bart: 0.7309511834503885, Lisa: 0.2690488

Some things to note here that the algorithm might have picked up (for fans of the Simpsons):
* Lines 1, 2, 4, 6 provide too little information
* Lines 0, 3: Bergstrom is Lisa's substitute teacher
* Line 5: might be because of the complex words (and Lisa cares about the environment, too)
* Line 7: 'Hey' and especially 'man' make this a very Bart-thing to say.
* Line 8: Don't know, maybe 'girls'?
* Line 9: Milhouse is Bart's friend (though also in love with Lisa, so it's close)

If you want to check out which features predict the classes strongly, see the attribute `feature_log_prob_` of the Naive Bayes model.