# Simpsons: evaluation

In this Notebook, we will predict the character of lines of dialogue from the *Simpsons*.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import MultinomialNB

## Pre-processing ##


First, let's start with the code to generate a document-feature matrix

In [2]:
df = pd.read_csv('simpsons.csv')
df = df.loc[(df['raw_character_text'] == 'Lisa Simpson') | (df['raw_character_text'] == 'Bart Simpson')]
text = df['spoken_words'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
docu_feat = vect.transform(text) # make a matrix
df.head()

Unnamed: 0,raw_character_text,spoken_words
1,Lisa Simpson,Where's Mr. Bergstrom?
3,Lisa Simpson,That life is worth living.
7,Bart Simpson,Victory party under the slide!
9,Lisa Simpson,Mr. Bergstrom! Mr. Bergstrom!
11,Lisa Simpson,Do you know where I could find him?


## Building the model ##

Now, we will use the Naïve Bayes classifier from `sklearn`.

In [3]:
nb = MultinomialNB() #create the model
X = docu_feat #the document-feature matrix is the X matrix
y = df['raw_character_text'] #creating the y vector

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) #split the data and store it

nb = nb.fit(X_train, y_train) #fit the model X=features, y=character

## Evaluating the model ##

In [4]:

#Evaluate the model
y_test_p = nb.predict(X_test)
nb.score(X_test, y_test)

0.642904290429043

The accuracy is 63.7%, which is not great considering there are only two categories. What if we guessed the same category all the time?

In [5]:
df['raw_character_text'].value_counts(normalize=True)

Bart Simpson    0.544954
Lisa Simpson    0.455046
Name: raw_character_text, dtype: float64

So we're doing about 9.3 percentage points better than when we guessed Bart all the time. Not great, but that's to be expected with such short lines of dialogue. Let's create a confusion matrix.

In [6]:
cm = confusion_matrix(y_test, y_test_p)
cm = pd.DataFrame(cm, index=['Bart', 'Lisa'], columns=['Bart pred', 'Lisa pred'])
cm

Unnamed: 0,Bart pred,Lisa pred
Bart,3267,911
Lisa,1794,1603


Bart has more lines than Lisa, which is how I figured out the ordering of the labels. However, you can also always get them using the `.classes_` attribute of the object.

In [7]:
nb.classes_

array(['Bart Simpson', 'Lisa Simpson'], dtype='<U12')

Let's calculate precision and recall. Remember: precision is the proportion of the "Bart" predictions that is actually "Bart". Recall is the proportion of real "Bart" that is predicted as "Bart".

In [8]:
print(f"The precision for Bart is: {cm.iloc[0,0]/(cm.iloc[0,0]+cm.iloc[1,0])}") #this uses the coordinates of the confustion matrix
print(f"The recall for Bart is: {cm.iloc[0,0]/(cm.iloc[0,0]+cm.iloc[0,1])}")
print(f"The precision for Lisa is: {cm.iloc[1,1]/(cm.iloc[1,1]+cm.iloc[0,1])}")
print(f"The precision for Lisa is: {cm.iloc[1,1]/(cm.iloc[1,1]+cm.iloc[1,0])}")


The precision for Bart is: 0.6455245998814464
The recall for Bart is: 0.7819530876017233
The precision for Lisa is: 0.637629276054097
The precision for Lisa is: 0.47188695908154255


We could do the same much shorter using a built-in function `classification_report`

In [13]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_p))

              precision    recall  f1-score   support

Bart Simpson       0.65      0.78      0.71      4178
Lisa Simpson       0.64      0.47      0.54      3397

    accuracy                           0.64      7575
   macro avg       0.64      0.63      0.62      7575
weighted avg       0.64      0.64      0.63      7575



* The precision for "Bart Simpson" is 0.65, which means that out of a 100 lines that are predicted to be Bart's, only 65 are actually Bart.

## Predicting probabilities instead of classes ##

We can predict the probabilities of a line of dialogue using the method `.predict_proba`. Let's try it on a line of dialogue.

In [18]:
print(df.iloc[0,1])
print(nb.predict_proba(X[0]))

Where's Mr. Bergstrom?
[[0.02764282 0.97235718]]


This gives us an array/matrix of size 1x2 (you can see this from the double square brackets). Let's check out more lines. We need to get inside the array. Note that the array is two-dimensional (see the brackets).

In [12]:
for i in range(10):
    prob = nb.predict_proba(X[i])
    print(f"Line: {i}. {df.iloc[i,1]}")
    print(f"Bart: {prob[0,0]}, Lisa: {prob[0,1]}")
    


Line: 0. Where's Mr. Bergstrom?
Bart: 0.02823148173303606, Lisa: 0.9717685182669633
Line: 1. That life is worth living.
Bart: 0.5962751907882123, Lisa: 0.4037248092117888
Line: 2. Victory party under the slide!
Bart: 0.8162185640113552, Lisa: 0.18378143598864466
Line: 3. Mr. Bergstrom! Mr. Bergstrom!
Bart: 0.0007086004249150008, Lisa: 0.9992913995750851
Line: 4. Do you know where I could find him?
Bart: 0.5373212010811312, Lisa: 0.4626787989188686
Line: 5. The train, how like him... traditional, yet environmentally sound.
Bart: 0.08794218252630742, Lisa: 0.9120578174736946
Line: 6. I see he touched you, too.
Bart: 0.5219607655971505, Lisa: 0.47803923440285034
Line: 7. Hey, thanks for your vote, man.
Bart: 0.9547850467242326, Lisa: 0.0452149532757656
Line: 8. Well, you got that right. Thanks for your vote, girls.
Bart: 0.7793934992698764, Lisa: 0.22060650073012048
Line: 9. Well, don't sweat it. Just so long as a couple of people did... right, Milhouse?
Bart: 0.8914004903805847, Lisa: 0.

Some things to note here that the algorithm might have picked up (for fans of the Simpsons):
* Lines 1, 2, 4, 6 provide too little information
* Lines 0, 3: Bergstrom is Lisa's substitute teacher
* Line 5: might be because of the complex words (and Lisa cares about the environment, too)
* Line 7: 'Hey' and especially 'man' make this a very Bart-thing to say.
* Line 8: Don't know, maybe 'girls'?
* Line 9: Milhouse is Bart's friend and they regularly speak to each other

**This is where the exercise ends. The rest is extra material.**


## Finding the most predictive features ##

If you want to check out which words/features predict the classes strongly, see the attribute `feature_log_prob_` of the Naive Bayes model. This array contains the (modelled) probabilities that a text from Lisa or Bart contains a certain word. Since a single word does not occur very often, these probabilities tend to be very small. That's why they are expressed in a logarithm.

In [50]:
nb.feature_log_prob_

array([[ -9.76115564, -10.16662075, -10.85976793, ...,  -9.25033002,
        -10.85976793, -10.85976793],
       [-10.09127332, -10.7844205 ,  -8.83851035, ..., -10.7844205 ,
        -10.7844205 , -10.09127332]])

In [51]:
probs_df = pd.DataFrame(probs)#Let's make a dataframe out of it so it's more readable. The first column is for Bart, the second for Lisa. The rows are the words.
probs_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14248,14249,14250,14251,14252,14253,14254,14255,14256,14257
0,-9.761156,-10.166621,-10.859768,-10.166621,-10.859768,-10.166621,-10.859768,-9.761156,-10.166621,-10.859768,...,-10.859768,-10.859768,-10.859768,-10.166621,-10.166621,-10.859768,-10.166621,-9.25033,-10.859768,-10.859768
1,-10.091273,-10.784421,-8.83851,-10.784421,-10.091273,-10.784421,-10.784421,-10.784421,-10.784421,-10.091273,...,-9.685808,-10.784421,-9.398126,-10.784421,-10.784421,-10.091273,-10.784421,-10.784421,-10.784421,-10.091273


Let's transpose (switch rows and columns) so that it's easier readable. Let's also add labels for Bart & Lisa and the word labels. We'll get the latter from the method *.get_feature_names()* from out CountVectorizer object *vect*.

In [52]:
probs_df = probs_df.transpose()
probs_df.index = vect.get_feature_names()
probs_df.columns = ["Bart", "Lisa"]
probs_df

Unnamed: 0,Bart,Lisa
000,-9.761156,-10.091273
007,-10.166621,-10.784421
10,-10.859768,-8.838510
1000,-10.166621,-10.784421
10201,-10.859768,-10.091273
...,...,...
zur,-10.859768,-10.091273
zz,-10.166621,-10.784421
zzzapp,-9.250330,-10.784421
ãªtre,-10.859768,-10.784421


Now let's sort ascending to find the most predictive words. First for Bart:

In [54]:
probs_df = probs_df.sort_values("Bart", ascending=False)
probs_df.head(20)

Unnamed: 0,Bart,Lisa
dad,-4.390518,-4.083689
,-4.6021,-4.540254
hey,-4.679751,-5.678475
don,-4.766198,-4.725297
oh,-4.784422,-4.854831
just,-4.817135,-4.748939
ll,-4.932842,-5.174949
like,-5.001835,-5.025519
mom,-5.039685,-4.706778
ve,-5.139456,-5.182302


In [55]:
probs_df = probs_df.sort_values("Lisa", ascending=False)
probs_df.head(20)

Unnamed: 0,Bart,Lisa
dad,-4.390518,-4.083689
bart,-6.205808,-4.470872
,-4.6021,-4.540254
mom,-5.039685,-4.706778
don,-4.766198,-4.725297
just,-4.817135,-4.748939
oh,-4.784422,-4.854831
like,-5.001835,-5.025519
know,-5.155985,-5.12146
ll,-4.932842,-5.174949


Note that these have some overlap. That's because these words occur in both of their lines. If we want to know the most *distinctive* features, we have to subtract the numbers from each other<sup>*</sup>. Let's do that and sort.

\* <sub>For the mathematically inclined: you might be wondering why subtracting and not dividing (as Naive Bayes is based on the ratio of the two probabilities). First of all, well spotted! The reason is we're using log probabilities. A division is a subtraction in log units. If this is way over your head, don't worry - it's not that important.<sub>

In [56]:
probs_df["diff"] = probs_df["Bart"] - probs_df["Lisa"]
probs_df = probs_df.sort_values("diff")
probs_df.head(20)

Unnamed: 0,Bart,Lisa,diff
lu,-10.859768,-8.011832,-2.847936
public,-10.859768,-8.145363,-2.714405
bergstrom,-10.859768,-8.219471,-2.640297
puzzle,-10.859768,-8.386525,-2.473243
smithers,-10.859768,-8.481835,-2.377933
professor,-10.859768,-8.481835,-2.377933
congratulations,-10.859768,-8.481835,-2.377933
bleeding,-10.859768,-8.481835,-2.377933
buddhist,-10.859768,-8.481835,-2.377933
gem,-10.859768,-8.587196,-2.272572


Some very Lisa-like words here, like "kitty", "neat", "buddhist", "bergstrom".

Now for Bart, sort ascending:

In [58]:
probs_df = probs_df.sort_values("diff", ascending=False)
probs_df.head(20)

Unnamed: 0,Bart,Lisa,diff
lis,-5.954493,-10.091273,4.13678
ay,-7.333407,-10.784421,3.451013
carumba,-6.967948,-10.091273,3.123326
crap,-7.815245,-10.784421,2.969175
dude,-7.815245,-10.784421,2.969175
nah,-8.087179,-10.784421,2.697241
crappy,-8.151718,-10.784421,2.632703
gettin,-8.220711,-10.784421,2.56371
sis,-8.374861,-10.784421,2.409559
loser,-8.374861,-10.784421,2.409559


This one is even more recognizable: "lis" (Lisa), "ay", "carumba", "crap", "dude" are instantly recognizable as Bart's.

We might exclude some very infrequent words from this analysis. That exercise is left up to the reader.

