# Simpsons

In this Notebook, we will predict the character of lines of dialogue from the *Simpsons*.

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

First, let's start with the code to generate a document-feature matrix

In [3]:
df = pd.read_csv('simpsons.csv')
df = df.loc[(df['raw_character_text'] == 'Lisa Simpson') | (df['raw_character_text'] == 'Bart Simpson')]
text = df['spoken_words'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
docu_feat = vect.transform(text) # make a matrix


Now, we will use the Naïve Bayes classifier from `sklearn`.

In [14]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB() #create the model
X = docu_feat #the document-feature matrix is the X matrix
y = df['raw_character_text'] #creating the y vector

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) #split the data and store it

nb = nb.fit(X_train, y_train) #fit the model X=features, y=character

#Evaluate the model
y_test_p = nb.predict(X_test)
nb.score(X_test, y_test)


0.63749174917491747

The accuracy is 63.7%, which is not great considering there are only two categories. What if we guessed the same category all the time?

In [15]:
df['raw_character_text'].value_counts(normalize=True)

Bart Simpson    0.544954
Lisa Simpson    0.455046
Name: raw_character_text, dtype: float64

So we're doing about 9.3 percentage points better than when we guessed Bart all the time. Not great, but that's to be expected with such short lines of dialogue.