# Text Modeling

In [1]:
import pandas as pd

In [8]:
df = pd.read_csv('simpsons.csv')
df.head(10)

Unnamed: 0,raw_character_text,spoken_words
0,Miss Hoover,"No, actually, it was a little of both. Sometim..."
1,Lisa Simpson,Where's Mr. Bergstrom?
2,Miss Hoover,I don't know. Although I'd sure like to talk t...
3,Lisa Simpson,That life is worth living.
4,Edna Krabappel-Flanders,The polls will be open from now until the end ...
5,Martin Prince,I don't think there's anything left to say.
6,Edna Krabappel-Flanders,Bart?
7,Bart Simpson,Victory party under the slide!
8,,
9,Lisa Simpson,Mr. Bergstrom! Mr. Bergstrom!


In [14]:
df[(df["raw_character_text"] == "Lisa Simpson") | (df["raw_character_text"] == "Bart Simpson")]


Unnamed: 0,raw_character_text,spoken_words
1,Lisa Simpson,Where's Mr. Bergstrom?
3,Lisa Simpson,That life is worth living.
7,Bart Simpson,Victory party under the slide!
9,Lisa Simpson,Mr. Bergstrom! Mr. Bergstrom!
11,Lisa Simpson,Do you know where I could find him?
...,...,...
158299,Lisa Simpson,Can we have wine?
158301,Lisa Simpson,Can I have wine?
158303,Lisa Simpson,Does Bart have to be there?
158305,Lisa Simpson,Can we do it this week?


To read the text and use it for our analysis, we need an object from `sklearn` called a `CountVectorizer`. Essentially, what it does is create a dictionary from a series of text. It lowercases the text and tokenizes it by using whitespace and interpunction as separations between words. I use a list of frequent English words ('stop words') that will not be counted: they are not informative enough.

We will need to convert the text to Unicode, which is a standard text format. We do so by using `.values.astype('U')`.

In [16]:
from sklearn.feature_extraction.text import CountVectorizer #The CountVectorizer object

text = reviews['spoken_words'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
vect
feature_names = vect.get_feature_names() #Get the words from the vocabulary
#feature_names
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:520]}")

There are 38778 words in the vocabulary. A selection: ['abreast', 'abridged', 'abridging', 'abroad', 'abs', 'absa', 'absconded', 'absence', 'absent', 'absentee', 'abso', 'absolut', 'absolute', 'absolutely', 'absolution', 'absolve', 'absolved', 'absorb', 'absorbativity', 'absorbed']



Now that we have the dictionary, we can count the occurences of each word for each review. This way, we can create a document-feature matrix, with documents (reviews) in the rows, and features (words) in the columns.

In [18]:
docu_feat = vect.transform(text) #The transform method from the CountVectorizer object creates the matrix


In [22]:
print(docu_feat[0:500,0:500]) #Let's print a little part of the matrix: the first 50 words & documents

  (48, 425)	1
  (53, 474)	1
  (289, 468)	1
  (387, 5)	1
  (444, 277)	1
  (486, 401)	1


# Naive Bayes

In [25]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

First, let's start with the code to generate a document-feature matrix.

In [26]:
df = pd.read_csv('simpsons.csv')
df = df.loc[(df['raw_character_text'] == 'Lisa Simpson') | (df['raw_character_text'] == 'Bart Simpson')]
text = df['spoken_words'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
docu_feat = vect.transform(text) # make a matrix

Now, we will use the Naïve Bayes classifier from `sklearn`.

In [30]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB() #create the model
X = docu_feat #the document-feature matrix is the X matrix
y = df['raw_character_text'] #creating the y vector

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) #split the data and store it

nb = nb.fit(X_train, y_train) #fit the model X=features, y=character




# Evaluation of Notebook

In [31]:
#Evaluate the model
y_test_p = nb.predict(X_test)
nb.score(X_test, y_test)

0.6300990099009901

The accuracy is 63.7%, which is not great considering there are only two categories. What if we guessed the same category all the time?

In [32]:
df['raw_character_text'].value_counts(normalize=True) #guess bart everytime

Bart Simpson    0.544954
Lisa Simpson    0.455046
Name: raw_character_text, dtype: float64

So we're doing about 9.3 percentage points better than when we guessed Bart all the time. Not great, but that's to be expected with such short lines of dialogue.

In [34]:
cm = confusion_matrix(y_test, y_test_p)
cm = pd.DataFrame(cm, index=['Bart', 'Lisa'], columns=['Bart predicted', 'Lisa predicted'])
cm

Unnamed: 0,Bart predicted,Lisa predicted
Bart,3200,868
Lisa,1934,1573


Bart has more lines than Lisa, which is how I figured out the ordering of the labels. However, you can also always get them using the `.classes_` attribute of the object.

In [35]:
nb.classes_

array(['Bart Simpson', 'Lisa Simpson'], dtype='<U12')

Let's calculate precision and recall. Remember: precision is the proportion of the "Bart" predictions that is actually "Bart". Recall is the proportion of real "Bart" that is predicted as "Bart".

In [36]:
print(f"The precision for Bart is: {cm.iloc[0,0]/(cm.iloc[0,0]+cm.iloc[1,0])}")
print(f"The recall for Bart is: {cm.iloc[0,0]/(cm.iloc[0,0]+cm.iloc[0,1])}")
print(f"The precision for Lisa is: {cm.iloc[1,1]/(cm.iloc[1,1]+cm.iloc[0,1])}")
print(f"The precision for Lisa is: {cm.iloc[1,1]/(cm.iloc[1,1]+cm.iloc[1,0])}")



The precision for Bart is: 0.6232956758862486
The recall for Bart is: 0.7866273352999017
The precision for Lisa is: 0.6444080294961082
The precision for Lisa is: 0.44853150841174794


We could do the same much shorter using a built-in function `classification_report`

In [37]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_p, nb.classes_)) #this function needs the class names, which are in nb.classes_

              precision    recall  f1-score   support

Bart Simpson       0.62      0.79      0.70      4068
Lisa Simpson       0.64      0.45      0.53      3507

    accuracy                           0.63      7575
   macro avg       0.63      0.62      0.61      7575
weighted avg       0.63      0.63      0.62      7575

