# Simpsons

In this Notebook, we will pre-process lines of dialogue from the *Simpsons*.

In [1]:
import pandas as pd

First, let's read in the data file.

In [2]:
df = pd.read_csv('simpsons.csv')
df.head()

Unnamed: 0,raw_character_text,spoken_words
0,Miss Hoover,"No, actually, it was a little of both. Sometim..."
1,Lisa Simpson,Where's Mr. Bergstrom?
2,Miss Hoover,I don't know. Although I'd sure like to talk t...
3,Lisa Simpson,That life is worth living.
4,Edna Krabappel-Flanders,The polls will be open from now until the end ...


In [3]:
df = df[(df['raw_character_text'] == 'Lisa Simpson') | (df['raw_character_text'] == 'Bart Simpson')]
df.head()

Unnamed: 0,raw_character_text,spoken_words
1,Lisa Simpson,Where's Mr. Bergstrom?
3,Lisa Simpson,That life is worth living.
7,Bart Simpson,Victory party under the slide!
9,Lisa Simpson,Mr. Bergstrom! Mr. Bergstrom!
11,Lisa Simpson,Do you know where I could find him?


To read the text and use it for our analysis, we need an object from `sklearn` called a `CountVectorizer`. Essentially, what it does is create a dictionary from a series of text. It lowercases the text and tokenizes it by using whitespace and interpunction as separations between words. I use a list of frequent English words ('stop words') that will not be counted: they are not informative enough.

We will need to convert the text to Unicode, which is a standard text format. We do so by using `.values.astype('U')`.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer #The CountVectorizer object

text = df['spoken_words'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
feature_names = vect.get_feature_names() #Get the words from the vocabulary
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:520]}")


There are 38778 words in the vocabulary. A selection: ['abreast', 'abridged', 'abridging', 'abroad', 'abs', 'absa', 'absconded', 'absence', 'absent', 'absentee', 'abso', 'absolut', 'absolute', 'absolutely', 'absolution', 'absolve', 'absolved', 'absorb', 'absorbativity', 'absorbed']


Now that we have the dictionary, we can count the occurences of each word for each review. This way, we can create a document-feature matrix, with documents (reviews) in the rows, and features (words) in the columns.

In [4]:
docu_feat = vect.transform(text) # make a matrix

In [10]:
print(docu_feat[0:500,0:500])

  (0, 699)	1
  (0, 9612)	1
  (0, 19886)	1
  (0, 20463)	1
  (0, 22700)	1
  (0, 22934)	1
  (0, 30619)	1
  (0, 34380)	1
  (1, 3212)	1
  (1, 22309)	1
  (2, 9337)	1
  (2, 9351)	1
  (2, 9963)	1
  (2, 18870)	1
  (2, 19584)	1
  (2, 19739)	1
  (2, 25556)	1
  (2, 33326)	1
  (2, 33756)	1
  (2, 33962)	1
  (2, 34940)	1
  (3, 19686)	1
  (3, 19906)	1
  (3, 38115)	1
  (4, 5416)	1
  :	:
  (38769, 16064)	1
  (38769, 19308)	1
  (38769, 21836)	1
  (38770, 15125)	1
  (38770, 21136)	1
  (38771, 22619)	1
  (38772, 6239)	1
  (38772, 14530)	1
  (38772, 21333)	1
  (38772, 36529)	1
  (38773, 15681)	1
  (38773, 23654)	1
  (38774, 21842)	1
  (38775, 22619)	1
  (38776, 14388)	1
  (38776, 14993)	1
  (38776, 15104)	1
  (38776, 18870)	1
  (38776, 19465)	1
  (38776, 20867)	1
  (38776, 24661)	1
  (38776, 25615)	1
  (38776, 29449)	1
  (38777, 14443)	1
  (38777, 14717)	1


Just for the example, we will make a regular matrix out of the sparse matrix. **Again, this is NOT recommended during actual analysis**.

In [12]:
#Create a regular matrix out of docu_feat, make it into a DataFrame and concatenate it along the columns
#We need to reset the index because otherwise we end up with a bunch of NA's
df_words = pd.concat([df[0:50].reset_index(), pd.DataFrame(docu_feat.toarray()[0:50]).reset_index()], axis=1)
df_words.head(5)



MemoryError: Unable to allocate 45.7 GiB for an array with shape (158314, 38778) and data type int64