# Analysis of scripts by clustering text

We'll use the Natural Language Toolkit to further explore scripts from the Simpsons. First, we will need to clean up the text and extract keywords, then we can use other natural language processing methods to visualize our results.

In [1]:
# imports
import bs4
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import urllib
import time
%matplotlib inline
import psycopg2
from sqlalchemy import create_engine
import nltk
import re

## Connect to PostgreSQL database of Simpsons scripts

In [2]:
# connect to postgresl
dbname = 'simpsonsscripts'
username = 'hsf001'

con = None
con = psycopg2.connect(database = dbname, user = username)

engine = create_engine('postgres://%s@localhost/%s'%(username,dbname))
print(engine.url)

postgres://hsf001@localhost/simpsonsscripts


## Pull scripts from one season

We'll focus on one season first and pull the text from Season 8 scripts.

In [3]:
# scripts from season 8

sql_query = """
SELECT ep.number, ep.name, scripts.text
FROM episodes ep  
  LEFT JOIN scripts ON ep.url = scripts.url
WHERE ep.season='8'
"""

sql_out = pd.read_sql_query(sql_query,con)
sql_out['number']=sql_out['number'].astype(int)
sql_out.sort_values(by='number', inplace=True)

textList = sql_out['text'].tolist()

## TF-IDF for one season of the Simpsons

Let's try using a common analysis method, term frequency - inverse document frequency (TF-IDF), and look at what words or phrases might be most representative of an episode. We'll do this for the episodes from season 8 following [this document clustering guide](http://brandonrose.org/clustering).

### Data processing

First, we will need to process our data by formatting the text and removing meaningless words. 

### Stopwords

[Stop words](https://en.wikipedia.org/wiki/Stop_words) often consist of articles, pronouns, prepositions, etc. that don't convey significant meaning and may vary depending on different situations. We're going to use the [NLTK](http://www.nltk.org/) list of English stop words here.

In [4]:
# load NLTK's English stop words as variable stopwords
#nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')
print(stopwords[:10])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


### Stemming

[Stemming](https://en.wikipedia.org/wiki/Stemming) is used to break words down to their roots. For example, 'running' and 'run' have the same root 'run.' We'll use the [Snowball Stemmer](http://snowballstem.org/) from NLTK.

In [5]:
# load nltk's SnowballStemmer as variabled 'stemmer'
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

### Tokenizing

[Tokenizing](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) is used to separate individual words in a string of text. For example, `hello world` is a string that can be broken into 2 tokens `hello` and `world` with a space delimiter. We'll use the [tokenizer](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.sent_tokenize) from NLTK.

We'll follow the guide and create 2 functions to tokenize and stem the text reviews.  

- tokenize_and_stem: split the scripts into a list of words and stem each word
- tokenize_only: split the scripts into a list of words

These functions can be used to create a dictionary that will allow us to use stems for an algorithm but later convert the stems back to their full words for presentation purposes.

In [6]:
# define a tokenizer and stemmer that returns the set of stems in the text that it is passed

def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as its own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems


def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as its own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

We can use the stemming/tokenizing and tokenizing only functions to iterate over the list of scripts to create 2 vocabularies: one stemmed and one only tokenized.

In [7]:
# need to download the necessary files (pickle model)
#nltk.download('punkt')
testLine = textList[3]
out = nltk.sent_tokenize(testLine)
print(out)
for sent in nltk.sent_tokenize(testLine):
    print(sent)
    for word in nltk.word_tokenize(sent):
        print(word.lower())

['Homer Simpson:  (HORRIFIED) Oh my God... space aliens!', "Don't eat me!", 'I have a wife and kids!', 'Eat them!']
Homer Simpson:  (HORRIFIED) Oh my God... space aliens!
homer
simpson
:
(
horrified
)
oh
my
god
...
space
aliens
!
Don't eat me!
do
n't
eat
me
!
I have a wife and kids!
i
have
a
wife
and
kids
!
Eat them!
eat
them
!


In [32]:
# do not run this - it will take very long

totalvocab_stemmed = []
totalvocab_tokenized = []
#for i in textList:
for i in textList[1]:
    allwords_stemmed = tokenize_and_stem(i) # for each item in 'synopses,' tokenize/stem
    totalvocab_stemmed.extend(allwords_stemmed) # extend the 'totalvocab_stemmed' list
    
    allwords_tokenized = tokenize_only(i)
    totalvocab_tokenized.extend(allwords_tokenized)


In [8]:
import time

start = time.time()

#x = 1000

#testnames = nameList[:x]
#testtexts = textList[:x]
testtexts = textList

totalvocab_stemmed = []
totalvocab_tokenized = []
for i in testtexts:
    allwords_stemmed = tokenize_and_stem(i) # for each item in 'synopses,' tokenize/stem
    totalvocab_stemmed.extend(allwords_stemmed) # extend the 'totalvocab_stemmed' list
    
    allwords_tokenized = tokenize_only(i)
    totalvocab_tokenized.extend(allwords_tokenized)

end = time.time()

#print('Time elapsed: ' + str(round(end - start, 4)) + ' seconds for ' + str(x) + ' businesses.')
print('Time elapsed: ' + str(round(end - start, 4)) + ' seconds')

Time elapsed: 4.4004 seconds


In [9]:
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
print('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame')

there are 88134 items in vocab_frame


In [10]:
display(vocab_frame.head())

Unnamed: 0,words
dole,dole
campaign,campaign
stop,stop
ext,ext
dole,dole


In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

# define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

%time tfidf_matrix = tfidf_vectorizer.fit_transform(testtexts) #fit the vectorizer to reviews

print(tfidf_matrix.shape)

CPU times: user 3.05 s, sys: 20.7 ms, total: 3.07 s
Wall time: 3.07 s
(6690, 2)


In [12]:
terms = tfidf_vectorizer.get_feature_names()

In [13]:
from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix)

In [14]:
from sklearn.cluster import KMeans

num_clusters = 5

km = KMeans(n_clusters=num_clusters)

%time km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

CPU times: user 232 ms, sys: 12 ms, total: 244 ms
Wall time: 251 ms


In [15]:
from sklearn.externals import joblib

#uncomment the below to save your model 
#since I've already run my model I am loading from the pickle

joblib.dump(km,  'doc_cluster.pkl')

km = joblib.load('doc_cluster.pkl')
clusters = km.labels_.tolist()

In [None]:
#starList = ysql['stars'].tolist()
#nrevList = ysql['review_count'].tolist()

#teststars = starList[:x]
#testnrevs = nrevList[:x]

In [None]:
#reviews = { 'name': testnames, 'stars': teststars, 'review': testtexts, 'cluster': clusters, 'nreviews': testnrevs }

#frame = pd.DataFrame(reviews, index = [clusters] , columns = ['stars', 'name', 'cluster', 'nreviews'])

In [None]:
#frame['cluster'].value_counts() # number of businesses per cluster (clusters from 0 to 4)

In [None]:
#grouped = frame['stars'].groupby(frame['cluster']) # group by cluster for aggregation purposes

#grouped.mean() # average rank (1 to 5) per cluster