# Topic Modeling - Stanley Tucci Movie Plots

## First Round

Latent Dirichlet Allocation using the gensim model. This is a probabilistic approach to topic modeling on text data. 

In [11]:
# Read in document term matrix 

# Read in the document-term matrix
import pandas as pd
import pickle

data = pd.read_pickle('Data/Tucci-dtm.pkl')
#data = data.transpose()
data.head()


Unnamed: 0_level_0,aa,aaron,abandon,abandoned,abandoning,abandons,abducted,abducting,abductions,abducts,...,zetajones,zeus,zeuss,zimsky,zola,zolas,zone,zoneone,zones,zoo
Movie Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Prizzi's Honor,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Who's That Girl (1987 film),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Monkey Shines,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Slaves of New York,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"Fear, Anxiety & Depression",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
# Import the necessary modules for LDA with gensim
# Terminal / Anaconda Navigator: conda install -c conda-forge gensim
from gensim import matutils, models
import scipy.sparse

In [8]:
# One of the required inputs is a term-document matrix
tdm = data.transpose()
tdm.head()

Movie Title,Prizzi's Honor,Who's That Girl (1987 film),Monkey Shines,Slaves of New York,"Fear, Anxiety & Depression",Quick Change,Men of Respect,Billy Bathgate (film),In the Soup,Beethoven (film),...,Submission (2017 film),Show Dogs,Patient Zero (film),A Private War,Night Hunter (2018 film),Worth (film),Supernova (2020 film),The King's Man,Jolt (film),Moonfall (upcoming film)
aa,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aaron,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abandon,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abandoned,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abandoning,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
# We're going to put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [14]:
# Gensim also requires dictionary of the all terms and their respective location in the term-document matrix
cv = pickle.load(open("Data/cv.pkl", "rb"))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

In [17]:

# Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term),
# we need to specify two other parameters as well - the number of topics and the number of passes
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=6, passes=15)
lda.print_topics()

[(0,
  '0.006*"nick" + 0.005*"john" + 0.005*"viktor" + 0.005*"jiro" + 0.005*"olive" + 0.005*"susie" + 0.004*"new" + 0.004*"harvey" + 0.004*"home" + 0.004*"demarco"'),
 (1,
  '0.007*"slevin" + 0.006*"celine" + 0.005*"robert" + 0.005*"film" + 0.005*"goodkat" + 0.005*"ali" + 0.005*"sabine" + 0.004*"ella" + 0.003*"boss" + 0.003*"allan"'),
 (2,
  '0.006*"harry" + 0.006*"jim" + 0.005*"belle" + 0.004*"beast" + 0.004*"morgan" + 0.004*"ben" + 0.003*"mimi" + 0.003*"beethoven" + 0.003*"fiona" + 0.003*"tells"'),
 (3,
  '0.006*"cole" + 0.006*"sullivan" + 0.005*"rogers" + 0.004*"gittens" + 0.004*"eddie" + 0.004*"assange" + 0.003*"film" + 0.003*"jimmy" + 0.003*"daniel" + 0.003*"gwen"'),
 (4,
  '0.012*"katniss" + 0.007*"peeta" + 0.006*"district" + 0.004*"harry" + 0.004*"peabody" + 0.004*"team" + 0.004*"games" + 0.004*"sherman" + 0.004*"capitol" + 0.003*"optimus"'),
 (5,
  '0.004*"andy" + 0.004*"rodney" + 0.004*"bigweld" + 0.004*"marisa" + 0.004*"jack" + 0.004*"constantine" + 0.003*"soups" + 0.003*"dom

## Adjectives Only

The plot descriptons of his movies seem to have a lot of character names -- which makes sense. I was thinking of cutting the movie into nouns and adjectives, but the names are nouns which still makes it hard to distinguish his movies without previous experience. 

In [18]:
# Let's create a function to pull out nouns from a string of text
def adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj =  pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

In [19]:
# Apply the nouns function to the transcripts to filter only on nouns
data_nouns_adj = pd.DataFrame(data_clean.transcript.apply(nouns_adj))
data_nouns_adj

NameError: name 'data_clean' is not defined