## Exploratory Data Analysis

Due to the assumptions made during scraping the movie datasets (that the plot synopsis found on Wikipedia accurately represents theme and actions of movie), I am going to perform some exploratory data analysis to get a better feel for the data. 

We want to find some obvious patterns within the data before using ML algorithms to find hidden patterns. In text data this is usually done with basic descriptive statistics. For text data we are going to do with through looking at things like common words, or 

In [4]:
# Read in the document-term matrix
import pandas as pd

data = pd.read_pickle('Data/Tucci-dtm.pkl')
data = data.transpose()
data.head()

Movie Title,Prizzi's Honor,Who's That Girl (1987 film),Monkey Shines,Slaves of New York,"Fear, Anxiety & Depression",Quick Change,Men of Respect,Billy Bathgate (film),In the Soup,Beethoven (film),...,Submission (2017 film),Show Dogs,Patient Zero (film),A Private War,Night Hunter (2018 film),Worth (film),Supernova (2020 film),The King's Man,Jolt (film),Moonfall (upcoming film)
aa,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aaron,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abandon,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abandoned,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abandoning,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Most Common Words

One of the exciting things about using plot descriptions is that the information should be tailored to each individual movie. However, I was worried that filler words describing movies might continuously inject themselves. Words like 'film, directed, cast'. I wanted to see if there were common words that were not important when individualizing the movie, and instead were repeated over and over when describing movies in general. To do this, I took a look at the most common words in each movie description. 

In [6]:
# Find the top 30 words said in each movie description
top_dict = {}
for c in data.columns:
    top = data[c].sort_values(ascending=False).head(30)
    top_dict[c]= list(zip(top.index, top.values))

top_dict

{"Prizzi's Honor": [('charley', 13),
  ('maerose', 6),
  ('irene', 6),
  ('don', 4),
  ('man', 4),
  ('marksie', 3),
  ('woman', 3),
  ('wife', 3),
  ('charleys', 3),
  ('father', 3),
  ('california', 2),
  ('estranged', 2),
  ('wedding', 2),
  ('york', 2),
  ('casino', 2),
  ('marry', 2),
  ('hit', 2),
  ('family', 2),
  ('fact', 2),
  ('money', 2),
  ('angelo', 2),
  ('ends', 2),
  ('business', 2),
  ('prizzi', 2),
  ('doesnt', 2),
  ('nevada', 2),
  ('eduardo', 2),
  ('acting', 2),
  ('new', 2),
  ('dominic', 2)],
 "Who's That Girl (1987 film)": [('nikki', 12),
  ('loudon', 11),
  ('worthington', 5),
  ('mr', 5),
  ('shes', 4),
  ('murray', 4),
  ('bus', 4),
  ('finds', 3),
  ('box', 3),
  ('johnny', 3),
  ('bell', 2),
  ('pick', 2),
  ('love', 2),
  ('woman', 2),
  ('arrested', 2),
  ('wedding', 2),
  ('john', 2),
  ('cougar', 2),
  ('years', 2),
  ('thieves', 2),
  ('men', 2),
  ('theft', 2),
  ('catch', 2),
  ('philadelphia', 2),
  ('vanishes', 1),
  ('intelligent', 1),
  ('new',

In [7]:
from collections import Counter

# Let's first pull out the top 30 words for each comedian
words = []
for movie in data.columns:
    top = [word for (word, count) in top_dict[movie]]
    for t in top:
        words.append(t)
        
words

['charley',
 'maerose',
 'irene',
 'don',
 'man',
 'marksie',
 'woman',
 'wife',
 'charleys',
 'father',
 'california',
 'estranged',
 'wedding',
 'york',
 'casino',
 'marry',
 'hit',
 'family',
 'fact',
 'money',
 'angelo',
 'ends',
 'business',
 'prizzi',
 'doesnt',
 'nevada',
 'eduardo',
 'acting',
 'new',
 'dominic',
 'nikki',
 'loudon',
 'worthington',
 'mr',
 'shes',
 'murray',
 'bus',
 'finds',
 'box',
 'johnny',
 'bell',
 'pick',
 'love',
 'woman',
 'arrested',
 'wedding',
 'john',
 'cougar',
 'years',
 'thieves',
 'men',
 'theft',
 'catch',
 'philadelphia',
 'vanishes',
 'intelligent',
 'new',
 'plans',
 'make',
 'realizes',
 'ella',
 'allan',
 'allans',
 'melanie',
 'geoffrey',
 'house',
 'romantic',
 'having',
 'attempts',
 'head',
 'kills',
 'monkeys',
 'surgery',
 'bird',
 'pentobarbitone',
 'relationship',
 'violent',
 'experimental',
 'despite',
 'takes',
 'condition',
 'music',
 'pet',
 'neck',
 'wiseman',
 'summons',
 'dr',
 'linda',
 'tape',
 'arrives',
 'eleanor',
 '

In [8]:
# Let's aggregate this list and identify the most common words along with how many routines they occur in
Counter(words).most_common()

[('film', 18),
 ('new', 17),
 ('man', 14),
 ('tells', 14),
 ('stanley', 11),
 ('day', 10),
 ('father', 9),
 ('family', 9),
 ('finds', 9),
 ('love', 9),
 ('life', 9),
 ('tucci', 9),
 ('time', 9),
 ('begins', 8),
 ('directed', 8),
 ('wife', 7),
 ('city', 7),
 ('work', 7),
 ('home', 7),
 ('daughter', 7),
 ('later', 7),
 ('goes', 7),
 ('world', 7),
 ('woman', 6),
 ('york', 6),
 ('relationship', 6),
 ('takes', 6),
 ('jack', 6),
 ('gets', 6),
 ('away', 6),
 ('escape', 6),
 ('killed', 6),
 ('american', 6),
 ('fantasies', 6),
 ('play', 6),
 ('wedding', 5),
 ('money', 5),
 ('having', 5),
 ('head', 5),
 ('fashion', 5),
 ('group', 5),
 ('old', 5),
 ('son', 5),
 ('kill', 5),
 ('court', 5),
 ('dog', 5),
 ('return', 5),
 ('police', 5),
 ('just', 5),
 ('festival', 5),
 ('fantasizes', 5),
 ('decides', 5),
 ('returns', 5),
 ('affair', 5),
 ('gun', 5),
 ('killing', 5),
 ('mr', 4),
 ('men', 4),
 ('make', 4),
 ('despite', 4),
 ('music', 4),
 ('friend', 4),
 ('way', 4),
 ('boss', 4),
 ('billy', 4),
 ('boy'

Through this I found that there were some words repeated like film and directed. However, there was not a strong enough cut off when just looking at the list for me to add new stop words into my corpus. This can be further looked at later. 