# Louis George    

## EDA of Scraped Script and Score Data

In [269]:
import numpy as np
import pandas as pd

import re
import json
import spacy

Reading in, and inspection of the script file:

In [195]:
df = pd.read_csv('data/scripts_upto_all.csv', index_col='Unnamed: 0')

In [196]:
df.head(2)

Unnamed: 0,titles,scripts,genres
0,10 Things I Hate About You,TEN THINGS I HA...,"['Action', 'Adventure', 'Animation', 'Comedy',..."
1,12,\n \n\n\n\n\nCUT FROM BLACK\n\nTITLE: FIN\n...,"['Action', 'Adventure', 'Animation', 'Comedy',..."


In [197]:
df.shape

(1210, 3)

In [198]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1210 entries, 0 to 1209
Data columns (total 3 columns):
titles     1210 non-null object
scripts    1145 non-null object
genres     1210 non-null object
dtypes: object(3)
memory usage: 37.8+ KB


In [199]:
df.isna().sum()

titles      0
scripts    65
genres      0
dtype: int64

In [200]:
df[df['scripts'].isna()]

Unnamed: 0,titles,scripts,genres
17,48 Hrs.,,"['Action', 'Adventure', 'Animation', 'Comedy',..."
20,8 Mile,,"['Action', 'Adventure', 'Animation', 'Comedy',..."
22,9,,"['Action', 'Adventure', 'Animation', 'Comedy',..."
29,A.I.,,"['Action', 'Adventure', 'Animation', 'Comedy',..."
122,Back to the Future,,"['Action', 'Adventure', 'Animation', 'Comedy',..."
...,...,...,...
1121,Troy,,"['Action', 'Adventure', 'Animation', 'Comedy',..."
1136,Unforgiven,,"['Action', 'Adventure', 'Animation', 'Comedy',..."
1142,Valentine's Day,,"['Action', 'Adventure', 'Animation', 'Comedy',..."
1146,Vertigo,,"['Action', 'Adventure', 'Animation', 'Comedy',..."


After checking some of these titles, the link to the script loads a pdf, or other document type. The script that I currently have can't handle such documents. Time allowing I may revisit this.     

Now because of the way my script grabbed the genres, I also grabbed a table with links to all 18 of the various genres. This table always got read before the script genres, and so by taking everything after the first 18 I should be left with each scripts respective genre.

In [201]:
for i in range(df.shape[0]):
    df['genres'][i] = df['genres'][i].split()
    df['genres'][i] = df['genres'][i][18:]

In [202]:
df.head()

Unnamed: 0,titles,scripts,genres
0,10 Things I Hate About You,TEN THINGS I HA...,"['Comedy',, 'Romance']]"
1,12,\n \n\n\n\n\nCUT FROM BLACK\n\nTITLE: FIN\n...,['Comedy']]
2,12 and Holding,\n \n \n ...,['Drama']]
3,12 Monkeys,TWELVE MONKEYS\n \n An orig...,"['Drama',, 'Sci-Fi',, 'Thriller']]"
4,12 Years a Slave,12 YEARS A SLAVE\...,['Drama']]


I'll need to do some cleaning for that column, but it shouldn't be too bad as I will simply be removing all non alphabetic characters.    

I will now drop all 65 of the movies which I wasn't able to obtain the script for, and clean the genres up.

In [203]:
df = df.dropna().reset_index()

In [264]:
for i in range(df.shape[0]):
    for j in range(len(df['genres'][i])):
        temp = re.findall('[A-Za-z]', df['genres'][i][j])
        word = ''.join(temp)
        df['genres'][i][j] = word

Need to change format of all titles with the form: "title, The", and "Title: sub title"    

Turns out it handles the vast majority of the form: "Title: sub title" properly, and that only a handful don't go through. For that reason I am going to let them go, as there doesn't seem to be an immediatly obvious solution to capture the ones that don't go through and leave the ones that do alone (Some titles rely on both the title and sub title).   

I guess that I could do it dynamically when querying: try the whole whole title, and if not correct result try with the partial title. Will do this if time permits.

In [218]:
for i in range(df.shape[0]):
    if re.search("The$", df['titles'][i]):
        n_title = "The " + re.split(", ", df['titles'][i])[0]
        df['titles'][i] = n_title

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Now to save the titles as a csv for use to cross reference against the OMDb API and get the scores.

In [219]:
df['titles'].to_csv('NLP_Movie_Scripts/movie_titles.csv', header='titles')

### Tokenizing the Scripts

Vectorizer takes in a tokenizer. We then use this blown up df in models.

In [273]:
nlp = spacy.load("en_core_web_sm")

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

In [271]:
!python -m spacy validate


[+] Loaded compatibility table

[i] spaCy installation: C:\Users\louis\Anaconda3\lib\site-packages\spacy

TYPE      NAME             MODEL            VERSION      
package   en-core-web-sm   en_core_web_sm   2.2.5     [+]



### Cleaning the scores

In [86]:
df_scores1 = pd.read_csv('data/movie_info1.csv').drop('Unnamed: 0', axis=1)

In [87]:
df_scores1.head()

Unnamed: 0,titles,info
0,10 Things I Hate About You,"{""Title"":""10 Things I Hate About You"",""Year"":""..."
1,12,"{""Title"":""12"",""Year"":""2007"",""Rated"":""PG-13"",""R..."
2,12 and Holding,"{""Title"":""12 and Holding"",""Year"":""2005"",""Rated..."
3,12 Monkeys,"{""Title"":""12 Monkeys"",""Year"":""1995"",""Rated"":""R..."
4,12 Years a Slave,"{""Title"":""12 Years a Slave"",""Year"":""2013"",""Rat..."


In [100]:
df_scores1.isna().any()

titles    False
info      False
dtype: bool

In [234]:
temp = json.loads(df_scores1['info'][17])

In [235]:
temp

{'Title': '50-50',
 'Year': '2011',
 'Rated': 'N/A',
 'Released': 'N/A',
 'Runtime': '15 min',
 'Genre': 'Short, Crime, Drama, Romance',
 'Director': 'Megan Riakos',
 'Writer': 'Megan Riakos',
 'Actors': 'Jessica McNamee, Oliver Ackland, Drew Pearson, Les Chantery',
 'Plot': "Follows Nellie Cameron, a real life prostitute famous on the streets of Sydney in the 1920s and 30s. Nellie is enjoying the perks of her world, the drugs, the money, the fame but in her game she can't afford love.",
 'Language': 'English',
 'Country': 'Australia',
 'Awards': '1 win.',
 'Poster': 'https://m.media-amazon.com/images/M/MV5BZjM0NjhkNTItYjI0Yi00YmEzLTk0MzUtMzk1MGM5ZTVlODU4XkEyXkFqcGdeQXVyMjgxMzAxNQ@@._V1_SX300.jpg',
 'Ratings': [{'Source': 'Internet Movie Database', 'Value': '6.1/10'}],
 'Metascore': 'N/A',
 'imdbRating': '6.1',
 'imdbVotes': '34',
 'imdbID': 'tt1833204',
 'Type': 'movie',
 'DVD': 'N/A',
 'BoxOffice': 'N/A',
 'Production': 'N/A',
 'Website': 'N/A',
 'Response': 'True'}

In [136]:
type(temp['Ratings'][0]['Source'])

str

In [223]:
df_scores1 = df_scores1.assign(IMDb_score=np.nan)
df_scores1 = df_scores1.assign(RT_score=np.nan)
df_scores1 = df_scores1.assign(Meta_score=np.nan)
df_scores1 = df_scores1.assign(box_office=np.nan)

In [229]:
for i in range(df_scores1.shape[0]):
    temp = json.loads(df_scores1['info'][i])
    try:
        for j in temp['Ratings']:
            if j['Source'] == 'Internet Movie Database':
                df_scores1['IMDb_score'][i] = j['Value']
            elif j['Source'] == 'Rotten Tomatoes':
                df_scores1['RT_score'][i] = j['Value']
            elif j['Source'] == 'Metacritic':
                df_scores1['Meta_score'][i] = j['Value']
        df_scores1['box_office'][i] = temp['BoxOffice']
    except:
        #print("Exception triggered")
        df_scores1['IMDb_score'][i] = np.nan
        df_scores1['RT_score'][i] = np.nan
        df_scores1['Meta_score'][i] = np.nan
        df_scores1['box_office'][i] = np.nan

In [230]:
df_scores1.head()

Unnamed: 0,titles,info,IMDb_score,RT_score,Meta_score,box_office
0,10 Things I Hate About You,"{""Title"":""10 Things I Hate About You"",""Year"":""...",7.3/10,68%,70/100,
1,12,"{""Title"":""12"",""Year"":""2007"",""Rated"":""PG-13"",""R...",7.7/10,76%,72/100,
2,12 and Holding,"{""Title"":""12 and Holding"",""Year"":""2005"",""Rated...",7.5/10,73%,65/100,
3,12 Monkeys,"{""Title"":""12 Monkeys"",""Year"":""1995"",""Rated"":""R...",8.0/10,89%,74/100,
4,12 Years a Slave,"{""Title"":""12 Years a Slave"",""Year"":""2013"",""Rat...",8.1/10,95%,96/100,"$50,628,650"


In [233]:
df_scores1[df_scores1['RT_score'].isna()]

Unnamed: 0,titles,info,IMDb_score,RT_score,Meta_score,box_office
9,187,"{""Title"":""187"",""Year"":""2016"",""Rated"":""N/A"",""Re...",,,,
17,50-50,"{""Title"":""50-50"",""Year"":""2011"",""Rated"":""N/A"",""...",6.1/10,,,
28,"Abyss, The","{""Title"":""Abyss: The Greatest Proposal Ever"",""...",,,,
30,Adaptation,"{""Title"":""Adaptation"",""Year"":""2019–"",""Rated"":""...",,,,
31,"Addams Family, The","{""Response"":""False"",""Error"":""Movie not found!""}",,,,
...,...,...,...,...,...,...
484,"Happy Birthday, Wanda June","{""Title"":""Happy Birthday, Wanda June"",""Year"":""...",6.1/10,,,
488,Harold and Kumar Go to White Castle,"{""Response"":""False"",""Error"":""Movie not found!""}",,,,
489,"Haunting, The","{""Title"":""Paranormal Haunting: The Curse of th...",2.0/10,,,
495,"Hebrew Hammer, The","{""Response"":""False"",""Error"":""Movie not found!""}",,,,


In [214]:
for i in df_scores1[df_scores1['IMDb_score'].isna()]['titles']:
    if re.search(':', i):
        print(i)

Airplane 2: The Sequel
American Shaolin: King of Kickboxers II
Boondock Saints 2: All Saints Day
Crow: City of Angels, The
Evil Dead II: Dead by Dawn
Hellboy 2: The Golden Army


In [216]:
list(df_scores1[df_scores1['IMDb_score'].isna()]['titles'])

['187',
 'Abyss, The',
 'Adaptation',
 'Addams Family, The',
 'Adjustment Bureau, The',
 'Adventures of Buckaroo Banzai Across the Eighth Dimension, The',
 'Airplane 2: The Sequel',
 'American President, The',
 'American Shaolin: King of Kickboxers II',
 'Amityville Asylum, The',
 'Anniversary Party, The',
 'Apartment, The',
 'Avengers, The',
 'Avengers, The (2012)',
 "Avventura, L' (The Adventure)",
 'Bachelor Party, The',
 'Back-up Plan, The',
 'Batman 2',
 'Battle of Algiers, The',
 'Battle of Shaker Heights, The',
 'Believer, The',
 'Best Exotic Marigold Hotel, The',
 'Big Blue, The',
 'Big Lebowski, The',
 'Big Sick, The',
 'Big White, The',
 'Black Dahlia, The',
 'Blast from the Past, The',
 'Blind Side, The',
 'Bling Ring, The',
 'Book of Eli, The',
 'Boondock Saints 2: All Saints Day',
 'Boondock Saints, The',
 'Bourne Supremacy, The',
 'Boxtrolls, The',
 'Breakfast Club, The',
 'Brothers Bloom, The',
 'Change-Up, The',
 'Cincinnati Kid, The',
 'Cooler, The',
 'Crow Salvation, 