# Movie Analysis: Obtaining Data 

## About:
In this notebook I will import the external data sets I gathered, turn them into tables to check on them and then save them as CSV files for further data scrubbing:

1. Money
2. Actors
3. Keywords

In [1]:
# imports for entire data gathering phase
import pandas as pd 
from bs4 import BeautifulSoup
import os

## 1. Money:
This dataset was scraped from the imdbpro website. It was behind authenticaton (really good authentication, owned by Amazon.com) and even though I have an account, I was unable to setup a normal scraping script.

I ended up logging into the site, and running a script in the developer console to scroll to the bottom (the content is lazy loaded, and scrolling down for an hour was not going to happen), and once at the bottom to save the page content.

In [2]:
def read_html(file_path):
    """Takes a saved html page and returns the results container that holds all the targeted info
    
    Arguments:
        file_path (string): path to find the saved html page

    Returns:
        result (BeautifulSoup object): or None if result wasn't found
    """
    # strip all the actor elements from the first html
    with open(file_path, 'r') as f_html:
        html = BeautifulSoup(f_html)
        # pull out only the elements we want
        result = html.find('div', id='results').ul.find_all('li', recursive=False)
        return result if result else None

In [3]:
# setup paths for importing raw html
movie_html_a_path = os.path.join(os.pardir, 'data', 'external', 'movie_money_1m_plus.htm')
movie_html_b_path = os.path.join(os.pardir, 'data', 'external', 'movie_money_1m_minus.htm')

In [4]:
money_a = read_html(movie_html_a_path)
money_b = read_html(movie_html_b_path)

In [5]:
# sanity check the results
print('type of container 1: {},\ntype of container 2: {}'.format(type(money_a), type(money_b)))
print('size of container 1: {},\nsize of container 2: {}'.format(len(money_a), len(money_b)))

type of container 1: <class 'bs4.element.ResultSet'>,
type of container 2: <class 'bs4.element.ResultSet'>
size of container 1: 7851,
size of container 2: 6851


In [6]:
# if they both look okay, go ahead and join them, then check the new length
money_a.extend(money_b)
len(money_a)

14702

In [7]:
# container parser
def money_parser(soup):
    """Parses through an array of soup objects and takes out the relevant info
    
    Arguments:
        soup (bs4.element.ResultSet): Chunk to search through
    
    Returns:
        results (List): A list of dictionaries 
    """
    results = []
    for title in soup:
        result = {
            'imdb_id': title.find('span', class_='display-title').a.get('href')[27:36] if title.find('span', class_='display-title') else None,
            'title': title.find('span', class_='display-title').a.get_text() if title.find('span', class_='display-title') else None,
            'year': title.find('span', class_='year').get_text()[1:-1] if title.find('span', class_='year') else None,
            'director': title.find('span', class_='display-name').a.get_text() if title.find('span', class_='display-name') else None,
            'production_co': title.find('span', class_='display-company').a.get_text() if title.find('span', class_='display-company') else None,
            'region_code': title.find('span', class_='region_code').get_text().strip() if title.find('span', class_='region_code') else None,
            'rank': title.find('span', class_='ranking').get_text().strip() if title.find('span', class_='ranking') else None,
            'budget_usd': title.find('span', class_='budget_usd').get_text().strip() if title.find('span', class_='budget_usd') else None,
            'us_gross': title.find('span', class_='us_gross').get_text().strip() if title.find('span', class_='us_gross') else None
        }
        results.append(result)
    return results

In [8]:
parsed_money = money_parser(money_a)

### Explore and make sure we have what we were looking for

In [9]:
# load up new dataset with the parsed info
money_test_df = pd.DataFrame.from_dict(parsed_money)
money_test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14702 entries, 0 to 14701
Data columns (total 9 columns):
imdb_id          14700 non-null object
title            14700 non-null object
year             14685 non-null object
director         14671 non-null object
production_co    14345 non-null object
region_code      13447 non-null object
rank             14700 non-null object
budget_usd       14700 non-null object
us_gross         14700 non-null object
dtypes: object(9)
memory usage: 1.0+ MB


In [10]:
# look at the description
money_test_df.describe()

Unnamed: 0,imdb_id,title,year,director,production_co,region_code,rank,budget_usd,us_gross
count,14700,14700,14685,14671,14345,13447,14700.0,14700.0,14700
unique,14560,14204,105,6952,6232,90,14549.0,807.0,1370
top,tt0041959,Hamlet,2017,Woody Allen,Universal Pictures,[US],,,$1.1MM
freq,2,4,604,42,400,8886,12.0,6704.0,145


In [11]:
# look at some samples
money_test_df.sample(3)

Unnamed: 0,imdb_id,title,year,director,production_co,region_code,rank,budget_usd,us_gross
10201,tt2024519,The Broken Circle Breakdown,2012,Felix van Groeningen,Menuet Producties,[BE],10053,,$175K
2721,tt0280460,The Banger Sisters,2002,Bob Dolman,Fox Searchlight Pictures,[US],19898,$10MM,$30MM
3220,tt0387514,Prime,2005,Ben Younger,Prime Film Productions LLC,[US],13311,$22MM,$23MM


In [12]:
# test for na values
money_test_df['us_gross'].isna().sum()

2

Look's like a pretty solid dataset. I will still have to format the currencies in the scrubbing phase. Let me save it to a CSV for the data scrubbing process:

In [13]:
# file out path. Going to put it in interim for now. After its scrubbed it will live in processed.
out_path = os.path.join(os.pardir, 'data', 'interim', 'money.csv')
money_test_df.to_csv(out_path, index=False)

In [14]:
money_df = pd.read_csv(out_path)
money_df['imdb_id'].sample(50)

8131     tt6752848
9782     tt0120775
13037    tt6874254
2372     tt0079948
519      tt0120347
12043    tt0309600
9488     tt0476884
11719    tt3186318
723      tt1059786
8273     tt1385869
9461     tt6692354
9530     tt0097662
6602     tt0080116
4448     tt0160429
2290     tt0137523
12632    tt1230130
10383    tt1646975
14249    tt0100275
11238    tt0975684
6116     tt0102493
12398    tt0498351
3904     tt0055928
10871    tt3282712
1865     tt0057193
14277    tt6251666
6175     tt4080728
3762     tt0271367
11780    tt3451230
10982    tt2184331
8533     tt1139800
13605    tt7765120
11545    tt3142232
11891    tt0870089
14312    tt1603489
5050     tt0077275
14592    tt0120376
957      tt0077766
4209     tt2097307
11581    tt2023690
7132     tt5342904
113      tt0295297
11538    tt8844204
2232     tt0481536
11683    tt6764122
3463     tt0082558
13242    tt2085888
7292     tt1017456
6353     tt3416744
7624     tt0120865
976      tt0093565
Name: imdb_id, dtype: object

## 2. Actors:
Actors were also taken from the people section of the imdb pro website, in the same manner.

In [15]:
# setup paths for importing raw html
actors_html_path = os.path.join(os.pardir, 'data', 'external', 'actors_0-10k.htm')
actors_html_b_path = os.path.join(os.pardir, 'data', 'external', 'actors_10k-20k.htm')
actors_html_c_path = os.path.join(os.pardir, 'data', 'external', 'actors_20k-30k.htm')

In [16]:
# use our function to grab results div
actors_a = read_html(actors_html_path)
actors_b = read_html(actors_html_b_path)
actors_c = read_html(actors_html_c_path)

In [17]:
# sanity check the results
print('type of container 1: {},\ntype of container 2: {},\ntype of container 3: {}'.format(type(actors_a), type(actors_b), type(actors_c)))
print('size of container 1: {},\nsize of container 2: {},\ntype of container 3: {}'.format(len(actors_a), len(actors_b), len(actors_c)))

type of container 1: <class 'bs4.element.ResultSet'>,
type of container 2: <class 'bs4.element.ResultSet'>,
type of container 3: <class 'bs4.element.ResultSet'>
size of container 1: 9726,
size of container 2: 851,
type of container 3: 626


Add them together:

In [18]:
actors_a.extend(actors_b)
actors_a.extend(actors_c)

Check the size:

In [19]:
len(actors_a)

11203

In [20]:
def parse_actors(soup):
    """Parses through an array of soup objects and takes out the relevant info
    
    Arguments:
        soup (bs4.element.ResultSet): Chunk to search through
    
    Returns:
        results (List): A list of dictionaries 
    """
    actors = []
    for actor_html in soup:
        actor = {
            'name': actor_html.find('span', class_='display-name').a.get_text() if actor_html.find('span', class_='display-name') else None,
            'year': actor_html.find('span', class_='year').get_text().strip() if actor_html.find('span', class_='year') else None,
            'rank': actor_html.find('span', class_='ranking').get_text().strip() if actor_html.find('span', class_='ranking') else None,
            'age': actor_html.find('span', class_='age_rank').get_text().strip() if actor_html.find('span', class_='age_rank') else None,
            'height': actor_html.find('span', class_='height').get_text().strip() if actor_html.find('span', class_='height') else None
        }
        actors.append(actor)
    return actors

In [21]:
parsed_actors = parse_actors(actors_a)

In [22]:
# load into df
actors_df = pd.DataFrame().from_dict(parsed_actors)

Checking out the result

In [23]:
actors_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11203 entries, 0 to 11202
Data columns (total 5 columns):
name      11200 non-null object
year      11198 non-null object
rank      11200 non-null object
age       11200 non-null object
height    11200 non-null object
dtypes: object(5)
memory usage: 437.7+ KB


In [24]:
actors_df.sample(5)

Unnamed: 0,name,year,rank,age,height
1140,Teresa Ruiz,(2018-2020),1157,,"5' 3½"""
9294,Henry Simmons,(2014),9628,,"6' 4"""
3562,Cerina Vincent,(2002),3653,,"5' 7"""
4094,Chris Tucker,(2001),4202,48.0,"6' 1"""
6570,Yasiin Bey,(2003),6753,46.0,"5' 9"""


In [25]:
actors_df.describe()

Unnamed: 0,name,year,rank,age,height
count,11200,11198,11200,11200.0,11200.0
unique,10645,1347,10654,129.0,122.0
top,James Murray,(2019),1130,,
freq,3,512,2,3195.0,1455.0


In [26]:
actors_df.isna().sum()

name      3
year      5
rank      3
age       3
height    3
dtype: int64

Looks pretty good to me. Let's write it to a CSV and move on.

In [27]:
# file out path. Going to put it in interim for now. After its scrubbed it will live in processed.
out_path = os.path.join(os.pardir, 'data', 'interim', 'actors.csv')
actors_df.to_csv(out_path, index=False)

In [28]:
actors_df = pd.read_csv(out_path)
actors_df.head()

Unnamed: 0,name,year,rank,age,height
0,Max von Sydow,(1929–2020),1,90.0,"6' 4"""
1,Ana de Armas,(2017),2,,"5' 6¼"""
2,Iliza Shlesinger,(2020),3,,"5' 5"""
3,Mark Wahlberg,(2010),4,48.0,"5' 8"""
4,Tom Hanks,(2000),5,63.0,"6' 0"""


## 3. Keywords
The keywords data was gathered from the TheMovieDB.com site using the script in src/data/keyword_builder.

The script creates a list of imdb id's, and queries the API for each id. If it finds an id it will query the API once more to get the list of keywords for that id.

After the scripts runs through all the id's, it exports the database to a CSV and saves it in /data/raw.

## Finished
That is it for pulling in new data. We should have enough to answer our questions.