## Web scraping: Movies

#### Scrape information from boxofficemojo.com to find a list of features for the [top 1000 movies by worldwide box office](https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/).

#### I start out by scraping the first page, which is limited to just the top 200 movies.

In [586]:
from bs4 import BeautifulSoup
import requests

In [587]:
url = 'https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/'

response = requests.get(url)
response.status_code

200

In [588]:
page = response.text
soup = BeautifulSoup(page)
print(soup.prettify)

<bound method Tag.prettify of <!DOCTYPE html>
<html class="a-no-js" data-19ax5a9jf="dingo"><head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/><meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta charset="utf-8"/>
<title dir="ltr">Top Lifetime Grosses - Box Office Mojo</title><meta content="Top Lifetime Grosses" name="title"/>
<meta content="Box Office Mojo" property="og:site_name"/>
<meta content="telephone=no" name="format-detection"/>
<link href="https://m.media-amazon.com/images/G/01/boxofficemojo/v2/favicon._CB448965889_.ico" rel="icon" type="image/x-icon"/>
<link href="https://images-na.ssl-images-amazon.com/images/I/11EIQ5IGqaL._RC|012LjolmrML.css,51AZ-Jz5kmL.css,51IB+wfP8qL.css,01evdoiemkL.css,01K+Ps1DeEL.css,01Vctty9pOL.css,314djKvMsUL.css,01ZTetsDh7L.css,11cMnOipjJL.css,01pbA9Lg3yL.css,21LK7jaicML.css,11L58Qpo0GL.css,21kyTi1FabL.css,01ruG+gDPFL.css,01YhS3Cs-hL.css,21GwE3cR-yL.css,11KLBtpWIAL.css,11nWWh1kQdL.css,11M4Xw

In [589]:
table = soup.find('table')
table

<table class="a-bordered a-horizontal-stripes a-size-base a-span12 mojo-body-table mojo-table-annotated"><tr><th class="a-text-right mojo-field-type-rank a-nowrap"><span title="Rank">Rank</span>
</th><th class="a-text-left mojo-field-type-title a-nowrap"><span title="Title">Title</span>
</th><th class="a-text-right mojo-field-type-money a-nowrap"><span title="Worldwide Lifetime Gross">Worldwide Lifetime Gross</span>
</th><th class="a-text-right mojo-field-type-money a-nowrap"><span title="Domestic Lifetime Gross">Domestic Lifetime Gross</span>
</th><th class="a-text-right mojo-field-type-percent a-nowrap"><span title="Domestic %">Domestic %</span>
</th><th class="a-text-right mojo-field-type-money a-nowrap"><span title="Foreign Lifetime Gross">Foreign Lifetime Gross</span>
</th><th class="a-text-right mojo-field-type-percent a-nowrap"><span title="Foreign %">Foreign %</span>
</th><th class="a-text-left mojo-field-type-year a-nowrap"><span title="Year">Year</span>
</th></tr><tr><td clas

In [590]:
rows = [row for row in table.find_all('tr')]
rows[1]

<tr><td class="a-text-right mojo-header-column mojo-truncate mojo-field-type-rank">1</td><td class="a-text-left mojo-field-type-title"><a class="a-link-normal" href="/title/tt4154796/?ref_=bo_cso_table_1">Avengers: Endgame</a></td><td class="a-text-right mojo-field-type-money">$2,797,800,564</td><td class="a-text-right mojo-field-type-money">$858,373,000</td><td class="a-text-right mojo-field-type-percent">30.7%</td><td class="a-text-right mojo-field-type-money">$1,939,427,564</td><td class="a-text-right mojo-field-type-percent">69.3%</td><td class="a-text-left mojo-field-type-year"><a class="a-link-normal" href="/year/2019/?ref_=bo_cso_table_1">2019</a></td></tr>

In [591]:
rows[1].find_all('td')

[<td class="a-text-right mojo-header-column mojo-truncate mojo-field-type-rank">1</td>,
 <td class="a-text-left mojo-field-type-title"><a class="a-link-normal" href="/title/tt4154796/?ref_=bo_cso_table_1">Avengers: Endgame</a></td>,
 <td class="a-text-right mojo-field-type-money">$2,797,800,564</td>,
 <td class="a-text-right mojo-field-type-money">$858,373,000</td>,
 <td class="a-text-right mojo-field-type-percent">30.7%</td>,
 <td class="a-text-right mojo-field-type-money">$1,939,427,564</td>,
 <td class="a-text-right mojo-field-type-percent">69.3%</td>,
 <td class="a-text-left mojo-field-type-year"><a class="a-link-normal" href="/year/2019/?ref_=bo_cso_table_1">2019</a></td>]

In [592]:
rows[1].find_all('td')[1].find('a')['href']

'/title/tt4154796/?ref_=bo_cso_table_1'

In [593]:
rows[1].find('a')

<a class="a-link-normal" href="/title/tt4154796/?ref_=bo_cso_table_1">Avengers: Endgame</a>

In [594]:
top_movies_test = {}

for row in rows[1:201]:
    items = row.find_all('td')
    link_class = items[1].find('a')
    title = link_class.text
    slug = link_class['href']
    top_movies_test[title] = [slug] + [i.text for i in items]

In [595]:
import pandas as pd

In [596]:
worldwide_movies = pd.DataFrame(top_movies_test).T
worldwide_movies.columns = ['url slug', 'ranking', 'title', 'worldwise gross', 'domestic gross', 'domestic %', 'foreign gross', 'foreign %', 'year']
worldwide_movies #double-checking I have all 200 movies from the first page

Unnamed: 0,url slug,ranking,title,worldwise gross,domestic gross,domestic %,foreign gross,foreign %,year
Avengers: Endgame,/title/tt4154796/?ref_=bo_cso_table_1,1,Avengers: Endgame,"$2,797,800,564","$858,373,000",30.7%,"$1,939,427,564",69.3%,2019
Avatar,/title/tt0499549/?ref_=bo_cso_table_2,2,Avatar,"$2,790,439,092","$760,507,625",27.2%,"$2,029,931,467",72.8%,2009
Titanic,/title/tt0120338/?ref_=bo_cso_table_3,3,Titanic,"$2,195,169,869","$659,363,944",30%,"$1,535,805,925",70%,1997
Star Wars: Episode VII - The Force Awakens,/title/tt2488496/?ref_=bo_cso_table_4,4,Star Wars: Episode VII - The Force Awakens,"$2,068,224,036","$936,662,225",45.3%,"$1,131,561,811",54.7%,2015
Avengers: Infinity War,/title/tt4154756/?ref_=bo_cso_table_5,5,Avengers: Infinity War,"$2,048,359,754","$678,815,482",33.1%,"$1,369,544,272",66.9%,2018
...,...,...,...,...,...,...,...,...,...
The Revenant,/title/tt1663202/?ref_=bo_cso_table_196,196,The Revenant,"$532,950,503","$183,637,894",34.5%,"$349,312,609",65.5%,2015
The Meg,/title/tt4779682/?ref_=bo_cso_table_197,197,The Meg,"$530,259,473","$145,443,742",27.4%,"$384,815,731",72.6%,2016
Ralph Breaks the Internet,/title/tt5848272/?ref_=bo_cso_table_198,198,Ralph Breaks the Internet,"$529,323,962","$201,091,711",38%,"$328,232,251",62%,2018
Hotel Transylvania 3: Summer Vacation,/title/tt5220122/?ref_=bo_cso_table_199,199,Hotel Transylvania 3: Summer Vacation,"$528,583,774","$167,510,016",31.7%,"$361,073,758",68.3%,2018


We now want to do the same process we did for above (but shorter and in a function for movies 201-1000 on the worldwide box office list).

In [597]:
def gather_movies(url):
    '''Function which repeats the process above but in all one step'''
        
    response = requests.get(url)
    page = response.text
    soup = BeautifulSoup(page)
    
    table = soup.find('table')
    rows = [row for row in table.find_all('tr')]

    top_movies = {}
    
    for row in rows[1:201]:
        items = row.find_all('td')
        link_class = items[1].find('a')
        title = link_class.text
        slug = link_class['href']
        top_movies[title] = [slug] + [i.text for i in items]
        
    return top_movies

#Create 5 separate dictionaries based off the 5 Box Office Mojo web pages of movies with Top Lifetime Grosses

dict1_to_200 = gather_movies('https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/')
dict201_to_400 = gather_movies('https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?offset=200')
dict401_to_600 = gather_movies('https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?offset=400')
dict601_to_800 = gather_movies('https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?offset=600')
dict801_to_1000 = gather_movies('https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?offset=800')

In [598]:
#Combine all 5 dictionaries just created and combine them into a single dictionary.

dict1_to_1000 = {**dict1_to_200, **dict201_to_400, **dict401_to_600, **dict601_to_800, **dict801_to_1000}

In [599]:
#Turn that newly-created giant dictionary into a data frame.

top1000 = pd.DataFrame(dict1_to_1000).T
top1000.columns = ['url_slug', 'ranking', 'title', 'worldwide_gross', 'domestic_gross', 'domestic_%', 'foreign_gross', 'foreign_%', 'year']
top1000

Unnamed: 0,url_slug,ranking,title,worldwide_gross,domestic_gross,domestic_%,foreign_gross,foreign_%,year
Avengers: Endgame,/title/tt4154796/?ref_=bo_cso_table_1,1,Avengers: Endgame,"$2,797,800,564","$858,373,000",30.7%,"$1,939,427,564",69.3%,2019
Avatar,/title/tt0499549/?ref_=bo_cso_table_2,2,Avatar,"$2,790,439,092","$760,507,625",27.2%,"$2,029,931,467",72.8%,2009
Titanic,/title/tt0120338/?ref_=bo_cso_table_3,3,Titanic,"$2,195,169,869","$659,363,944",30%,"$1,535,805,925",70%,1997
Star Wars: Episode VII - The Force Awakens,/title/tt2488496/?ref_=bo_cso_table_4,4,Star Wars: Episode VII - The Force Awakens,"$2,068,224,036","$936,662,225",45.3%,"$1,131,561,811",54.7%,2015
Avengers: Infinity War,/title/tt4154756/?ref_=bo_cso_table_5,5,Avengers: Infinity War,"$2,048,359,754","$678,815,482",33.1%,"$1,369,544,272",66.9%,2018
...,...,...,...,...,...,...,...,...,...
Hellboy II: The Golden Army,/title/tt0411477/?ref_=bo_cso_table_195,1010,Hellboy II: The Golden Army,"$168,319,243","$75,986,503",45.1%,"$92,332,740",54.9%,2008
Insidious: The Last Key,/title/tt5726086/?ref_=bo_cso_table_196,1011,Insidious: The Last Key,"$167,885,588","$67,745,330",40.4%,"$100,140,258",59.6%,2018
Unstoppable,/title/tt0477080/?ref_=bo_cso_table_198,1013,Unstoppable,"$167,805,466","$81,562,942",48.6%,"$86,242,524",51.4%,2010
Three Men and a Baby,/title/tt0094137/?ref_=bo_cso_table_199,1014,Three Men and a Baby,"$167,780,960","$167,780,960",100%,$0,-,1987


#### I'm removing the columns that I don't care about:
- **title** is already in the index
- **ranking** is not consistent in the table; some numbers are skipped for seemingly no reason. I can easily just sort the dataframe by **worldwide gross** at any point instead.
- **foreign gross**, **foreign %**, and **domestic_%** are included in **worldwide gross**, the dependent variable I'm trying to predict.

In [600]:
top1000 = top1000.drop(['ranking','title','domestic_%','foreign_gross','foreign_%'], axis=1)
top1000

Unnamed: 0,url_slug,worldwide_gross,domestic_gross,year
Avengers: Endgame,/title/tt4154796/?ref_=bo_cso_table_1,"$2,797,800,564","$858,373,000",2019
Avatar,/title/tt0499549/?ref_=bo_cso_table_2,"$2,790,439,092","$760,507,625",2009
Titanic,/title/tt0120338/?ref_=bo_cso_table_3,"$2,195,169,869","$659,363,944",1997
Star Wars: Episode VII - The Force Awakens,/title/tt2488496/?ref_=bo_cso_table_4,"$2,068,224,036","$936,662,225",2015
Avengers: Infinity War,/title/tt4154756/?ref_=bo_cso_table_5,"$2,048,359,754","$678,815,482",2018
...,...,...,...,...
Hellboy II: The Golden Army,/title/tt0411477/?ref_=bo_cso_table_195,"$168,319,243","$75,986,503",2008
Insidious: The Last Key,/title/tt5726086/?ref_=bo_cso_table_196,"$167,885,588","$67,745,330",2018
Unstoppable,/title/tt0477080/?ref_=bo_cso_table_198,"$167,805,466","$81,562,942",2010
Three Men and a Baby,/title/tt0094137/?ref_=bo_cso_table_199,"$167,780,960","$167,780,960",1987


#### I now want to add columns from information that exist on the individual pages of each movie and not in the main table:
- Domestic Distributor
- Budget
- Release Date
- MPAA rating
- Running Time
- Genres
- (and Title, to eventually match with the existing Title in the index)

#### The following code is playing around with scraping code for Avengers: Endgame to confirm how every object is scraped.

In [601]:
url_avengers = 'https://www.boxofficemojo.com/title/tt4154796/?ref_=bo_cso_table_1'

#Request HTMl and parse
response_avengers = requests.get(url_avengers)
page_avengers = response_avengers.text
soup = BeautifulSoup(page_avengers)

In [602]:
#Title
soup.find('title').text.split('-')[0].strip()

'Avengers: Endgame'

In [603]:
#Distributor
soup.find(text='Domestic Distributor').findNext().text.split('See')[0]

'Walt Disney Studios Motion Pictures'

In [604]:
#Budget
soup.find(text='Budget').findNext(class_='money').text.replace('$','').replace(',','')

'356000000'

In [605]:
#Release Month
soup.find(text='Earliest Release Date').findNext().text.split()[0]

'April'

In [606]:
#MPAA Rating
soup.find(text='MPAA').findNext().text

'PG-13'

In [607]:
#Runtime in Minutes
rt = soup.find(text='Running Time').findNext().text.split()
int(rt[0]) * 60 + int(rt[2])

181

In [608]:
#Genre
soup.find(text='Genres').findNext().text.split()

['Action', 'Adventure', 'Drama', 'Sci-Fi']

#### I'm now going to create a function to find all those elements at once.

In [609]:
def get_movie_details(slug):
    '''
    Gather additional movie details from the Box Office Mojo individual movie pages.
    Return as a dictionary.
    '''
    
    #Start with the main BOM url that all the subpages lives under
    domain = 'https://www.boxofficemojo.com'
    
    #Create full URL to scrape
    url = domain + slug
    
    #Request HTMl and parse
    response = requests.get(url)
    page = response.text
    soup = BeautifulSoup(page)
    
    column_headers = ['movie_title', 'distributor', 'budget', 'release_month', 'rating', 'runtime','genre']
    
    #Copy formulas I used for Avengers: Endgame above
    
    #Title
    movie_title = soup.find('title').text.split('-')[0].strip()
    #Distributor
    distributor = soup.find(text='Domestic Distributor').findNext().text.split('See')[0]
    #Budget
    budget = soup.find(text='Budget').findNext(class_='money').text.replace('$','').replace(',','')
    #Release Month
    release_month = soup.find(text='Earliest Release Date').findNext().text.split()[0]
    #MPAA rating
    rating = soup.find(text='MPAA').findNext().text
    #Runtime in minutes
    rt = soup.find(text='Running Time').findNext().text.split()
    runtime = int(rt[0]) * 60 + int(rt[2])
    #Genre
    genre = soup.find(text='Genres').findNext().text.split()
    
    #Create dictionary with all elements above
    movie_details = dict(zip(column_headers, [movie_title, distributor, budget, release_month, rating, runtime, genre]))
    
    return movie_details

#### I now want to use the function above to create a dictionary for the movie page details.

- Some of the movies were throwing AttributeErrors because not all the page data was available. For example, Avengers: Infinity War didn't have a budget listed.
- Additionally, some of the indexes (that definitely existed) were throwing IndexErrors, so I added some "try/except"s in there so it loops through the whole list without error.

In [610]:
top1000_info = []

for slug in top1000.url_slug[0:500]:
    try:
        top1000_info.append(get_movie_details(slug))
    except (AttributeError, IndexError):
        pass

In [611]:
for slug in top1000.url_slug[501:]:
    try:
        top1000_info.append(get_movie_details(slug))
    except (AttributeError, IndexError):
        pass

In [612]:
top1000_info

[{'movie_title': 'Avengers: Endgame',
  'distributor': 'Walt Disney Studios Motion Pictures',
  'budget': '356000000',
  'release_month': 'April',
  'rating': 'PG-13',
  'runtime': 181,
  'genre': ['Action', 'Adventure', 'Drama', 'Sci-Fi']},
 {'movie_title': 'Avatar',
  'distributor': 'Twentieth Century Fox',
  'budget': '237000000',
  'release_month': 'December',
  'rating': 'PG-13',
  'runtime': 162,
  'genre': ['Action', 'Adventure', 'Fantasy', 'Sci-Fi']},
 {'movie_title': 'Titanic',
  'distributor': 'Paramount Pictures',
  'budget': '200000000',
  'release_month': 'December',
  'rating': 'PG-13',
  'runtime': 194,
  'genre': ['Drama', 'Romance']},
 {'movie_title': 'Star Wars: Episode VII',
  'distributor': 'Walt Disney Studios Motion Pictures',
  'budget': '245000000',
  'release_month': 'December',
  'rating': 'PG-13',
  'runtime': 138,
  'genre': ['Action', 'Adventure', 'Sci-Fi']},
 {'movie_title': 'Jurassic World',
  'distributor': 'Universal Pictures',
  'budget': '150000000',


In [613]:
top1000_info_df = pd.DataFrame(top1000_info)
top1000_info_df.set_index('movie_title', inplace=True)

top1000_info_df

Unnamed: 0_level_0,distributor,budget,release_month,rating,runtime,genre
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Avengers: Endgame,Walt Disney Studios Motion Pictures,356000000,April,PG-13,181,"[Action, Adventure, Drama, Sci-Fi]"
Avatar,Twentieth Century Fox,237000000,December,PG-13,162,"[Action, Adventure, Fantasy, Sci-Fi]"
Titanic,Paramount Pictures,200000000,December,PG-13,194,"[Drama, Romance]"
Star Wars: Episode VII,Walt Disney Studios Motion Pictures,245000000,December,PG-13,138,"[Action, Adventure, Sci-Fi]"
Jurassic World,Universal Pictures,150000000,June,PG-13,124,"[Action, Adventure, Sci-Fi]"
...,...,...,...,...,...,...
Seven Pounds,Sony Pictures Entertainment (SPE),55000000,December,PG-13,123,[Drama]
Dodgeball,Twentieth Century Fox,20000000,June,PG-13,92,"[Comedy, Sport]"
Insidious: The Last Key,Universal Pictures,10000000,January,PG-13,103,"[Horror, Mystery, Thriller]"
Unstoppable,Twentieth Century Fox,100000000,November,PG-13,98,"[Action, Thriller]"


#### I now want to merge the two dataframes.

In [614]:
top1000 = pd.merge(top1000, top1000_info_df, left_index=True, right_index=True)

In [615]:
top1000.head()

Unnamed: 0,url_slug,worldwide_gross,domestic_gross,year,distributor,budget,release_month,rating,runtime,genre
"10,000 BC",/title/tt0443649/?ref_=bo_cso_table_161,"$269,784,201","$94,784,201",2008,Warner Bros.,105000000,March,PG-13,109,"[Action, Adventure, Drama, Fantasy, History]"
12 Years a Slave,/title/tt2024544/?ref_=bo_cso_table_65,"$187,733,202","$56,671,993",2013,Fox Searchlight Pictures,20000000,October,R,134,"[Biography, Drama, History]"
1917,/title/tt8579674/?ref_=bo_cso_table_125,"$384,788,959","$159,227,644",2019,Universal Pictures,95000000,December,R,119,"[Drama, War]"
2 Fast 2 Furious,/title/tt0322259/?ref_=bo_cso_table_60,"$236,350,661","$127,154,901",2003,Universal Pictures,76000000,June,PG-13,107,"[Action, Crime, Thriller]"
2012,/title/tt1190080/?ref_=bo_cso_table_93,"$791,217,826","$166,112,167",2009,Sony Pictures Entertainment (SPE),200000000,November,PG-13,158,"[Action, Adventure, Sci-Fi]"


#### Additional cleaning is required to convert all the dollar amounts and the release months and years to numerical values.

In [616]:
top1000['worldwide_gross'] = top1000['worldwide_gross'].str.replace("$","").str.replace(",","")
top1000['domestic_gross'] = top1000['domestic_gross'].str.replace("$","").str.replace(",","")

In [617]:
top1000['worldwide_gross'] = pd.to_numeric(top1000['worldwide_gross'])
top1000['domestic_gross'] = pd.to_numeric(top1000['domestic_gross'])
top1000['year'] = pd.to_numeric(top1000['year'])
top1000['budget'] = pd.to_numeric(top1000['budget'])

months = {'January':1, 'February':2, 'March':3, 'April':4, 'May':5, 'June':6, 'July':7, 'August':8, 'September':9, 'October':10, 'November':11, 'December':12}
top1000.release_month = top1000.release_month.map(months)

In [618]:
top1000.sort_values('worldwide_gross', ascending=False, inplace=True)
top1000.head()

Unnamed: 0,url_slug,worldwide_gross,domestic_gross,year,distributor,budget,release_month,rating,runtime,genre
Avengers: Endgame,/title/tt4154796/?ref_=bo_cso_table_1,2797800564,858373000,2019,Walt Disney Studios Motion Pictures,356000000,4,PG-13,181,"[Action, Adventure, Drama, Sci-Fi]"
Avatar,/title/tt0499549/?ref_=bo_cso_table_2,2790439092,760507625,2009,Twentieth Century Fox,237000000,12,PG-13,162,"[Action, Adventure, Fantasy, Sci-Fi]"
Titanic,/title/tt0120338/?ref_=bo_cso_table_3,2195169869,659363944,1997,Paramount Pictures,200000000,12,PG-13,194,"[Drama, Romance]"
Jurassic World,/title/tt0369610/?ref_=bo_cso_table_6,1670401444,652270625,2015,Universal Pictures,150000000,6,PG-13,124,"[Action, Adventure, Sci-Fi]"
The Avengers,/title/tt0848228/?ref_=bo_cso_table_8,1518815515,623357910,2012,Walt Disney Studios Motion Pictures,220000000,4,PG-13,143,"[Action, Adventure, Sci-Fi]"


#### I save it to pickle to use in another notebook for my regression analysis.

In [619]:
import pickle

In [620]:
top1000.to_pickle('top1000.pkl')