## Web Scrapping & API for Microsoft Movie Project

## Overview

We reviewed the four sites recommended for the project: IMDB, Box Office Mojo, TMDB, and Rotten Tomatoes.  This initial review helped us formulate a better understanding of the available data and how we could obtain it from the websites.  We also started to discuss why Microsoft would want to enter the movie business and what information would be most relevant to them.

***Box Office Mojo Web Scrapping***

We found that it was easy to web scrape a table of the 1,000 highest grossing films of all time from Box Office Mojo.  Using Beautifulsoup, we were able to obtain key financial information as well as the IMDB ID for each movie on the list.

***TMDB API***

Then we used the IMDB IDs to make a series of API calls to TMDB to obstain details for each movie on the list including genre, production companies, release date, etc.

Once we gathered this raw data into our Jupyter Notebook, we turned it into a dataframe and exported it as a CSV so we would have a copy of the data for future use.


In [28]:
import pandas as pd # data analysis
import requests # get url
from bs4 import BeautifulSoup as bs # data scraping
import matplotlib.pyplot as plt # Data visualisation
import datetime # Check week number
import time # Delay for iterated API requests
import config #config file to protect API key

## Web Scrapping Iteration 1

Below is our first html grab from Box Office Mojo.  The website contains a list of the 1,000 highest grossing movies of all time spread over 5 different web pages.

In [3]:
url = "https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?area=XWW&ref_=bo_cso_ac"
r = requests.get(url)
soup = bs(r.content, "html.parser")
print(soup.prettify)

<bound method Tag.prettify of <!DOCTYPE doctype html>
<html class="a-no-js" data-19ax5a9jf="dingo"><head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/><meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta charset="utf-8"/>
<title dir="ltr">Top Lifetime Grosses - Box Office Mojo</title><meta content="Top Lifetime Grosses" name="title"/>
<meta content="Box Office Mojo" property="og:site_name"/>
<meta content="telephone=no" name="format-detection"/>
<link href="https://m.media-amazon.com/images/G/01/boxofficemojo/v2/favicon._CB448965889_.ico" rel="icon" type="image/x-icon"/>
<link href="https://images-na.ssl-images-amazon.com/images/I/51tax7M48-L._RC|516fcOUE-HL.css,01evdoiemkL.css,01K+Ps1DeEL.css,31pdJv9iSzL.css,01VszOUTO6L.css,11UGC+GXOPL.css,21LK7jaicML.css,11L58Qpo0GL.css,21kyTi1FabL.css,01ruG+gDPFL.css,01YhS3Cs-hL.css,21GwE3cR-yL.css,019SHZnt8RL.css,01wAWQRgXzL.css,21bWcRJYNIL.css,11WgRxUdJRL.css,01dU8+SPlFL.css,11ocrgKoE-L.cs



The code block below takes the html 'soup' obtained from the Box Office Mojo webpage and then pulls the specific data points we need for our analysis.  The data was organized in a table on the webpage, so we looped through the soup to find all the "tr" tags for table rows and all the "td" tags for the speciifc table elements.  We made of a list of all the relevant information in each row.  We also isolated the IMDB ID which was located in a URL for each movie.  We extracted each ID and added it to the list of the movie it identified.  The result was a list of 200 lists containing each movie.

We then repeated this smae process for each of the 5 web pages to make 5 seperate lists containing 200 lists of movie data.


In [4]:

mov_list1 = []
for record in soup.findAll("tr"):
   # print(record.findAll("td"))
    row = [x.text for x in record.findAll("td")]
    #print(row)
    for a in record.findAll('a', href=True):
        if a['href'].startswith('/title/'):
            row.append(a['href'][7:16])
    mov_list1.append(row)
print(mov_list1)
        


[[], ['1', 'Avengers: Endgame', '$2,797,800,564', '$858,373,000', '30.7%', '$1,939,427,564', '69.3%', '2019', 'tt4154796'], ['2', 'Avatar', '$2,790,439,000', '$760,507,625', '27.2%', '$2,029,931,375', '72.8%', '2009', 'tt0499549'], ['3', 'Titanic', '$2,194,439,542', '$659,363,944', '30%', '$1,535,075,598', '70%', '1997', 'tt0120338'], ['4', 'Star Wars: Episode VII - The Force Awakens', '$2,068,223,624', '$936,662,225', '45.3%', '$1,131,561,399', '54.7%', '2015', 'tt2488496'], ['5', 'Avengers: Infinity War', '$2,048,359,754', '$678,815,482', '33.1%', '$1,369,544,272', '66.9%', '2018', 'tt4154756'], ['6', 'Jurassic World', '$1,670,400,637', '$652,270,625', '39%', '$1,018,130,012', '61%', '2015', 'tt0369610'], ['7', 'The Lion King', '$1,656,943,394', '$543,638,043', '32.8%', '$1,113,305,351', '67.2%', '2019', 'tt6105098'], ['8', 'The Avengers', '$1,518,812,988', '$623,357,910', '41%', '$895,455,078', '59%', '2012', 'tt0848228'], ['9', 'Furious 7', '$1,515,047,671', '$353,007,020', '23.3%'

We also isolated the IMDB IDs separately so we could easily make the API calls to TMDB.

In [5]:
movie_ids1 = []
for a in soup.findAll('a', href=True):
    if a['href'].startswith('/title/'):
        movie_ids1.append(a['href'][7:16])
print(len(movie_ids1))

200


## Web Scrapping Iteration 2

In [6]:
url = "https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?area=XWW&offset=200"
r = requests.get(url)
soup2 = bs(r.content, "html.parser")
print(soup.prettify)

<bound method Tag.prettify of <!DOCTYPE doctype html>
<html class="a-no-js" data-19ax5a9jf="dingo"><head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/><meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta charset="utf-8"/>
<title dir="ltr">Top Lifetime Grosses - Box Office Mojo</title><meta content="Top Lifetime Grosses" name="title"/>
<meta content="Box Office Mojo" property="og:site_name"/>
<meta content="telephone=no" name="format-detection"/>
<link href="https://m.media-amazon.com/images/G/01/boxofficemojo/v2/favicon._CB448965889_.ico" rel="icon" type="image/x-icon"/>
<link href="https://images-na.ssl-images-amazon.com/images/I/51tax7M48-L._RC|516fcOUE-HL.css,01evdoiemkL.css,01K+Ps1DeEL.css,31pdJv9iSzL.css,01VszOUTO6L.css,11UGC+GXOPL.css,21LK7jaicML.css,11L58Qpo0GL.css,21kyTi1FabL.css,01ruG+gDPFL.css,01YhS3Cs-hL.css,21GwE3cR-yL.css,019SHZnt8RL.css,01wAWQRgXzL.css,21bWcRJYNIL.css,11WgRxUdJRL.css,01dU8+SPlFL.css,11ocrgKoE-L.cs

In [7]:
mov_list2 = []
for record in soup2.findAll("tr"):
   # print(record.findAll("td"))
    row = [x.text for x in record.findAll("td")]
    #print(row)
    for a in record.findAll('a', href=True):
        if a['href'].startswith('/title/'):
            row.append(a['href'][7:16])
    mov_list2.append(row)
print(mov_list2)

[[], ['201', 'Dunkirk', '$526,940,665', '$189,740,665', '36%', '$337,200,000', '64%', '2017', 'tt5013056'], ['202', 'Godzilla', '$524,976,069', '$200,676,069', '38.2%', '$324,300,000', '61.8%', '2014', 'tt0831387'], ['203', 'Sherlock Holmes', '$524,028,679', '$209,028,679', '39.9%', '$315,000,000', '60.1%', '2009', 'tt0988045'], ['204', 'Meet the Fockers', '$522,657,936', '$279,261,160', '53.4%', '$243,396,776', '46.6%', '2004', 'tt0290002'], ['205', 'How to Train Your Dragon: The Hidden World', '$521,799,505', '$160,799,505', '30.8%', '$361,000,000', '69.2%', '2019', 'tt2386490'], ['206', 'WALL·E', '$521,311,860', '$223,808,164', '42.9%', '$297,503,696', '57.1%', '2008', 'tt0910970'], ['207', 'Kung Fu Panda 3', '$521,170,825', '$143,528,619', '27.5%', '$377,642,206', '72.5%', '2016', 'tt2267968'], ['208', 'Terminator 2: Judgment Day', '$520,884,847', '$205,881,154', '39.5%', '$315,003,693', '60.5%', '1991', 'tt0103064'], ['209', 'Ant-Man', '$519,311,965', '$180,202,163', '34.7%', '$33

In [8]:
movie_ids2 = []
for a in soup2.findAll('a', href=True):
    if a['href'].startswith('/title/'):
        movie_ids2.append(a['href'][7:16])
print(movie_ids2)

['tt5013056', 'tt0831387', 'tt0988045', 'tt0290002', 'tt2386490', 'tt0910970', 'tt2267968', 'tt0103064', 'tt0478970', 'tt2709692', 'tt0099653', 'tt0103639', 'tt2357291', 'tt0332452', 'tt0120363', 'tt0892769', 'tt0117998', 'tt1623205', 'tt0800320', 'tt4777008', 'tt3450958', 'tt0356910', 'tt0808151', 'tt1291150', 'tt0315327', 'tt0126029', 'tt1436562', 'tt0120815', 'tt1318514', 'tt0099785', 'tt0086190', 'tt0367594', 'tt2510894', 'tt0097576', 'tt2126355', 'tt7349950', 'tt1772341', 'tt0073195', 'tt1119646', 'tt1490017', 'tt4701182', 'tt1408101', 'tt0133093', 'tt0100405', 'tt0317219', 'tt0172495', 'tt0376994', 'tt0465234', 'tt2872732', 'tt0117060', 'tt0416449', 'tt0325710', 'tt7362036', 'tt0240772', 'tt1014738', 'tt0800369', 'tt0213149', 'tt0120855', 'tt3783958', 'tt0120912', 'tt0440963', 'tt0209163', 'tt1231580', 'tt1707386', 'tt0070047', 'tt0107614', 'tt1340138', 'tt0101414', 'tt0803096', 'tt1517451', 'tt1485796', 'tt0181852', 'tt5884052', 'tt0246460', 'tt5113040', 'tt0162222', 'tt2231461'

## Web Scrapping Iteration 3

In [9]:
url = "https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?offset=400&area=XWW"
r = requests.get(url)
soup3 = bs(r.content, "html.parser")
print(soup3.prettify)

<bound method Tag.prettify of <!DOCTYPE doctype html>
<html class="a-no-js" data-19ax5a9jf="dingo"><head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/><meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta charset="utf-8"/>
<title dir="ltr">Top Lifetime Grosses - Box Office Mojo</title><meta content="Top Lifetime Grosses" name="title"/>
<meta content="Box Office Mojo" property="og:site_name"/>
<meta content="telephone=no" name="format-detection"/>
<link href="https://m.media-amazon.com/images/G/01/boxofficemojo/v2/favicon._CB448965889_.ico" rel="icon" type="image/x-icon"/>
<link href="https://images-na.ssl-images-amazon.com/images/I/51tax7M48-L._RC|516fcOUE-HL.css,01evdoiemkL.css,01K+Ps1DeEL.css,31pdJv9iSzL.css,01VszOUTO6L.css,11UGC+GXOPL.css,21LK7jaicML.css,11L58Qpo0GL.css,21kyTi1FabL.css,01ruG+gDPFL.css,01YhS3Cs-hL.css,21GwE3cR-yL.css,019SHZnt8RL.css,01wAWQRgXzL.css,21bWcRJYNIL.css,11WgRxUdJRL.css,01dU8+SPlFL.css,11ocrgKoE-L.cs

In [10]:
mov_list3 = []
for record in soup3.findAll("tr"):
   # print(record.findAll("td"))
    row = [x.text for x in record.findAll("td")]
    #print(row)
    for a in record.findAll('a', href=True):
        if a['href'].startswith('/title/'):
            row.append(a['href'][7:16])
    mov_list3.append(row)
print(mov_list3)

[[], ['402', 'Rush Hour 2', '$347,325,802', '$226,164,286', '65.1%', '$121,161,516', '34.9%', '2001', 'tt0266915'], ['403', 'Trolls', '$346,864,462', '$153,707,064', '44.3%', '$193,157,398', '55.7%', '2016', 'tt1679335'], ['404', 'xXx: Return of Xander Cage', '$346,118,277', '$44,898,413', '13%', '$301,219,864', '87%', '2017', 'tt1293847'], ['405', 'Pocahontas', '$346,079,773', '$141,579,773', '40.9%', '$204,500,000', '59.1%', '1995', 'tt0114148'], ['406', 'How the Grinch Stole Christmas', '$345,141,403', '$260,044,825', '75.3%', '$85,096,578', '24.7%', '2000', 'tt0170016'], ['407', 'Star Trek Beyond', '$343,471,816', '$158,848,340', '46.2%', '$184,623,476', '53.8%', '2016', 'tt2660888'], ['408', 'Alvin and the Chipmunks: Chipwrecked', '$342,695,435', '$133,110,742', '38.8%', '$209,584,693', '61.2%', '2011', 'tt1615918'], ['409', 'Wanted', '$342,463,063', '$134,508,551', '39.3%', '$207,954,512', '60.7%', '2008', 'tt0493464'], ['410', 'The Flintstones', '$341,631,208', '$130,531,208', '

In [11]:
movie_ids3 = []
for a in soup3.findAll('a', href=True):
    if a['href'].startswith('/title/'):
        movie_ids3.append(a['href'][7:16])
print(movie_ids3)

['tt0266915', 'tt1679335', 'tt1293847', 'tt0114148', 'tt0170016', 'tt2660888', 'tt1615918', 'tt0493464', 'tt0109813', 'tt6644200', 'tt0461770', 'tt0327084', 'tt1253863', 'tt0112462', 'tt0473075', 'tt0096874', 'tt0421715', 'tt1397514', 'tt0117500', 'tt2034800', 'tt3110958', 'tt0120667', 'tt0087469', 'tt0120347', 'tt1979388', 'tt2294449', 'tt0212338', 'tt0096438', 'tt0947798', 'tt6966692', 'tt0177971', 'tt0090555', 'tt0458352', 'tt0114369', 'tt6146586', 'tt2446042', 'tt0116583', 'tt1067106', 'tt2279373', 'tt0108052', 'tt1001526', 'tt0104714', 'tt1142988', 'tt0955308', 'tt0361748', 'tt0115433', 'tt3065204', 'tt0093010', 'tt0938283', 'tt1457767', 'tt1041829', 'tt0268978', 'tt0086960', 'tt0118571', 'tt1764651', 'tt0371606', 'tt0119822', 'tt0338348', 'tt4046784', 'tt2592614', 'tt0145660', 'tt4116284', 'tt0496806', 'tt0970866', 'tt0397892', 'tt0117438', 'tt0163187', 'tt0878804', 'tt8946378', 'tt7830888', 'tt2582846', 'tt0454921', 'tt1446192', 'tt0075860', 'tt0299658', 'tt5140878', 'tt2543472'

## Web Scrapping Iteration 4

In [12]:
url = "https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?area=XWW&offset=600"
r = requests.get(url)
soup4 = bs(r.content, "html.parser")
print(soup4.prettify)

<bound method Tag.prettify of <!DOCTYPE doctype html>
<html class="a-no-js" data-19ax5a9jf="dingo"><head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/><meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta charset="utf-8"/>
<title dir="ltr">Top Lifetime Grosses - Box Office Mojo</title><meta content="Top Lifetime Grosses" name="title"/>
<meta content="Box Office Mojo" property="og:site_name"/>
<meta content="telephone=no" name="format-detection"/>
<link href="https://m.media-amazon.com/images/G/01/boxofficemojo/v2/favicon._CB448965889_.ico" rel="icon" type="image/x-icon"/>
<link href="https://images-na.ssl-images-amazon.com/images/I/51tax7M48-L._RC|516fcOUE-HL.css,01evdoiemkL.css,01K+Ps1DeEL.css,31pdJv9iSzL.css,01VszOUTO6L.css,11UGC+GXOPL.css,21LK7jaicML.css,11L58Qpo0GL.css,21kyTi1FabL.css,01ruG+gDPFL.css,01YhS3Cs-hL.css,21GwE3cR-yL.css,019SHZnt8RL.css,01wAWQRgXzL.css,21bWcRJYNIL.css,11WgRxUdJRL.css,01dU8+SPlFL.css,11ocrgKoE-L.cs

In [13]:
mov_list4 = []
for record in soup4.findAll("tr"):
   # print(record.findAll("td"))
    row = [x.text for x in record.findAll("td")]
    #print(row)
    for a in record.findAll('a', href=True):
        if a['href'].startswith('/title/'):
            row.append(a['href'][7:16])
    mov_list4.append(row)
print(mov_list4)

[[], ['605', 'Wild Hogs', '$253,625,427', '$168,273,550', '66.4%', '$85,351,877', '33.6%', '2007', 'tt0486946'], ['606', 'High School Musical 3', '$252,909,177', '$90,559,416', '35.8%', '$162,349,761', '64.2%', '2008', 'tt0962726'], ['607', 'Hercules', '$252,712,101', '$99,112,101', '39.2%', '$153,600,000', '60.8%', '1997', 'tt0119282'], ['608', 'X-Men: Dark Phoenix', '$252,442,974', '$65,845,974', '26.1%', '$186,597,000', '73.9%', '2019', 'tt6565702'], ['609', 'True Grit', '$252,276,927', '$171,243,005', '67.9%', '$81,033,922', '32.1%', '2010', 'tt1403865'], ['610', 'Bean', '$251,212,670', '$45,319,423', '18%', '$205,893,247', '82%', '1997', 'tt0118689'], ['611', 'American Hustle', '$251,171,807', '$150,117,807', '59.8%', '$101,054,000', '40.2%', '2013', 'tt1800241'], ['612', 'Enemy of the State', '$250,849,789', '$111,549,836', '44.5%', '$139,299,953', '55.5%', '1998', 'tt0120660'], ['613', "You've Got Mail", '$250,821,495', '$115,821,495', '46.2%', '$135,000,000', '53.8%', '1998', '

In [14]:
movie_ids4 = []
for a in soup4.findAll('a', href=True):
    if a['href'].startswith('/title/'):
        movie_ids4.append(a['href'][7:16])
print(movie_ids4)

['tt0486946', 'tt0962726', 'tt0119282', 'tt6565702', 'tt1403865', 'tt0118689', 'tt1800241', 'tt0120660', 'tt0128853', 'tt0449010', 'tt0328880', 'tt0120746', 'tt0298130', 'tt0185937', 'tt0217869', 'tt5273624', 'tt0109686', 'tt6823368', 'tt2191701', 'tt2452042', 'tt0099088', 'tt0068646', 'tt1192628', 'tt0109831', 'tt0119094', 'tt3949660', 'tt1077368', 'tt0286716', 'tt1021867', 'tt0314331', 'tt2120120', 'tt1267297', 'tt0120812', 'tt0373051', 'tt1815862', 'tt0104257', 'tt0844471', 'tt0298203', 'tt1528854', 'tt1234721', 'tt6751668', 'tt0116213', 'tt2316204', 'tt3263904', 'tt2094766', 'tt0389860', 'tt0099423', 'tt1855325', 'tt0092493', 'tt1691917', 'tt3104988', 'tt0118688', 'tt1320261', 'tt0187078', 'tt0076666', 'tt1045658', 'tt0322259', 'tt0347149', 'tt4846340', 'tt0212720', 'tt0097165', 'tt3079380', 'tt0163651', 'tt1605630', 'tt0142342', 'tt2974918', 'tt2084970', 'tt0970416', 'tt0328828', 'tt1568346', 'tt0467406', 'tt1024648', 'tt0453451', 'tt0116209', 'tt0105417', 'tt8350360', 'tt0360486'

## Web Scrapping Iteration 5

In [15]:
url = "https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?area=XWW&offset=800"
r = requests.get(url)
soup5 = bs(r.content, "html.parser")
print(soup5.prettify)

<bound method Tag.prettify of <!DOCTYPE doctype html>
<html class="a-no-js" data-19ax5a9jf="dingo"><head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/><meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta charset="utf-8"/>
<title dir="ltr">Top Lifetime Grosses - Box Office Mojo</title><meta content="Top Lifetime Grosses" name="title"/>
<meta content="Box Office Mojo" property="og:site_name"/>
<meta content="telephone=no" name="format-detection"/>
<link href="https://m.media-amazon.com/images/G/01/boxofficemojo/v2/favicon._CB448965889_.ico" rel="icon" type="image/x-icon"/>
<link href="https://images-na.ssl-images-amazon.com/images/I/51tax7M48-L._RC|516fcOUE-HL.css,01evdoiemkL.css,01K+Ps1DeEL.css,31pdJv9iSzL.css,01VszOUTO6L.css,11UGC+GXOPL.css,21LK7jaicML.css,11L58Qpo0GL.css,21kyTi1FabL.css,01ruG+gDPFL.css,01YhS3Cs-hL.css,21GwE3cR-yL.css,019SHZnt8RL.css,01wAWQRgXzL.css,21bWcRJYNIL.css,11WgRxUdJRL.css,01dU8+SPlFL.css,11ocrgKoE-L.cs

In [16]:
mov_list5 = []
for record in soup5.findAll("tr"):
   # print(record.findAll("td"))
    row = [x.text for x in record.findAll("td")]
    #print(row)
    for a in record.findAll('a', href=True):
        if a['href'].startswith('/title/'):
            row.append(a['href'][7:16])
    mov_list5.append(row)
print(mov_list5)

[[], ['808', 'Garfield', '$203,172,417', '$75,369,589', '37.1%', '$127,802,828', '62.9%', '2004', 'tt0356634'], ['809', 'Little Women', '$203,127,575', '$107,727,575', '53%', '$95,400,000', '47%', '2019', 'tt3281548'], ['810', 'The Addams Family', '$203,044,905', '$100,044,905', '49.3%', '$103,000,000', '50.7%', '2019', 'tt1620981'], ['811', 'Patch Adams', '$202,292,902', '$135,026,902', '66.8%', '$67,266,000', '33.2%', '1998', 'tt0129290'], ['812', 'Teenage Mutant Ninja Turtles', '$201,965,915', '$135,265,915', '67%', '$66,700,000', '33%', '1990', 'tt0100758'], ['813', 'Kindergarten Cop', '$201,957,688', '$91,457,688', '45.3%', '$110,500,000', '54.7%', '1990', 'tt0099938'], ['814', 'Straight Outta Compton', '$201,634,991', '$161,197,785', '80%', '$40,437,206', '20%', '2015', 'tt1398426'], ['815', '21 Jump Street', '$201,585,328', '$138,447,667', '68.7%', '$63,137,661', '31.3%', '2012', 'tt1232829'], ['816', 'Valkyrie', '$201,545,517', '$83,077,833', '41.2%', '$118,467,684', '58.8%', '

In [17]:
movie_ids5 = []
for a in soup5.findAll('a', href=True):
    if a['href'].startswith('/title/'):
        movie_ids5.append(a['href'][7:16])
print(movie_ids5)

['tt0356634', 'tt3281548', 'tt1620981', 'tt0129290', 'tt0100758', 'tt0099938', 'tt1398426', 'tt1232829', 'tt0985699', 'tt0400717', 'tt0239395', 'tt0099810', 'tt1854564', 'tt0313737', 'tt1245526', 'tt0120632', 'tt0395699', 'tt0343660', 'tt1386703', 'tt8755316', 'tt1649419', 'tt2459022', 'tt4575576', 'tt1351685', 'tt2398241', 'tt0338459', 'tt0258000', 'tt2203939', 'tt0111282', 'tt0442933', 'tt1606389', 'tt0305224', 'tt0942385', 'tt5580390', 'tt0107798', 'tt3691740', 'tt2066051', 'tt2361509', 'tt6398184', 'tt0312004', 'tt0377981', 'tt0164184', 'tt0217505', 'tt4591310', 'tt1179904', 'tt0455944', 'tt3513498', 'tt0101272', 'tt0398165', 'tt0349205', 'tt3766354', 'tt0164052', 'tt0114069', 'tt0111070', 'tt6324278', 'tt7713068', 'tt8239806', 'tt0120902', 'tt3890264', 'tt0095956', 'tt0359950', 'tt0077766', 'tt2024544', 'tt0113277', 'tt1396218', 'tt0391198', 'tt0762107', 'tt9426210', 'tt0119314', 'tt0358273', 'tt0283426', 'tt1144884', 'tt0230011', 'tt0454848', 'tt0120484', 'tt0970179', 'tt4765284'

## Combining the Lists

Once we had a list of 200 movies from each webpage, we combined the lists together into a list of all 1,000 highest grossing movies of all time.

We checked the list and discovered that there was an empty list at the beginning of each list of 200 lists.  We decided to remove those once we turned the list into a dataframe.

In [18]:
highest_grossing = mov_list1 + mov_list2 + mov_list3 + mov_list4 + mov_list5

In [19]:
print(len(highest_grossing))

1005


## Column Names for Data Frame

We grabbed the header names to use as the column names for the dataframe.

In [20]:
for header in soup.findAll('th'):
    print(header.text)

Rank

Title

Worldwide Lifetime Gross

Domestic Lifetime Gross

Domestic %

Foreign Lifetime Gross

Foreign %

Year



In [21]:
headers = ['rank', 'title', 'worldwide_lifetime_gross', 'domestic_lifetime_gross', 'domestic_per', 'foreign_lifetime_gross', 'foreign_per', 'year', 'imdb_id']
print(headers)



['rank', 'title', 'worldwide_lifetime_gross', 'domestic_lifetime_gross', 'domestic_per', 'foreign_lifetime_gross', 'foreign_per', 'year', 'imdb_id']


In [22]:
mojo_df = pd.DataFrame(highest_grossing, columns=headers)
mojo_df

Unnamed: 0,rank,title,worldwide_lifetime_gross,domestic_lifetime_gross,domestic_per,foreign_lifetime_gross,foreign_per,year,imdb_id
0,,,,,,,,,
1,1,Avengers: Endgame,"$2,797,800,564","$858,373,000",30.7%,"$1,939,427,564",69.3%,2019,tt4154796
2,2,Avatar,"$2,790,439,000","$760,507,625",27.2%,"$2,029,931,375",72.8%,2009,tt0499549
3,3,Titanic,"$2,194,439,542","$659,363,944",30%,"$1,535,075,598",70%,1997,tt0120338
4,4,Star Wars: Episode VII - The Force Awakens,"$2,068,223,624","$936,662,225",45.3%,"$1,131,561,399",54.7%,2015,tt2488496
...,...,...,...,...,...,...,...,...,...
1000,1010,Magic Mike,"$167,739,368","$113,721,571",67.8%,"$54,017,797",32.2%,2012,tt1915581
1001,1011,Alexander,"$167,298,192","$34,297,191",20.5%,"$133,001,001",79.5%,2004,tt0346491
1002,1012,Up in the Air,"$166,842,739","$83,823,381",50.2%,"$83,019,358",49.8%,2009,tt1193138
1003,1013,Nutty Professor II: The Klumps,"$166,339,890","$123,309,890",74.1%,"$43,030,000",25.9%,2000,tt0144528


## Remove Extra Empty Rows

We removed the extra empty rows that we discovered when analyzing the list of 1,000 movies.

In [23]:
clean_mojo_df = mojo_df.dropna(axis=0, how='all', thresh=None, subset=None, inplace=False)

In [24]:
clean_mojo_df.reset_index
clean_mojo_df

Unnamed: 0,rank,title,worldwide_lifetime_gross,domestic_lifetime_gross,domestic_per,foreign_lifetime_gross,foreign_per,year,imdb_id
1,1,Avengers: Endgame,"$2,797,800,564","$858,373,000",30.7%,"$1,939,427,564",69.3%,2019,tt4154796
2,2,Avatar,"$2,790,439,000","$760,507,625",27.2%,"$2,029,931,375",72.8%,2009,tt0499549
3,3,Titanic,"$2,194,439,542","$659,363,944",30%,"$1,535,075,598",70%,1997,tt0120338
4,4,Star Wars: Episode VII - The Force Awakens,"$2,068,223,624","$936,662,225",45.3%,"$1,131,561,399",54.7%,2015,tt2488496
5,5,Avengers: Infinity War,"$2,048,359,754","$678,815,482",33.1%,"$1,369,544,272",66.9%,2018,tt4154756
...,...,...,...,...,...,...,...,...,...
1000,1010,Magic Mike,"$167,739,368","$113,721,571",67.8%,"$54,017,797",32.2%,2012,tt1915581
1001,1011,Alexander,"$167,298,192","$34,297,191",20.5%,"$133,001,001",79.5%,2004,tt0346491
1002,1012,Up in the Air,"$166,842,739","$83,823,381",50.2%,"$83,019,358",49.8%,2009,tt1193138
1003,1013,Nutty Professor II: The Klumps,"$166,339,890","$123,309,890",74.1%,"$43,030,000",25.9%,2000,tt0144528


## IMDB IDs for the TMDB API Calls

We checked the length of each list of IMDB IDs to make sure it was accurate.  Then we combined the lists and used the list of 1,000 IDs to make the API call for more details from TMDB.

In [25]:
print(len(movie_ids1))
print(len(movie_ids2))
print(len(movie_ids3))
print(len(movie_ids4))
print(len(movie_ids5))

200
200
200
200
200


In [26]:
all_movie_ids = movie_ids1 + movie_ids2 + movie_ids3 + movie_ids4 + movie_ids5
print(all_movie_ids)

['tt4154796', 'tt0499549', 'tt0120338', 'tt2488496', 'tt4154756', 'tt0369610', 'tt6105098', 'tt0848228', 'tt2820852', 'tt4520988', 'tt2395427', 'tt1825683', 'tt1201607', 'tt2527336', 'tt4881806', 'tt2294629', 'tt2771200', 'tt3606756', 'tt4630562', 'tt1300854', 'tt2293640', 'tt3498820', 'tt1477834', 'tt0167260', 'tt6320628', 'tt4154664', 'tt1399103', 'tt1074638', 'tt2109248', 'tt1345836', 'tt7286456', 'tt1979376', 'tt2527338', 'tt0435761', 'tt0383574', 'tt3748528', 'tt6139732', 'tt1298650', 'tt3469046', 'tt0107290', 'tt2277860', 'tt0120915', 'tt1014759', 'tt2948356', 'tt0903624', 'tt0468569', 'tt0241527', 'tt0926084', 'tt1690953', 'tt0110357', 'tt3040964', 'tt2283362', 'tt0449088', 'tt1170358', 'tt2310332', 'tt0167261', 'tt0373889', 'tt0266543', 'tt0417741', 'tt0298148', 'tt1727824', 'tt0330373', 'tt0413300', 'tt0120737', 'tt1080016', 'tt2379713', 'tt2250912', 'tt0295297', 'tt1667889', 'tt2709768', 'tt2975590', 'tt7131870', 'tt0121766', 'tt1951264', 'tt3896198', 'tt2096673', 'tt1270797'

## API Calls to TMDB

We went to the TMDB website and created a developer account through their API, which gave us an API key.  We then looked through their API documentation to identify the specific get request URL we needed to obtain the information we wanted.  We iterated through our list of IMDB IDs and replaced the ID in each URL to open the page associated with each movie and append the data to a list.  We also added in .5 seconds of wait time between each request so it wouldn't put up any redflags and give TMDB any reason to blacklist us from their website.

The result was a list of dictionaries, which we turned into a dataframe.

In [29]:
import json
import time
import requests

List_of_responses = []
for id in all_movie_ids[0:5]: #we updated this to the first 5 webpages after we obtained the data so we could run other code if necesary.
    response = requests.get("https://api.themoviedb.org/3/movie/" + id + "?api_key=" + config.key + "&language=en-US")
    data = response.json()
    time.sleep(.5)
    List_of_responses.append(data)
print(List_of_responses)
    

[{'adult': False, 'backdrop_path': '/7RyHsO4yDXtBv1zUU3mTpHeQ0d5.jpg', 'belongs_to_collection': {'id': 86311, 'name': 'The Avengers Collection', 'poster_path': '/yFSIUVTCvgYrpalUktulvk3Gi5Y.jpg', 'backdrop_path': '/zuW6fOiusv4X9nnW3paHGfXcSll.jpg'}, 'budget': 356000000, 'genres': [{'id': 12, 'name': 'Adventure'}, {'id': 878, 'name': 'Science Fiction'}, {'id': 28, 'name': 'Action'}], 'homepage': 'https://www.marvel.com/movies/avengers-endgame', 'id': 299534, 'imdb_id': 'tt4154796', 'original_language': 'en', 'original_title': 'Avengers: Endgame', 'overview': "After the devastating events of Avengers: Infinity War, the universe is in ruins due to the efforts of the Mad Titan, Thanos. With the help of remaining allies, the Avengers must assemble once more in order to undo Thanos' actions and restore order to the universe once and for all, no matter what consequences may be in store.", 'popularity': 38.57, 'poster_path': '/or06FN3Dka5tukK1e9sl16pB3iy.jpg', 'production_companies': [{'id': 4

We grabbed the keys from the first dictionary in the list to use as our column titles.

In [57]:
List_of_responses[0].keys()

dict_keys(['adult', 'backdrop_path', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id', 'imdb_id', 'original_language', 'original_title', 'overview', 'popularity', 'poster_path', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'video', 'vote_average', 'vote_count'])

We checked out list of dictionaries to make sure it had the number of records we wanted.

In [43]:
print(len(List_of_responses))

1000


We turned the list of dictionaries into a dataframe and started to look around.

In [54]:
tmdb_df = pd.DataFrame(List_of_responses)
tmdb_df.head()

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,status_code,status_message
0,False,/7RyHsO4yDXtBv1zUU3mTpHeQ0d5.jpg,"{'id': 86311, 'name': 'The Avengers Collection...",356000000.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 878, ...",https://www.marvel.com/movies/avengers-endgame,299534.0,tt4154796,en,Avengers: Endgame,...,181.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Part of the journey is the end.,Avengers: Endgame,False,8.3,11403.0,,
1,False,/aHcth2AXzZSjhX7JYh7ie73YVNc.jpg,"{'id': 87096, 'name': 'Avatar Collection', 'po...",237000000.0,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",http://www.avatarmovie.com/,19995.0,tt0499549,en,Avatar,...,162.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Enter the World of Pandora.,Avatar,False,7.4,20393.0,,
2,False,/xqQztbT6KlPLQLlRtNHoXivEMZA.jpg,,200000000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,597.0,tt0120338,en,Titanic,...,194.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Nothing on Earth could come between them.,Titanic,False,7.8,16026.0,,
3,False,/c2Ax8Rox5g6CneChwy1gmu4UbSb.jpg,"{'id': 10, 'name': 'Star Wars Collection', 'po...",245000000.0,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",http://www.starwars.com/films/star-wars-episod...,140607.0,tt2488496,en,Star Wars: The Force Awakens,...,136.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Every generation has a story.,Star Wars: The Force Awakens,False,7.4,13866.0,,
4,False,/bOGkgRGdhrBYJSLpXaxhXVstddV.jpg,"{'id': 86311, 'name': 'The Avengers Collection...",300000000.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 28, '...",https://www.marvel.com/movies/avengers-infinit...,299536.0,tt4154756,en,Avengers: Infinity War,...,149.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,An entire universe. Once and for all.,Avengers: Infinity War,False,8.3,16902.0,,


We identified 4 rows with empty cells.  We realized this was becuase the URL we used for the API request ended with a tag indicating an english langauage film and the films associated with these rows weren't in english.  We decided to drop these rows since we wanted to focus on the english language market anyway.

In [None]:
import numpy as np
clean_tmdb_df[clean_tmdb_df['imdb_id'].isnull()]

In [113]:
tmdb_df.iloc[112:115]

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,status_code,status_message
112,False,/6fX7NF6IUJCTVssei7Shgl9J6LL.jpg,,175000000.0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://disney.go.com/disneypictures/up/,14160.0,tt1049413,en,Up,...,96.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Up,False,7.9,13666.0,,
113,,,,,,,,,,,...,,,,,,,,,34.0,The resource you requested could not be found.
114,False,/cQPbUnglzgIE6XDnh1dHnHKB26Y.jpg,,105000000.0,"[{'id': 878, 'name': 'Science Fiction'}, {'id'...",http://gravitymovie.warnerbros.com,49047.0,tt1454468,en,Gravity,...,91.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Don't Let Go,Gravity,False,7.2,10673.0,,


In [66]:
clean_tmdb_df = tmdb_df.drop(['adult', 'backdrop_path', 'homepage', 'id', 'original_title', 'overview', 'poster_path', 'revenue', 'status',
'spoken_languages',
'tagline',
'title',
'vote_average',
'vote_count',
'status_code',
'status_message', 'video'], axis=1)

In [67]:
clean_tmdb_df.columns

Index(['belongs_to_collection', 'budget', 'genres', 'imdb_id',
       'original_language', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'runtime'],
      dtype='object')

In [68]:
clean_tmdb_df.head()

Unnamed: 0,belongs_to_collection,budget,genres,imdb_id,original_language,popularity,production_companies,production_countries,release_date,runtime
0,"{'id': 86311, 'name': 'The Avengers Collection...",356000000.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 878, ...",tt4154796,en,38.57,"[{'id': 420, 'logo_path': '/hUzeosd33nzE5MCNsZ...","[{'iso_3166_1': 'US', 'name': 'United States o...",2019-04-24,181.0
1,"{'id': 87096, 'name': 'Avatar Collection', 'po...",237000000.0,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",tt0499549,en,29.738,"[{'id': 444, 'logo_path': '/42UPdZl6B2cFXgNUAS...","[{'iso_3166_1': 'US', 'name': 'United States o...",2009-12-10,162.0
2,,200000000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",tt0120338,en,26.449,"[{'id': 4, 'logo_path': '/fycMZt242LVjagMByZOL...","[{'iso_3166_1': 'US', 'name': 'United States o...",1997-11-18,194.0
3,"{'id': 10, 'name': 'Star Wars Collection', 'po...",245000000.0,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",tt2488496,en,28.812,"[{'id': 1634, 'logo_path': None, 'name': 'True...","[{'iso_3166_1': 'US', 'name': 'United States o...",2015-12-15,136.0
4,"{'id': 86311, 'name': 'The Avengers Collection...",300000000.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 28, '...",tt4154756,en,84.768,"[{'id': 420, 'logo_path': '/hUzeosd33nzE5MCNsZ...","[{'iso_3166_1': 'US', 'name': 'United States o...",2018-04-25,149.0


In [114]:
clean_tmdb_df.dropna(axis=0, how="all", inplace=True)

In [115]:
clean_tmdb_df.iloc[112:115]

Unnamed: 0,belongs_to_collection,budget,genres,imdb_id,original_language,popularity,production_companies,production_countries,release_date,runtime
112,,175000000.0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",tt1049413,en,28.29,"[{'id': 3, 'logo_path': '/1TjvGVDMYsj6JBxOAkUH...","[{'iso_3166_1': 'US', 'name': 'United States o...",2009-05-28,96.0
114,,105000000.0,"[{'id': 878, 'name': 'Science Fiction'}, {'id'...",tt1454468,en,16.315,"[{'id': 7470, 'logo_path': None, 'name': 'Espe...","[{'iso_3166_1': 'GB', 'name': 'United Kingdom'...",2013-10-03,91.0
115,"{'id': 131295, 'name': 'Captain America Collec...",170000000.0,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",tt1843866,en,15.084,"[{'id': 420, 'logo_path': '/hUzeosd33nzE5MCNsZ...","[{'iso_3166_1': 'US', 'name': 'United States o...",2014-03-20,136.0


## Merged the two dataframes together into a Master

We merged the two dataframes together so the scrapped Box Office Mojo dataframe is on the left and the details obtained from the TMDB API are on the right.

In [116]:
master_df = clean_mojo_df.merge(clean_tmdb_df, on='imdb_id', how='outer')

In [117]:
master_df

Unnamed: 0,rank,title,worldwide_lifetime_gross,domestic_lifetime_gross,domestic_per,foreign_lifetime_gross,foreign_per,year,imdb_id,belongs_to_collection,budget,genres,original_language,popularity,production_companies,production_countries,release_date,runtime
0,1,Avengers: Endgame,"$2,797,800,564","$858,373,000",30.7%,"$1,939,427,564",69.3%,2019,tt4154796,"{'id': 86311, 'name': 'The Avengers Collection...",356000000.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 878, ...",en,38.570,"[{'id': 420, 'logo_path': '/hUzeosd33nzE5MCNsZ...","[{'iso_3166_1': 'US', 'name': 'United States o...",2019-04-24,181.0
1,2,Avatar,"$2,790,439,000","$760,507,625",27.2%,"$2,029,931,375",72.8%,2009,tt0499549,"{'id': 87096, 'name': 'Avatar Collection', 'po...",237000000.0,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",en,29.738,"[{'id': 444, 'logo_path': '/42UPdZl6B2cFXgNUAS...","[{'iso_3166_1': 'US', 'name': 'United States o...",2009-12-10,162.0
2,3,Titanic,"$2,194,439,542","$659,363,944",30%,"$1,535,075,598",70%,1997,tt0120338,,200000000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",en,26.449,"[{'id': 4, 'logo_path': '/fycMZt242LVjagMByZOL...","[{'iso_3166_1': 'US', 'name': 'United States o...",1997-11-18,194.0
3,4,Star Wars: Episode VII - The Force Awakens,"$2,068,223,624","$936,662,225",45.3%,"$1,131,561,399",54.7%,2015,tt2488496,"{'id': 10, 'name': 'Star Wars Collection', 'po...",245000000.0,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",en,28.812,"[{'id': 1634, 'logo_path': None, 'name': 'True...","[{'iso_3166_1': 'US', 'name': 'United States o...",2015-12-15,136.0
4,5,Avengers: Infinity War,"$2,048,359,754","$678,815,482",33.1%,"$1,369,544,272",66.9%,2018,tt4154756,"{'id': 86311, 'name': 'The Avengers Collection...",300000000.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 28, '...",en,84.768,"[{'id': 420, 'logo_path': '/hUzeosd33nzE5MCNsZ...","[{'iso_3166_1': 'US', 'name': 'United States o...",2018-04-25,149.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,1010,Magic Mike,"$167,739,368","$113,721,571",67.8%,"$54,017,797",32.2%,2012,tt1915581,"{'id': 328247, 'name': 'Magic Mike Collection'...",7000000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 35, 'name...",en,11.408,"[{'id': 34981, 'logo_path': None, 'name': 'Iro...","[{'iso_3166_1': 'US', 'name': 'United States o...",2012-06-28,110.0
996,1011,Alexander,"$167,298,192","$34,297,191",20.5%,"$133,001,001",79.5%,2004,tt0346491,,155000000.0,"[{'id': 10752, 'name': 'War'}, {'id': 36, 'nam...",en,12.523,"[{'id': 54997, 'logo_path': None, 'name': 'WR ...","[{'iso_3166_1': 'FR', 'name': 'France'}, {'iso...",2004-11-21,175.0
997,1012,Up in the Air,"$166,842,739","$83,823,381",50.2%,"$83,019,358",49.8%,2009,tt1193138,,25000000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",en,13.775,"[{'id': 32157, 'logo_path': '/nBD7uzlUlpivUpOo...","[{'iso_3166_1': 'US', 'name': 'United States o...",2009-09-05,110.0
998,1013,Nutty Professor II: The Klumps,"$166,339,890","$123,309,890",74.1%,"$43,030,000",25.9%,2000,tt0144528,"{'id': 86028, 'name': 'The Nutty Professor Col...",84000000.0,"[{'id': 14, 'name': 'Fantasy'}, {'id': 35, 'na...",en,11.835,"[{'id': 33, 'logo_path': '/8lvHyhjr8oUKOOy2dKX...","[{'iso_3166_1': 'US', 'name': 'United States o...",2000-07-27,106.0


In [119]:
master_df.columns

Index(['rank', 'title', 'worldwide_lifetime_gross', 'domestic_lifetime_gross',
       'domestic_per', 'foreign_lifetime_gross', 'foreign_per', 'year',
       'imdb_id', 'belongs_to_collection', 'budget', 'genres',
       'original_language', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'runtime'],
      dtype='object')

## CSV Export

We exported the dataframe into a .csv file so we could access the data in the future.

In [122]:
import csv
master_df.to_csv('clean_master_movie_mod1.csv')