# Project #2

## Web Scraping

### Websites to consider scraping
* Top Worldwide Box Office by year [BOM](https://www.boxofficemojo.com/year/world/?ref_=bo_nb_in_tab)
* Chinese top grossing films from [BOM](https://www.boxofficemojo.com/weekend/by-year/2019/?area=CN)
* American ragings from [Rotton Tomatoes](https://www.rottentomatoes.com/)
* Chinese film titles from [atmovies](http://www.atmovies.com.tw/movie/)
* Chinese ratings from [douban](https://www.douban.com/)
* Top grossing films in China [wiki](https://en.wikipedia.org/wiki/List_of_highest-grossing_films_in_China)  
* Mainland China box office ranking [endata.com.cn](http://www.endata.com.cn/BoxOffice/BO/History/Movie/Alltimedomestic.html)
* Top grossing directors worldwide [wiki](https://en.wikipedia.org/wiki/List_of_highest-grossing_film_directors)
* Top grossing lead actors worldwide [the-numbers](https://www.the-numbers.com/box-office-star-records/international/lifetime-acting/top-grossing-leading-stars)


Articles for context  
 
* [China Film Insider](http://chinafilminsider.com/category/box-office/)
* [Movie ratings](https://www.makeuseof.com/tag/best-movie-ratings-sites/#:~:text=IMDb%20is%20great%20for%20seeing%20what%20general%20audiences%20think%20of%20a%20movie.&text=Rotten%20Tomatoes%20offers%20the%20best,you%20should%20use%20Rotten%20Tomatoes.)
* [Global films](https://qz.com/quartzy/1373949/hollywood-movies-are-truly-global-now/)
* [Global film industryu](https://en.wikipedia.org/wiki/Film_industry#:~:text=The%20worldwide%20theatrical%20market%20had,Africa%20with%20US%249.5%20billion.)
* [What makes a film successful in China?](https://www.greenhassonjanks.com/blog/what-makes-a-film-successful-in-china)
* [China box office stats](https://variety.com/2020/film/news/china-box-office-2019-review-ne-zha-wandering-earth-avengers-1203455038/)  
* [Marketing to China != marketing to HK](https://jingdaily.com/hong-kong-and-mainland-market-differences/) 

### Clickpath to get all BMO info

1. [YEAR Worldwide Box Office](https://www.boxofficemojo.com/year/world/2020/)
    * Get links to movie pages for all international movies in a year
    * NEXT: Click on a movie name to reach...
2. [Movie Page w/ international revenues](https://www.boxofficemojo.com/releasegroup/gr3944174085/?ref_=bo_ydw_table_1)
    * Get 
        * Domestic, International, Worldwide and Chinese gross
    * NEXT: click on 'Title Summary' to reach...
3. [Movie Page w/ film information](https://www.boxofficemojo.com/title/tt1502397/?ref_=bo_gr_ti)
    * Get 
        * Distributor
        * Domestic opening
        * Budget, Release date
        * MPAA rating
        * Running time
        * Genres
    * NEXT: click on 'Cast and Crew' to get...
4. [Movie page w/ Cast/crew](https://www.boxofficemojo.com/title/tt1502397/credits/?ref_=bo_tt_tab#tabs)
    * Get
        * Crew (directors, writers, producers, etc)
        * Cast

### 1. Getting Links from Year Worldwide Box Office pages  
#### 2020

In [1]:
# Necessary imports
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import seaborn as sns
import matplotlib.pyplot as plt
import dateutil.parser
import re

%matplotlib inline


Generating list of URLs

In [4]:
# grabbing links to movie pages for each movie listed in WW box office in last 20 years (1 page/year)

list_links = []

for year in range(2000, 2021):                                            # choosing date range
    link = "https://www.boxofficemojo.com/year/world/{}/".format(year)    # making list of yearly ww box office list pages
    list_links.append(link)                                               


In [5]:
# getting soup for each list page and parsing out soups for tables with links for each page

rows = []    # will have all movie page links

for url in list_links:
    response = requests.get(url)                                                   # reading each list page (21 total)
    print(url, response.status_code)                                               # checkign response is in 200s for each URL, prints below
    page = response.text
    soup = BeautifulSoup(page, "html5lib")                                         # getting soup for each list page
    row = soup.find_all('td', class_='a-text-left mojo-field-type-release_group')  # narrowing soups down to only sections with movie page links 
    rows.append(row)                                                               # adding each soup to list, makes list of lists, each list featuring all links from a single year

https://www.boxofficemojo.com/year/world/2000/ 200
https://www.boxofficemojo.com/year/world/2001/ 200
https://www.boxofficemojo.com/year/world/2002/ 200
https://www.boxofficemojo.com/year/world/2003/ 200
https://www.boxofficemojo.com/year/world/2004/ 200
https://www.boxofficemojo.com/year/world/2005/ 200
https://www.boxofficemojo.com/year/world/2006/ 200
https://www.boxofficemojo.com/year/world/2007/ 200
https://www.boxofficemojo.com/year/world/2008/ 200
https://www.boxofficemojo.com/year/world/2009/ 200
https://www.boxofficemojo.com/year/world/2010/ 200
https://www.boxofficemojo.com/year/world/2011/ 200
https://www.boxofficemojo.com/year/world/2012/ 200
https://www.boxofficemojo.com/year/world/2013/ 200
https://www.boxofficemojo.com/year/world/2014/ 200
https://www.boxofficemojo.com/year/world/2015/ 200
https://www.boxofficemojo.com/year/world/2016/ 200
https://www.boxofficemojo.com/year/world/2017/ 200
https://www.boxofficemojo.com/year/world/2018/ 200
https://www.boxofficemojo.com/y

In [42]:
type(rows[0])  # list of lists

bs4.element.ResultSet

In [56]:
# taking each soup of tables and pulling 

mp_links = {}                                   # setting empty dict for key=name val=link
# main_site = 'https://www.boxofficemojo.com'    

for soup in rows:
    for line in soup:                           # nested loop bc list of soups, each soup with many lines, iterates through each line of each soup to pull out a link from the line of soup element
        
        link = line.find('a')    
        title, url = link.text, link['href']
        mp_links[title] = url

In [44]:
len(mp_links)    # this many URLs

12344

In [92]:
mp_links['The Lion King'] # example of dict entry for each movie

'/releasegroup/gr403788293/?ref_=bo_ydw_table_2'

In [10]:
# later, to get values to list:
# x = list(mp_links.values())

### 2. Getting gross from first movie page

#### 2a. Testing for a single movie page

In [171]:
response = requests.get('https://www.boxofficemojo.com/releasegroup/gr3844755973/?ref_=bo_ydw_table_53')                                                 

In [172]:
response.status_code

200

In [173]:
page = response.text

In [174]:
soup = BeautifulSoup(page, "html5lib") 

In [175]:
# title

title = soup.find('h1', class_='a-size-extra-large').text
title

'The Angry Birds Movie 2'

In [176]:
# domestic gross

dtg = soup.find(class_='a-section a-spacing-none mojo-performance-summary-table').find_all('span', class_='money')[0].text
dtg

'$41,667,116'

In [177]:
# international gross (when ddtg==itg, change itg to Nan)

itg = soup.find(class_='a-section a-spacing-none mojo-performance-summary-table').find_all('span', class_='money')[1].text
itg

'$106,124,931'

In [178]:
# worldwide gross - DO NOT COLLECT, JUST SUM ITG+DTG (when ddtg==itg, change itg to Nan)

wtg = soup.find(class_='a-section a-spacing-none mojo-performance-summary-table').find_all('span', class_='money')[2].text
wtg

'$147,792,047'

In [179]:
# china release date

cn_release = soup.find('td', text='China').findNext().findNext().text
cn_release

'Aug 16, 2019'

In [180]:
# china opening sales

cn_opening = soup.find('td', text='China').findNext().findNext().findNext().text
cn_opening

'$9,238,982'

In [181]:
# china gross

ctg = soup.find('td', text='China').findNext().findNext().findNext().findNext().findNext().text
ctg

'$19,655,468'

In [182]:
# domestic release date

dom_release = soup.find('td', text='Domestic').findNext().findNext().text
dom_release

'Aug 13, 2019'

In [183]:
# domestic opening sales

dom_opening = soup.find('td', text='Domestic').findNext().findNext().findNext().text
dom_opening

'$10,354,073'

In [184]:
# LINK TO TITLE SUMMARY PAGE

summary_link_stub = soup.find('a', class_='a-link-normal mojo-title-link refiner-display-highlight')['href']
summary_link_stub

'/title/tt6095472/?ref_=bo_gr_ti'

#### 2b. Functions to clean up strings

In [37]:
def money_to_int(moneystring):
    if moneystring:
        moneystring = (str(moneystring).strip().replace('$', '').replace(',', '').replace('-', '').replace('\n', '').replace('–', ''))
        if len(moneystring) > 0:
            return int(float(moneystring))
        else: 
            return None
    else:
        return None

def to_date(datestring):
    if datestring:
        date = dateutil.parser.parse(datestring)
        return date
    else:
        return None
    
def runtime_to_minutes(runtimestring):
    runtime = runtimestring.split()
    try:
        minutes = int(runtime[0])*60 + int(runtime[2])
        return minutes
    except:
        return None

def clean_text_string(text_string):
    if text_string:
        text_string = text_string.strip().replace('\n', '').replace('See full company information', '')
        return text_string
    else:
        return None

#### 2c. Extend to all moviepages

In [444]:
def get_ctg(soup):
    
    '''Takes a soup and returns the chinese gross if listed on first movie page or None if nothing is found.'''
    
    obj = soup.find('td', text='China')
    
    if not obj: 
        return None

    gross = obj.findNext().findNext().findNext().findNext().findNext()
    
    if gross:
        return gross.text 
    else:
        return None
    

def get_itg(soup):
    '''Takes a soup and returns the international gross if listed on first movie page or None if nothing is found.'''

    itg = soup.find(class_='a-section a-spacing-none mojo-performance-summary-table').find_all('span', class_='money')[1].text
    if itg != dtg:
        return itg
    else:
        return None
        
    
def get_opening_sales(soup, region):
    
    '''Takes a soup and a region (str either 'Domestic' of 'China') and
    returns the regional opening sales from movie page or None if nothing is found.'''
    
    obj = soup.find('td', text=region)
    
    if not obj: 
        return None

    sales = obj.findNext().findNext().findNext()
    
    if sales:
        return sales.text 
    else:
        return None

    
def get_release_date(soup, region):
    
    '''Takes a soup and a region (str either 'Domestic' of 'China') and
    returns the regional release from movie page or None if nothing is found.'''
    
    obj = soup.find('td', text=region)
    
    if not obj: 
        return None

    release = obj.findNext().findNext()
    
    if release:
        return release.text 
    else:
        return None

def get_summary_page_link(soup):
    '''Takes a soup and returns the SUMMARY page link stub if listed on FIRST movie page or None if nothing is found.'''

    obj = soup.find('a', class_='a-link-normal mojo-title-link refiner-display-highlight')
    
    if not obj:
        return None
    
    page_link = obj['href']
    
    if page_link:
        return page_link
    else:
        return None
    

def get_cast_page_link(soup):
    '''Takes a soup and returns the CAST page link stub if listed on SECOND movie page or None if nothing is found.'''

    obj = soup.find('a', class_='a-size-base a-link-normal mojo-navigation-tab', text="Cast and Crew")
    
    if not obj:
        return None
    
    page_link = obj['href']
    
    if page_link:
        return page_link
    else:
        return None

In [443]:
soup.find('a', class_='a-size-base a-link-normal mojo-navigation-tab', text="Cast and Crew")['href']

'/title/tt0120755/credits/?ref_=bo_tt_tab#tabs'

#### 2c. Extend to all moviepages

In [255]:
def get_movie_dict_1(link):
    '''
    From BoxOfficeMojo link stub, request movie html, parse with BeautifulSoup, and
    collect 
        'movie_title', 
        'china_total_gross', 
        'domestic_toal_gross', 
        'international_total_gross', 
        'china_opening_sales',
        'domestic_opening_sales', 
        'china_release_date', 
        'dom_release_date', 
        'summary_link_stub'
    Return information as a dictionary.
    '''
    
    base_url = 'https://www.boxofficemojo.com'
    
    #Create full url to scrape
    url = base_url + link
    
    # Request HTML and parse
    response = requests.get(url)
    page = response.text
    soup = BeautifulSoup(page, "html5lib")

    
    headers = ['movie_title', 'china_total_gross', 'domestic_toal_gross', 'international_total_gross', 'china_opening_sales', 'domestic_opening_sales', 'china_release_date', 'dom_release_date', 'summary_link_stub']
    
    # Get title
    title_string = soup.find('h1', class_='a-size-extra-large').text
    title = title_string.split('-')[0].strip()

    # Get China gross
    raw_ctg = get_ctg(soup)
    ctg = money_to_int(raw_ctg)
    
    # Get domestic gross
    raw_dtg = soup.find(class_='a-section a-spacing-none mojo-performance-summary-table').find_all('span', class_='money')[0].text
    dtg = money_to_int(raw_dtg)

    # Get international gross (WHEN CLEANING, IF DTG==ITG, ITG SHOULD BE NAN)
    raw_itg = get_itg(soup)
    itg = money_to_int(raw_itg)
    
    # Get Worldwide gross - REMOVED FROM COLLECTION, CALC FROM DF LATER
#     raw_wtg = soup.find(class_='a-section a-spacing-none mojo-performance-summary-table').find_all('span', class_='money')[2].text
#     wtg = money_to_int(raw_wtg)
    
    # Get China opening sales
    raw_cn_opening = get_opening_sales(soup, "China")
    cn_opening = money_to_int(raw_cn_opening)
    
    # Get Domestic opening sales
    raw_dom_opening = get_opening_sales(soup, "Domestic")
    dom_opening = money_to_int(raw_dom_opening)

    # Get China release date
    raw_cn_release = get_release_date(soup, "China")
    cn_release = to_date(raw_cn_release)

    # Get Domestic release date
    raw_dom_release = get_release_date(soup, "Domestic")
    dom_release = to_date(raw_dom_release)

    # Get link to summary page
    summary_link_stub = get_summary_page_link(soup)
    
    #Create movie dictionary and return
    movie_dict = dict(zip(headers, [title,
                                ctg,
                                dtg,
                                itg, 
                                cn_opening,
                                dom_opening,
                                cn_release,
                                dom_release,
                                summary_link_stub]))

    return movie_dict

In [256]:
def scrape_mp_1(links, empty_list):
    for link in links:
        empty_list.append(get_movie_dict_1(link))

In [257]:
# calculating time to run scraper
 
len(mp_links.values())                 # number of links to scrape, 12344 total
lap_time = 26                          # timed function to scrape 100 sites (1 'lap'), took 26 seconds
laps = len(mp_links.values()) / 100    # number of laps (of 100 sites) to scrape all sites in list 
total_sec = lap_time * laps            # number of sec to scrape all sites
total_min = total_sec / 60             # number of minutes to scape all sites
total_min

53.49066666666667

In [284]:
# RUNNING SCRAPER ON FIRST ROUND OF MOVIE PAGES - FIRST ATTEMPT

mp_info_list_1 = []      # ONLY RUN THIS LINE THE FIRST TIME USING --- DO NOT USE IN SUBSEQUENT ATTEMPS WHEN PICKIN UP HALFWAY DOWN LIST

scrape_mp_1(mp_links.values(), mp_info_list_1)

ValueError: could not convert string to float: 'Latest Updates:            News |            Daily |            Weekend |            All Time |            International |            ShowdownsHelp            BoxOfficeMojo.com by IMDbPro  an            IMDb            company.                    © IMDb.com Inc. or its affiliates. All rights reserved.            Box Office Mojo and IMDb are trademarks or registered trademarks of IMDb.com Inc. or its affiliates.            Conditions of Use             and             Privacy Policy            under which this service is provided to you.'

In [None]:
mp_info_1 = pd.DataFrame(mp_info_list_1)  #convert list of dict to df
mp_info_1.set_index('movie_title', inplace=True)


In [261]:
# scraping stopped after 4829 sites (~40% through) bc missing China Opening sales data on page
#############################################################################################################
# NOTE WHEN CLEANING DATA, IF CHINA GROSS BUT NO CHINA OPENING, CHINA OPENING IS GROSS AND THERE IS NO GROSS#
#############################################################################################################
len(mp_info_list_1)

4829

In [285]:
# last movie scraped on first pass

mp_info_list_1[4828]

{'movie_title': 'Star Trek',
 'china_total_gross': 8916639,
 'domestic_toal_gross': 257730019,
 'international_total_gross': 127950427,
 'china_opening_sales': None,
 'domestic_opening_sales': 75204289,
 'china_release_date': datetime.datetime(2009, 5, 15, 0, 0),
 'dom_release_date': datetime.datetime(2009, 5, 8, 0, 0),
 'summary_link_stub': '/title/tt0796366/?ref_=bo_gr_ti'}

In [286]:
# making list of all links

mp_links_list = list(mp_links.values())

# checking that list order lines up with dict order (also checked 1 before)
mp_links_list[4828]

'/releasegroup/gr2599703045/?ref_=bo_ydw_table_13'

In [287]:
# new list with only remaining titles to scrape
# SKIPPED PROBLEMATIC ENTRY (STAR WARS 2009, MONSTERS VS ALIENS, TERMINATOR SALVATION)

mp_links_list2 = mp_links_list[4831:]
len(mp_links_list2)

7513

In [288]:
# RUNNING SCRAPER ON FIRST ROUND OF MOVIE PAGES - SECOND ATTEMPT

scrape_mp_1(mp_links_list2, mp_info_list_1)

ValueError: could not convert string to float: 'Latest Updates:            News |            Daily |            Weekend |            All Time |            International |            ShowdownsHelp            BoxOfficeMojo.com by IMDbPro  an            IMDb            company.                    © IMDb.com Inc. or its affiliates. All rights reserved.            Box Office Mojo and IMDb are trademarks or registered trademarks of IMDb.com Inc. or its affiliates.            Conditions of Use             and             Privacy Policy            under which this service is provided to you.'

In [289]:
len(mp_info_list_1)

5909

In [299]:
mp_links_list3 = mp_links_list[(5912):]    # skipped 5909-11
len(mp_links_list3)

6432

In [295]:
# RUNNING SCRAPER ON FIRST ROUND OF MOVIE PAGES - THIRD ATTEMPT

scrape_mp_1(mp_links_list3, mp_info_list_1)

ValueError: could not convert string to float: 'Latest Updates:            News |            Daily |            Weekend |            All Time |            International |            ShowdownsHelp            BoxOfficeMojo.com by IMDbPro  an            IMDb            company.                    © IMDb.com Inc. or its affiliates. All rights reserved.            Box Office Mojo and IMDb are trademarks or registered trademarks of IMDb.com Inc. or its affiliates.            Conditions of Use             and             Privacy Policy            under which this service is provided to you.'

In [296]:
len(mp_info_list_1)

10977

In [306]:
mp_links_list4 = mp_links_list[(10980):]    # skipped 10977-80
len(mp_links_list4)

1364

In [307]:
# RUNNING SCRAPER ON FIRST ROUND OF MOVIE PAGES - FOURTH ATTEMPT

scrape_mp_1(mp_links_list4, mp_info_list_1)

In [308]:
len(mp_info_list_1)

12344

In [356]:
# CONVERTING MP1 DATA TO DF AND SAVING AS CSV

mp1_df = pd.DataFrame(mp_info_list_1)

mp1_df.to_csv('mp1_df.csv')

In [357]:
# filtering to only display movies that were screened in china

len(mp1_df[mp1_df.china_total_gross > 0].sort_values(by='china_total_gross', ascending=False))

838

Scraping 20 years only yielded 832 that were screened in China.

**Running back another 5 years (1995-99) to try to get up to 1000**

In [335]:
# grabbing links to movie pages for each movie listed in WW box office from 1995-1999 (1 page/year)

list_links2 = []

for year in range(1995, 2000):                                            # choosing date range
    link = "https://www.boxofficemojo.com/year/world/{}/".format(year)    # making list of yearly ww box office list pages
    list_links2.append(link)                                               


In [337]:
rows2 = []    # will have all movie page links

for url in list_links2:
    response = requests.get(url)                                                   # reading each list page (21 total)
    print(url, response.status_code)                                               # checkign response is in 200s for each URL, prints below
    page = response.text
    soup = BeautifulSoup(page, "html5lib")                                         # getting soup for each list page
    row = soup.find_all('td', class_='a-text-left mojo-field-type-release_group')  # narrowing soups down to only sections with movie page links 
    rows2.append(row)                                                               # adding each soup to list, makes list of lists, each list featuring all links from a single year

https://www.boxofficemojo.com/year/world/1995/ 200
https://www.boxofficemojo.com/year/world/1996/ 200
https://www.boxofficemojo.com/year/world/1997/ 200
https://www.boxofficemojo.com/year/world/1998/ 200
https://www.boxofficemojo.com/year/world/1999/ 200


In [340]:
mp_links2 = {}                                   # setting empty dict for key=name val=link
# main_site = 'https://www.boxofficemojo.com'    

for soup in rows2:
    for line in soup:                           # nested loop bc list of soups, each soup with many lines, iterates through each line of each soup to pull out a link from the line of soup element
        
        link = line.find('a')    
        title, url = link.text, link['href']
        mp_links2[title] = url

In [341]:
len(mp_links2)

1612

In [342]:
scrape_mp_1(mp_links2.values(), mp_info_list_1)

In [343]:
len(mp_info_list_1)

13956

In [344]:
len(mp1_df[mp1_df.china_total_gross > 0].sort_values(by='china_total_gross', ascending=False))

832

In [348]:
# Adding 1994 since that was year first hollywood action movie screened in china

list_links3 = []

for year in range(1994, 1995):                                            # choosing date range
    link = "https://www.boxofficemojo.com/year/world/{}/".format(year)    # making list of yearly ww box office list pages
    list_links3.append(link)                                               


In [350]:
rows3 = []    # will have all movie page links

for url in list_links3:
    response = requests.get(url)                                                   # reading each list page (21 total)
    print(url, response.status_code)                                               # checkign response is in 200s for each URL, prints below
    page = response.text
    soup = BeautifulSoup(page, "html5lib")                                         # getting soup for each list page
    row = soup.find_all('td', class_='a-text-left mojo-field-type-release_group')  # narrowing soups down to only sections with movie page links 
    rows3.append(row)                                                               # adding each soup to list, makes list of lists, each list featuring all links from a single year

mp_links3 = {}                                   # setting empty dict for key=name val=link
# main_site = 'https://www.boxofficemojo.com'    

for soup in rows3:
    for line in soup:                           # nested loop bc list of soups, each soup with many lines, iterates through each line of each soup to pull out a link from the line of soup element
        
        link = line.find('a')    
        title, url = link.text, link['href']
        mp_links3[title] = url

https://www.boxofficemojo.com/year/world/1994/ 200


In [352]:
len(mp_links3)

255

In [353]:
scrape_mp_1(mp_links3.values(), mp_info_list_1)

In [359]:
# updating CSV with second and third round of scraping

mp1_df = pd.DataFrame(mp_info_list_1)

mp1_df.to_csv('mp1_df.csv')

### 3. Scraping movie pages with film info 

Page that displays after clicking "title summary"  

#### 3a. Getting links to scrape

In [365]:
mp1_df = pd.read_csv('mp1_df.csv', parse_dates=['china_release_date', 'dom_release_date'])

In [375]:
mp1_df.drop("Unnamed: 0", axis=1, inplace=True)

In [383]:
cn_only_mp1_df = mp1_df[mp1_df.china_total_gross > 0]

In [386]:
# still only 838 movies that came to china, but that's just what we'll have to work with!

len(cn_only_mp1_df)

838

In [394]:
# list of link stubs to each movies SECOND movie page (after clicking on 'Title Summary')

mp2_link_stubs_list = (cn_only_mp1_df
                       .summary_link_stub
                       .to_list())

#### 3b. Testing for a single movie page

In [395]:
response = requests.get('https://www.boxofficemojo.com/title/tt0120755/?ref_=bo_tt_ti')                                                 

In [396]:
response.status_code

200

In [397]:
page = response.text

In [398]:
soup = BeautifulSoup(page, "html5lib") 

In [399]:
# title

title = soup.find('h1', class_='a-size-extra-large').text
title

'Mission: Impossible II (2000)'

In [433]:
get_movie_value(soup, 'Cast and Crew')

'All-Time Rankings'

In [404]:
soup.find_all('span', class_='a-section a-spacing-none mojo-summary-values mojo-hidden-from-mobile')

[]

In [405]:
def get_movie_value(soup, field_name):
    
    '''Grab a value from Box Office Mojo HTML
    
    Takes a string attribute of a movie on the page and returns the string in
    the next sibling object (the value for that attribute) or None if nothing is found.
    '''
    
    obj = soup.find(text=re.compile(field_name))
    
    if not obj: 
        return None
    
    # this works for most of the values
    next_element = obj.findNext()
    
    if next_element:
        return next_element.text 
    else:
        return None

In [415]:
# Domestic distributor 

dom_distributor = get_movie_value(soup, 'Distributor')
dom_distributor

'Paramount PicturesSee full company information\n        \n    '

In [420]:
clean_text_string(dom_distributor)

'Paramount Pictures'

In [421]:
# Budget

budget = get_movie_value(soup, 'Budget')
budget

'$125,000,000'

In [422]:
# MPAA

mpaa = get_movie_value(soup, 'MPAA')
mpaa

'PG-13'

In [424]:
# Running time

running_time = get_movie_value(soup, 'Running Time')
running_time

'2 hr 3 min'

In [425]:
# Genres - MUST SPLIT

genres = get_movie_value(soup, 'Genres')
genres

'Action\n    \n        Adventure\n    \n        Thriller'

In [431]:
clean_text_string(genres).split()

['Action', 'Adventure', 'Thriller']

#### 3c. Extend to all moviepages

In [447]:
def get_movie_dict_2(link):
    '''
    From BoxOfficeMojo link stub, request movie html, parse with BeautifulSoup, and
    collect 
        'movie_title', 
        'dom_distributor', 
        'budget', 
        'mpaa', 
        'running_time',
        'genres', 
    Return information as a dictionary.
    '''
    
    base_url = 'https://www.boxofficemojo.com'
    
    #Create full url to scrape
    url = base_url + link
    
    # Request HTML and parse
    response = requests.get(url)
    page = response.text
    soup = BeautifulSoup(page, "html5lib")

    
    headers = ['movie_title', 'dom_distributor', 'budget', 'mpaa', 'running_time', 'genres', 'cast_link_stub']
    
    # Get title
    title_string = soup.find('h1', class_='a-size-extra-large').text
    title = title_string.split('-')[0].strip()

    # Get domestic distributor
    raw_dom_distributor = get_movie_value(soup, 'Distributor')
    dom_distributor = clean_text_string(raw_dom_distributor)
    
    # Get budget
    raw_budget = get_movie_value(soup, 'Budget')
    budget = money_to_int(raw_budget)

    # Get mpaa rating
    mpaa = get_movie_value(soup, 'MPAA')
    
    # Get running time
    raw_running_time = get_movie_value(soup, 'Running Time')
    running_time = runtime_to_minutes(raw_running_time)
    
    # Get genres
    raw_genres = get_movie_value(soup, 'Genres')
    genres = clean_text_string(raw_genres).split()

    # Get link to cast page
    cast_link_stub = get_cast_page_link(soup)
    
    #Create movie dictionary and return
    movie_dict = dict(zip(headers, [title,
                                dom_distributor,
                                budget,
                                mpaa, 
                                running_time,
                                genres,
                                cast_link_stub]))

    return movie_dict

In [449]:
def scrape_mp_2(links, empty_list):
    for link in links:
        empty_list.append(get_movie_dict_2(link))

In [451]:
# RUNNING SCRAPER ON SECOND ROUND OF MOVIE PAGES - FIRST ATTEMPT

mp_info_list_2 = []      # ONLY RUN THIS LINE THE FIRST TIME USING --- DO NOT USE IN SUBSEQUENT ATTEMPS WHEN PICKIN UP HALFWAY DOWN LIST

scrape_mp_2(mp2_link_stubs_list, mp_info_list_2)

In [453]:
len(mp_info_list_2)

838

In [454]:
# CONVERTING MP2 DATA TO DF AND SAVING AS CSV

mp2_df = pd.DataFrame(mp_info_list_2)

mp2_df.to_csv('mp2_df.csv')

### 4. Getting cast info from movie page 3

Page that displays after clicking "cast and crew"  

#### 4a. Getting links to scrape

In [763]:
mp2_df = pd.read_csv('mp2_df.csv')

In [764]:
mp2_df.drop("Unnamed: 0", axis=1, inplace=True)

In [460]:
mp2_df.head()

Unnamed: 0,movie_title,dom_distributor,budget,mpaa,running_time,genres,cast_link_stub
0,Mission: Impossible II (2000),Paramount Pictures,125000000.0,PG-13,123.0,"['Action', 'Adventure', 'Thriller']",/title/tt0120755/credits/?ref_=bo_tt_tab#tabs
1,Gladiator (2000),DreamWorks Distribution,103000000.0,R,155.0,"['Action', 'Adventure', 'Drama']",/title/tt0172495/credits/?ref_=bo_tt_tab#tabs
2,What Women Want (2011),China Lion Film Distribution,,,116.0,"['Comedy', 'Romance', 'Sci-Fi']",/title/tt1667150/credits/?ref_=bo_tt_tab#tabs
3,Charlie's Angels (2019),Sony Pictures Releasing,48000000.0,PG-13,118.0,"['Action', 'Adventure', 'Comedy']",/title/tt5033998/credits/?ref_=bo_tt_tab#tabs
4,Southpaw (2015),The Weinstein Company,30000000.0,R,124.0,"['Action', 'Drama', 'Sport']",/title/tt1798684/credits/?ref_=bo_tt_tab#tabs


In [466]:
mp3_link_stubs_list = (mp2_df
                       .cast_link_stub
                       .to_list())

#### 4b. Testing for a single movie page

In [485]:
response = requests.get('https://www.boxofficemojo.com/title/tt0120755/credits/?ref_=bo_tt_tab#tabs')                                                 

In [486]:
response.status_code

200

In [487]:
page = response.text

In [494]:
soup = BeautifulSoup(page, "html5lib") 

In [495]:
# title

title = soup.find('h1', class_='a-size-extra-large').text
title

'Mission: Impossible II (2000)'

In [504]:
# director

director = soup.find('td', text='Director').previous_sibling.text

In [505]:
clean_text_string(director)

'John Woo'

In [597]:
def get_director(soup):
    
    obj = soup.find('td', text='Director')
    
    if not obj:
        return None
    
    prev_element = obj.previous_sibling
    
    if prev_element:
        return prev_element.text
    
    else:
        return None

In [640]:
# principal_cast

actors_soup = soup.find('table', {"id": "principalCast"}).find_all('td')

actor_list = []

for row in actors_soup:
    actors_string = row.text
    clean_string = clean_text_string(actors_string)
    actor_list.append(clean_string)
    cast_list = actor_list[0::2]                     # removes character names and leaves only actor names
    
cast_list

['Tom Cruise', 'Dougray Scott', 'Thandie Newton', 'Ving Rhames']

In [670]:
def get_cast(soup):
    
    obj = soup.find('table', {"id": "principalCast"}).find_all('td')
    
    if not obj:
        return None
    
    actor_list = []
    
    if obj:
        for row in obj:
            actors_string = row.text

            clean_string = clean_text_string(actors_string)

            actor_list.append(clean_string)

            clean_list = actor_list[0::2]     # removes character names and leaves only actor names

            final_set = set(clean_list)       # removes duplicate actor names
            
            final_set = str(final_set).strip().replace('{', '').replace('}', '').replace('\'','')

        return str(final_set)           # convert back to list
        
    else: 
        return None


In [684]:
# to get list out of this cell later and separate

test = get_cast(soup)
print(test)
print('\n')
print(test.split(','))


Dougray Scott, Tom Cruise, Thandie Newton, Ving Rhames


['Dougray Scott', ' Tom Cruise', ' Thandie Newton', ' Ving Rhames']


#### c. Extend to all moviepages

In [651]:
def get_movie_dict_3(link):
    '''
    From BoxOfficeMojo link stub, request movie html, parse with BeautifulSoup, and
    collect 
        'movie_title', 
        'director', 
        'principal_cast' 
    Return information as a dictionary.
    '''
    
    base_url = 'https://www.boxofficemojo.com'
    
    #Create full url to scrape
    url = base_url + link
    
    # Request HTML and parse
    response = requests.get(url)
    page = response.text
    soup = BeautifulSoup(page, "html5lib")

    
    headers = ['movie_title', 'director', 'principal_cast']
    
    # Get title
    title_string = soup.find('h1', class_='a-size-extra-large').text
    title = title_string.split('-')[0].strip()

    # Get director
    raw_director = get_director(soup)
    director = clean_text_string(raw_director)
 
    # Get principal cast
    principal_cast = get_cast(soup)

    #Create movie dictionary and return
    movie_dict = dict(zip(headers, [title,
                                director,
                                principal_cast]))

    return movie_dict

In [599]:
def scrape_mp_3(links, empty_list):
    for link in links:
        empty_list.append(get_movie_dict_3(link))

In [685]:
# RUNNING SCRAPER ON THIRD ROUND OF MOVIE PAGES - FIRST ATTEMPT

mp_info_list_3 = []      # ONLY RUN THIS LINE THE FIRST TIME USING --- DO NOT USE IN SUBSEQUENT ATTEMPS WHEN PICKIN UP HALFWAY DOWN LIST

scrape_mp_3(mp3_link_stubs_list, mp_info_list_3)

AttributeError: 'NoneType' object has no attribute 'find_all'

In [653]:
len(mp_info_list_3)

547

In [686]:
mp3_link_stubs_list_2 = mp3_link_stubs_list[548:]
len(mp3_link_stubs_list_2)

290

In [687]:
# RUNNING SCRAPER ON THIRD ROUND OF MOVIE PAGES - SECOND ATTEMPT (skipping 547)

scrape_mp_3(mp3_link_stubs_list_2, mp_info_list_3)

In [688]:
len(mp_info_list_3)

837

In [689]:
# CONVERTING MP3 DATA TO DF AND SAVING AS CSV

mp3_df = pd.DataFrame(mp_info_list_3)

mp3_df.to_csv('mp3_df.csv')

### 5. Merging all three BOM dfs

#### 5a. cleaning principal cast

In [765]:
mp3_df = pd.read_csv('mp3_df.csv')

In [766]:
mp3_df.drop("Unnamed: 0", axis=1, inplace=True)

In [695]:
mp3_df.head(1)

Unnamed: 0,movie_title,director,principal_cast
0,Mission: Impossible II (2000),John Woo,"Dougray Scott, Tom Cruise, Thandie Newton, Vin..."


In [693]:
mp2_df.head(1)

Unnamed: 0,movie_title,dom_distributor,budget,mpaa,running_time,genres,cast_link_stub
0,Mission: Impossible II (2000),Paramount Pictures,125000000.0,PG-13,123.0,"['Action', 'Adventure', 'Thriller']",/title/tt0120755/credits/?ref_=bo_tt_tab#tabs


In [716]:
mp1_df.head(1)

Unnamed: 0,movie_title,china_total_gross,domestic_toal_gross,international_total_gross,china_opening_sales,domestic_opening_sales,china_release_date,dom_release_date,summary_link_stub
0,Mission: Impossible II,3453141.0,215409889,330978219,,57845297.0,2000-08-21,2000-05-24,/title/tt0120755/?ref_=bo_gr_ti


In [726]:
# testing how to make movie_title uniform across dfs

' '.join(mp2_df.movie_title[0].split()[:-1])

'Mission: Impossible II'

In [767]:
# cleaning mp2 and mp3 movie_title 

mp2_df.movie_title = mp2_df.movie_title.apply(lambda row: ' '.join(row.split()[:-1]))
mp3_df.movie_title = mp3_df.movie_title.apply(lambda row: ' '.join(row.split()[:-1]))

In [770]:
mp3_df.head(3)

Unnamed: 0,movie_title,director,principal_cast
0,Mission: Impossible II,John Woo,"Dougray Scott, Tom Cruise, Thandie Newton, Vin..."
1,Gladiator,Ridley Scott,"Connie Nielsen, Oliver Reed, Russell Crowe, Jo..."
2,What Women Want,Daming Chen,"Li Yuan, Julian Chen, Andy Lau, Li Gong"


#### 5b. Merging DFs

In [771]:
merged_1_2 =  pd.merge(mp2_df, mp1_df, how='left', on='movie_title')

In [772]:
len(merged_1_2)

907

In [774]:
merged_1_2.head()

Unnamed: 0,movie_title,dom_distributor,budget,mpaa,running_time,genres,cast_link_stub,china_total_gross,domestic_toal_gross,international_total_gross,china_opening_sales,domestic_opening_sales,china_release_date,dom_release_date,summary_link_stub
0,Mission: Impossible II,Paramount Pictures,125000000.0,PG-13,123.0,"['Action', 'Adventure', 'Thriller']",/title/tt0120755/credits/?ref_=bo_tt_tab#tabs,3453141.0,215409889.0,330978219.0,,57845297.0,2000-08-21,2000-05-24,/title/tt0120755/?ref_=bo_gr_ti
1,Gladiator,DreamWorks Distribution,103000000.0,R,155.0,"['Action', 'Adventure', 'Drama']",/title/tt0172495/credits/?ref_=bo_tt_tab#tabs,3376447.0,187705427.0,272878533.0,,34819017.0,2000-08-05,2000-05-05,/title/tt0172495/?ref_=bo_gr_ti
2,What Women Want,China Lion Film Distribution,,,116.0,"['Comedy', 'Romance', 'Sci-Fi']",/title/tt1667150/credits/?ref_=bo_tt_tab#tabs,10288154.0,123526.0,11714651.0,,53224.0,2011-02-03,2011-02-03,/title/tt1667150/?ref_=bo_gr_ti
3,Charlie's Angels,Sony Pictures Releasing,48000000.0,PG-13,118.0,"['Action', 'Adventure', 'Comedy']",/title/tt5033998/credits/?ref_=bo_tt_tab#tabs,10800219.0,17803077.0,55476811.0,7639741.0,8351109.0,2019-11-15,2019-11-15,/title/tt5033998/?ref_=bo_gr_ti
4,Southpaw,The Weinstein Company,30000000.0,R,124.0,"['Action', 'Drama', 'Sport']",/title/tt1798684/credits/?ref_=bo_tt_tab#tabs,1168081.0,52421953.0,39548874.0,722844.0,16701294.0,2016-09-02,2015-07-24,/title/tt1798684/?ref_=bo_gr_ti


In [775]:
merged_1_2.columns

Index(['movie_title', 'dom_distributor', 'budget', 'mpaa', 'running_time',
       'genres', 'cast_link_stub', 'china_total_gross', 'domestic_toal_gross',
       'international_total_gross', 'china_opening_sales',
       'domestic_opening_sales', 'china_release_date', 'dom_release_date',
       'summary_link_stub'],
      dtype='object')

In [776]:
merged_final_df =  pd.merge(merged_1_2, mp3_df, how='left', on='movie_title')

In [873]:
merged_final_df.head()

Unnamed: 0,movie_title,dom_distributor,budget,mpaa,running_time,genres,cast_link_stub,china_total_gross,domestic_toal_gross,international_total_gross,china_opening_sales,domestic_opening_sales,china_release_date,dom_release_date,summary_link_stub,director,principal_cast
0,Mission: Impossible II,Paramount Pictures,125000000.0,PG-13,123.0,"['Action', 'Adventure', 'Thriller']",/title/tt0120755/credits/?ref_=bo_tt_tab#tabs,3453141.0,215409889.0,330978219.0,,57845297.0,2000-08-21,2000-05-24,/title/tt0120755/?ref_=bo_gr_ti,John Woo,"Dougray Scott, Tom Cruise, Thandie Newton, Vin..."
1,Gladiator,DreamWorks Distribution,103000000.0,R,155.0,"['Action', 'Adventure', 'Drama']",/title/tt0172495/credits/?ref_=bo_tt_tab#tabs,3376447.0,187705427.0,272878533.0,,34819017.0,2000-08-05,2000-05-05,/title/tt0172495/?ref_=bo_gr_ti,Ridley Scott,"Connie Nielsen, Oliver Reed, Russell Crowe, Jo..."
2,What Women Want,China Lion Film Distribution,,,116.0,"['Comedy', 'Romance', 'Sci-Fi']",/title/tt1667150/credits/?ref_=bo_tt_tab#tabs,10288154.0,123526.0,11714651.0,,53224.0,2011-02-03,2011-02-03,/title/tt1667150/?ref_=bo_gr_ti,Daming Chen,"Li Yuan, Julian Chen, Andy Lau, Li Gong"
3,Charlie's Angels,Sony Pictures Releasing,48000000.0,PG-13,118.0,"['Action', 'Adventure', 'Comedy']",/title/tt5033998/credits/?ref_=bo_tt_tab#tabs,10800219.0,17803077.0,55476811.0,7639741.0,8351109.0,2019-11-15,2019-11-15,/title/tt5033998/?ref_=bo_gr_ti,Elizabeth Banks,"Naomi Scott, Elizabeth Banks, Kristen Stewart,..."
4,Southpaw,The Weinstein Company,30000000.0,R,124.0,"['Action', 'Drama', 'Sport']",/title/tt1798684/credits/?ref_=bo_tt_tab#tabs,1168081.0,52421953.0,39548874.0,722844.0,16701294.0,2016-09-02,2015-07-24,/title/tt1798684/?ref_=bo_gr_ti,Antoine Fuqua,"Rachel McAdams, Jake Gyllenhaal, Oona Laurence..."


In [611]:
mp3_df.head()

Unnamed: 0.1,Unnamed: 0,movie_title,director,principal_cast
0,0,Mission: Impossible II (2000),John Woo,"['Ryan Gosling', 'Thandie Newton', 'Bruce Will..."
1,1,Gladiator (2000),Ridley Scott,"['Ryan Gosling', 'Thandie Newton', 'Bruce Will..."
2,2,What Women Want (2011),Daming Chen,"['Ryan Gosling', 'Thandie Newton', 'Bruce Will..."
3,3,Charlie's Angels (2019),Elizabeth Banks,"['Ryan Gosling', 'Thandie Newton', 'Bruce Will..."
4,4,Southpaw (2015),Antoine Fuqua,"['Ryan Gosling', 'Thandie Newton', 'Bruce Will..."


In [779]:
.columns

Index(['movie_title', 'dom_distributor', 'budget', 'mpaa', 'running_time',
       'genres', 'cast_link_stub', 'china_total_gross', 'domestic_toal_gross',
       'international_total_gross', 'china_opening_sales',
       'domestic_opening_sales', 'china_release_date', 'dom_release_date',
       'summary_link_stub', 'director', 'principal_cast'],
      dtype='object')

In [889]:
len(merged_final_df)

799

In [866]:
merged_final_df.dropna(subset=['china_total_gross'], inplace=True)

In [888]:
# some rows have duplicates wiht different release dates, and various gross

merged_final_df[merged_final_df.movie_title.duplicated()]

Unnamed: 0,movie_title,dom_distributor,budget,mpaa,running_time,genres,cast_link_stub,china_total_gross,domestic_toal_gross,international_total_gross,china_opening_sales,domestic_opening_sales,china_release_date,dom_release_date,summary_link_stub,director,principal_cast


In [None]:
# removing various dupes

In [825]:
merged_final_df.drop(range(1146, 1151), inplace=True)

In [887]:
merged_final_df.drop(987, inplace=True)

In [886]:
merged_final_df[merged_final_df.movie_title == 'Tall Tales']

Unnamed: 0,movie_title,dom_distributor,budget,mpaa,running_time,genres,cast_link_stub,china_total_gross,domestic_toal_gross,international_total_gross,china_opening_sales,domestic_opening_sales,china_release_date,dom_release_date,summary_link_stub,director,principal_cast


In [895]:
merged_final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 799 entries, 0 to 1160
Data columns (total 17 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   movie_title                799 non-null    object        
 1   dom_distributor            791 non-null    object        
 2   budget                     427 non-null    float64       
 3   mpaa                       611 non-null    object        
 4   running_time               781 non-null    float64       
 5   genres                     799 non-null    object        
 6   cast_link_stub             799 non-null    object        
 7   china_total_gross          799 non-null    float64       
 8   domestic_toal_gross        799 non-null    float64       
 9   international_total_gross  799 non-null    float64       
 10  china_opening_sales        408 non-null    float64       
 11  domestic_opening_sales     769 non-null    float64       
 12  china_r

In [896]:
# CONVERTING FINAL DATA TO DF AND SAVING AS CSV

merged_final_df.to_csv('merged_final_df.csv')

In [902]:
# Movies that made more money in China than USA

len(merged_final_df[merged_final_df.china_total_gross > merged_final_df.domestic_toal_gross])

295

In [913]:
# Movies that did better in China than USA & were released in countries other than US and China (have greater int'l audience)

len(merged_final_df[(merged_final_df.china_total_gross > merged_final_df.domestic_toal_gross) & (merged_final_df.china_total_gross != merged_final_df.international_total_gross)].sort_values(by='domestic_toal_gross', ascending=False))

278

In [916]:
# Movies where international sales ONLY come from china

len(merged_final_df[merged_final_df.china_total_gross == merged_final_df.international_total_gross])

20

In [918]:
merged_final_df[merged_final_df.china_total_gross == merged_final_df.international_total_gross]

Unnamed: 0,movie_title,dom_distributor,budget,mpaa,running_time,genres,cast_link_stub,china_total_gross,domestic_toal_gross,international_total_gross,china_opening_sales,domestic_opening_sales,china_release_date,dom_release_date,summary_link_stub,director,principal_cast
616,The Ark of Mr Chow,China Lion Film Distribution,,,106.0,['Comedy'],/title/tt4727756/credits/?ref_=bo_tt_tab#tabs,7840000.0,54075.0,7840000.0,,22583.0,2015-06-19,2015-06-19,/title/tt4727756/?ref_=bo_gr_ti,Yang Xiao,"Yuexin Wang, Dongyu Zhou, Honglei Sun, Zijian ..."
621,A Fool,China Lion Film Distribution,,,103.0,"['Comedy', 'Drama']",/title/tt3856504/credits/?ref_=bo_tt_tab#tabs,2059608.0,8212.0,2059608.0,2040784.0,5741.0,2015-11-19,2015-11-27,/title/tt3856504/?ref_=bo_gr_ti,Jianbin Chen,"Xuebing Wang, Jianbin Chen, Kim Scar, Qinqin J..."
737,Phantom of the Theatre,Well Go USA Entertainment,,,103.0,"['Drama', 'Mystery', 'Romance', 'Thriller']",/title/tt5639650/credits/?ref_=bo_tt_tab#tabs,13482888.0,43955.0,13482888.0,7986290.0,21001.0,2016-04-29,2016-05-06,/title/tt5639650/?ref_=bo_gr_ti,Wai-Man Yip,"Simon Yam, Gangshan Jing, Ruby Lin, Tony Yo-ni..."
752,Papa,Jampa Films,,,106.0,"['Comedy', 'Drama', 'Family']",/title/tt4694440/credits/?ref_=bo_tt_tab#tabs,1329356.0,26677.0,1329356.0,920639.0,14644.0,2016-03-18,2016-03-18,/title/tt4694440/?ref_=bo_gr_ti,Xiao Zheng,"Yu Xia, Zi Yang, David Wu, Zuer Song"
753,Kaili Blues,Grasshopper Film,,,113.0,"['Drama', 'Mystery']",/title/tt4613272/credits/?ref_=bo_tt_tab#tabs,903072.0,32164.0,903072.0,494049.0,4164.0,2016-07-15,2016-05-20,/title/tt4613272/?ref_=bo_gr_ti,Bi Gan,"Feiyang Luo, Yongzhong Chen, Linyan Liu, Yue Guo"
843,This Is Not What I Expected,Well Go USA Entertainment,,,106.0,"['Comedy', 'Drama', 'Romance']",/title/tt6772874/credits/?ref_=bo_tt_tab#tabs,30658945.0,337670.0,30658945.0,12646812.0,135252.0,2017-04-27,2017-05-05,/title/tt6772874/?ref_=bo_gr_ti,Derek Hui,"Takeshi Kaneshiro, Ming Xi, Yi-zhou Sun, Dongy..."
849,Extraordinary Mission,Crimson Forest,23000000.0,,117.0,"['Action', 'Crime']",/title/tt6690310/credits/?ref_=bo_tt_tab#tabs,22703590.0,54174.0,22703590.0,,28649.0,2017-03-31,2017-04-07,/title/tt6690310/?ref_=bo_gr_ti,Alan Mak,"Yueting Lang, Yanhui Wang, Yihong Duan, Xuan H..."
860,God of War,Well Go USA Entertainment,,,128.0,"['Action', 'History']",/title/tt6083388/credits/?ref_=bo_tt_tab#tabs,9506524.0,53000.0,9506524.0,4155029.0,23912.0,2017-05-27,2017-06-02,/title/tt6083388/?ref_=bo_gr_ti,Gordon Chan,"Regina Wan, Keisuke Koide, Wenzhuo Zhao, Sammo..."
869,In Harm's Way,Shout! Factory,,,97.0,"['History', 'Romance', 'War']",/title/tt5759434/credits/?ref_=bo_tt_tab#tabs,4447734.0,4447734.0,4447734.0,1928669.0,,2017-11-10,2018-11-02,/title/tt5759434/?ref_=bo_gr_ti,Bille August,"Emile Hirsch, Cary Woodworth, Shaoqun Yu, Yife..."
873,Soul on a String,Film Movement,,,142.0,['Drama'],/title/tt5974624/credits/?ref_=bo_tt_tab#tabs,487625.0,3669.0,487625.0,38877.0,1355.0,2017-08-04,2017-05-19,/title/tt5974624/?ref_=bo_gr_ti,Yang Zhang,"Jinpa, Quniciren, Siano Dudiom Zahi"


In [919]:
# CONVERTING  C L E A N  FINAL DATA TO DF AND SAVING AS CSV

merged_final_df.to_csv('merged_final_df_clean.csv')

## Scraping list of top directors

In [3]:
response = requests.get('https://en.wikipedia.org/wiki/List_of_highest-grossing_film_directors')
page = response.text
soup = BeautifulSoup(page, "html5lib")


In [16]:
# top directors worldwide names

names = soup.find('table').find_all('span')

In [24]:
director_name_list = []

for name in names:
    name = name.text
    name = name.strip()
    if name in director_name_list:
        pass
    else:
        director_name_list.append(name)

list(set(director_name_list))

['Michael Bay',
 'Steven Spielberg',
 'James Cameron',
 'Russo brothers',
 'David Yates',
 'Christopher Nolan',
 'Jon Favreau',
 'Peter Jackson',
 'J. J. Abrams',
 'Tim Burton']

In [25]:
director_name_list

['Steven Spielberg',
 'Russo brothers',
 'Peter Jackson',
 'Michael Bay',
 'James Cameron',
 'David Yates',
 'Christopher Nolan',
 'J. J. Abrams',
 'Tim Burton',
 'Jon Favreau']

In [40]:
lifetime_gross_raw = soup.find('table').find_all('td', attrs={'style':'text-align:right;'})

lifetime_gross_list = []

for gross in lifetime_gross_raw:
    gross = money_to_int(gross.text)
    if name in lifetime_gross_list:
        pass
    else:
        lifetime_gross_list.append(gross)

list(lifetime_gross_list)

[10548456861,
 6844248566,
 6546042615,
 6443668117,
 6235731293,
 6020939913,
 4704255828,
 4625988452,
 4412653899,
 4333849545]

In [41]:
director_dict = dict(zip(director_name_list, lifetime_gross_list))
director_dict

{'Steven Spielberg': 10548456861,
 'Russo brothers': 6844248566,
 'Peter Jackson': 6546042615,
 'Michael Bay': 6443668117,
 'James Cameron': 6235731293,
 'David Yates': 6020939913,
 'Christopher Nolan': 4704255828,
 'J. J. Abrams': 4625988452,
 'Tim Burton': 4412653899,
 'Jon Favreau': 4333849545}

In [50]:
director_df = pd.DataFrame(list(director_dict.items()),columns = ['Name','Lifetime Gross']) 

In [51]:
director_df

Unnamed: 0,Name,Lifetime Gross
0,Steven Spielberg,10548456861
1,Russo brothers,6844248566
2,Peter Jackson,6546042615
3,Michael Bay,6443668117
4,James Cameron,6235731293
5,David Yates,6020939913
6,Christopher Nolan,4704255828
7,J. J. Abrams,4625988452
8,Tim Burton,4412653899
9,Jon Favreau,4333849545


In [53]:
# SAVING AS CSV

director_df.to_csv('director_df.csv', index=False)

## Scrapign top international stars

In [106]:
response = requests.get('https://www.the-numbers.com/box-office-star-records/international/lifetime-acting/top-grossing-leading-stars')
page = response.text
soup = BeautifulSoup(page, "html5lib")


In [91]:
# rank

rank = soup.find('td', class_='data').text

In [87]:
# name

actor_name = soup.find('td', class_='data').next_sibling.next_sibling.text

'Robert Downey, Jr.'

In [88]:
# lifetime int'l gross

lifetime_gross = soup.find('td', class_='data').next_sibling.next_sibling.next_sibling.next_sibling.text

In [89]:
# number of movies

num_films = soup.find('td', class_='data').next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.text

In [90]:
# avg box office per movie

per_movie = soup.find('td', class_='data').next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.text

In [117]:
# pulling table from single page

headers = ['rank', 'actor_name', 'lifetime_gross', 'num_films', 'per_movie']
actor_data = []

for obj in soup.find_all('td', class_='data'):

    rank = obj.text
    
    actor_name = obj.next_sibling.next_sibling.text
    
    raw_lifetime_gross = obj.next_sibling.next_sibling.next_sibling.next_sibling.text
    lifetime_gross = money_to_int(raw_lifetime_gross)
    
    num_films = obj.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.text
    
    raw_per_movie = obj.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.text
    per_movie = money_to_int(raw_per_movie)

    actor_dict = dict(zip(headers, [rank,
                                    actor_name,
                                    lifetime_gross,
                                    num_films,
                                    per_movie]))

    actor_data.append(actor_dict)

In [122]:
# Converting to DF adn saving as CSV

actor_df = pd.DataFrame(actor_data)

actor_df.to_csv('actor_df.csv', index=False)