# IMDB Film Data Scraper Function
## Web Scraping Development

## Objectives
* To apply web scraping development work into a film data scraper function

In [1]:
# Install packages, if necessary:
# pip install requests
# pip install beautifulsoup4

# Load libraries and URL:
import requests
import re
from bs4 import BeautifulSoup
import pandas as pd
import psycopg2
import getpass
import sqlalchemy as sa
# import numpy as np
# import seaborn as sns

## Summarizing Currency Conversion Function
The usd_conversion.py function is a utility to automatically convert reported financial values to USD for comparison.  This function is intended to support the IMDB_film_data.py function.

### UX Summary:
* When budget and box office data is scraped, the currency conversion function will automatically detect the reported currency and convert the values to USD
* Function returns the originally submitted financial value back in USD

### Assumptions:
* Financial value submitted has a three character currency prefix, e.g. EUR, which is used to determine which exchange rate to be used
* Exchange rate used is the latest exchange rate available, and does not take historical rates or inflation into account

### Acknowledgment:
This function relies on the foreign exchange rates API created by Madis Väin, (Ref. https://exchangeratesapi.io).

In [2]:
# Exchange rate function to convert budget and box office values to USD
# Function can be modified to convert to other available currencies

def perf_usd_conversion(native_value):
    # Call Exchange Rates API to look up latest USD exchange rates
    r_usd = 'https://api.exchangeratesapi.io/latest?base=USD'
    usd_response = requests.get(r_usd)
    rates = usd_response.json()
    
    # Parse reported value to determine currency used and remove currency code from string
    native_value = native_value.strip()
    if native_value[0] == '$':
        num_value = native_value.replace('$','')
        exchange_rate = 1
    else:
        currency = native_value[:3]
        exchange_rate = rates['rates'][currency]
        num_value = native_value[3:]
    num_value = num_value.replace(',','')
    if num_value.isnumeric() is True:
        usd_value = float(num_value) / exchange_rate
    else:
        usd_value = None
    return usd_value

## Summarizing Film Data Scraper
The IMDB_film_data.py function processes the input film href to scrape film, IMDB rating, budget, and box office data from its respective IMDB film page.

This function has hardcoded parameters to scrape the specific information required for this project.  Code comments have been provided to permit others to modify the scraper to fit different needs.

### Film href:
The function uses a film href as an input.  This is intentional, as the IMDB_filmo_scraper.py function scrapes filmography titles and the respective IMDB film href.

For example, the following is the full IMDB URL for the film 'Top Gun':
```
https://www.imdb.com/title/tt0092099/
```
To use the IMDB_film_data.py function, the following IMDB film href is used:
```
/title/tt0092099/
```
The full command used is:
```
imdb_film_data('/title/tt0092099/')
```

### UX Summary:
* User provides href to complete IMDB title URL
* Function retrieves the following data:
    * Title as 'title'
    * Year as 'year'
    * IMDB rating as 'imdb_rating'
    * Number of IMDB ratings submitted as 'imdb_qty'
    * Budget as 'budget'
    * Opening Weekend as 'gross_wknd'
    * Domestic Gross as 'gross_domestic'
    * Worldwide Gross as 'gross_ww'
* Function returns DataFrame of IMDB film data values

### Assumptions:
* Hardcoded HTML tags assume HTML structure will not change for life of project

In [3]:
# IMDB_film_data.py
# Ref. Development - IMDB Film Details Jupyter Notebook for more details
# Input: imdb_film_data(href_film)
# Output: filmdata
# Where filmdata is an array with href_film, title, year, imdb_rating, imdb_qty,
#                                 budget, gross_wknd, gross_domestic, gross_ww

import requests
import re
from bs4 import BeautifulSoup
import pandas as pd
from usd_conversion import usd_conversion

def imdb_film_data(href_film):
    # Append film_href input to full IMDB URL
    url = 'https://www.imdb.com' + href_film
    
    # Parse IMDB URL with BeautifulSoup
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    
    # Retrieve film title and release year data
    # Create variables title, year
    imdbFilmdata = soup.find('div', class_ = 'title_wrapper')
    title_year = imdbFilmdata.h1.text
    if imdbFilmdata.h1.span is None:
        title_year = title_year.replace(u'\xa0',u' ')
        title = title_year.strip()
        year = None
        pass
    else:
        yearbrackets = imdbFilmdata.h1.span.text
        title = title_year[:-len(yearbrackets)-2]
        yearstr = yearbrackets[1:len(yearbrackets)-1]
        year = int(yearstr)
    
    # Retrieve IMDB Rating and number of ratings submitted data
    # Create variables imdb_rating, imdb_qty
    imdbRatingdata = soup.find('div', class_ = 'imdbRating')
    if imdbRatingdata is None:
        imdb_rating = None
        imdb_qty = None
        pass
    else:
        str_imdb_rating = imdbRatingdata.strong.text
        str_imdb_qty = imdbRatingdata.a.text
        imdb_rating = float(str_imdb_rating)
        str_imdb_qty = str_imdb_qty.replace(',','')
        imdb_qty = float(str_imdb_qty)

    # Retrieve budget data
    # Create variable budget
    budgetTag = soup.find('h4', text = re.compile('^Budg'))
    if budgetTag is None:
        budget = None
        pass
    else:
        str_budget = budgetTag.next_sibling
        budget = usd_conversion(str_budget)

    # Retrieve box office data including opening weekend, gross domestic, and worldwide gross values
    # Create variables gross_wknd, gross_domestic, gross_ww
    gross_wkndTag = soup.find('h4', text = re.compile('^Opening Weekend'))
    if gross_wkndTag is None:
        gross_wknd = None
        pass
    else:
        str_gross_wknd = gross_wkndTag.next_sibling
        gross_wknd = usd_conversion(str_gross_wknd)
    
    gross_domesticTag = soup.find('h4', text = re.compile('^Gross '))
    if gross_domesticTag is None:
        gross_domestic = None
        pass
    else:
        str_gross_domestic = gross_domesticTag.next_sibling
        gross_domestic = usd_conversion(str_gross_domestic)
    
    gross_wwTag = soup.find('h4', text = re.compile('^Cumulative Worldwide Gross'))
    if gross_wwTag is None:
        gross_ww = None
        pass
    else:
        str_gross_ww = gross_wwTag.next_sibling
        gross_ww = usd_conversion(str_gross_ww)

    # Return list of film data in prescribed order
    filmdata = [href_film, title, year, imdb_rating, imdb_qty, budget, gross_wknd, gross_domestic, gross_ww]
    return filmdata


### Example: Using IMDB_film_data.py to scrape data for 'Top Gun'
To scrape film data for the movie 'Top Gun', enter the following command:

In [4]:
# Example: Top Gun
imdb_film_data('/title/tt0092099/')

['/title/tt0092099/',
 'Top Gun',
 1986,
 6.9,
 285540.0,
 15000000.0,
 8193052.0,
 179800601.0,
 356830601.0]

## Summarizing Filmography Data Scraper
The IMDB_filmo_scraper.py function processes the input actor or actress href to scrape the actor or actress's full name, date of birth, and filmography.  For the filmography, only films where the actor or actress is given acting credit is tabulated are scraped.  This table includes film title and IMDB film href, which is then iterated on using the IMDB_film_data.py function to gather film, IMDB rating, budget, and box office data.

This function has hardcoded parameters to scrape the specific information required for this project.  Code commens have been provided to permit others to modify the scraper to fit different needs.

### Actor or actress href:
The function uses an actor or actress href as an input.  This is intentional to follow IMDB convention of using href values when listing films, actors, or actresses.

For example, the following is the full IMDB URL for actress Charlize Theron:
```
https://www.imdb.com/name/nm0000234/
```
To use the IMDB_filmo_scraper.py function, the following IMDB actress href is used:
```
/name/nm0000234/
```
The full command used is:
```
imdb_filmo_scraper('/name/nm0000234/')
```

### UX Summary:
* User provides href to complete IMDB actor or actress URL
* Function retrieves the following data:
    * All films with Actor credits
    * Associated href for each film to be used in IMDB film data scraper
* Function returns:
    * filmpd, a DataFrame with all IMDB film data from the IMDB_film_data.py function for each film
    * actorpd, a DataFrame with the actor or actress's full name, and date of birth

### Assumptions:
* Hardcoded HTML tags assume HTML structure will not change for life of project

In [5]:
# IMDB_filmo_scraper.py
# Input: imdb_filmo_scraper(href_actor)
# Output: filmspd, actorpd
# Where filmspd is a DataFrame with data outputs from IMDB_film_data.py
#       actorpd is a DataFrame with href_actor, fullname, dob data

import datetime
import requests
import re
from bs4 import BeautifulSoup
import pandas as pd
from IMDB_film_data import imdb_film_data

def imdb_filmo_scraper(href_actor):
#     # Function timing
#     filmostart = pd.Timestamp.now()
#     print('IMDB Filmography Time Start!')
    
    # Append href input to full IMDB URL
    imdbUrl = 'https://www.imdb.com' + href_actor

    # Parse URL with BeautifulSoup
    r = requests.get(imdbUrl)
    soup = BeautifulSoup(r.content, 'html.parser')
    
    # Retrieve Actor or Actress name
    imdbNameData = soup.find('td', class_ = 'name-overview-widget__section')
    imdbName = imdbNameData.find('span', class_ = 'itemprop')
    fullname = imdbName.text
    
    # Retrieve birthday
    imdbBirthdaytag = soup.find('div', id = 'name-born-info')
    imdbBirthday = imdbBirthdaytag.find('time')
    str_dob = imdbBirthday.get('datetime')
    dob = datetime.datetime.strptime(str_dob, '%Y-%m-%d')

    # Create array for actor data, then convert to DataFrame
    actorpd = pd.DataFrame({"href_actor" : [href_actor],
                               "fullname" : [fullname],
                               "dob" : [dob]})

    # Retrieve filmography data
    # Revised for Actor and Actress credits
    films = soup.find_all('div', id = re.compile('^act'))
    
#     # Time milestone - Site parse time
#     siteparsetime = pd.Timestamp.now()
#     print('Site parsed - Time elapsed:')
#     print(siteparsetime - filmostart)

#     startfilmoprocessing = pd.Timestamp.now()

    # Create array, append what each href returns from IMDB Film Data function, then convert array to DataFrame 
    filmsarray = []
    for film in films:
        href_film = film.a.get('href')
        film_row = imdb_film_data(href_film)
        film_row.insert(0, href_actor)
        key_filmact = href_film + href_actor
        film_row.insert(0, key_filmact)
        filmsarray.append(film_row)
        
#         # Time milestone - Each film iteration in array
#         print('Film processed:')
#         print(pd.Timestamp.now())
        
    filmspd = pd.DataFrame(filmsarray, columns = ['key_filmact',
                                                  'href_actor',
                                                  'href_film',
                                                  'title',
                                                  'year',
                                                  'imdb_rating',
                                                  'imdb_qty',
                                                  'budget',
                                                  'gross_wknd',
                                                  'gross_domestic',
                                                  'gross_ww'
                                                  ])
#     # Time milestone - Filmography processing done
#     print('Total time:')
#     print(pd.Timestamp.now() - filmostart)
    return filmspd, actorpd


### Example: Using IMDB_filmo_scraper.py to scrape data for Charlize Theron
To scrape film data for actress Charlize Theron, enter the following command:

In [21]:
# # Example: Charlize Theron
# imdb_filmo_scraper('/name/nm0000234/')

# Example: Dwayne 'The Rock' Johnson
filmframe, actorframe = imdb_filmo_scraper('/name/nm0425005/')

In [27]:
filmframe[filmframe.columns[3:]]

Unnamed: 0,title,year,imdb_rating,imdb_qty,budget,gross_wknd,gross_domestic,gross_ww
0,Big Trouble in Little China,,,,,,,
1,Doc Savage,,,,,,,
2,San Andreas 2,,,,,,,
3,The King,,,,,,,
4,Young Rock,,,,,,,
5,Black Adam,2021.0,,,,,,
6,Red Notice,2020.0,,,,,,
7,Jungle Cruise,2021.0,,,,,,
8,Jumanji: The Next Level,2019.0,6.7,162623.0,125000000.0,59251543.0,316831246.0,796576000.0
9,Ballers,,7.6,36823.0,,,,


In [28]:
filmframe['imdb_rating'].mean()

6.140677966101694

In [29]:
filmframe['imdb_rating'].max()

8.1

In [36]:
filmframe['gross_ww'].max()

1515048151.0

In [38]:
filmframe.sort_values(by=['imdb_rating'], inplace=True, ascending=False)
filmframe[filmframe.columns[3:]]

Unnamed: 0,title,year,imdb_rating,imdb_qty,budget,gross_wknd,gross_domestic,gross_ww
41,Family Guy,,8.1,297834.0,,,,
62,That '70s Show,,8.1,150397.0,,,,
43,Saturday Night Live,,8.1,41632.0,,,,
35,Transformers Prime,,7.9,5176.0,,,,
59,Star Trek: Voyager,,7.8,56720.0,,,,
21,WWE Monday Night RAW,,7.8,7924.0,,,,
19,Moana,2016.0,7.6,259229.0,150000000.0,56631401.0,248757044.0,690860500.0
9,Ballers,,7.6,36823.0,,,,
34,Fast Five,2011.0,7.3,346454.0,125000000.0,86198765.0,209837675.0,626137700.0
10,WWF SmackDown!,,7.3,5368.0,,,,


In [39]:
filmframe.sort_values(by=['gross_ww'], inplace=True, ascending=False)
filmframe[filmframe.columns[3:]]

Unnamed: 0,title,year,imdb_rating,imdb_qty,budget,gross_wknd,gross_domestic,gross_ww
25,Fast & Furious 7,2015.0,7.1,352670.0,190000000.0,147187040.0,353007020.0,1515048000.0
18,The Fate of the Furious,2017.0,6.7,198569.0,250000000.0,98786705.0,226008385.0,1236005000.0
15,Jumanji: Welcome to the Jungle,2017.0,6.9,306570.0,90000000.0,36169328.0,404540171.0,962102200.0
8,Jumanji: The Next Level,2019.0,6.7,162623.0,125000000.0,59251543.0,316831246.0,796576000.0
27,Furious 6,2013.0,7.1,362468.0,160000000.0,97375245.0,238679850.0,788679800.0
11,Fast & Furious Presents: Hobbs & Shaw,2019.0,6.4,156992.0,200000000.0,60038950.0,173956935.0,759056900.0
19,Moana,2016.0,7.6,259229.0,150000000.0,56631401.0,248757044.0,690860500.0
34,Fast Five,2011.0,7.3,346454.0,125000000.0,86198765.0,209837675.0,626137700.0
24,San Andreas,2015.0,6.0,206784.0,110000000.0,54588173.0,155190832.0,473990800.0
57,The Mummy Returns,2001.0,6.3,296005.0,98000000.0,68139035.0,202019785.0,443280900.0


In [40]:
# Example: Dave Bautista
filmframe, actorframe = imdb_filmo_scraper('/name/nm1176985/')

In [41]:
filmframe[filmframe.columns[3:]]

Unnamed: 0,title,year,imdb_rating,imdb_qty,budget,gross_wknd,gross_domestic,gross_ww
0,Groove Tails,,,,,,,
1,The Killer's Game,,,,,,,
2,Guardians of the Galaxy Vol. 3,2021.0,,,,,,
3,Army of the Dead,,,,70000000.0,,,
4,Dune,2020.0,,,,,,
5,Room 104,,6.1,4671.0,,,,
6,Princess Bride,,8.1,155.0,,,,
7,My Spy,2020.0,6.3,15126.0,,,,5804624.0
8,Escape Plan: The Extractors,2019.0,4.4,10349.0,,,,1766092.0
9,What We Do in the Shadows,,8.5,28768.0,,,,


In [42]:
filmframe.sort_values(by=['imdb_rating'], inplace=True, ascending=False)
filmframe[filmframe.columns[3:]]

Unnamed: 0,title,year,imdb_rating,imdb_qty,budget,gross_wknd,gross_domestic,gross_ww
9,What We Do in the Shadows,,8.5,28768.0,,,,
68,WWE SmackDown! Here Comes the Pain,2003.0,8.5,472.0,,,,
10,Avengers: Endgame,2019.0,8.4,746253.0,356000000.0,357115007.0,858373000.0,2797801000.0
23,Disneyland Resort: Guardians of the Galaxy - M...,2017.0,8.4,276.0,,,,
19,Avengers: Infinity War,2018.0,8.4,791266.0,321000000.0,257698183.0,678815482.0,2048360000.0
71,OVW: Wrestling's Future Stars,2002.0,8.2,17.0,,,,
50,Chuck,,8.2,125250.0,,,,
41,WWE 2K14,2013.0,8.1,756.0,,,,
6,Princess Bride,,8.1,155.0,,,,
21,Blade Runner 2049,2017.0,8.0,438078.0,150000000.0,32753122.0,92054159.0,259239700.0


In [43]:
filmframe.sort_values(by=['imdb_rating'], inplace=True, ascending=False)
filmframe[filmframe.columns[3:]]

Unnamed: 0,title,year,imdb_rating,imdb_qty,budget,gross_wknd,gross_domestic,gross_ww
9,What We Do in the Shadows,,8.5,28768.0,,,,
68,WWE SmackDown! Here Comes the Pain,2003.0,8.5,472.0,,,,
10,Avengers: Endgame,2019.0,8.4,746253.0,356000000.0,357115007.0,858373000.0,2797801000.0
23,Disneyland Resort: Guardians of the Galaxy - M...,2017.0,8.4,276.0,,,,
19,Avengers: Infinity War,2018.0,8.4,791266.0,321000000.0,257698183.0,678815482.0,2048360000.0
71,OVW: Wrestling's Future Stars,2002.0,8.2,17.0,,,,
50,Chuck,,8.2,125250.0,,,,
6,Princess Bride,,8.1,155.0,,,,
41,WWE 2K14,2013.0,8.1,756.0,,,,
21,Blade Runner 2049,2017.0,8.0,438078.0,150000000.0,32753122.0,92054159.0,259239700.0


In [44]:
filmframe['imdb_rating'].mean()

6.568656716417909