# IMDB Film Data Scraper Function
## Web Scraping Development

## Objectives
* To apply web scraping development work into a film data scraper function

In [1]:
# Install packages, if necessary:
# pip install requests
# pip install beautifulsoup4

# Load libraries and URL:
import requests
import re
from bs4 import BeautifulSoup
import pandas as pd
import psycopg2
import getpass
import sqlalchemy as sa
# import numpy as np
# import seaborn as sns

## Summarizing Currency Conversion Function
The usd_conversion.py function is a utility to automatically convert reported financial values to USD for comparison.  This function is intended to support the IMDB_film_data.py function.

### UX Summary:
* When budget and box office data is scraped, the currency conversion function will automatically detect the reported currency and convert the values to USD
* Function returns the originally submitted financial value back in USD

### Assumptions:
* Financial value submitted has a three character currency prefix, e.g. EUR, which is used to determine which exchange rate to be used
* Exchange rate used is the latest exchange rate available, and does not take historical rates or inflation into account

### Acknowledgment:
This function relies on the foreign exchange rates API created by Madis Väin, (Ref. https://exchangeratesapi.io).

In [2]:
# Exchange rate function to convert budget and box office values to USD
# Function can be modified to convert to other available currencies

def perf_usd_conversion(native_value):
    # Call Exchange Rates API to look up latest USD exchange rates
    r_usd = 'https://api.exchangeratesapi.io/latest?base=USD'
    usd_response = requests.get(r_usd)
    rates = usd_response.json()
    
    # Parse reported value to determine currency used and remove currency code from string
    native_value = native_value.strip()
    if native_value[0] == '$':
        num_value = native_value.replace('$','')
        exchange_rate = 1
    else:
        currency = native_value[:3]
        exchange_rate = rates['rates'][currency]
        num_value = native_value[3:]
    num_value = num_value.replace(',','')
    if num_value.isnumeric() is True:
        usd_value = float(num_value) / exchange_rate
    else:
        usd_value = None
    return usd_value

## Summarizing Film Data Scraper
The IMDB_film_data.py function processes the input film href to scrape film, IMDB rating, budget, and box office data from its respective IMDB film page.

This function has hardcoded parameters to scrape the specific information required for this project.  Code comments have been provided to permit others to modify the scraper to fit different needs.

### Film href:
The function uses a film href as an input.  This is intentional, as the IMDB_filmo_scraper.py function scrapes filmography titles and the respective IMDB film href.

For example, the following is the full IMDB URL for the film 'Top Gun':
```
https://www.imdb.com/title/tt0092099/
```
To use the IMDB_film_data.py function, the following IMDB film href is used:
```
/title/tt0092099/
```
The full command used is:
```
imdb_film_data('/title/tt0092099/')
```

### UX Summary:
* User provides href to complete IMDB title URL
* Function retrieves the following data:
    * Title as 'title'
    * Year as 'year'
    * IMDB rating as 'imdb_rating'
    * Number of IMDB ratings submitted as 'imdb_qty'
    * Budget as 'budget'
    * Opening Weekend as 'gross_wknd'
    * Domestic Gross as 'gross_domestic'
    * Worldwide Gross as 'gross_ww'
* Function returns DataFrame of IMDB film data values

### Assumptions:
* Hardcoded HTML tags assume HTML structure will not change for life of project

In [3]:
# IMDB_film_data.py
# Ref. Development - IMDB Film Details Jupyter Notebook for more details
# Input: imdb_film_data(href_film)
# Output: filmdata
# Where filmdata is an array with href_film, title, year, imdb_rating, imdb_qty,
#                                 budget, gross_wknd, gross_domestic, gross_ww

import requests
import re
from bs4 import BeautifulSoup
import pandas as pd
from usd_conversion import usd_conversion

def imdb_film_data(href_film):
    # Append film_href input to full IMDB URL
    url = 'https://www.imdb.com' + href_film
    
    # Parse IMDB URL with BeautifulSoup
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    
    # Retrieve film title and release year data
    # Create variables title, year
    imdbFilmdata = soup.find('div', class_ = 'title_wrapper')
    title_year = imdbFilmdata.h1.text
    if imdbFilmdata.h1.span is None:
        title_year = title_year.replace(u'\xa0',u' ')
        title = title_year.strip()
        year = None
        pass
    else:
        yearbrackets = imdbFilmdata.h1.span.text
        title = title_year[:-len(yearbrackets)-2]
        yearstr = yearbrackets[1:len(yearbrackets)-1]
        year = int(yearstr)
    
    # Retrieve IMDB Rating and number of ratings submitted data
    # Create variables imdb_rating, imdb_qty
    imdbRatingdata = soup.find('div', class_ = 'imdbRating')
    if imdbRatingdata is None:
        imdb_rating = None
        imdb_qty = None
        pass
    else:
        str_imdb_rating = imdbRatingdata.strong.text
        str_imdb_qty = imdbRatingdata.a.text
        imdb_rating = float(str_imdb_rating)
        str_imdb_qty = str_imdb_qty.replace(',','')
        imdb_qty = float(str_imdb_qty)

    # Retrieve budget data
    # Create variable budget
    budgetTag = soup.find('h4', text = re.compile('^Budg'))
    if budgetTag is None:
        budget = None
        pass
    else:
        str_budget = budgetTag.next_sibling
        budget = usd_conversion(str_budget)

    # Retrieve box office data including opening weekend, gross domestic, and worldwide gross values
    # Create variables gross_wknd, gross_domestic, gross_ww
    gross_wkndTag = soup.find('h4', text = re.compile('^Opening Weekend'))
    if gross_wkndTag is None:
        gross_wknd = None
        pass
    else:
        str_gross_wknd = gross_wkndTag.next_sibling
        gross_wknd = usd_conversion(str_gross_wknd)
    
    gross_domesticTag = soup.find('h4', text = re.compile('^Gross '))
    if gross_domesticTag is None:
        gross_domestic = None
        pass
    else:
        str_gross_domestic = gross_domesticTag.next_sibling
        gross_domestic = usd_conversion(str_gross_domestic)
    
    gross_wwTag = soup.find('h4', text = re.compile('^Cumulative Worldwide Gross'))
    if gross_wwTag is None:
        gross_ww = None
        pass
    else:
        str_gross_ww = gross_wwTag.next_sibling
        gross_ww = usd_conversion(str_gross_ww)

    # Return list of film data in prescribed order
    filmdata = [href_film, title, year, imdb_rating, imdb_qty, budget, gross_wknd, gross_domestic, gross_ww]
    return filmdata


### Example: Using IMDB_film_data.py to scrape data for 'Top Gun'
To scrape film data for the movie 'Top Gun', enter the following command:

In [4]:
# Example: Top Gun
imdb_film_data('/title/tt0092099/')

['/title/tt0092099/',
 'Top Gun',
 1986,
 6.9,
 285594.0,
 15000000.0,
 8193052.0,
 179800601.0,
 356830601.0]

## Summarizing Filmography Data Scraper
The IMDB_filmo_scraper.py function processes the input actor or actress href to scrape the actor or actress's full name, date of birth, and filmography.  For the filmography, only films where the actor or actress is given acting credit is tabulated are scraped.  This table includes film title and IMDB film href, which is then iterated on using the IMDB_film_data.py function to gather film, IMDB rating, budget, and box office data.

This function has hardcoded parameters to scrape the specific information required for this project.  Code comments have been provided to permit others to modify the scraper to fit different needs.

### Actor or actress href:
The function uses an actor or actress href as an input.  This is intentional to follow IMDB convention of using href values when listing films, actors, or actresses.

For example, the following is the full IMDB URL for actress Charlize Theron:
```
https://www.imdb.com/name/nm0000234/
```
To use the IMDB_filmo_scraper.py function, the following IMDB actress href is used:
```
/name/nm0000234/
```
The full command used is:
```
imdb_filmo_scraper('/name/nm0000234/')
```

### UX Summary:
* User provides href to complete IMDB actor or actress URL
* Function retrieves the following data:
    * All films with Actor credits
    * Associated href for each film to be used in IMDB film data scraper
* Function returns:
    * filmpd, a DataFrame with all IMDB film data from the IMDB_film_data.py function for each film
    * actorpd, a DataFrame with the actor or actress's full name, and date of birth

### Assumptions:
* Hardcoded HTML tags assume HTML structure will not change for life of project

In [5]:
# IMDB_filmo_scraper.py
# Input: imdb_filmo_scraper(href_actor)
# Output: filmspd, actorpd
# Where filmspd is a DataFrame with data outputs from IMDB_film_data.py
#       actorpd is a DataFrame with href_actor, fullname, dob data

import datetime
import requests
import re
from bs4 import BeautifulSoup
import pandas as pd
from IMDB_film_data import imdb_film_data

def imdb_filmo_scraper(href_actor):
#     # Function timing
#     filmostart = pd.Timestamp.now()
#     print('IMDB Filmography Time Start!')
    
    # Append href input to full IMDB URL
    imdbUrl = 'https://www.imdb.com' + href_actor

    # Parse URL with BeautifulSoup
    r = requests.get(imdbUrl)
    soup = BeautifulSoup(r.content, 'html.parser')
    
    # Retrieve Actor or Actress name
    imdbNameData = soup.find('td', class_ = 'name-overview-widget__section')
    imdbName = imdbNameData.find('span', class_ = 'itemprop')
    fullname = imdbName.text
    
    # Retrieve birthday
    imdbBirthdaytag = soup.find('div', id = 'name-born-info')
    imdbBirthday = imdbBirthdaytag.find('time')
    str_dob = imdbBirthday.get('datetime')
    dob = datetime.datetime.strptime(str_dob, '%Y-%m-%d')

    # Create array for actor data, then convert to DataFrame
    actorpd = pd.DataFrame({"href_actor" : [href_actor],
                               "fullname" : [fullname],
                               "dob" : [dob]})

    # Retrieve filmography data
    # Revised for Actor and Actress credits
    films = soup.find_all('div', id = re.compile('^act'))
    
#     # Time milestone - Site parse time
#     siteparsetime = pd.Timestamp.now()
#     print('Site parsed - Time elapsed:')
#     print(siteparsetime - filmostart)

#     startfilmoprocessing = pd.Timestamp.now()

    # Create array, append what each href returns from IMDB Film Data function, then convert array to DataFrame 
    filmsarray = []
    for film in films:
        href_film = film.a.get('href')
        film_row = imdb_film_data(href_film)
        film_row.insert(0, href_actor)
        key_filmact = href_film + href_actor
        film_row.insert(0, key_filmact)
        filmsarray.append(film_row)
        
#         # Time milestone - Each film iteration in array
#         print('Film processed:')
#         print(pd.Timestamp.now())
        
    filmspd = pd.DataFrame(filmsarray, columns = ['key_filmact',
                                                  'href_actor',
                                                  'href_film',
                                                  'title',
                                                  'year',
                                                  'imdb_rating',
                                                  'imdb_qty',
                                                  'budget',
                                                  'gross_wknd',
                                                  'gross_domestic',
                                                  'gross_ww'
                                                  ])
#     # Time milestone - Filmography processing done
#     print('Total time:')
#     print(pd.Timestamp.now() - filmostart)
    return filmspd, actorpd


### Example: Using IMDB_filmo_scraper.py to scrape data for Charlize Theron
To scrape film data for actress Charlize Theron, enter the following command:

In [6]:
# # Example: Charlize Theron
imdb_filmo_scraper('/name/nm0000234/')

(                           key_filmact        href_actor           href_film  \
 0    /title/tt8390502//name/nm0000234/  /name/nm0000234/   /title/tt8390502/   
 1    /title/tt5433138//name/nm0000234/  /name/nm0000234/   /title/tt5433138/   
 2    /title/tt7556122//name/nm0000234/  /name/nm0000234/   /title/tt7556122/   
 3   /title/tt12607768//name/nm0000234/  /name/nm0000234/  /title/tt12607768/   
 4    /title/tt6394270//name/nm0000234/  /name/nm0000234/   /title/tt6394270/   
 ..                                 ...               ...                 ...   
 58   /title/tt0120373//name/nm0000234/  /name/nm0000234/   /title/tt0120373/   
 59   /title/tt0119302//name/nm0000234/  /name/nm0000234/   /title/tt0119302/   
 60   /title/tt0117887//name/nm0000234/  /name/nm0000234/   /title/tt0117887/   
 61   /title/tt0115438//name/nm0000234/  /name/nm0000234/   /title/tt0115438/   
 62   /title/tt0109415//name/nm0000234/  /name/nm0000234/   /title/tt0109415/   
 
                          

## Summarizing Cloud Uploader
The IMDB_cloud_uploader.py function processes the input DataFrames to upsert into a Postgres database.

This function has hardcoded parameters to scrape the specific information required for this project.  Code comments have been provided to permit others to modify the scraper to fit different needs.

### UX Summary:
* Inputs are two DataFrames in format of filmpd, and actorpd, (Ref. IMDB_filmo_scraper.py)
* User provides Postgres database credentials
* Function upserts to the following relational databases:
    * db_actor
    * db_film
    * db_filmcredits
* Function returns:
    * filmpd, a DataFrame with all IMDB film data from the IMDB_film_data.py function for each film
    * actorpd, a DataFrame with the actor or actress's full name, and date of birth

### Assumptions:
* Hardcoded HTML tags assume HTML structure will not change for life of project

In [None]:
# IMDB_cloud_upload.py
# Hard-coded example to upsert provided DataFrames in format of filmpd and actorpd, (Ref. IMDB_filmo_scraper.py)

# Declare libraries
import pandas as pd
import psycopg2
import getpass
import sqlalchemy as sa
from IMDB_filmo_scraper import imdb_filmo_scraper
from IMDB_film_data import imdb_film_data
from usd_conversion import usd_conversion

def imdb_cloud_upload(filmpd, actorpd):
    # Postgres database credentials - Hardcoded
    # Upload DataFrame to Google Cloud Platform
    # Google Cloud Platform Parameters
    db_user = "postgres"
    db_host = "34.95.29.176"
    db_name = "postgres"

    # User inputs password and creates SQLAlchemy engine
    db_pass = getpass.getpass("Password:")
    conn = sa.create_engine("postgresql://postgres:" + db_pass + "@" + db_host + "/" + db_name)
    print("Connected to Google Cloud Platform.")
    
    # Clear and create temp_table_film
    print("Creating temp_table_film.")
    conn.execute(
        "DROP TABLE IF EXISTS temp_table_film"
    )
    conn.execute(
        "CREATE TABLE temp_table_film (key_filmact varchar PRIMARY KEY, href_actor varchar, href_film varchar, title varchar, year integer, imdb_rating numeric, imdb_qty numeric, budget numeric, gross_wknd numeric, gross_domestic numeric, gross_ww numeric)"
    )
    print("Created temp_table_film.")
    
    # Clear and create temp_table_actor
    print("Creating temp_table_actor.")
    conn.execute(
        "DROP TABLE IF EXISTS temp_table_actor"
    )
    conn.execute(
        "CREATE TABLE temp_table_actor (href_actor varchar PRIMARY KEY, fullname varchar NOT NULL, dob DATE)"
    )
    
    # Populate temp_table_film with result_films
    result_films.to_sql("temp_table_film", conn, index = False, if_exists = 'append')
    print("Populated temp_table_film.")
    
    # Populate temp_table_actor with result_actor
    result_actor.to_sql("temp_table_actor", conn, index = False, if_exists = 'append')
    print("Populated temp_table_actor.")

    # Merge temp_table_film into db_film
    conn.execute(
        sa.text("""\
            INSERT INTO db_film (href_film, title, year, imdb_rating, imdb_qty, budget, gross_wknd, gross_domestic, gross_ww)
            SELECT href_film, title, year, imdb_rating, imdb_qty, budget, gross_wknd, gross_domestic, gross_ww FROM temp_table_film
            ON CONFLICT (href_film) DO UPDATE SET (title, year, imdb_rating, imdb_qty, budget, gross_wknd, gross_domestic, gross_ww) = (EXCLUDED.title, EXCLUDED.year, EXCLUDED.imdb_rating, EXCLUDED.imdb_qty, EXCLUDED.budget, EXCLUDED.gross_wknd, EXCLUDED.gross_domestic, EXCLUDED.gross_ww)
            """
        )
    )
    print("Upsert to db_film complete.")

    # Merge temp_table_actor into db_actor
    conn.execute(
        sa.text("""\
            INSERT INTO db_actor (href_actor, fullname, dob)
            SELECT href_actor, fullname, dob FROM temp_table_actor
            ON CONFLICT (href_actor) DO NOTHING
            """
        )
    )
    print("Upsert to db_filmcredits complete.")

    # Merge temp_table_film into db_filmcredits
    conn.execute(
        sa.text("""\
            INSERT INTO db_filmcredits (key_filmact, href_film, href_actor)
            SELECT key_filmact, href_film, href_actor FROM temp_table_film
            ON CONFLICT (key_filmact) DO NOTHING
            """
        )
    )
    print("Upsert to db_filmcredits complete.")
