# Web Scraping Development
## IMDB Film Details

## Objectives
* Explore how to scrape general film information, as well as budget and box office data from IMDB pages
* Create a currency conversion function to parse and process scraped budget and box office data
* Create a function to summarize IMDB film detail scraping


In [99]:
# Install packages, if necessary:
# pip install requests
# pip install beautifulsoup4

## Exploring IMDB movie pages
Before functions can be defined, we need to understand how IMDB web pages are organized.  HTML code will be parsed using BeautifulSoup to extract the information we are looking for as strings.  Strings will then be processed into floats or integers as necessary, and normalized to USD for comparison.

In [11]:
# Load libraries and URL:
import requests
import re
from bs4 import BeautifulSoup
import pandas as pd
# import numpy as np
# import seaborn as sns

# Example URL is for Top Gun:
# Separate href from full URL for integration with filmography scraper:
href = '/title/tt0092099/'
url = 'https://www.imdb.com' + href

# Load URL and confirm success by printing first 100 characters:
r = requests.get(url)
print(r.content[:100])

# Parse HTML with BeautifulSoup:
soup = BeautifulSoup(r.content, 'html.parser')

b'\n\n\n\n\n\n\n<!DOCTYPE html>\n<html\n    xmlns:og="http://ogp.me/ns#"\n    xmlns:fb="http://www.facebook.com/'


In [63]:
# Prettify HTML for manual review, if necessary
# print(soup.prettify())

In [70]:
# Within 'soup', find movie title data including title and year:
imdbFilmdata = soup.find('div', class_ = 'title_wrapper')
title_year = imdbFilmdata.h1.text
yearbrackets = imdbFilmdata.h1.span.text

# Perform string operation for title:
title = title_year[:-len(yearbrackets)-2]

# Perform string operation and convert to integer for year:
yearstr = yearbrackets[1:len(yearbrackets)-1]
year = int(yearstr)

# Verify parsing:
print("Title (Year):", title_year)
print("Title:", title)
print("Year:", year)

# Within 'soup', find all <div> tag with 'class' attribute "imdbRating":
imdbRatingdata = soup.find('div', class_ = 'imdbRating')
str_imdbRating = imdbRatingdata.strong.text
str_imdbRatingQty = imdbRatingdata.a.text

# Convert strings to float and int:
imdbRating = float(str_imdbRating)
str_imdbRatingQty = str_imdbRatingQty.replace(',','')
imdbRatingQty = int(str_imdbRatingQty)

# Verify variable types:
# print("Year:", type(year))
# print("IMDB Rating:", type(imdbRating))
# print("IMDB Rating Quantity:", type(imdbRatingQty))

# Verify values:
print("IMDB Rating:", imdbRating)
print("From", imdbRatingQty, "ratings")

Title (Year): Top Gun (1986) 
Title: Top Gun
Year: 1986
IMDB Rating: 6.9
From 284517 ratings


## Parsing titleDetails
Only take the necessary data.  In this case, focus is on:
- Budget
- Opening weekend
- Gross USA
- Cumulative worldwide gross

Note: Returned values are strings and need to be formatted before calculations.  This example also assumes USD values.

In [90]:
# budgetTag = soup.find('h4', text = 'Budget:')
budgetTag = soup.find('h4', text = re.compile('^Budg'))
str_budgetVal = budgetTag.next_sibling
budgetVal = budget_test.replace('$','')
budgetVal = int(budgetVal)
print(budgetTag.text, budgetVal)

openingTag = soup.find('h4', text = re.compile('^Opening Weekend'))
str_openingVal = openingTag.next_sibling
openingVal = str_openingVal.replace('$','')
openingVal = openingVal.replace(',','')
openingVal = int(openingVal)
print(openingTag.text, openingVal)

domesticTag = soup.find('h4', text = re.compile('^Gross '))
str_domesticVal = domesticTag.next_sibling
domesticVal = str_domesticVal.replace('$','')
domesticVal = domesticVal.replace(',','')
domesticVal = int(domesticVal)
print(domesticTag.text, domesticVal)

worldwideTag = soup.find('h4', text = re.compile('^Cumulative Worldwide Gross'))
str_worldwideVal = worldwideTag.next_sibling
worldwideVal = str_worldwideVal.replace('$','')
worldwideVal = worldwideVal.replace(',','')
worldwideVal = int(worldwideVal)
print(worldwideTag.text, worldwideVal)

Budget: 15000000
Opening Weekend USA: 8193052
Gross USA: 179800601
Cumulative Worldwide Gross: 356830601


In [94]:
# Putting it together into a sample DataFrame:
# Assemble all key values into an array:
# Title, Year, IMDB Rating, # of Ratings, Budget, Opening Weekend, Gross Domestic, Cumulative Gross
# title, year, imdbRating, imdbRatingQty, budgetVal, openingVal, domesticVal, worldwideVal

filmdata = []
filmdata.append([title, year, imdbRating, imdbRatingQty, budgetVal, openingVal, domesticVal, worldwideVal])

# Convert to DataFrame:
pdfilmdata = pd.DataFrame(filmdata, columns = ['Title',
                                               'Year',
                                               'IMDB_Rating',
                                               'IMDB_Ratings',
                                               'Budget',
                                               'Opening_Weekend',
                                               'Domestic_Gross',
                                               'Worldwide_Gross'
                                              ])
pdfilmdata

Unnamed: 0,Title,Year,IMDB_Rating,IMDB_Ratings,Budget,Opening_Weekend,Domestic_Gross,Worldwide_Gross
0,Top Gun,1986,6.9,284517,15000000,8193052,179800601,356830601


## Currency Conversion Feature
Film production is global and USD is not the only currency used in IMDB.  Since data can be reported in different currencies, additional code must be able to determine what currency is being reported, and to apply the correct exchange rate to convert all values to USD.

Before proceeding, a currency conversion function must be defined to support data parsing and processing.

### Exchange Rate API
Credit: https://exchangeratesapi.io/
Exchange rates API is an open source service for current and historical foreign exchange rates.  Basic USD exchange rate call will be used to populate an exchange rate table at the start of the program.

USD: https://api.exchangeratesapi.io/latest?base=USD

In [25]:
# Use urllib and json to call API and process response
r_usd = 'https://api.exchangeratesapi.io/latest?base=USD'
usd_response = requests.get(r_usd)
rates = usd_response.json()
print(rates)
currency = 'EUR'
rates['rates'][currency]

{'rates': {'CAD': 1.3420055134, 'HKD': 7.7513783598, 'ISK': 135.9407305307, 'PHP': 49.3762922123, 'DKK': 6.4126464507, 'HUF': 298.9145416954, 'CZK': 22.6292212267, 'GBP': 0.7838128877, 'RON': 4.1630771881, 'SEK': 8.8464851826, 'IDR': 14629.5658166782, 'INR': 74.8328738801, 'BRL': 5.2357856651, 'RUB': 71.8416609235, 'HRK': 6.4757064094, 'JPY': 106.2715368711, 'THB': 31.7203652653, 'CHF': 0.9243625086, 'EUR': 0.8614748449, 'MYR': 4.2644727774, 'BGN': 1.6848725017, 'TRY': 6.8483804273, 'CNY': 7.0169710544, 'NOK': 9.213731909, 'NZD': 1.5080978635, 'ZAR': 16.7427636113, 'USD': 1.0, 'MXN': 22.4676085458, 'SGD': 1.3855099931, 'AUD': 1.4107512061, 'ILS': 3.4150585803, 'KRW': 1203.3339076499, 'PLN': 3.794452102}, 'base': 'USD', 'date': '2020-07-24'}


0.8614748449

In [60]:
# Create exchange rate function to determine currency, look up exchange rate, and perform conversion to USD
def perf_usd_conversion(native_value):
    # Call Exchange Rates API to look up latest USD exchange rates
    r_usd = 'https://api.exchangeratesapi.io/latest?base=USD'
    usd_response = requests.get(r_usd)
    rates = usd_response.json()
    
    # Parse reported value to determine currency used and remove currency code from string
    native_value = native_value.strip()
    if native_value[0] == '$':
        num_value = native_value.replace('$','')
        exchange_rate = 1
    else:
        currency = native_value[:3]
        exchange_rate = rates['rates'][currency]
        num_value = native_value[3:]
    num_value = num_value.replace(',','')
    if num_value.isnumeric() is True:
        usd_value = float(num_value) / exchange_rate
    else:
        usd_value = None
    return usd_value

## Example: Convert a given value to test ability to detect currency
We can input a specific currency value to test whether the currency exchange function can correctly interpret which exchange rate to apply.  We can also test the strip() and replace functions for processing a given string.

In [67]:
perf_usd_conversion(' GBP6,0,0,0,0,0,0     ')

7654888.168024696

## Example: Processing budget and box office data for Snatch (2000)
Snatch reported its budget in GBP but its box office numbers in USD.  We use this movie to test the currency exchange function.


In [44]:
def imdb_film_data_test(href):
    # Append href input to full IMDB URL
    url = 'https://www.imdb.com' + href
    
    # Parse URL with BeautifulSoup
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    
    # Retrieve budget data
    budgetTag = soup.find('h4', text = re.compile('^Budg'))
    if budgetTag is None:
        budgetVal = None
        pass
    else:
        str_budgetVal = budgetTag.next_sibling
        budgetVal = perf_usd_conversion(str_budgetVal)

    # Retrieve box office data
    openingTag = soup.find('h4', text = re.compile('^Opening Weekend'))
    if openingTag is None:
        openingVal = None
        pass
    else:
        str_openingVal = openingTag.next_sibling
        openingVal = perf_usd_conversion(str_openingVal)
    
    domesticTag = soup.find('h4', text = re.compile('^Gross '))
    if domesticTag is None:
        domesticVal = None
        pass
    else:
        str_domesticVal = domesticTag.next_sibling
        domesticVal = perf_usd_conversion(str_domesticVal)
    
    worldwideTag = soup.find('h4', text = re.compile('^Cumulative Worldwide Gross'))
    if worldwideTag is None:
        worldwideVal = None
        pass
    else:
        str_worldwideVal = worldwideTag.next_sibling
        worldwideVal = perf_usd_conversion(str_worldwideVal)

    # Return list of film data in prescribed order
    filmdata = [budgetVal, openingVal, domesticVal, worldwideVal]
    return filmdata

In [69]:
# Example: Snatch
# Budget reported in GBP, opening weekend has extra characters that could break the function
imdb_film_data_test('/title/tt0208092/')

[7654888.168024696, 27932.0, 30328156.0, 83557872.0]

## Convert to Function
Convert to function.  If statements added to manage missing fields.

In [46]:
def imdb_film_data(href):
    # Append href input to full IMDB URL
    url = 'https://www.imdb.com' + href
    
    # Parse URL with BeautifulSoup
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    
    # Retrieve film data
    imdbFilmdata = soup.find('div', class_ = 'title_wrapper')
    title_year = imdbFilmdata.h1.text
    if imdbFilmdata.h1.span is None:
        title_year = title_year.replace(u'\xa0',u' ')
        title = title_year.strip()
        year = None
        pass
    else:
        yearbrackets = imdbFilmdata.h1.span.text
        title = title_year[:-len(yearbrackets)-2]
        yearstr = yearbrackets[1:len(yearbrackets)-1]
        year = int(yearstr)
    
    # Retrieve IMDB Rating data
    imdbRatingdata = soup.find('div', class_ = 'imdbRating')
    if imdbRatingdata is None:
        imdbRating = None
        imdbRatingQty = None
        pass
    else:
        str_imdbRating = imdbRatingdata.strong.text
        str_imdbRatingQty = imdbRatingdata.a.text
        imdbRating = float(str_imdbRating)
        str_imdbRatingQty = str_imdbRatingQty.replace(',','')
        imdbRatingQty = int(str_imdbRatingQty)

    # Retrieve budget data
    budgetTag = soup.find('h4', text = re.compile('^Budg'))
    if budgetTag is None:
        budgetVal = None
        pass
    else:
        str_budgetVal = budgetTag.next_sibling
        budgetVal = perf_usd_conversion(str_budgetVal)

    # Retrieve box office data
    openingTag = soup.find('h4', text = re.compile('^Opening Weekend'))
    if openingTag is None:
        openingVal = None
        pass
    else:
        str_openingVal = openingTag.next_sibling
        openingVal = perf_usd_conversion(str_openingVal)
    
    domesticTag = soup.find('h4', text = re.compile('^Gross '))
    if domesticTag is None:
        domesticVal = None
        pass
    else:
        str_domesticVal = domesticTag.next_sibling
        domesticVal = perf_usd_conversion(str_domesticVal)
    
    worldwideTag = soup.find('h4', text = re.compile('^Cumulative Worldwide Gross'))
    if worldwideTag is None:
        worldwideVal = None
        pass
    else:
        str_worldwideVal = worldwideTag.next_sibling
        worldwideVal = perf_usd_conversion(str_worldwideVal)

#     # Retrieve budget data
#     budgetTag = soup.find('h4', text = re.compile('^Budg'))
#     if budgetTag is None:
#         budgetVal = None
#         pass
#     else:
#         str_budgetVal = budgetTag.next_sibling
#         budgetVal = str_budgetVal.replace('$','')
#         budgetVal = budgetVal.replace(',','')
#         budgetVal = int(budgetVal)

#     # Retrieve box office data
#     openingTag = soup.find('h4', text = re.compile('^Opening Weekend'))
#     if openingTag is None:
#         openingVal = None
#         pass
#     else:
#         str_openingVal = openingTag.next_sibling
#         openingVal = str_openingVal.replace('$','')
#         openingVal = openingVal.replace(',','')
#         openingVal = int(openingVal)
    
#     domesticTag = soup.find('h4', text = re.compile('^Gross '))
#     if domesticTag is None:
#         domesticVal = None
#         pass
#     else:
#         str_domesticVal = domesticTag.next_sibling
#         domesticVal = str_domesticVal.replace('$','')
#         domesticVal = domesticVal.replace(',','')
#         domesticVal = int(domesticVal)
    
#     worldwideTag = soup.find('h4', text = re.compile('^Cumulative Worldwide Gross'))
#     if worldwideTag is None:
#         worldwideVal = None
#         pass
#     else:
#         str_worldwideVal = worldwideTag.next_sibling
#         worldwideVal = str_worldwideVal.replace('$','')
#         worldwideVal = worldwideVal.replace(',','')
#         worldwideVal = int(worldwideVal)
    
#     # Basic:
#     # Append data to table and convert to DataFrame
#     # This method returns a DataFrame and will prove inefficient with the filmography function!
#     filmdata = []
#     filmdata.append([title, year, imdbRating, imdbRatingQty, budgetVal, openingVal, domesticVal, worldwideVal])
# #     pdfilmdata = pd.DataFrame(filmdata)
#     pdfilmdata = pd.DataFrame(filmdata, columns = ['Title',
#                                                'Year',
#                                                'IMDB_Rating',
#                                                'IMDB_Ratings',
#                                                'Budget',
#                                                'Opening_Weekend',
#                                                'Domestic_Gross',
#                                                'Worldwide_Gross'
#                                               ])
#     return pdfilmdata

#     # Advanced:
#     # Return list of film data in prescribed order
    filmdata = [title, year, imdbRating, imdbRatingQty, budgetVal, openingVal, domesticVal, worldwideVal]
    return filmdata

In [70]:
# Test cases:

# Example: Mission: Impossible 7
# No score, budget, box office
# imdb_film_data('/title/tt9603212/')

# Example: Luna Park
# No score, year, budget, box office
# imdb_film_data('/title/tt1123441/')

# Example: Edge of Tomorrow
# imdb_film_data('/title/tt1631867/')

# Example: Moments
# Invalid budget
imdb_film_data('/title/tt3359412/')

['Moments', 2013, 6.9, 100, 127.58146946707826, None, None, None]