# Oscar Data Web Scrape - Web Data Science
### Jack Stein

The goal of this workbook is to create a dataset of the Oscar's Best Picture nominees and winners since 2000. To do this we will use selenium to scrape the [Oscar's Database](https://awardsdatabase.oscars.org/) and [Rotten Tomatoes](https://www.rottentomatoes.com), and we will use [The Movie Database's (TMDB)](https://developer.themoviedb.org/reference/intro/getting-started) API.

With this data we should hopefully have enough factors to figure out what makes a movie an Oscar winner! There is definitely room for this dataset to grow to include the actors, directors, and other information, but this is a great place to start.

Import libraries

In [1]:
import pandas as pd
import requests, json
from bs4 import BeautifulSoup
from urllib.parse import quote
import numpy as np

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By

pd.options.display.max_columns = 200
pd.options.display.float_format = '{:,}'.format

### Step 1
The first step is to download all the winners for the 'Best Picture' award at the Oscars. Since the Oscars has a searchable database, we have to use Selenium to filter the search

In [2]:
# Using Selenium for Oscars Database
oscars_url = "https://awardsdatabase.oscars.org/search"
chrome_service = ChromeService(ChromeDriverManager().install())

chrome_options = webdriver.ChromeOptions()
# chrome_options.page_load_strategy = "eager"
chrome_options.add_argument('--headless')

# Launch the driver
driver = webdriver.Chrome(options = chrome_options, service=chrome_service)
driver.get(oscars_url)

# Step through and click on each filter
# select award category f
element1=driver.find_element(By.XPATH, "//*[@id='basicsearch']/div/div[1]/div[2]/div/span/div/button")
element1.click()
element2=driver.find_element(By.XPATH, '//*[@id="basicsearch"]/div/div[1]/div[2]/div/span/div/ul/li[22]/a/label/input')
element2.click()
element3=driver.find_element(By.XPATH, "//*[@id='basicsearch']/div/div[1]/div[2]/div/span/div/button")
element3.click()
# select winners only
# element4 = driver.find_element(By.XPATH, "//*[@id='BasicSearchView_IsWinnersOnly']")
# element4.click()
element5 = driver.find_element(By.XPATH, '//*[@id="basicsearch"]/div/div[2]/div[2]/div/div[1]/span/div/button')
element5.click()
#select from year
element6 = driver.find_element(By.XPATH, '//*[@id="basicsearch"]/div/div[2]/div[2]/div/div[1]/span/div/ul/li[25]/a/label/input')
element6.click()
#select to year
element7 = driver.find_element(By.XPATH, '//*[@id="basicsearch"]/div/div[2]/div[2]/div/div[2]/span/div/button')
element7.click()
element8 = driver.find_element(By.XPATH, '//*[@id="basicsearch"]/div/div[2]/div[2]/div/div[2]/span/div/ul/li[3]/a/label/input')
element8.click()
# search
element9 = driver.find_element(By.XPATH, '//*[@id="btnbasicsearch"]')
element9.click()

# Get source
oscar_raw = driver.page_source.encode('utf-8')

# Convert to Soup
oscar_soup = BeautifulSoup(oscar_raw)

# Quit the driver
driver.quit()

The following function reads the HTML from the database search return and grabs the data we want to store. We then save it to `oscar_df`

In [3]:
def parseOscars(soup):
    soup = soup.find_all("div",class_='awards-result-chron result-group group-awardcategory-chron')
    oscar_list = []
    yearint = 0
    for x in range(0, len(soup)):
        resultsize = len(soup[yearint].find_all("a",class_="nominations-link"))
        resultsize2 = int((resultsize-2)/2)
        # winner check
        for x in range(0,resultsize2):
            if(soup[yearint].find_all("div",class_='result-details awards-result-actingorsimilar')[x].find("span", class_="glyphicon glyphicon-star") != None):
                winner = x

        os_yr = str(soup[yearint].find_all("a",class_="nominations-link")[0].text[:4])
        os_ayr = str(soup[yearint].find_all("a",class_="nominations-link")[0].text[5:])
        for j in range(0,resultsize2*2,2):
            oscar_dict = {}
            oscar_dict['year'] = os_yr
            oscar_dict['award year'] = os_ayr
            oscar_dict['movie'] = str(soup[yearint].find_all("a",class_="nominations-link")[j+2].text)
            oscar_dict['producers'] = str(soup[yearint].find_all("a",class_="nominations-link")[j+3].text[:-11])
            if (j/2 == winner):
                oscar_dict['winner'] = True
            else:
                oscar_dict['winner'] = False
            oscar_list.append(oscar_dict)
        yearint = yearint + 1
        
    return oscar_list
oscar_list = parseOscars(oscar_soup) 

In [4]:
oscar_df = pd.DataFrame(oscar_list)
oscar_df.head(15)

Unnamed: 0,year,award year,movie,producers,winner
0,2000,(73rd),Chocolat,"David Brown, Kit Golden and Leslie Holleran",False
1,2000,(73rd),"Crouching Tiger, Hidden Dragon","Bill Kong, Hsu Li Kong and Ang Lee",False
2,2000,(73rd),Erin Brockovich,"Danny DeVito, Michael Shamberg and Stacey Sher",False
3,2000,(73rd),Gladiator,"Douglas Wick, David Franzoni and Branko Lustig",True
4,2000,(73rd),Traffic,"Edward Zwick, Marshall Herskovitz and Laura Bi...",False
5,2001,(74th),A Beautiful Mind,Brian Grazer and Ron Howard,True
6,2001,(74th),Gosford Park,"Robert Altman, Bob Balaban and David Levy",False
7,2001,(74th),In the Bedroom,"Graham Leader, Ross Katz and Todd Field",False
8,2001,(74th),The Lord of the Rings: The Fellowship of the Ring,"Peter Jackson, Fran Walsh and Barrie M. Osborne",False
9,2001,(74th),Moulin Rouge,"Martin Brown, Baz Luhrmann and Fred Baron",False


### Step 2
The next step is to pull data from The Movie Database (TMDB) API. I already registered for API access and have the api token saved below. We need that to make our API calls.

In [5]:
a_token = ""
headers = {
        "accept": "application/json",
        "Authorization": f"Bearer {a_token}"
    }

This function makes the API request for each movie in our Oscars dataframe and returns TBDB's ID they have assigned that movie. We will assume that the first movie returned in our search is correct.

In [6]:
def getTMBD_id(row):
    search_encoded = quote(row["movie"])
    year_ = row["year"]

    url = f"https://api.themoviedb.org/3/search/movie?query={search_encoded}&include_adult=false&language=en-US&page=1&year={year_}"

    response = requests.get(url, headers=headers).json()    
        
    if (len(response["results"])>=1):
        movie_id = response["results"][0]["id"]
    else:
        # increment year
        year_ = int(row["year"]) + 1
        year_ = str(year_)
        
        url = f"https://api.themoviedb.org/3/search/movie?query={search_encoded}&include_adult=false&language=en-US&page=1&year={year_}"
        response = requests.get(url, headers=headers).json() 
        
        movie_id = response["results"][0]["id"]

    return movie_id
# getMovieDBID(oscar_df.loc[0])
oscar_df["TMDB_id"] =  oscar_df.apply(getTMBD_id, axis=1)

In [7]:
oscar_df.head(5)

Unnamed: 0,year,award year,movie,producers,winner,TMDB_id
0,2000,(73rd),Chocolat,"David Brown, Kit Golden and Leslie Holleran",False,392
1,2000,(73rd),"Crouching Tiger, Hidden Dragon","Bill Kong, Hsu Li Kong and Ang Lee",False,146
2,2000,(73rd),Erin Brockovich,"Danny DeVito, Michael Shamberg and Stacey Sher",False,462
3,2000,(73rd),Gladiator,"Douglas Wick, David Franzoni and Branko Lustig",True,98
4,2000,(73rd),Traffic,"Edward Zwick, Marshall Herskovitz and Laura Bi...",False,1900


This function gets the movie details, similar to the parseOscars function. It returns a dictionary with movie information for each movie we pass in. After we collect this data, we concatenate it with our `oscar_df`


In [8]:
def movieDetails(row):
    movie_id_query = row["TMDB_id"]
    
    url = f"https://api.themoviedb.org/3/movie/{movie_id_query}?language=en-US"
    headers = {
        "accept": "application/json",
        "Authorization": f"Bearer {a_token}"
    }
    response = requests.get(url, headers=headers).json()
    detail_d = {}
    detail_d["imdb_id"] = response["imdb_id"]
    detail_d["budget"] = response["budget"]
    detail_d["revenue"] = response["revenue"]
    detail_d["popularity"] = response["popularity"]
    detail_d["genre"] = response["genres"][0]["name"]
    detail_d["adult"] = response["adult"]
    companies_list = []
    for x in range(0,len(response["production_companies"])):
        companies_list.append(response["production_companies"][x]["name"])
    companies =", ".join(companies_list)
    detail_d["production_companies"] = companies_list
    detail_d["release_date"] = response["release_date"]
    detail_d["vote_average"] = response["vote_average"]
    detail_d["runtime(mins)"] = response["runtime"]
    return detail_d

details_df = oscar_df.apply(movieDetails,axis=1,result_type='expand')
oscar_df = pd.concat([oscar_df,details_df],axis=1)

In [9]:
oscar_df.head()

Unnamed: 0,year,award year,movie,producers,winner,TMDB_id,imdb_id,budget,revenue,popularity,genre,adult,production_companies,release_date,vote_average,runtime(mins)
0,2000,(73rd),Chocolat,"David Brown, Kit Golden and Leslie Holleran",False,392,tt0241303,25000000,152500343,16.956,Comedy,False,"[Fat Free, Miramax, David Brown Productions]",2000-12-22,7.0,121
1,2000,(73rd),"Crouching Tiger, Hidden Dragon","Bill Kong, Hsu Li Kong and Ang Lee",False,146,tt0190332,17000000,213525736,24.728,Adventure,False,"[China Film Co-Production Corporation, Columbi...",2000-07-06,7.41,120
2,2000,(73rd),Erin Brockovich,"Danny DeVito, Michael Shamberg and Stacey Sher",False,462,tt0195685,52000000,256271286,23.903,Drama,False,[Jersey Films],2000-03-17,7.42,131
3,2000,(73rd),Gladiator,"Douglas Wick, David Franzoni and Branko Lustig",True,98,tt0172495,103000000,465361176,63.604,Action,False,"[Universal Pictures, Scott Free Productions, R...",2000-05-04,8.21,155
4,2000,(73rd),Traffic,"Edward Zwick, Marshall Herskovitz and Laura Bi...",False,1900,tt0181865,48000000,207515725,22.661,Thriller,False,"[USA Films, Compulsion Inc., Initial Entertain...",2000-12-27,7.051,147


### Step 3
Now we will scrape [RottenTomatoes](https://www.rottentomatoes.com) and collect the critic and audience scores (and some other data that coincides). I had to manually chance some of the movie names to match the RottenTomatoes too. 

NOTE: This takes a while to run

In [10]:
def rottenManualfixes(movie):
    errorMovies={
        "moulin_rouge":"moulin_rouge_2001",
        "the_pianist":"pianist",
        "crash":'1144992-crash',
        "the_reader":"reader",
        "up_in_the_air":"up_in_the_air_2009",
        "the_tree_of_life":"the_tree_of_life_2011",
        "gravity":"gravity_2013",
        "the_wolf_of_wall_street":"the_wolf_of_wall_street_2013",
        "whiplash":"whiplash_2014",
        "room":"room_2015",
        "spotlight":"spotlight_2015",
        "arrival":"arrival_2016",
        "fences":"fences_2016",
        "lion":"lion_2016",
        "moonlight":"moonlight_2016",
        "black_panther":"black_panther_2018",
        "vice":"vice_2018",
        "joker":"joker_2019",
        "little_women":"little_women_2019",
        "parasite":"parasite_2019",
        "1917":"1917_2019",
        "the_father":"the_father_2021",
        "dune":"dune_2021",
        "tár":"tar_2022"
    }
    if movie in errorMovies.keys():
        return errorMovies[movie]
    else:
        return movie
    
def rottenTomatoesInfo(myarray):
    loopmax = len(myarray)
    critic_ratings = []
    audience_ratings = []
    for x in range(0,loopmax):
        # Print number of movie in list that is currently being searching for
        print(f"Searching for movie number {x+1} of {loopmax}",end="\r")
        
        movie = myarray[x].lower()
        movie = str.replace(movie,",","")
        movie = str.replace(movie,":","")
        movie = str.replace(movie,"...","_")
        movie = str.replace(movie," ","_")
        movie = rottenManualfixes(movie)
#         print(movie)
    
        
        chrome_service = ChromeService(ChromeDriverManager().install())
        chrome_options = webdriver.ChromeOptions()
        chrome_options.page_load_strategy = "eager"
        chrome_options.add_argument('--headless')
        driver = webdriver.Chrome(options = chrome_options, service=chrome_service)
        driver.get(f'https://www.rottentomatoes.com/m/{movie}')
        _raw = driver.page_source.encode('utf-8')
        _soup = BeautifulSoup(_raw)
        driver.quit()

        #CRITICS
#         critic_ratings = []
        if (len(_soup.find_all('score-details-critics-deprecated')) != 0):
            critics_details = _soup.find('score-details-critics-deprecated')

            critic_ratings_d = {}
            critic_ratings_d['average_rating_c'] = float(critics_details['averagerating'])
            critic_ratings_d['liked_count_c'] = int(critics_details['likedcount'])
            critic_ratings_d['not_liked_count_c'] = int(critics_details['notlikedcount'])
            critic_ratings_d['rating_count_c'] = int(critics_details['ratingcount'])
            critic_ratings_d['state_c'] = critics_details['state']
            critic_ratings_d['value_c'] = int(critics_details['value'])

    #         print(critic_ratings_d)
            critic_ratings.append(critic_ratings_d)
        else:
            critic_ratings_d = {}
            critic_ratings_d['average_rating_c'] = np.nan
            critic_ratings_d['liked_count_c'] = np.nan
            critic_ratings_d['not_liked_count_c'] = np.nan
            critic_ratings_d['rating_count_c'] = np.nan
            critic_ratings_d['state_c'] = np.nan
            critic_ratings_d['value_c'] = np.nan

    #         print(critic_ratings_d)
            critic_ratings.append(critic_ratings_d)
        
        # AUDIENCE
#         audience_ratings = []
        if (len(_soup.find_all('score-details-audience-deprecated')) != 0):
            audience_details = _soup.find('score-details-audience-deprecated')

            audience_ratings_d = {}
            audience_ratings_d['average_rating_a'] = float(audience_details['averagerating'])
            audience_ratings_d['liked_count_a'] = int(audience_details['likedcount'])
            audience_ratings_d['not_liked_count_a'] = int(audience_details['notlikedcount'])
            audience_ratings_d['rating_count_a'] = int(audience_details['ratingcount'])
            audience_ratings_d['review_count_a'] = int(audience_details['reviewcount'])
            audience_ratings_d['value_a'] = int(audience_details['value'])
            
    #         print(audience_ratings_d)
            audience_ratings.append(audience_ratings_d)
        else:
            audience_ratings_d = {}
            audience_ratings_d['average_rating_a'] = float(audience_details['averagerating'])
            audience_ratings_d['liked_count_a'] = int(audience_details['likedcount'])
            audience_ratings_d['not_liked_count_a'] = int(audience_details['notlikedcount'])
            audience_ratings_d['rating_count_a'] = int(audience_details['ratingcount'])
            audience_ratings_d['review_count_a'] = int(audience_details['reviewcount'])
            audience_ratings_d['value_a'] = int(audience_details['value'])
            
    #         print(audience_ratings_d)
            audience_ratings.append(audience_ratings_d)
        
    return critic_ratings, audience_ratings

In [11]:
critic_ratings, audience_ratings = rottenTomatoesInfo(oscar_df["movie"].values)

Searching for movie number 171 of 171

Now that we have our lists of ratings, we will make them each into their own dataframes, merge them together, then merge them back on the `oscar_df`

In [12]:
critic_ratings_df = pd.DataFrame(critic_ratings)
audience_ratings_df = pd.DataFrame(audience_ratings)
rotten_df = pd.concat([critic_ratings_df,audience_ratings_df],axis=1)
rotten_df.head(3)

Unnamed: 0,average_rating_c,liked_count_c,not_liked_count_c,rating_count_c,state_c,value_c,average_rating_a,liked_count_a,not_liked_count_a,rating_count_a,review_count_a,value_a
0,6.0,75.0,44.0,119.0,fresh,63.0,4.0,35771,7232,329145,13002,83
1,8.7,166.0,4.0,170.0,certified-fresh,98.0,4.1,29180,4725,427892,13700,86
2,7.5,127.0,23.0,150.0,certified-fresh,85.0,3.9,13393,3174,418943,8239,81


In [13]:
oscar_df = pd.concat([oscar_df,rotten_df],axis=1)
oscar_df
# save to csv so I do not have to run everytime
oscar_df.to_csv("oscars.csv")

Now we have successfully collected data from 3 differnt sources and combinded them to one dataset!