In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from io import StringIO
import requests
import re

# An Analysis of Movie Performance

In this part, you’ll gather data about popular movies and award winners. The goal is to build a dataset that you’ll later use to analyze what makes a movie successful and how awards and box office performance relate to one another.

### Part 1: Data Gathering
1. Scrape Best Picture Data.  
    * Scrape the [Best Picture wikipedia page](https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture).  
    * Extract for each year:  
        * Year  
        * Film Title  
        * Winner (Yes/No)  
    * Data cleaning tips:  
        * Ensure that year and film title columns are clean and consistent (no footnotes, parentheses, etc.).
        * Save the results as best_picture.csv.  

In [2]:
# read in website for webscraping
URL = 'https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture'

headers = {
    "User-Agent": "MyPythonScraper"
}

response = requests.get(URL, headers=headers)

In [3]:
# check for connection
response.status_code

200

In [4]:
# create object for webscrape
movies_soup = BeautifulSoup(response.text)

In [5]:
# locate tables in HTML
movies_soup.findAll('table', attrs={'class' : 'wikitable'})[0]

<table class="wikitable sortable sticky-header" style="font-size:1.00em; line-height:1.5em;">
<tbody><tr bgcolor="#bebebe">
<th width="5%">Year of Film Release
</th>
<th width="40%">Film
</th>
<th width="55%">Film Studio
</th></tr>
<tr style="background:#FAEB86">
<th rowspan="3"><a href="/wiki/1928_in_film" title="1928 in film">1927/28</a><br/><span style="font-size: 85%;"><a href="/wiki/1st_Academy_Awards" title="1st Academy Awards">(1st)</a></span>
</th>
<td><i><b><a href="/wiki/Wings_(1927_film)" title="Wings (1927 film)">Wings</a></b></i>
</td>
<td><b><a href="/wiki/Famous_Players%E2%80%93Lasky" title="Famous Players–Lasky">Famous Players–Lasky</a> <span style="font-size: 85%;">(<a href="/wiki/Lucien_Hubbard" title="Lucien Hubbard">Lucien Hubbard</a>, <a href="/wiki/Jesse_L._Lasky" title="Jesse L. Lasky">Jesse L. Lasky</a>, <a href="/wiki/B._P._Schulberg" title="B. P. Schulberg">B. P. Schulberg</a>, &amp; <a href="/wiki/Adolph_Zukor" title="Adolph Zukor">Adolph Zukor</a>, producers

In [6]:
# display tables that we are scraping
table_html = str(movies_soup.findAll('table', attrs={'class' : 'wikitable'}))
from IPython.core.display import HTML
HTML(table_html)

Year of Film Release,Film,Film Studio
1927/28 (1st),Wings,"Famous Players–Lasky (Lucien Hubbard, Jesse L. Lasky, B. P. Schulberg, & Adolph Zukor, producers)"
1927/28 (1st),7th Heaven,"Fox (William Fox, producer)"
1927/28 (1st),The Racket,"The Caddo Company (Howard Hughes, producer)"
1928/29 (2nd) [a],,
1928/29 (2nd) [a],The Broadway Melody,"Metro-Goldwyn-Mayer (Irving Thalberg & Lawrence Weingarten, producers)"
1928/29 (2nd) [a],Alibi,"Feature Productions (Roland West, producer)"
1928/29 (2nd) [a],The Hollywood Revue,"Metro-Goldwyn-Mayer (Irving Thalberg & Harry Rapf, producers)"
1928/29 (2nd) [a],In Old Arizona,"Fox (Winfield Sheehan, producer)"
1928/29 (2nd) [a],The Patriot,Paramount Famous Lasky

Year of Film Release,Film,Film Studio/Producer(s)
1929/30 (3rd),All Quiet on the Western Front,"Universal (Carl Laemmle Jr., producer)"
1929/30 (3rd),The Big House,"Cosmopolitan (Irving Thalberg, producer)"
1929/30 (3rd),Disraeli,"Warner Bros. (Jack L. Warner & Darryl F. Zanuck, producers)"
1929/30 (3rd),The Divorcee,"Metro-Goldwyn-Mayer (Robert Z. Leonard, producer)"
1929/30 (3rd),The Love Parade,"Paramount Famous Lasky (Ernst Lubitsch, producer)"
1930/31 (4th),,
1930/31 (4th),Cimarron,"RKO Radio (William LeBaron, producer)"
1930/31 (4th),East Lynne,Fox
1930/31 (4th),The Front Page,"The Caddo Company (Howard Hughes & Lewis Milestone, producers)"
1930/31 (4th),Skippy,"Paramount Publix (Jesse L. Lasky, B. P. Schulberg, & Adolph Zukor, producers)"

Year of Film Release,Film,Film Studio
1940 (13th),Rebecca,"Selznick International Pictures (David O. Selznick, producer)"
1940 (13th),"All This, and Heaven Too",Warner Bros.
1940 (13th),Foreign Correspondent,Walter Wanger (production company)
1940 (13th),The Grapes of Wrath,20th Century-Fox
1940 (13th),The Great Dictator,Charles Chaplin Productions
1940 (13th),Kitty Foyle,RKO Radio
1940 (13th),The Letter,Warner Bros.
1940 (13th),The Long Voyage Home,Argosy-Wanger
1940 (13th),Our Town,Sol Lesser (production company)
1940 (13th),The Philadelphia Story,Metro-Goldwyn-Mayer

Year of Film Release,Film,Film Studio/Producer(s)
1950 (23rd),All About Eve,"20th Century-Fox (Darryl F. Zanuck, producer)"
1950 (23rd),Born Yesterday,Columbia
1950 (23rd),Father of the Bride,Metro-Goldwyn-Mayer
1950 (23rd),King Solomon's Mines,Metro-Goldwyn-Mayer
1950 (23rd),Sunset Boulevard,Paramount
1951 (24th),,
1951 (24th),An American in Paris,Arthur Freed
1951 (24th),Decision Before Dawn,Anatole Litvak and Frank McCarthy
1951 (24th),A Place in the Sun,George Stevens
1951 (24th),Quo Vadis,Sam Zimbalist

Year of Film Release,Film,Producer(s)
1960 (33rd),The Apartment,Billy Wilder
1960 (33rd),The Alamo,John Wayne
1960 (33rd),Elmer Gantry,Bernard Smith
1960 (33rd),Sons and Lovers,Jerry Wald
1960 (33rd),The Sundowners,Fred Zinnemann
1961 (34th),,
1961 (34th),West Side Story,Robert Wise
1961 (34th),Fanny,Joshua Logan
1961 (34th),The Guns of Navarone,Carl Foreman
1961 (34th),The Hustler,Robert Rossen

Year of Film Release,Film,Producer(s)
1970 (43rd),Patton,Frank McCarthy
1970 (43rd),Airport,Ross Hunter
1970 (43rd),Five Easy Pieces,Bob Rafelson and Richard Wechsler
1970 (43rd),Love Story,Howard G. Minsky
1970 (43rd),M*A*S*H,Ingo Preminger
1971 (44th),,
1971 (44th),The French Connection,Philip D'Antoni
1971 (44th),A Clockwork Orange,Stanley Kubrick
1971 (44th),Fiddler on the Roof,Norman Jewison
1971 (44th),The Last Picture Show,Stephen J. Friedman

Year of Film Release,Film,Producer(s)
1980 (53rd),Ordinary People,Ronald L. Schwary
1980 (53rd),Coal Miner's Daughter,Bernard Schwartz
1980 (53rd),The Elephant Man,Jonathan Sanger
1980 (53rd),Raging Bull,Irwin Winkler and Robert Chartoff
1980 (53rd),Tess,Claude Berri and Timothy Burrill
1981 (54th),,
1981 (54th),Chariots of Fire,David Puttnam
1981 (54th),Atlantic City,Denis Héroux
1981 (54th),On Golden Pond,Bruce Gilbert
1981 (54th),Raiders of the Lost Ark,Frank Marshall

Year of Film Release,Film,Producer(s)
1990 (63rd),Dances With Wolves,Jim Wilson and Kevin Costner
1990 (63rd),Awakenings,Walter F. Parkes and Lawrence Lasker
1990 (63rd),Ghost,Lisa Weinstein
1990 (63rd),The Godfather Part III,Francis Ford Coppola
1990 (63rd),Goodfellas,Irwin Winkler
1991 (64th),,
1991 (64th),The Silence of the Lambs,"Edward Saxon, Kenneth Utt, and Ron Bozman"
1991 (64th),Beauty and the Beast,Don Hahn
1991 (64th),Bugsy,"Mark Johnson, Barry Levinson and Warren Beatty"
1991 (64th),JFK,A. Kitman Ho and Oliver Stone

Year of Film Release,Film,Producer(s)
2000 (73rd),Gladiator,"Douglas Wick, David Franzoni, and Branko Lustig"
2000 (73rd),Chocolat,"David Brown, Kit Golden, and Leslie Holleran"
2000 (73rd),"Crouching Tiger, Hidden Dragon","William Kong, Hsu Li-kong, and Ang Lee"
2000 (73rd),Erin Brockovich,"Danny DeVito, Michael Shamberg, and Stacey Sher"
2000 (73rd),Traffic,"Edward Zwick, Marshall Herskovitz, and Laura Bickford"
2001 (74th),,
2001 (74th),A Beautiful Mind,Brian Grazer and Ron Howard
2001 (74th),Gosford Park,"Robert Altman, Bob Balaban, and David Levy"
2001 (74th),In the Bedroom,"Graham Leader, Ross Katz, and Todd Field"
2001 (74th),The Lord of the Rings: The Fellowship of the Ring,"Peter Jackson, Fran Walsh, and Barrie M. Osborne"

Year of Film Release,Film,Producer(s)
2010 (83rd),The King's Speech,"Iain Canning, Emile Sherman, and Gareth Unwin"
2010 (83rd),Black Swan,"Scott Franklin, Mike Medavoy, and Brian Oliver"
2010 (83rd),The Fighter,"David Hoberman, Todd Lieberman, and Mark Wahlberg"
2010 (83rd),Inception,Christopher Nolan and Emma Thomas
2010 (83rd),The Kids Are All Right,"Gary Gilbert, Jeff Levy-Hinte, and Celine Rattray"
2010 (83rd),127 Hours,"Danny Boyle, John Smithson, and Christian Colson"
2010 (83rd),The Social Network,"Dana Brunetti, Ceán Chaffin, Michael De Luca, and Scott Rudin"
2010 (83rd),Toy Story 3,Darla K. Anderson
2010 (83rd),True Grit,"Joel Coen, Ethan Coen, and Scott Rudin"
2010 (83rd),Winter's Bone,Alix Madigan and Anne Rosellini

Year of Film Release,Film,Producer(s)
2020 (93rd),Nomadland,"Frances McDormand, Peter Spears, Mollye Asher, Dan Janvey, and Chloé Zhao"
2020 (93rd),The Father,"David Parfitt, Jean-Louis Livi, and Philippe Carcassonne"
2020 (93rd),Judas and the Black Messiah,"Shaka King, Charles D. King, and Ryan Coogler"
2020 (93rd),Mank,"Ceán Chaffin, Eric Roth, and Douglas Urbanski"
2020 (93rd),Minari,Christina Oh
2020 (93rd),Promising Young Woman,"Ben Browning, Ashley Fox, Emerald Fennell, and Josey McNamara"
2020 (93rd),Sound of Metal,Bert Hamelinck and Sacha Ben Harroche
2020 (93rd),The Trial of the Chicago 7,Marc Platt and Stuart M. Besser
2021 (94th),CODA,"Philippe Rousselet, Fabrice Gianfermi, and Patrick Wachsberger"
2021 (94th),Belfast,"Laura Berwick, Kenneth Branagh, Becca Kovacik, and Tamar Thomas"

Record,Producer,Film,Age
Oldest winner,Saul Zaentz,The English Patient,"76 years, 24 days"
Oldest nominee,Clint Eastwood,American Sniper,"84 years, 229 days"
Youngest winner,Carl Laemmle Jr.,All Quiet on the Western Front,"22 years, 191 days"
Youngest nominee,Carl Laemmle Jr.,All Quiet on the Western Front,"22 years, 144 days"

Production company/distributor,Nominations,Wins
Columbia Pictures,56,12
United Artists,48,12
Paramount Pictures,22,11
Universal Pictures,37,10
Metro-Goldwyn-Mayer,40,9
Warner Bros. Pictures,28,9
20th Century Fox,64,8
Fox Searchlight Pictures,23,5
Miramax Films,21,4
DreamWorks,15,4


In [7]:
# display tables as a list of dataframes 
pd.read_html(StringIO(str(movies_soup.findAll('table', attrs={'class' : 'wikitable'}))))[0]

Unnamed: 0,Year of Film Release,Film,Film Studio
0,1927/28 (1st),Wings,"Famous Players–Lasky (Lucien Hubbard, Jesse L...."
1,1927/28 (1st),7th Heaven,"Fox (William Fox, producer)"
2,1927/28 (1st),The Racket,"The Caddo Company (Howard Hughes, producer)"
3,1928/29 (2nd) [a],,
4,1928/29 (2nd) [a],The Broadway Melody,Metro-Goldwyn-Mayer (Irving Thalberg & Lawrenc...
5,1928/29 (2nd) [a],Alibi,"Feature Productions (Roland West, producer)"
6,1928/29 (2nd) [a],The Hollywood Revue,Metro-Goldwyn-Mayer (Irving Thalberg & Harry R...
7,1928/29 (2nd) [a],In Old Arizona,"Fox (Winfield Sheehan, producer)"
8,1928/29 (2nd) [a],The Patriot,Paramount Famous Lasky


In [8]:
# concatenate all tables into one dataframe
tables = movies_soup.findAll('table', attrs={'class' : 'wikitable'})

all_movies_df = pd.concat(
    [pd.read_html(StringIO(str(table)))[0] 
     for table in tables], ignore_index=True)
all_movies_df.head()


Unnamed: 0,Year of Film Release,Film,Film Studio,Film Studio/Producer(s),Producer(s),Record,Producer,Age,Production company/distributor,Nominations,Wins
0,1927/28 (1st),Wings,"Famous Players–Lasky (Lucien Hubbard, Jesse L....",,,,,,,,
1,1927/28 (1st),7th Heaven,"Fox (William Fox, producer)",,,,,,,,
2,1927/28 (1st),The Racket,"The Caddo Company (Howard Hughes, producer)",,,,,,,,
3,1928/29 (2nd) [a],,,,,,,,,,
4,1928/29 (2nd) [a],The Broadway Melody,Metro-Goldwyn-Mayer (Irving Thalberg & Lawrenc...,,,,,,,,


In [9]:
# locate winning movie titles for new column 
movie_wins_html = movies_soup.findAll('tr', attrs={'style' : 'background:#FAEB86'})
winning_movies = re.findall(r'title="(.+?)"', str(movie_wins_html))
winning_movies_final = [re.sub(r'\(.*?\)', '', m).strip() for m in winning_movies]
winning_movies_final[:10]

['1928 in film',
 '1st Academy Awards',
 'Wings',
 'Famous Players–Lasky',
 'Lucien Hubbard',
 'Jesse L. Lasky',
 'B. P. Schulberg',
 'Adolph Zukor',
 'The Broadway Melody',
 'Metro-Goldwyn-Mayer']

In [10]:
# add winner column to dataframe and use 'winning_movies_final' to filter for "Yes/No"
all_movies_df['Winner'] = all_movies_df['Film'].isin(winning_movies_final)
all_movies_df['Winner'] = all_movies_df['Winner'].map({True : 'Yes', False : 'No'})
all_movies_df.head()

Unnamed: 0,Year of Film Release,Film,Film Studio,Film Studio/Producer(s),Producer(s),Record,Producer,Age,Production company/distributor,Nominations,Wins,Winner
0,1927/28 (1st),Wings,"Famous Players–Lasky (Lucien Hubbard, Jesse L....",,,,,,,,,Yes
1,1927/28 (1st),7th Heaven,"Fox (William Fox, producer)",,,,,,,,,No
2,1927/28 (1st),The Racket,"The Caddo Company (Howard Hughes, producer)",,,,,,,,,No
3,1928/29 (2nd) [a],,,,,,,,,,,No
4,1928/29 (2nd) [a],The Broadway Melody,Metro-Goldwyn-Mayer (Irving Thalberg & Lawrenc...,,,,,,,,,Yes


In [11]:
# display column names for column drop
all_movies_df.columns

Index(['Year of Film Release', 'Film', 'Film Studio',
       'Film Studio/Producer(s)', 'Producer(s)', 'Record', 'Producer', 'Age',
       'Production company/distributor', 'Nominations', 'Wins', 'Winner'],
      dtype='object')

In [12]:
# drop unnecesary columns and NaN values
all_movies_dropped_columns = all_movies_df.drop(['Film Studio', 'Film Studio/Producer(s)', 'Producer(s)', 'Record', 'Producer', 'Age', 'Production company/distributor', 'Nominations', 'Wins'], axis=1) 
all_movies_dropped_columns_rename = all_movies_dropped_columns.rename(columns={'Year of Film Release': 'Year'})
dropna_df = all_movies_dropped_columns_rename.dropna().copy()
dropna_df.tail(2)

Unnamed: 0,Year,Film,Winner
689,2024 (97th),The Substance,No
690,2024 (97th),Wicked,No


In [16]:
# clean year values in year column
dropna_df['Year'] = dropna_df['Year'].str.extract(r"(\d{4})")
movies = dropna_df

In [14]:
movies

Unnamed: 0,Year,Film,Winner
0,1927,Wings,Yes
1,1927,7th Heaven,No
2,1927,The Racket,No
4,1928,The Broadway Melody,Yes
5,1928,Alibi,No
...,...,...,...
686,2024,Emilia Pérez,No
687,2024,I'm Still Here,No
688,2024,Nickel Boys,No
689,2024,The Substance,No


2. Gather Movie Data via TMDB API  
    a. Set up the API    
    * Create a free [TMDB account](https://developer.themoviedb.org/docs/getting-started)  
    * Generate an API key are review their documentation, especially:  
        * /discover/movie  
        * /movie/{movie_id}  
        * /search/movie  
    b. Collect top movies (2015-2024)  
    For each year from 2015 to 2024:  
        * Query TMDB for the top 100 movies (by vote count).  
        * For each movie, gather:  
            * Title  
            * Release Year  
            * Genre(s)  
            * Vote Average  
            * Vote Count  
            * Budget  
            * Revenue  
            * TMDB ID  
        * Store all results in a single DataFrame and export to movies_2015_2024.csv.
        * Hint: TMDB rate limits are generous for free accounts, but you should pause between requests (eg. time.sleep(0.25)). 
        * Some Oscar films may not appear in the top 100 by vote count. For any missing, use the /search/movie endpoint to add it.  

**Optional Extension: Actors and Actresses** 

1. Scrape Wikipedia for Best Actor and Best Actress Data
    * Scrape the following Wikipedia pages:  
        * [Best Actor](https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actor)
        * [Best Actress](https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actress)
    * Each apge contains tables of winners and nominees by year.
    * Extract the following columns:  
        * Year
        * Actor/Actress Name
        * Film Title
        * Winner (Yes/No)
    * Data cleaning tips:  
        * Remove footnote markers from names and movie titles.
        * Ensure that you save just the release year (eg. 2009 instead of 2009 (82nd))
        * Store the cleaned data as two csv files:  
            * best_actor.csv
            * best_actress.csv  

2. Collect Actor and Actress Filmographies  
    Using the data from your actor and actresses CSVs:  
    * Search TMDB for each recent performer (using /search/person). Note: you can start with 2015-2024 initially, but, if time allows, you can go back even further.
    * For each person, retrieve their movie credits using /person/{person_id}/movie_credits.  
    * Extract relevant fields for each movie, such as:  
        * Actor/Actress Name  
        * Movie Title  
        * Character Name (optional)  
        * Release Year  
        * Movie ID
    * Combine all filmographies into one file, actor_filmography.csv

In [15]:
# http://localhost