# An Analysis of Movie Performance

In this part, you’ll gather data about popular movies and award winners. The goal is to build a dataset that you’ll later use to analyze what makes a movie successful and how awards and box office performance relate to one another.

In [1]:
import pandas as pd
import requests
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import json
import time

### Part 1: Data Gathering
1. Scrape Best Picture Data.  
    * Scrape the [Best Picture wikipedia page](https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture).  
    * Extract for each Movie:  
        * Year  
        * Film Title  
        * Winner (Yes/No)  
    * Data cleaning tips:  
        * Ensure that year and film title columns are clean and consistent (no footnotes, parentheses, etc.).
        * Save the results as best_picture.csv.  

In [2]:
URL = 'https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture'

headers = {
    "User-Agent": "Movie_Agent"
}

resp = requests.get(URL, headers=headers)
soup = BeautifulSoup(resp.text)
#soup.prettify()

In [3]:
soup = BeautifulSoup(resp.text)
#soup.prettify()

In [4]:
winners = soup.findAll('tr', attrs={'style' : 'background:#FAEB86'})
winning_titles = [winner.td.text.strip() for winner in winners]

In [5]:
# Get all wikitables
all_tables = soup.findAll('table', attrs={'class' : 'wikitable'})

# Find just the tables with movie data by filtering on the 'Year of Film Release' column header
# This excludes the 2 wikitables at the very bottom of the page ('Age superlatives' and 'Production companies and distributors with multiple nominations and wins')
movie_tables = [table for table in all_tables if 'Year of Film Release' in table.find('tr').text]

In [66]:
# Create empty list to store dictionaries of movie data, ex. [{'Title': 'Wings', 'Movie_Year': '1927', 'Awards_Year': '1928', 'Winner':'True'}]
movie_info = []

# Iterate through all tables to extract movie data
for table in movie_tables:
    
    # Find all 'tr' tags ('table row'), skipping the first one since it just contains column headers
    rows = table.findAll('tr')[1:]
    
    # Iterate through all rows of the table to find year, movie, and winner status
    for row in rows:
        
        # If the row contains a 'th' ('table header') tag, extract the year and store it in a variable
        if len(row.findAll('th')) != 0:
            awards_year = row.th.a.text

            # If the awards_year contains a slash, only grab the later year
            if '/' in awards_year:
                awards_year = awards_year[:2] + awards_year[-2:]
        
        # Get the movie title in this row, if there is one
        if len(row.findAll('td')) != 0:
            title = row.td.text.strip()
        else:
            title = ''
        # Get the winner status by seeing if the background is yellow
        if row.has_attr('style'):
            if row['style']=='background:#FAEB86':
                winner='Yes'
            else:
                winner='No'
        else:
            winner='No'
        
        # If this row has a movie title, append the movie info to the movie_info list
        if title != '':
            movie_info_dict = {'Title': title, 'Awards_Year': awards_year, 'Winner': winner}
            movie_info.append(movie_info_dict)

In [77]:
# Convert movie_info to a pandas DataFrame
movie_info_df = pd.DataFrame(movie_info)
movie_info_df.tail(20)

Unnamed: 0,Title,Awards_Year,Winner
591,Oppenheimer,2023,Yes
592,American Fiction,2023,No
593,Anatomy of a Fall,2023,No
594,Barbie,2023,No
595,The Holdovers,2023,No
596,Killers of the Flower Moon,2023,No
597,Maestro,2023,No
598,Past Lives,2023,No
599,Poor Things,2023,No
600,The Zone of Interest,2023,No


In [78]:
movie_info_df.to_csv('best_picture.csv', index=False)

2. Gather Movie Data via TMDB API  
    a. Set up the API    
    * Create a free [TMDB account](https://developer.themoviedb.org/docs/getting-started)  
    * Generate an API key are review their documentation, especially:  
        * /discover/movie  
        * /movie/{movie_id}  
        * /search/movie  
    b. Collect top movies (2015-2024)  
    For each year from 2015 to 2024:  
        * Query TMDB for the top 100 movies (by vote count).  
        * For each movie, gather:  
            * Title  
            * Release Year  
            * Genre(s)  
            * Vote Average  
            * Vote Count  
            * Budget  
            * Revenue  
            * TMDB ID  
        * Store all results in a single DataFrame and export to movies_2015_2024.csv.
        * Hint: TMDB rate limits are generous for free accounts, but you should pause between requests (eg. time.sleep(0.25)). 
        * Some Oscar films may not appear in the top 100 by vote count. For any missing, use the /search/movie endpoint to add it.  


In [9]:
with open('api_movies.json') as fi:
    credentials = json.load(fi)
    
key = credentials['key']

In [10]:
endpoint = "https://api.themoviedb.org/3/discover/movie"

movies = []

for year in range (2015, 2025):
    for page in range (1,6):
        params = {
            'primary_release_year': year,
            'sort_by' : 'vote_count.desc',
            'api_key' : key
        }


        response = requests.get(endpoint, params = params)
        res = response.json()['results']
        
        for movie in res:
            movies.append(movie)
            
        
        time.sleep(0.25)

In [54]:
#print(movies)

In [20]:
len(movies)

1000

In [21]:
movie_titles = []
for movie in movies:
    movie_titles.append(movie['title'])  

In [22]:
import datetime

release_year = []
for movie in movies:
    release_year.append(movie['release_date'])  

    years = [datetime.datetime.strptime(date_str, "%Y-%m-%d").year for date_str in release_year]

In [30]:
movie_ids = []
genre_ids = []
genre_name = []

for movie in movies:
    movie_ids.append(movie['id'])
    
for movie_id in movie_ids[0:1]:  # movie_genres is a list of genre IDs for each movie
    endpoint = f'https://api.themoviedb.org/3/movie/{movie_id}'
    # Define params
    params = {
     'api_key': key,
    }
    
     # Get response
    response = requests.get(endpoint, params = params)
    res = response.json()
    print(res)

    time.sleep(0.25)

{'adult': False, 'backdrop_path': '/kIBK5SKwgqIIuRKhhWrJn3XkbPq.jpg', 'belongs_to_collection': {'id': 86311, 'name': 'The Avengers Collection', 'poster_path': '/yFSIUVTCvgYrpalUktulvk3Gi5Y.jpg', 'backdrop_path': '/zuW6fOiusv4X9nnW3paHGfXcSll.jpg'}, 'budget': 365000000, 'genres': [{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'Adventure'}, {'id': 878, 'name': 'Science Fiction'}], 'homepage': 'https://www.marvel.com/movies/avengers-age-of-ultron', 'id': 99861, 'imdb_id': 'tt2395427', 'origin_country': ['US'], 'original_language': 'en', 'original_title': 'Avengers: Age of Ultron', 'overview': 'When Tony Stark tries to jumpstart a dormant peacekeeping program, things go awry and Earth’s Mightiest Heroes are put to the ultimate test as the fate of the planet hangs in the balance. As the villainous Ultron emerges, it is up to The Avengers to stop him from enacting his terrible plans, and soon uneasy alliances and unexpected action pave the way for an epic and unique global adventure.', 'p

In [31]:
res['genres']

[{'id': 28, 'name': 'Action'},
 {'id': 12, 'name': 'Adventure'},
 {'id': 878, 'name': 'Science Fiction'}]

In [32]:
len(movie_ids)

1000

In [45]:
movie_ids = []
genre_names = []
budgets = []
revenues = []

for movie in movies:
    movie_ids.append(movie['id'])
    
for movie_id in movie_ids:  # movie_genres is a list of genre IDs for each movie
    #print(movie_id)
    endpoint = f'https://api.themoviedb.org/3/movie/{movie_id}'
    # Define params
    params = {
     'api_key': key,
    }
    
     # Get response
    response = requests.get(endpoint, params = params)
    res = response.json()
   # print(res)

      # Extract genre_ids and genres
    genre_names.append([genre['name'] for genre in res['genres']])
    budgets.append(res['budget'])
    revenues.append(res['revenue'])
    
    # Sleep before next API call - fixed indentation to be inside the inner loop
    time.sleep(0.25)

In [48]:
vote_avg = [d['vote_average'] for d in movies]
vote_avg[:5]

[7.271, 7.627, 7.91, 6.7, 7.69]

In [50]:
vote_ct = [d['vote_count'] for d in movies]
vote_ct[:10]

[23847, 23503, 22917, 21094, 20579, 20416, 20067, 18777, 17383, 14806]

In [82]:
movies_tmdb = pd.DataFrame(
    # naming the new columns and inputting the list data from collections
    {'movie_titles': movie_titles,
     'vote_avg': vote_avg,
     'vote_count': vote_ct,
     'movie_ids' : movie_ids,
     'genre' : genre_names,
     'Years' : years,
     'Budget': budgets,
     'Revenue': revenues
     
     })
movies_tmdb

Unnamed: 0,movie_titles,vote_avg,vote_count,movie_ids,genre,Years,Budget,Revenue
0,Avengers: Age of Ultron,7.271,23847,99861,"[Action, Adventure, Science Fiction]",2015,365000000,1405403694
1,Mad Max: Fury Road,7.627,23503,76341,"[Action, Adventure, Science Fiction]",2015,150000000,378858340
2,Inside Out,7.910,22917,150540,"[Animation, Family, Adventure, Drama, Comedy]",2015,175000000,857611174
3,Jurassic World,6.700,21094,135397,"[Action, Adventure, Science Fiction, Thriller]",2015,150000000,1671537444
4,The Martian,7.690,20579,286217,"[Drama, Adventure, Science Fiction]",2015,108000000,631058917
...,...,...,...,...,...,...,...,...
995,Nosferatu,6.693,3227,426063,"[Horror, Fantasy]",2024,50000000,181764515
996,Bad Boys: Ride or Die,7.346,3110,573435,"[Action, Comedy, Crime, Thriller, Adventure]",2024,100000000,404547819
997,A Quiet Place: Day One,6.685,3016,762441,"[Horror, Science Fiction, Thriller]",2024,67000000,261907653
998,Sonic the Hedgehog 3,7.600,3007,939243,"[Action, Science Fiction, Comedy, Family]",2024,122000000,492162604


In [83]:
movies_tmdb.to_csv('movies_2015_2024.csv')

In [84]:
best_picture = pd.read_csv('best_picture.csv')
oscar_winners_2015_2024 = best_picture[(best_picture['Awards_Year'] >= 2015) & (best_picture['Winner'] == 'Yes')]
oscar_winners_2015_2024

Unnamed: 0,Title,Awards_Year,Winner
520,Spotlight,2015,Yes
528,Moonlight,2016,Yes
537,The Shape of Water,2017,Yes
546,Green Book,2018,Yes
554,Parasite,2019,Yes
563,Nomadland,2020,Yes
571,CODA,2021,Yes
581,Everything Everywhere All at Once,2022,Yes
591,Oppenheimer,2023,Yes
601,Anora,2024,Yes


In [85]:
for title in oscar_winners_2015_2024['Title']:
    if title in movies_tmdb['movie_titles'].values:
        print(f'{title} is already in Dataframe.')
    else:
        print(f'{title} is Not in Dataframe.')

Spotlight is Not in Dataframe.
Moonlight is Not in Dataframe.
The Shape of Water is already in Dataframe.
Green Book is already in Dataframe.
Parasite is already in Dataframe.
Nomadland is Not in Dataframe.
CODA is Not in Dataframe.
Everything Everywhere All at Once is already in Dataframe.
Oppenheimer is already in Dataframe.
Anora is Not in Dataframe.


**Optional Extension: Actors and Actresses** 

1. Scrape Wikipedia for Best Actor and Best Actress Data
    * Scrape the following Wikipedia pages:  
        * [Best Actor](https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actor)
        * [Best Actress](https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actress)
    * Each apge contains tables of winners and nominees by year.
    * Extract the following columns:  
        * Year
        * Actor/Actress Name
        * Film Title
        * Winner (Yes/No)
    * Data cleaning tips:  
        * Remove footnote markers from names and movie titles.
        * Ensure that you save just the release year (eg. 2009 instead of 2009 (82nd))
        * Store the cleaned data as two csv files:  
            * best_actor.csv
            * best_actress.csv  

2. Collect Actor and Actress Filmographies  
    Using the data from your actor and actresses CSVs:  
    * Search TMDB for each recent performer (using /search/person). Note: you can start with 2015-2024 initially, but, if time allows, you can go back even further.
    * For each person, retrieve their movie credits using /person/{person_id}/movie_credits.  
    * Extract relevant fields for each movie, such as:  
        * Actor/Actress Name  
        * Movie Title  
        * Character Name (optional)  
        * Release Year  
        * Movie ID
    * Combine all filmographies into one file, actor_filmography.csv