#### Question 2: Gathering Movie Data via TMDB API

* Set up the API
    * Create a free TMDB account
    * Generate an API key and review their documentation, especially:
        * /discover/movie: https://developer.themoviedb.org/reference/discover-movie
        * /movie/{movie_id}: https://developer.themoviedb.org/reference/movie-details
        * /search/movie: https://developer.themoviedb.org/reference/search-movie
* Collect top movies (2015-2024)
    * For each year from 2015 to 2024:
        * Query TMDB for the top 100 movies (by vote count).
        * For each movie, gather:
            * Title
            * Release Year
            * Genre(s)
            * Vote Average
            * Vote Count
            * Budget
            * Revenue
            * TMDB ID
* Store all results in a single DataFrame and export to movies_2015_2024.csv.
* Hint: TMDB rate limits are generous for free accounts, but you should pause between requests (eg. time.sleep(0.25)).
* Some Oscar films may not appear in the top 100 by vote count. For any missing, use the /search/movie endpoint to add it.

In [67]:
import requests
import json
import time
import pandas as pd

In [43]:
# Load API key from keys file
with open('keys.json') as fi:
    credentials = json.load(fi)

api_key = credentials['api_key']

In [44]:
endpoint = 'https://api.themoviedb.org/3/discover/movie'

movie_data = []

# Iterate through all years between 2015 and 2024
for year in range(2015,2025):
    
    # Each page contains 20 results, so we need to iterate through 5 pages to get 100 results
    for page in range(1,6):

        # Define params
        params = {
            'api_key': api_key,
            'primary_release_year': year,
            'sort_by': 'vote_count.desc',
            'page': page
        }
    
        # Get response
        response = requests.get(endpoint, params = params)
        res = response.json()['results']
        for movie in res:
            movie_data.append(movie)

        # Sleep before next API call 
        time.sleep(0.25)

In [76]:
movie_titles = []
release_years = []
genres = []
vote_averages = []
vote_counts = []
budgets = []
revenues = []
tmdb_ids = []

for movie in movie_data:
    movie_titles.append(movie['title'])
    release_years.append(movie['release_date'][:4])
    vote_averages.append(movie['vote_average'])
    vote_counts.append(movie['vote_count'])
    tmdb_ids.append(movie['id'])
    
for movie_id in tmdb_ids:
    # Use the movie ids to search for budget, revenue, and genre information
    endpoint = f'https://api.themoviedb.org/3/movie/{movie_id}'
    # Define params
    params = {
        'api_key': api_key,
    }
    # Get response
    response = requests.get(endpoint, params = params)
    res = response.json()
    # Extract budget, revenue, and genres
    budgets.append(res['budget'])
    revenues.append(res['revenue'])
    genres.append([genre['name'] for genre in res['genres']])
    # Sleep before next API call
    time.sleep(0.25)

In [96]:
# Convert movie data to a pandas DataFrame
tmdb_movie_data_df = pd.DataFrame({'Movie_Title': movie_titles, 'Release_Year': release_years, 'Genre': genres, 'Vote_Average': vote_averages, 'Vote_Count': vote_counts, 'Budget': budgets, 'Revenue': revenues, 'TMDB_ID': tmdb_ids})
tmdb_movie_data_df

Unnamed: 0,Movie_Title,Release_Year,Genre,Vote_Average,Vote_Count,Budget,Revenue,TMDB_ID
0,Avengers: Age of Ultron,2015,"[Action, Adventure, Science Fiction]",7.271,23837,365000000,1405403694,99861
1,Mad Max: Fury Road,2015,"[Action, Adventure, Science Fiction]",7.627,23495,150000000,378858340,76341
2,Inside Out,2015,"[Animation, Family, Adventure, Drama, Comedy]",7.900,22908,175000000,857611174,150540
3,Jurassic World,2015,"[Action, Adventure, Science Fiction, Thriller]",6.699,21092,150000000,1671537444,135397
4,The Martian,2015,"[Drama, Adventure, Science Fiction]",7.690,20571,108000000,631058917,286217
...,...,...,...,...,...,...,...,...
995,Miller's Girl,2024,"[Thriller, Drama, Romance]",6.348,880,4000000,1430985,1026436
996,Love Lies Bleeding,2024,"[Crime, Romance, Thriller, Drama]",6.572,856,10000000,12778225,948549
997,I'm Still Here,2024,"[Drama, History]",7.952,857,1480000,36109482,1000837
998,Arthur the King,2024,"[Adventure, Drama]",7.579,854,19000000,40829138,618588


In [93]:
# Import list of 2015 - 2024 Oscar winners from best_picture.csv
best_picture = pd.read_csv('../data/best_picture.csv', dtype={'Awards_Year': int})
oscar_winners = best_picture[(best_picture['Winner'] == 'Yes') & (best_picture['Awards_Year'].isin(range(2015,2025)))]['Title']

In [104]:
titles = []
years = []
genres = []
vote_averages = []
vote_counts = []
budgets = []
revenues = []
ids = []

# See if any Oscar winners are missing from the DataFrame. For any missing, use the /search/movie endpoint to add them
for winner in oscar_winners:
    if winner not in tmdb_movie_data_df['Movie_Title']:
        print(f'{winner} is not in the DataFrame! Adding it to the DataFrame now.')
        
        # Search the TMDB search-movie endpoint for the movie title
        endpoint = 'https://api.themoviedb.org/3/search/movie'
        
        # Define params
        params = {
            'api_key': api_key,
            'query': winner
        }
        
        # Get response
        response = requests.get(endpoint, params = params)
        res = response.json()['results'][0]
        
        # Collect movie info 
        titles.append(res['title'])
        years.append(res['release_date'][:4])
        ids.append(res['id'])
        vote_averages.append(res['vote_average'])
        vote_counts.append(res['vote_count'])

        # Use the movie id to search for budget, revenue, and genre information
        endpoint = f'https://api.themoviedb.org/3/movie/{res['id']}'
        
        # Define params
        params = {
            'api_key': api_key,
        }
        
        # Get response
        response = requests.get(endpoint, params = params)
        res = response.json()
        
        # Extract budget, revenue, and genres
        budgets.append(res['budget'])
        revenues.append(res['revenue'])
        genres.append([genre['name'] for genre in res['genres']])
        
        # Sleep before next API call
        time.sleep(0.25)

# Create a DataFrame of all the new rows 
new_rows = pd.DataFrame({'Movie_Title': titles, 'Release_Year': years, 'Genre': genres, 'Vote_Average': vote_averages, 'Vote_Count': vote_counts, 'Budget': budgets, 'Revenue': revenues, 'TMDB_ID': ids})
# Concatenate the new rows with the existing DataFrame
tmdb_movie_data_df = pd.concat([tmdb_movie_data_df, new_rows], ignore_index=True)

Spotlight is not in the DataFrame! Adding it to the DataFrame now.
Moonlight is not in the DataFrame! Adding it to the DataFrame now.
The Shape of Water is not in the DataFrame! Adding it to the DataFrame now.
Green Book is not in the DataFrame! Adding it to the DataFrame now.
Parasite is not in the DataFrame! Adding it to the DataFrame now.
Nomadland is not in the DataFrame! Adding it to the DataFrame now.
CODA is not in the DataFrame! Adding it to the DataFrame now.
West Side Story is not in the DataFrame! Adding it to the DataFrame now.
Everything Everywhere All at Once is not in the DataFrame! Adding it to the DataFrame now.
All Quiet on the Western Front is not in the DataFrame! Adding it to the DataFrame now.
Oppenheimer is not in the DataFrame! Adding it to the DataFrame now.
Anora is not in the DataFrame! Adding it to the DataFrame now.


In [105]:
tmdb_movie_data_df.tail(10)

Unnamed: 0,Movie_Title,Release_Year,Genre,Vote_Average,Vote_Count,Budget,Revenue,TMDB_ID
1002,The Shape of Water,2017,"[Drama, Fantasy, Romance]",7.2,12522,19500000,195300000,399055
1003,Green Book,2018,"[Drama, Comedy, History]",8.225,12363,23000000,321752656,490132
1004,Parasite,1982,"[Horror, Science Fiction]",4.8,81,800000,7000000,48311
1005,Nomadland,2021,[Drama],7.183,3190,5000000,39458207,581734
1006,CODA,2021,"[Drama, Music, Romance]",7.903,2425,10000000,1905058,776503
1007,West Side Story,2021,"[Drama, Romance, Crime]",6.956,1689,100000000,76016171,511809
1008,Everything Everywhere All at Once,2022,"[Action, Adventure, Science Fiction]",7.73,7430,25000000,139200000,545611
1009,All Quiet on the Western Front,2022,"[War, History, Drama]",7.722,4370,20000000,0,49046
1010,Oppenheimer,2023,"[Drama, History]",8.046,10929,100000000,952000000,872585
1011,Anora,2024,"[Drama, Comedy, Romance]",7.054,2722,6000000,56286295,1064213
