#### Question 2: Gathering Movie Data via TMDB API

* Set up the API
    * Create a free TMDB account
    * Generate an API key and review their documentation, especially:
        * /discover/movie: https://developer.themoviedb.org/reference/discover-movie
        * /movie/{movie_id}: https://developer.themoviedb.org/reference/movie-details
        * /search/movie: https://developer.themoviedb.org/reference/search-movie
* Collect top movies (2015-2024)
    * For each year from 2015 to 2024:
        * Query TMDB for the top 100 movies (by vote count).
        * For each movie, gather:
            * Title
            * Release Year
            * Genre(s)
            * Vote Average
            * Vote Count
            * Budget
            * Revenue
            * TMDB ID
* Store all results in a single DataFrame and export to movies_2015_2024.csv.
* Hint: TMDB rate limits are generous for free accounts, but you should pause between requests (eg. time.sleep(0.25)).
* Some Oscar films may not appear in the top 100 by vote count. For any missing, use the /search/movie endpoint to add it.

In [1]:
import requests
import json
import time
import pandas as pd
import datetime
from pathlib import Path 

In [2]:
# Load API key from keys file
with open('keys.json') as fi:
    credentials = json.load(fi)

api_key = credentials['api_key']

In [3]:
endpoint = 'https://api.themoviedb.org/3/discover/movie'

movie_titles = []
release_dates = []
vote_averages = []
vote_counts = []
tmdb_ids = []

# Iterate through all years between 2015 and 2024
for year in range(2015,2025):
    
    # Each page contains 20 results, so we need to iterate through 5 pages to get 100 results
    for page in range(1,6):

        # Define params
        params = {
            'api_key': api_key,
            'primary_release_year': year,
            'sort_by': 'vote_count.desc',
            'page': page
        }
    
        # Get response
        response = requests.get(endpoint, params = params)
        res = response.json()['results']

        # Extract movie title, release date, vote average, and vote count for each movie
        for movie in res:
            movie_titles.append(movie['title'])
            release_dates.append(movie['release_date'])
            vote_averages.append(movie['vote_average'])
            vote_counts.append(movie['vote_count'])
            tmdb_ids.append(movie['id'])

        # Sleep before next API call 
        time.sleep(0.25)

# Extract just the years from the release dates
years = [datetime.datetime.strptime(date_str, "%Y-%m-%d").year for date_str in release_dates]

In [4]:
budgets = []
revenues = []
genres = [] 

# Use the movie ids to search for budget, revenue, and genre information
for movie_id in tmdb_ids:
    
    endpoint = f'https://api.themoviedb.org/3/movie/{movie_id}'
    
    # Define params
    params = {
        'api_key': api_key,
    }
    
    # Get response
    response = requests.get(endpoint, params = params)
    res = response.json()
    
    # Extract budget, revenue, and genres
    budgets.append(res['budget'])
    revenues.append(res['revenue'])
    genres.append([genre['name'] for genre in res['genres']])
    
    # Sleep before next API call
    time.sleep(0.25)

In [9]:
# Convert movie data to a pandas DataFrame
tmdb_movie_data_df = pd.DataFrame({'Movie_Title': movie_titles, 
                                   'Release_Year': years, 
                                   'Genre': genres, 
                                   'Vote_Average': vote_averages, 
                                   'Vote_Count': vote_counts, 
                                   'Budget': budgets, 
                                   'Revenue': revenues, 
                                   'TMDB_ID': tmdb_ids})

In [10]:
# Import list of 2015 - 2024 Oscar winners from best_picture.csv
best_picture = pd.read_csv('../data/best_picture.csv')
oscar_winners_2015_2024 = best_picture[(best_picture['Awards_Year'] >= 2015) & (best_picture['Winner'] == 'Yes')]

In [13]:
# See if any of the 2015 - 2024 Oscar winners are missing from the TMDB movie data
for title in oscar_winners_2015_2024['Title']:
    if title not in tmdb_movie_data_df['Movie_Title'].values:
        print(f'{title} is NOT in the tmdb_movie_data_df DataFrame!')

In [8]:
# Write the tmdb_movie_data_df DataFrame to a csv file in the data folder
filepath = Path('../data/movies_2015_2024.csv')  
filepath.parent.mkdir(parents=True, exist_ok=True)  
tmdb_movie_data_df.to_csv(filepath)  