#### Question 2: Gathering Movie Data via TMDB API

* Set up the API
    * Create a free TMDB account
    * Generate an API key and review their documentation, especially:
        * /discover/movie: https://developer.themoviedb.org/reference/discover-movie
        * /movie/{movie_id}: https://developer.themoviedb.org/reference/movie-details
        * /search/movie: https://developer.themoviedb.org/reference/search-movie
* Collect top movies (2015-2024)
    * For each year from 2015 to 2024:
        * Query TMDB for the top 100 movies (by vote count).
        * For each movie, gather:
            * Title
            * Release Year
            * Genre(s)
            * Vote Average
            * Vote Count
            * Budget
            * Revenue
            * TMDB ID
* Store all results in a single DataFrame and export to movies_2015_2024.csv.
* Hint: TMDB rate limits are generous for free accounts, but you should pause between requests (eg. time.sleep(0.25)).
* Some Oscar films may not appear in the top 100 by vote count. For any missing, use the /search/movie endpoint to add it.

In [28]:
import requests
import json
import time
import pandas as pd
import datetime
from pathlib import Path 

In [17]:
# Load API key from keys file
with open('keys.json') as fi:
    credentials = json.load(fi)

api_key = credentials['api_key']

In [18]:
endpoint = 'https://api.themoviedb.org/3/discover/movie'

movie_titles = []
release_dates = []
vote_averages = []
vote_counts = []
tmdb_ids = []

# Iterate through all years between 2015 and 2024
for year in range(2015,2025):
    
    # Each page contains 20 results, so we need to iterate through 5 pages to get 100 results
    for page in range(1,6):

        # Define params
        params = {
            'api_key': api_key,
            'primary_release_year': year,
            'sort_by': 'vote_count.desc',
            'page': page
        }
    
        # Get response
        response = requests.get(endpoint, params = params)
        res = response.json()['results']
        for movie in res:
            movie_titles.append(movie['title'])
            release_dates.append(movie['release_date'])
            vote_averages.append(movie['vote_average'])
            vote_counts.append(movie['vote_count'])
            tmdb_ids.append(movie['id'])

        # Sleep before next API call 
        time.sleep(0.25)

years = [datetime.datetime.strptime(date_str, "%Y-%m-%d").year for date_str in release_dates]

In [21]:
budgets = []
revenues = []
genres = [] 
    
for movie_id in tmdb_ids:
    # Use the movie ids to search for budget, revenue, and genre information
    endpoint = f'https://api.themoviedb.org/3/movie/{movie_id}'
    # Define params
    params = {
        'api_key': api_key,
    }
    # Get response
    response = requests.get(endpoint, params = params)
    res = response.json()
    # Extract budget, revenue, and genres
    budgets.append(res['budget'])
    revenues.append(res['revenue'])
    genres.append([genre['name'] for genre in res['genres']])
    # Sleep before next API call
    time.sleep(0.25)

In [22]:
# Convert movie data to a pandas DataFrame
tmdb_movie_data_df = pd.DataFrame({'Movie_Title': movie_titles, 'Release_Year': years, 'Genre': genres, 'Vote_Average': vote_averages, 'Vote_Count': vote_counts, 'Budget': budgets, 'Revenue': revenues, 'TMDB_ID': tmdb_ids})
tmdb_movie_data_df

Unnamed: 0,Movie_Title,Release_Year,Genre,Vote_Average,Vote_Count,Budget,Revenue,TMDB_ID
0,Avengers: Age of Ultron,2015,"[Action, Adventure, Science Fiction]",7.271,23847,365000000,1405403694,99861
1,Mad Max: Fury Road,2015,"[Action, Adventure, Science Fiction]",7.627,23503,150000000,378858340,76341
2,Inside Out,2015,"[Animation, Family, Adventure, Drama, Comedy]",7.910,22917,175000000,857611174,150540
3,Jurassic World,2015,"[Action, Adventure, Science Fiction, Thriller]",6.700,21094,150000000,1671537444,135397
4,The Martian,2015,"[Drama, Adventure, Science Fiction]",7.690,20579,108000000,631058917,286217
...,...,...,...,...,...,...,...,...
995,Miller's Girl,2024,"[Thriller, Drama, Romance]",6.348,880,4000000,1430985,1026436
996,I'm Still Here,2024,"[Drama, History]",7.951,861,1480000,36109482,1000837
997,Love Lies Bleeding,2024,"[Crime, Romance, Thriller, Drama]",6.572,856,10000000,12778225,948549
998,Arthur the King,2024,"[Adventure, Drama]",7.577,855,19000000,40829138,618588


In [23]:
# Import list of 2015 - 2024 Oscar winners from best_picture.csv
best_picture = pd.read_csv('../data/best_picture.csv', dtype={'Awards_Year': int})
oscar_winners = best_picture[(best_picture['Winner'] == 'Yes') & (best_picture['Awards_Year'].isin(range(2015,2025)))]['Title']
oscar_winners

520                            Spotlight
528                            Moonlight
537                   The Shape of Water
546                           Green Book
554                             Parasite
563                            Nomadland
571                                 CODA
580                      West Side Story
581    Everything Everywhere All at Once
582       All Quiet on the Western Front
591                          Oppenheimer
601                                Anora
Name: Title, dtype: object

In [26]:
for winner in oscar_winners:
    if winner not in tmdb_movie_data_df['Movie_Title'].values:
        print(f'{winner} is not in the DataFrame!')
    else:
        print(f'{winner} is in the DataFrame already.')

Spotlight is in the DataFrame already.
Moonlight is in the DataFrame already.
The Shape of Water is in the DataFrame already.
Green Book is in the DataFrame already.
Parasite is in the DataFrame already.
Nomadland is in the DataFrame already.
CODA is in the DataFrame already.
West Side Story is in the DataFrame already.
Everything Everywhere All at Once is in the DataFrame already.
All Quiet on the Western Front is in the DataFrame already.
Oppenheimer is in the DataFrame already.
Anora is in the DataFrame already.


In [29]:
# Write the tmdb_movie_data_df DataFrame to a csv file in the data folder
filepath = Path('../data/movies_2015_2024.csv')  
filepath.parent.mkdir(parents=True, exist_ok=True)  
tmdb_movie_data_df.to_csv(filepath)  