# Part 1: Data Gathering
<b>Gather Movie Data via TMDB API.</b>
<br>
<br>
a. Set up the API
Create a free TMDB account
Generate an API key are review their documentation, especially:
* /discover/movie
* /movie/{movie_id}
* /search/movie

b. Collect top movies (2015-2024)
<br>For each year from 2015 to 2024: Query TMDB for the top 100 movies (by vote count).
<br>For each movie, gather:
* Title
* Release Year
* Genre(s)
* Vote Average
* Vote Count
* Budget
* Revenue
* TMDB ID
<br>Store all results in a single DataFrame and export to movies_2015_2024.csv.
<br>Hint: TMDB rate limits are generous for free accounts, but you should pause between requests (eg. time.sleep(0.25)).
<br>Some Oscar films may not appear in the top 100 by vote count. For any missing, use the /search/movie endpoint to add it.

In [1]:
# IMPORT BUILT-IN LIBRARIES
import ast
import json
import re
import requests
import time

# IMPORT 3RD-PARTY LIBRARIES
import pandas as pd

In [2]:
# SET BASE URL
base_url = "https://api.themoviedb.org/3"

# SET ENDPOINTS
auth_endpoint = "/authentication"
discover_endpoint = "/discover/movie"
movie_endpoint = "/movie" # {movie_id}
search_endpoint = "/search/movie"

# FETCH API KEY
with open("../config/api_key.txt") as file:
    api_key = ast.literal_eval(file.read())

# SET HEADERS
headers = {
    "Authorization": f"Bearer {api_key['access_token']}",
    "accept": "application/json"
}

# SEND REQUEST TO VALIDATE AUTHENTICATION
response = requests.get(url=base_url+auth_endpoint, headers=headers)

print(response.text)

{"success":true}


In [3]:
# INITIALIZE MOVIE DICT
movie_dict = {}

# LOOP THE DESIRED RANGE OF YEARS
for year in range(2015, 2025):
    print(f"YEAR: {year}")

    # INITIALIZE MOVIE LIST
    movie_list = []
    
    # ITERATE THE PAGES TO GET 100 MOVIES PER YEAR
    for page_number in range(1, 10):
        
        print(f"LIST LENGTH: {len(movie_list)}")
        
        # IF THE LENGTH OF THE MOVIE LIST IS LESS THAN 100, GET NEXT PAGE
        if len(movie_list) < 100:

            print(f"PAGE: {page_number}")
            
            # SET DISCOVER PARAMS
            params = {
                "sort_by": "vote_count.desc",
                "primary_release_year": year,
                "page": page_number
            }
            
            # SEND REQUEST TO DISCOVER ENDPOINT
            response = requests.get(url=base_url+discover_endpoint, headers=headers, params=params)

            # IF SUCCESSFUL REQUEST...
            if response.status_code == 200:

                # APPEND RESULTS TO MOVIE LIST
                json_body = response.json()
                movie_list = movie_list + json_body.get("results", [])

            # IF FAILED REQUEST...
            else:

                # RAISE EXCEPTION
                raise Exception

            # WAIT BETWEEN REQUESTS
            time.sleep(0.25)

        # IF THE LENGTH OF THE MOVIE LIST IS 100 OR GREATER, STORE LIST IN MOVIE DICT
        else:
            movie_dict[year] = movie_list
            break

YEAR: 2015
LIST LENGTH: 0
PAGE: 1
LIST LENGTH: 20
PAGE: 2
LIST LENGTH: 40
PAGE: 3
LIST LENGTH: 60
PAGE: 4
LIST LENGTH: 80
PAGE: 5
LIST LENGTH: 100
YEAR: 2016
LIST LENGTH: 0
PAGE: 1
LIST LENGTH: 20
PAGE: 2
LIST LENGTH: 40
PAGE: 3
LIST LENGTH: 60
PAGE: 4
LIST LENGTH: 80
PAGE: 5
LIST LENGTH: 100
YEAR: 2017
LIST LENGTH: 0
PAGE: 1
LIST LENGTH: 20
PAGE: 2
LIST LENGTH: 40
PAGE: 3
LIST LENGTH: 60
PAGE: 4
LIST LENGTH: 80
PAGE: 5
LIST LENGTH: 100
YEAR: 2018
LIST LENGTH: 0
PAGE: 1
LIST LENGTH: 20
PAGE: 2
LIST LENGTH: 40
PAGE: 3
LIST LENGTH: 60
PAGE: 4
LIST LENGTH: 80
PAGE: 5
LIST LENGTH: 100
YEAR: 2019
LIST LENGTH: 0
PAGE: 1
LIST LENGTH: 20
PAGE: 2
LIST LENGTH: 40
PAGE: 3
LIST LENGTH: 60
PAGE: 4
LIST LENGTH: 80
PAGE: 5
LIST LENGTH: 100
YEAR: 2020
LIST LENGTH: 0
PAGE: 1
LIST LENGTH: 20
PAGE: 2
LIST LENGTH: 40
PAGE: 3
LIST LENGTH: 60
PAGE: 4
LIST LENGTH: 80
PAGE: 5
LIST LENGTH: 100
YEAR: 2021
LIST LENGTH: 0
PAGE: 1
LIST LENGTH: 20
PAGE: 2
LIST LENGTH: 40
PAGE: 3
LIST LENGTH: 60
PAGE: 4
LIST LENGTH:

In [57]:
# CREATE A DATAFRAME FROM THE MOVIE DICT
movie_df = (
    pd.DataFrame(data=movie_dict.items(), columns=['Year', 'Data'])
    .explode("Data")
    .reset_index(drop=True)
)

In [63]:
# CREATE A DATAFRAME FROM THE DATA
data_df = pd.DataFrame(data=[row for row in movie_df['Data']])

In [73]:
# MERGE DATAFRAMES
movie_df = movie_df.merge(right=data_df, left_index=True, right_index=True)

In [75]:
movie_df.head(2)

Unnamed: 0,Year,Data,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count
0,2015,"{'adult': False, 'backdrop_path': '/kIBK5SKwgq...",False,/kIBK5SKwgqIIuRKhhWrJn3XkbPq.jpg,"[28, 12, 878]",99861,en,Avengers: Age of Ultron,When Tony Stark tries to jumpstart a dormant p...,11.8411,/4ssDuvEDkSArWEdyBl2X5EHvYKU.jpg,2015-04-22,Avengers: Age of Ultron,False,7.271,23846
1,2015,"{'adult': False, 'backdrop_path': '/gqrnQA6Xpp...",False,/gqrnQA6Xppdl8vIb2eJc58VC1tW.jpg,"[28, 12, 878]",76341,en,Mad Max: Fury Road,An apocalyptic story set in the furthest reach...,10.6392,/hA2ple9q4qnwxp3hKVNhroipsir.jpg,2015-05-13,Mad Max: Fury Road,False,7.627,23502


In [47]:
# title
# release_year
# genre_ids
# vote_average
# vote_count
# budget
# revenue
# id