## Description

#### Purpose: To obtain TMDB IDs for all movies on TMDB with release dates between 2010 and 2023. 

#### Output: `2.1.1_TMDB_IDs_2010_2023.csv`

This notebook contains the functions necessary to retrieve TMDB IDs for all movies on TMDB released in a given year. To achieve this, the code first retrieves the number of "Pages" of output from TMDB's "Discover" query, which allows the user to find films given parameters like language and release year. To avoid timeout issues, the code takes the total number of pages, divides it by two, and takes the ceiling. It then retrieves TMDB IDs for (the ceiling of) half of the movies in a given year in ascending order by release date and the other half in descending order. It then takes the set of the two lists to remove any duplicate TMDB IDs that may have been captured on pages of both the ascending and descending functions. This process is repeated for all desired years and the resulting list of TMDB IDs is output to a csv.

Movies released in 2013-2023 will comprise the movies in the dataset when building the model. Movies released in 2010-2012 will be used for feature engineering in the same dataset.

In [1]:
# Install Library for TMDB API
#!pip install tmdbv3api

In [2]:
from tmdbv3api import TMDb
from tmdbv3api import Movie
from tmdbv3api.exceptions import TMDbException
import random
import pandas as pd
import matplotlib.pyplot as plt
import requests
import json
tmdb=TMDb()
tmdb.api_key=' '
    ## API key redacted

In [3]:
# Function to get the total number of Pages of TMDB output and the JSON results
# Sorting by non-adult, english-language movies with primary release year in {year}
def get_number_pages_and_results_year(year):
    url = f"https://api.themoviedb.org/3/discover/movie?include_adult=false&include_video=false&page={1}&language=en-US&primary_release_year={year}&sort_by=primary_release_date.asc&with_original_language=en"

    headers = {
        "accept": "application/json",
        "Authorization": "Bearer ' '"
            ## token redacted
    }

    response = requests.get(url, headers=headers)
    page = json.loads(response.text)

    pages = page["total_pages"]
    results = page["total_results"]

    return pages, results

In [4]:
# Function to get TMDB pages in descending order of release date
def get_desc_by_year(year, num_pages):
    data = []
    for i in range(num_pages):
        url = f"https://api.themoviedb.org/3/discover/movie?include_adult=false&include_video=false&page={i+1}&language=en-US&primary_release_year={year}&sort_by=primary_release_date.desc&with_original_language=en"

        headers = {
            "accept": "application/json",
            "Authorization": "Bearer ' '"
                ## token redacted
        }

        response = requests.get(url, headers=headers)
        page = json.loads(response.text)
        data.append(page)
    return data

In [5]:
# Function to get TMDB pages in ascending order of release date
def get_asc_by_year(year, num_pages):
    data = []
    for i in range(num_pages):
        url = f"https://api.themoviedb.org/3/discover/movie?include_adult=false&include_video=false&page={i+1}&language=en-US&primary_release_year={year}&sort_by=primary_release_date.asc&with_original_language=en"

        headers = {
            "accept": "application/json",
            "Authorization": "Bearer ' '"
                ## token redacted
        }

        response = requests.get(url, headers=headers)
        page = json.loads(response.text)
        data.append(page)
    return data

In [6]:
# Function to get TMDB IDs
def get_movie_ids_by_year(year):
    
    # Get total number of pages
    pages, results = get_number_pages_and_results_year(year)
    
    # Find the (ceiling of) half the number of pages
    ciel_pages = -(pages //-2)
    
    # Get results when sorted by ascending order of release
    data_asc = get_asc_by_year(year, ciel_pages)
    
    # Get results when sorted by descending order of release
    data_desc = get_desc_by_year(year, ciel_pages)

    # Append ascending and descending IDs
    ids_desc = []
    ids_asc = []
    for p in range(ciel_pages):
        for i in range(len(data_desc[p]["results"])):
            ids_desc.append(data_desc[p]["results"][i]["id"])
        for j in range(len(data_asc[p]["results"])):
            ids_asc.append(data_asc[p]["results"][j]["id"])
    
    # Combine IDs and remove duplicates that may have gotten captured in both ascending and descending
    ids = ids_asc + ids_desc
    ids = list(set(ids))
    print(f"We wanted {results}, and we got {len(ids)} for year {year}")
    
    return ids   

In [7]:
id_list = []
for yr in range(2009, 2022): #Years are off by -1
    id_list += get_movie_ids_by_year(yr+1)

We wanted 7918, and we got 7881 for year 2010
We wanted 8740, and we got 8740 for year 2011
We wanted 9561, and we got 9561 for year 2012
We wanted 10973, and we got 10973 for year 2013
We wanted 11711, and we got 11711 for year 2014
We wanted 11768, and we got 11768 for year 2015
We wanted 11994, and we got 11981 for year 2016
We wanted 13499, and we got 13499 for year 2017
We wanted 13732, and we got 13732 for year 2018
We wanted 15298, and we got 15296 for year 2019
We wanted 16362, and we got 16362 for year 2020
We wanted 16844, and we got 16844 for year 2021
We wanted 15691, and we got 15691 for year 2022


In [9]:
df = pd.DataFrame({'ids':id_list})
df.to_csv("./Outputs/2.1.1_TMDB_IDs_2010_2023.csv")