# TMDB API PULL
**Author:** Michael McCann <br>
**Last Updated:** 13 MAY 2022

<u>Overview:</u> <br>
The [IMDB data](https://github.com/msmccann10/PP-movie-database-and-analysis/blob/main/01_IMDB_Data_Loading_and_Processing.ipynb) does not include financial information (budget/revenue) or MPAA certification. [The Movie Database (TMDB)](https://www.themoviedb.org/) is another database that contains this information and is accessible via API call. We will access the API to get additional data to make our database.
 
<u>Tasks:</u><br> 
- Make folders/location to save data
- Build and run API for TMDB
- Remove unwanted files

<u>Other:</u><br>
TMDB API: https://developers.themoviedb.org/4/getting-started/authorization <br>
TMDP API Package: https://github.com/celiao/tmdbsimple

## Imports

In [4]:
import pandas as pd
import numpy as np
import json, os, math, time

import tmdbsimple as tmdb
from tqdm.notebook import tqdm_notebook

## Get API Key, Instantiate API

In [5]:
# Get filepath for keys
file = "../../.secret/tmdb_api.json"

# load keys as login
with open(file, 'r') as f:
    login = json.load(f)

# Instantiate API
tmdb.API_KEY = login['api-key']

## Functions

### Movie Ratings Function

In [6]:
def movie_rating(movie_id):
    movie = tmdb.Movies(movie_id)
    info = movie.info()
    releases = movie.releases()
        
    for c in releases['countries']:
        if c['iso_3166_1'] == 'US':
            info['certification'] = c['certification']
    return info

### Write JSON function

In [7]:
# def write_json(new_data, filename):
#     with open(filename, 'r+') as f:
#         file_data = json.load(f)
        
#         if (type(new_data) == list) & (type(file_data) == list):
#             file_data.extend(new_data)
#         else:
#             file_data.append(new_data)
#         file.seek(0)
        
#         json.dump(file_data, f)
        
def write_json(new_data, filename): 
    with open(filename,'r+') as file:
    # First we load existing data into a dict.
        file_data = json.load(file)
        ## Choose extend or append
        if (type(new_data) == list) & (type(file_data) == list):
            file_data.extend(new_data)
        else:
             file_data.append(new_data)
        # Sets file's current position at offset.
        file.seek(0)
        # convert back to json.
        json.dump(file_data, file)

## Set Variables

In [8]:
# Created or Confirm folder for data
FOLDER = "Data/"
os.makedirs(FOLDER, exist_ok=True)
os.listdir(FOLDER)

# Read in title basics from IMDB
basics = pd.read_csv('data/title_basics.csv.gz')

# Set Years we are interested in
YEARS_TO_GET = list(range(2000,2022))

## API Call

In [6]:
for YEAR in tqdm_notebook(YEARS_TO_GET, desc='YEARS', position = 0):    
    JSON_FILE = f'{FOLDER}tmdb_api_requests_{YEAR}.json'
    
    # Check for JSON_FILE
    file_exists = os.path.isfile(JSON_FILE)

    # If it does not exist create a blank one for use
    if file_exists == False:
        with open(JSON_FILE, 'w') as f:
            json.dump([{'imdb_id':0}],f)
    
    # get movie ids ('tconst') from our IMDB basics dataframe
    df = basics.loc[basics['startYear'] == YEAR].copy()
    movie_ids = df['tconst'].copy()
    
    # Check for previous pulls 
    previous_df = pd.read_json(JSON_FILE)
    movie_ids_to_get = movie_ids[~movie_ids.isin(previous_df['imdb_id'])]
    
    for movie_id in tqdm_notebook(movie_ids_to_get,
                             desc=f'Movies from {YEAR}',
                             position = 1,
                             leave = True):
            try:
                temp = movie_rating(movie_id)
                write_json(temp, JSON_FILE)
                time.sleep(0.02)

            except Exception as e:
                    continue
    
    # Save the results out to a zipped csv                
    final_year_df = pd.read_json(JSON_FILE)
    final_year_df.to_csv(f"{FOLDER}final_tmdb_data_{YEAR}.csv.gz", compression="gzip", index=False)

YEARS:   0%|          | 0/22 [00:00<?, ?it/s]

Movies from 2000:   0%|          | 0/207 [00:00<?, ?it/s]

Movies from 2001:   0%|          | 0/241 [00:00<?, ?it/s]

Movies from 2002:   0%|          | 0/1505 [00:00<?, ?it/s]

Movies from 2003:   0%|          | 0/1630 [00:00<?, ?it/s]

Movies from 2004:   0%|          | 0/1832 [00:00<?, ?it/s]

Movies from 2005:   0%|          | 0/2124 [00:00<?, ?it/s]

Movies from 2006:   0%|          | 0/2347 [00:00<?, ?it/s]

Movies from 2007:   0%|          | 0/2483 [00:00<?, ?it/s]

Movies from 2008:   0%|          | 0/2823 [00:00<?, ?it/s]

Movies from 2009:   0%|          | 0/3450 [00:00<?, ?it/s]

Movies from 2010:   0%|          | 0/3750 [00:00<?, ?it/s]

Movies from 2011:   0%|          | 0/4143 [00:00<?, ?it/s]

Movies from 2012:   0%|          | 0/4426 [00:00<?, ?it/s]

Movies from 2013:   0%|          | 0/4620 [00:00<?, ?it/s]

Movies from 2014:   0%|          | 0/4771 [00:00<?, ?it/s]

Movies from 2015:   0%|          | 0/4934 [00:00<?, ?it/s]

Movies from 2016:   0%|          | 0/5147 [00:00<?, ?it/s]

Movies from 2017:   0%|          | 0/5509 [00:00<?, ?it/s]

Movies from 2018:   0%|          | 0/5629 [00:00<?, ?it/s]

Movies from 2019:   0%|          | 0/5676 [00:00<?, ?it/s]

Movies from 2020:   0%|          | 0/4766 [00:00<?, ?it/s]

Movies from 2021:   0%|          | 0/4700 [00:00<?, ?it/s]

In [7]:
pd.read_csv('data/final_tmdb_data_2002.csv.gz')

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,0,,,,,,,,,,...,,,,,,,,,,
1,tt0096056,0.0,/95U3MUDXu4xSCmVLtWgargRipDi.jpg,,0.0,"[{'id': 18, 'name': 'Drama'}]",,109809.0,en,Crime and Punishment,...,0.0,126.0,"[{'english_name': 'Polish', 'iso_639_1': 'pl',...",Released,,Crime and Punishment,0.0,5.5,11.0,
2,tt0118926,0.0,/p3BzCgX1gDIPdWfuFqRHIe52Ynf.jpg,,0.0,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name...",,20689.0,en,The Dancer Upstairs,...,5227348.0,132.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,"An honest man caught in a world of intrigue, p...",The Dancer Upstairs,0.0,6.3,50.0,R
3,tt0119980,0.0,,,0.0,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name...",,563364.0,en,Random Shooting in LA,...,0.0,91.0,[],Released,,Random Shooting in LA,0.0,0.0,0.0,
4,tt0120679,0.0,/s04Ds4xbJU7DzeGVyamccH4LoxF.jpg,,12000000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",https://www.miramax.com/movie/frida,1360.0,en,Frida,...,56298474.0,123.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Prepare to be seduced.,Frida,0.0,7.5,1720.0,R
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1213,tt5802260,0.0,,,0.0,[],,890955.0,zh,殺人計畫,...,0.0,0.0,[],Released,,殺人計畫,0.0,0.0,0.0,
1214,tt6449044,0.0,/a9pkw8stijESGx1flSGPqcXLkHu.jpg,"{'id': 957260, 'name': 'The Conman Collection'...",0.0,"[{'id': 35, 'name': 'Comedy'}]",,314105.0,cn,賭俠2002,...,0.0,97.0,"[{'english_name': 'Cantonese', 'iso_639_1': 'c...",Released,,The Conman 2002,0.0,6.0,2.0,
1215,tt6694126,0.0,/sXjVpTZyDvwzPVZve3AmyCUBeHk.jpg,,0.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,819174.0,fa,عروس خوش‌قدم,...,0.0,101.0,"[{'english_name': 'Persian', 'iso_639_1': 'fa'...",Released,,The Lucky Bride,0.0,0.0,0.0,
1216,tt8302928,0.0,,,0.0,"[{'id': 16, 'name': 'Animation'}, {'id': 12, '...",,866533.0,el,Movie Toons: Treasure Island,...,0.0,0.0,[],Released,,Movie Toons: Treasure Island,0.0,0.0,0.0,


## Remove Files
Our JSON files have been made into csv.gz files. We do not need to keep these JSONs as they are just taking up extra space and can be deleted. 

**Note:** This section is commented out to keep anyone from deleting files they might want to keep. Only run if you are willing/able to remove the JSON files.

In [14]:
## os.listdir(FOLDER)

## Loop through and remove JSON files created above.
# for YEAR in YEARS_TO_GET:
#     if os.path.exists(f'{FOLDER}tmdb_api_requests_{YEAR}.json'):
#         os.remove(f'{FOLDER}tmdb_api_requests_{YEAR}.json')
#     else:
#         print (f"file {FOLDER}tmdb_api_requests_{YEAR}.json does not exist")