# Part 2: Extract from TMDB API

## Methods

- Load in the cleaned title basics data from part 1. 
- Use the `tconst` values to extract movie data from The Movie Database's API.
- Keep the data sorted in files by startyear to ensure data remains under the git filesize limit.
- Combine all years into 1 final CSV and then perform some quick EDA to get overview of data collected (next notebook: Part 2B)|

# Loading the Data 

In [None]:
%load_ext autoreload
%autoreload 2
import project_functions as pf


In [1]:
# Install tmdbsimple (only need to run once)
# !pip install tmdbsimple

In [2]:
import json
with open('/Users/codingdojo/.secret/tmdb_api.json', 'r') as f:
    login = json.load(f)
## Display the keys of the loaded dict
login.keys()

dict_keys(['api-key'])

In [3]:
import tmdbsimple as tmdb
tmdb.API_KEY =  login['api-key']

In [18]:
# !pip install tzlocal

Defaulting to user installation because normal site-packages is not writeable


In [19]:
import os
import pandas as pd
import time
import datetime as dt
import tzlocal

from tqdm.notebook import tqdm_notebook
FOLDER = "Data/"
# os.makedirs(FOLDER, exist_ok=True)
os.listdir(FOLDER)

['title_basics_cleaned.csv.gz',
 'title_ratings_cleaned.csv.gz',
 'tmdb_api_results_2000.json',
 'title_akas_cleaned.csv.gz']

In [6]:
# Load in the dataframe from project part 1 as basics:
basics = pd.read_csv('Data/title_basics_cleaned.csv.gz')
basics.info()
basics.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112878 entries, 0 to 112877
Data columns (total 8 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   tconst          112878 non-null  object 
 1   titleType       112878 non-null  object 
 2   primaryTitle    112878 non-null  object 
 3   originalTitle   112878 non-null  object 
 4   isAdult         112878 non-null  int64  
 5   startYear       112878 non-null  float64
 6   runtimeMinutes  112878 non-null  int64  
 7   genres          112878 non-null  object 
dtypes: float64(1), int64(2), object(5)
memory usage: 6.9+ MB


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El Tango del Viudo y Su Espejo Deformante,0,2020.0,70,Drama
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,122,Drama
3,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005.0,100,"Comedy,Horror,Sci-Fi"
4,tt0093119,movie,Grizzly II: Revenge,Grizzly II: The Predator,0,2020.0,74,"Horror,Music,Thriller"


### Functions for API Extraction

In [7]:
def get_movie_with_rating(movie_id):
    """Retrieve the dictionary of movie data using tmbdbsimple, including MPAA rating.
    Args:
        movie_id (str): movie-to-retreive's id (from IMDB's tconst column)
        
    Returns:
        dict: dictionary of movie.info() + the certification from movie.releases"""
    # Get the movie object for the current id
    movie = tmdb.Movies(movie_id)
    # save the .info .releases dictionaries
    movie_info = movie.info()
    releases = movie.releases()
    # Loop through countries in releases
    for c in releases['countries']:
        # if the country abbreviation==US
        if c['iso_3166_1' ] =='US':
            ## save a "certification" key in the info dict with the certification
            movie_info['certification'] = c['certification']
    return movie_info

In [9]:
## Define start years
YEARS_TO_GET = list(range(2000,2023))#[2000,2001]
YEARS_TO_GET

[2000,
 2001,
 2002,
 2003,
 2004,
 2005,
 2006,
 2007,
 2008,
 2009,
 2010,
 2011,
 2012,
 2013,
 2014,
 2015,
 2016,
 2017,
 2018,
 2019,
 2020,
 2021,
 2022]

> Note: i should test if removing saving each movie will dramatically speed up API extraction.

In [15]:
## Setting booleans to control which method of saving data
ADVANCED_MOVIE_ID_WORKFLOW=False
ERROR_LOG = "tmdb_api_errors.json"
with open(ERROR_LOG,'w') as f:
    json.dump([{'movie id':'','error':''}],f)

In [None]:
# Start of OUTER loop
if ADVANCED_MOVIE_ID_WORKFLOW==True:
    print('- Since ADVANCED_MOVIE_ID_WORKFLOW is True, will save each movie to disk.')
else:
    print('- Since ADVANCED_MOVIE_ID_WORKFLOW is False, will append each movie to list.')

now = dt.datetime.now(tz=tzlocal.get_localzone())
print(f"\n[i] API Calls started on {now.strftime('%m-%D-%Y')} @ {now.strftime('%I:%M:%S %p')}")



for YEAR in tqdm_notebook(YEARS_TO_GET,desc='YEARS',
                          position=0):
    

    #Saving new year as the current df
    df = basics.loc[ basics['startYear']==YEAR].copy()
    # saving movie ids to list
    movie_ids = df['tconst'].copy()#.to_list()
    
    
   ## if saving each movie to disk
    if ADVANCED_MOVIE_ID_WORKFLOW:

        #Defining the JSON file to store results for year
        JSON_FILE = f'{FOLDER}tmdb_api_results_{YEAR}.json'

        # Check if file exists
        file_exists = os.path.isfile(JSON_FILE)


        # If it does not exist: create it
        if file_exists == False:
        # save an empty dict with just "imdb_id" to the new json file.
            with open(JSON_FILE,'w') as f:
                json.dump([{'imdb_id':0}],f)


        # Load existing data from json into a dataframe called "previous_df"
        previous_df = pd.read_json(JSON_FILE)


        # filter out any ids that are already in the JSON_FILE
        movie_ids_to_get = movie_ids[~movie_ids.isin(previous_df['imdb_id'])]

    ## if append to list insteead of saving .json
    else:
        ## Make list to append data to instead of saving to file
        year_data = []
        ## use all movie_ids from basicsß
        movie_ids_to_get = movie_ids
        
        
        
    #Get index and movie id from list
    # INNER Loop
    for movie_id in tqdm_notebook(movie_ids_to_get,
                                  desc=f'Movies from {YEAR}',
                                  position=1,
                                  leave=True):
        # Attempt to retrieve then data for the movie id
        try:
            temp = get_movie_with_rating(movie_id)  #This uses your pre-ma    de function
            # Append/extend results to existing file using a pre-made function
            
            ## Save to json_file if advanced workflow
            if ADVANCED_MOVIE_ID_WORKFLOW:
                
                pf.write_json(temp,JSON_FILE)
                
            else:
                year_data.append(temp)
                
            
            # Short 10 ms sleep to prevent overwhelming server
            time.sleep(0.01)

        # If it fails,  append error message with id to json
        except Exception as e:
            temp = {'movie_id':movie_id,'error':str(e)}
            pf.write_json(temp, ERROR_LOG)
            continue

    ## FINAL SAVING OF YEAR DATA
    if ADVANCED_MOVIE_ID_WORKFLOW:
        final_year_df = pd.read_json(JSON_FILE)
        
    else:
        final_year_df = pd.DataFrame(year_data)
    
    ## save compressed csv
    csv_fname = f"{FOLDER}final_tmdb_data_{YEAR}.csv.gz"
    final_year_df.to_csv(csv_fname, compression="gzip", index=False)


- Since ADVANCED_MOVIE_ID_WORKFLOW is False, will append each movie to list.

[i] API Calls started on 05-05/21/22-2022 @ 05:32:13 PM


YEARS:   0%|          | 0/23 [00:00<?, ?it/s]

Movies from 2000:   0%|          | 0/1725 [00:00<?, ?it/s]

Movies from 2001:   0%|          | 0/1859 [00:00<?, ?it/s]

Movies from 2002:   0%|          | 0/1922 [00:00<?, ?it/s]

Movies from 2003:   0%|          | 0/2123 [00:00<?, ?it/s]

Movies from 2004:   0%|          | 0/2435 [00:00<?, ?it/s]

Movies from 2005:   0%|          | 0/2837 [00:00<?, ?it/s]

Movies from 2006:   0%|          | 0/3224 [00:00<?, ?it/s]

Movies from 2007:   0%|          | 0/3486 [00:00<?, ?it/s]

Movies from 2008:   0%|          | 0/4118 [00:00<?, ?it/s]

Movies from 2009:   0%|          | 0/4932 [00:00<?, ?it/s]

Movies from 2010:   0%|          | 0/5418 [00:00<?, ?it/s]

Movies from 2011:   0%|          | 0/5991 [00:00<?, ?it/s]

Movies from 2012:   0%|          | 0/6469 [00:00<?, ?it/s]

Movies from 2013:   0%|          | 0/6798 [00:00<?, ?it/s]

Movies from 2014:   0%|          | 0/6962 [00:00<?, ?it/s]

Movies from 2015:   0%|          | 0/7055 [00:00<?, ?it/s]

Movies from 2016:   0%|          | 0/7234 [00:00<?, ?it/s]

- Analysis continued in "Part 2B - EDA Overview of TMDB Data.ipynb"