# Part 2: Extract from TMDB API

## Methods

- Load in the cleaned title basics data from part 1. 
- Use the `tconst` values to extract movie data from The Movie Database's API.
- Keep the data sorted in files by startyear to ensure data remains under the git filesize limit.
- Combine all years into 1 final CSV and then perform some quick EDA to get overview of data collected (next notebook: Part 2B)|

# Loading the Data 

In [1]:
%load_ext autoreload
%autoreload 2
import project_functions as pf


In [2]:
# Install tmdbsimple (only need to run once)
# !pip install tmdbsimple

In [3]:
import json
with open('/Users/codingdojo/.secret/tmdb_api.json', 'r') as f:
    login = json.load(f)
## Display the keys of the loaded dict
login.keys()

dict_keys(['api-key', 'api-token'])

In [4]:
import tmdbsimple as tmdb
tmdb.API_KEY =  login['api-key']

In [5]:
# !pip install tzlocal

In [6]:
import os
import pandas as pd
import time
import datetime as dt
import tzlocal

from tqdm.notebook import tqdm_notebook
FOLDER = "Data/"
# os.makedirs(FOLDER, exist_ok=True)
os.listdir(FOLDER)

['final_tmdb_data_2006.csv.gz',
 'final_tmdb_data_2018.csv.gz',
 'final_tmdb_data_2014.csv.gz',
 'final_tmdb_data_2022.csv.gz',
 'title_basics_cleaned.csv.gz',
 'combined_tmdb_api_data.csv.gz',
 '.DS_Store',
 'final_tmdb_data_2008.csv.gz',
 'final_tmdb_data_2016.csv.gz',
 'final_tmdb_data_2004.csv.gz',
 'final_tmdb_data_2020.csv.gz',
 'title_ratings_cleaned.csv.gz',
 'tmdb_api_results_2000.json',
 'final_tmdb_data_2000.csv.gz',
 'final_tmdb_data_2012.csv.gz',
 'final_tmdb_data_2010.csv.gz',
 'final_tmdb_data_2002.csv.gz',
 'final_tmdb_data_2019.csv.gz',
 'final_tmdb_data_2007.csv.gz',
 'final_tmdb_data_2015.csv.gz',
 'For Tableau',
 'final_tmdb_data_2021.csv.gz',
 'title_crew.csv.gz',
 'final_tmdb_data_2017.csv.gz',
 'final_tmdb_data_2009.csv.gz',
 'final_tmdb_data_2005.csv.gz',
 'final_tmdb_data_2001.csv.gz',
 '.ipynb_checkpoints',
 'final_tmdb_data_2013.csv.gz',
 'name_basics.csv.gz',
 'title_principals.csv.gz',
 'title_akas_cleaned.csv.gz',
 'final_tmdb_data_2011.csv.gz',
 'final_tm

In [7]:
# Load in the dataframe from project part 1 as basics:
basics = pd.read_csv('Data/title_basics_cleaned.csv.gz')
basics.info()
basics.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 127375 entries, 0 to 127374
Data columns (total 8 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   tconst          127375 non-null  object 
 1   titleType       127375 non-null  object 
 2   primaryTitle    127375 non-null  object 
 3   originalTitle   127375 non-null  object 
 4   isAdult         127375 non-null  int64  
 5   startYear       127375 non-null  float64
 6   runtimeMinutes  127375 non-null  int64  
 7   genres          127375 non-null  object 
dtypes: float64(1), int64(2), object(5)
memory usage: 7.8+ MB


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,70,Drama
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,122,Drama
3,tt0070596,movie,Socialist Realism,El realismo socialista,0,2023.0,78,Drama
4,tt0082328,movie,Embodiment of Evil,Encarnação do Demônio,0,2008.0,94,Horror


In [8]:
basics[basics['primaryTitle'].str.contains("The Marvels")]

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres
20994,tt10676048,movie,The Marvels,The Marvels,0,2023.0,105,"Action,Adventure,Fantasy"


### Functions for API Extraction

In [9]:
def get_movie_with_rating(movie_id):
    """Retrieve the dictionary of movie data using tmbdbsimple, including MPAA rating.
    Args:
        movie_id (str): movie-to-retreive's id (from IMDB's tconst column)
        
    Returns:
        dict: dictionary of movie.info() + the certification from movie.releases"""
    # Get the movie object for the current id
    movie = tmdb.Movies(movie_id)
    # save the .info .releases dictionaries
    movie_info = movie.info()
    releases = movie.releases()
    # Loop through countries in releases
    for c in releases['countries']:
        # if the country abbreviation==US
        if c['iso_3166_1' ] =='US':
            ## save a "certification" key in the info dict with the certification
            movie_info['certification'] = c['certification']
    return movie_info

In [10]:
## Define start years
# YEARS_TO_GET = list(range(2000,2023))#[2000,2001]
YEARS_TO_GET =  sorted(basics['startYear'].unique())
YEARS_TO_GET

[2000.0,
 2001.0,
 2002.0,
 2003.0,
 2004.0,
 2005.0,
 2006.0,
 2007.0,
 2008.0,
 2009.0,
 2010.0,
 2011.0,
 2012.0,
 2013.0,
 2014.0,
 2015.0,
 2016.0,
 2017.0,
 2018.0,
 2019.0,
 2020.0,
 2021.0,
 2022.0,
 2023.0]

> Note: i should test if removing saving each movie will dramatically speed up API extraction.

In [11]:
## Setting booleans to control which method of saving data
SAVE_TO_DISK_FIRST=False
ERROR_LOG = "tmdb_api_errors.json"
with open(ERROR_LOG,'w') as f:
    json.dump([{'movie id':'','error':''}],f)

In [12]:
## Reversing order of YEARS_TO_GET
YEARS_TO_GET = YEARS_TO_GET[::-1]
YEARS_TO_GET

[2023.0,
 2022.0,
 2021.0,
 2020.0,
 2019.0,
 2018.0,
 2017.0,
 2016.0,
 2015.0,
 2014.0,
 2013.0,
 2012.0,
 2011.0,
 2010.0,
 2009.0,
 2008.0,
 2007.0,
 2006.0,
 2005.0,
 2004.0,
 2003.0,
 2002.0,
 2001.0,
 2000.0]

In [13]:
import os
# Start of OUTER loop
if SAVE_TO_DISK_FIRST==True:
    print('- Since SAVE_TO_DISK_FIRST is True, will save each movie to disk.')
else:
    print('- Since SAVE_TO_DISK_FIRST is False, will append each movie to list.')

now = dt.datetime.now(tz=tzlocal.get_localzone())
print(f"\n[i] API Calls started on {now.strftime('%m-%D-%Y')} @ {now.strftime('%I:%M:%S %p')}")



for YEAR in tqdm_notebook(YEARS_TO_GET,desc='YEARS',
                          position=0):
    

    #Saving new year as the current df
    df = basics.loc[ basics['startYear']==YEAR].copy()
    # saving movie ids to list
    movie_ids = df['tconst'].copy()#.to_list()
    
    
   ## if saving each movie to disk
    if SAVE_TO_DISK_FIRST:

        #Defining the JSON file to store results for year
        JSON_FILE = f'{FOLDER}tmdb_api_results_{YEAR}.json'

        # Check if file exists
        file_exists = os.path.isfile(JSON_FILE)


        # If it does not exist: create it
        if file_exists == False:
        # save an empty dict with just "imdb_id" to the new json file.
            with open(JSON_FILE,'w') as f:
                json.dump([{'imdb_id':0}],f)


        # Load existing data from json into a dataframe called "previous_df"
        previous_df = pd.read_json(JSON_FILE)


        # filter out any ids that are already in the JSON_FILE
        movie_ids_to_get = movie_ids[~movie_ids.isin(previous_df['imdb_id'])]

    ## if append to list insteead of saving .json
    else:
        ## Make list to append data to instead of saving to file
        year_data = []
        ## use all movie_ids from basicsß
        movie_ids_to_get = movie_ids
        
        
        
    #Get index and movie id from list
    # INNER Loop
    for movie_id in tqdm_notebook(movie_ids_to_get,
                                  desc=f'Movies from {YEAR}',
                                  position=1,
                                  leave=True):
        # Attempt to retrieve then data for the movie id
        try:
            temp = get_movie_with_rating(movie_id)  #This uses your pre-ma    de function
            # Append/extend results to existing file using a pre-made function
            
            ## Save to json_file if advanced workflow
            if SAVE_TO_DISK_FIRST:
                
                pf.write_json(temp,JSON_FILE)
                
            else:
                year_data.append(temp)
                
            
            # Short 10 ms sleep to prevent overwhelming server
            time.sleep(0.01)

        # If it fails,  append error message with id to json
        except Exception as e:
            temp = {'movie_id':movie_id,'error':str(e)}
            pf.write_json(temp, ERROR_LOG)
            continue

    ## FINAL SAVING OF YEAR DATA
    if SAVE_TO_DISK_FIRST:
        final_year_df = pd.read_json(JSON_FILE)
        
    else:
        final_year_df = pd.DataFrame(year_data)
    
    ## save compressed csv
    csv_fname = f"{FOLDER}final_tmdb_data_{YEAR}.csv.gz"
    final_year_df.to_csv(csv_fname, compression="gzip", index=False)


- Since SAVE_TO_DISK_FIRST is False, will append each movie to list.

[i] API Calls started on 11-11/22/23-2023 @ 09:08:33 PM


YEARS:   0%|          | 0/24 [00:00<?, ?it/s]

Movies from 2023.0:   0%|          | 0/5537 [00:00<?, ?it/s]

Movies from 2022.0:   0%|          | 0/6941 [00:00<?, ?it/s]

Movies from 2021.0:   0%|          | 0/6989 [00:00<?, ?it/s]

Movies from 2020.0:   0%|          | 0/7071 [00:00<?, ?it/s]

Movies from 2019.0:   0%|          | 0/8139 [00:00<?, ?it/s]

Movies from 2018.0:   0%|          | 0/7899 [00:00<?, ?it/s]

Movies from 2017.0:   0%|          | 0/7833 [00:00<?, ?it/s]

Movies from 2016.0:   0%|          | 0/7416 [00:00<?, ?it/s]

Movies from 2015.0:   0%|          | 0/7234 [00:00<?, ?it/s]

Movies from 2014.0:   0%|          | 0/7169 [00:00<?, ?it/s]

KeyboardInterrupt: 

- Analysis continued in "Part 2B - EDA Overview of TMDB Data.ipynb"

### Future To Do:
- Save an entry with just the movie id if there is no information found in the api