In [16]:
from tqdm.notebook import tqdm_notebook
import pandas as pd
import os, time,json
import tmdbsimple as tmdb 
import os, json, math, time
FOLDER = "Data/"
os.makedirs(FOLDER, exist_ok=True)
os.listdir(FOLDER)

['title_basics.csv.gz', 'title_akas.csv.gz', 'title_ratings.csv.gz']

Define Your Functions
You should ultimately put any custom functions at the top of your notebook. You can first write them where you first use them in your project, but once you have the functions completed and tested, you should move their definitions to the top of your notebook after you import your packages.

You will need your function to get the movie rating from the prior lesson, as well as the new function below: write_json. This is a modified version of a function from https://www.geeksforgeeks.org/append-to-json-file-using-python/. Notice that the original source link is included in the function's docstring to give proper credit to the original authors.

In [2]:
def write_json(new_data, filename): 
    """Appends a list of records (new_data) to a json file (filename). 
    Adapted from: https://www.geeksforgeeks.org/append-to-json-file-using-python/"""  
    
    with open(filename,'r+') as file:
        # First we load existing data into a dict.
        file_data = json.load(file)
        ## Choose extend or append
        if (type(new_data) == list) & (type(file_data) == list):
            file_data.extend(new_data)
        else:
             file_data.append(new_data)
        # Sets file's current position at offset.
        file.seek(0)
        # convert back to json.
        json.dump(file_data, file)

Load in the Title Basics data
You need to read in the filtered dataframe you created based on the specification of Project 3 Part 1.

You will be filtering out the movies for each year inside the loop, so we will need this loaded and ready to be filtered.

In [5]:
# Load in the dataframe from project part 1 as basics:
basics = pd.read_csv("Data/title_basics.csv.gz")

In [6]:
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020,,70,Drama
2,tt0068865,movie,Lives of Performers,Lives of Performers,0,2016,,90,Drama
3,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
4,tt0082328,movie,Embodiment of Evil,Encarnação do Demônio,0,2008,,94,Horror


Create Required Lists for the Loop
Define a list of the Years to Extract from the API

We have data from 2000 - 2020 available. If we just want results for the first two years, we will create a YEARS_TO_GET list that only contains those 2 years (for now). This will control our outer loop.

In [7]:
YEARS_TO_GET = [2000,2001]

Define an errors list

We will want to be able to save the ids and error messages for any movie that causes an error. To do so, we will want to create an empty errors list before our loops that we can append to later.

In [8]:
errors = [ ]

Start OUTER loop

Set up Progress Bar
We want to keep track of our progress and ensure our calls are working. The progress bar works within the for statement of the for loop. Note that this will iterate through each year that is defined in the YEARS_TO_GET variable.



In [17]:
# Start of OUTER loop
for YEAR in tqdm_notebook(YEARS_TO_GET, desc='YEARS', position=0):
    time.sleep(.2) 

YEARS:   0%|          | 0/2 [00:00<?, ?it/s]

Ultimately we will be creating a loop, but let's explore each piece of the code:

Select a JSON_FILE filename to save the results in progress.
Check if the file exists.
if no:
Create the empty JSON file with with open that just contains the key "imdb_id"
if yes:
Do nothing.
First, define the file path and names: We are going to have multiple files since we are creating a separate file for each year. The code below will identify the folder in the FOLDER we just defined above and will name the file based on the current year.

In [21]:
#Defining the JSON file to store results for year
JSON_FILE = f'{"Data"}tmdb_api_results_{2023}.json'

In [22]:
# Check if file exists
file_exists = os.path.isfile(JSON_FILE)

In [23]:
file_exists

False

In [24]:
# If it does not exist: create it
if file_exists == False:
# save an empty dict with just "imdb_id" to the new json file.
    with open(JSON_FILE,'w') as f:
        json.dump([{'imdb_id':0}],f)

In [26]:
file_exists

False

Define/filter the IDs to call
We are going to break up our title_basics data by year, so we will define a new dataframe for each year. Notice that which YEAR will depend on what we define YEAR as. Leaving YEAR a variable allows the code to be easier to read and reproduce.

In [27]:
#Saving new year as the current df
df = basics.loc[ basics['startYear']==YEAR].copy()
# saving movie ids to list
movie_ids = df['tconst'].copy()

Check for and remove any previously downloaded Movie id's
You may remember from our lesson on efficient API calls that we are going to build in some safeguards when looping through multiple calls.

Load in any existing API results with pd.read_json
Check to see if any of the movie_ids to get are already in the JSON file.
Filter out only movies that are missing from the JSON file to use in the loop
The code loads any existing information from the JSON file into a dataframe called the "previous_df." This will start empty, but as you iterate through the loop, it will continue to have more and more information.

In [28]:
# Load existing data from json into a dataframe called "previous_df"
previous_df = pd.read_json(JSON_FILE)

Check for and filter out movie IDs that already exist

The next line of code will prevent you from wasting API calls on data you already have. Note that it is defining the ids you are calling in such a way that it excludes any ids that are already present in the previous_df. You may recall that this will also allow you to "pick up where you left off" if your API call gets interrupted.

In [29]:
# filter out any ids that are already in the JSON_FILE
movie_ids_to_get = movie_ids[~movie_ids.isin(previous_df['imdb_id'])]

Now we have defined the "movie_ids_to_get". It includes the ids from our dataframe in the year we are seeking, and it excludes any that we have already made calls for.

We will use this list for our inner loop of API calls.

Start INNER Loop
Now that we have the filtered list of movie_ids_to_get for the current year, we will now create an inner loop to iterate through the movie_ids_to_get, and for each ID, we will: retrieve the movie info from the TMDB API, append the movie_info dictionary to our JSON_FILE, wait 20 ms to avoid overwhelming the API.

Iterate through the list of Movie IDs and make the calls
The code below relies on the function you wrote in the previous lesson that made API calls and added the certification to the .info results. Here this function is named "get_movie_with_rating". Make sure you have the function from the earlier lesson in the code file before you plan to call on it! This loop also uses the function above (write_json) to extend/append the results to the .json file. Make sure both functions are defined in your code file before you try to call them!

Since some movies exist in IMDB's title basics dataset (our DataFrame) that do not exist within the database for TMDB's API, we will get an error whenever we attempt to retrieve a movie id that TMDB does not have in its database.

To get around this, we will use a try and except statement around our inner loop. We will TRY to run the inner loop to retrieve and save the data for the current movie_id, but if we get an error, we will save the movie_id and error message in our errors list

In [30]:
    #Get index and movie id from list
    # INNER Loop
    for movie_id in tqdm_notebook(movie_ids_to_get,
                                  desc=f'Movies from {YEAR}',
                                  position=1,
                                  leave=True):
        try:
            # Retrieve then data for the movie id
            temp = get_movie_with_rating(movie_id)  
            # Append/extend results to existing file using a pre-made function
            write_json(temp,JSON_FILE)
            # Short 20 ms sleep to prevent overwhelming server
            time.sleep(0.02)
            
        except Exception as e:
            errors.append([movie_id, e])

Movies from 2001:   0%|          | 0/1584 [00:00<?, ?it/s]

After the Inner Loop
Once the inner loop through the current movie_ids_to_get has finished, we will have all of our results for that year in our JSON_FILE. We now want to save them in a smaller file format.
Save the year's results as csv.gz file

Once all of the API calls for the current year are made, you should open your .json file with pd.read_json and convert each json file to a compressed csv (".csv.gz") to save space. This is done after the inner loop but within the outer loop.

In [31]:
    final_year_df = pd.read_json(JSON_FILE)
    final_year_df.to_csv(f"{FOLDER}final_tmdb_data_{YEAR}.csv.gz", compression="gzip", index=False)

After Your Inner & Outer Loop
Print a message reporting back the number of movie ids that caused an error.


In [32]:
print(f"- Total errors: {len(errors)}")

- Total errors: 1584
