Planning

Before jumping into the code, it is important to outline in plain language what you are trying to do. Before you can ask the computer to do it, you have to really understand what you are asking. This week has introduced some new code that you may still be getting used to, so this lesson will help walk you through the task. We will go through the individual pieces of code, but for the project, you will need to put it all together, in a logical order, with correct formatting! There will be an OUTER and INNER loop: a loop within a loop!

Our goal will be to

-  Determine where to save our results and in what file format.
-  Decide what subset of movies to retrieve (based on Years).
-  Develop code to make API calls based on our existing IMDB IDs with the INNER Loop
-  Organize output by year into separate .json files using an OUTER LOOP

So let's consider what our process looks like:

BEFORE the Loops
-   Designate a folder to save your information
-    Define the years you wish to retrieve
-   Define any custom functions you will use


Create an OUTER loop for each year with a progress bar using tqdm_notebook

1. Define a JSON_FILE filename to save the results in progress.
 -   Check if the file exists.
   if no:  Create the empty JSON file with with open that just contains the key "imdb_id"
    if yes:   Do nothing.
3.   Define/filter the movie IDs you want to retrieve (that belongs to the year being retrieved)
3.  Check for and remove any previously downloaded movie id's to prevent duplicate API calls.
-    Load in any existing/previous results with pd.read_json
 -    Check to see if any of the movie_ids to get are already in the JSON file.
-    Filter out only movies that are missing from JSON file to use in the loop.



- Create an INNER loop to make API calls for each id in the YEAR specified in the outer loop. For each id:
 - Load up results thus far from JSON file as a list.
 - Extract current ID from API and extract dictionary of results
 - Append the new results to the list from the JSON file
 - Save the updated JSON file back to disk

-  After the inner loop, save the final results for that year as a csv.gz file with the year in the filename.
 - Then, the outer loop repeats for the remaining years.



In [14]:
import pandas as pd
from tqdm.notebook import tqdm_notebook
import os, time,json
import tmdbsimple as tmdb 
FOLDER = "Data/"
os.makedirs(FOLDER, exist_ok=True)
os.listdir(FOLDER)

['title_akas.csv.gz', 'title_basics.csv.gz', 'title_ratings.csv.gz']

In [12]:
def get_movie_with_rating(movie_id):
    movie = tmdb.Movies(movie_id)
    
    info = movie.info()
    
    releases = movie.releases()
    
    for c in releases['countries']:
        if c['iso_3166_1']=='US':
            info['certification']=c['certification']
    return info           

In [2]:
def write_json(new_data, filename): 
    """Appends a list of records (new_data) to a json file (filename). 
    Adapted from: https://www.geeksforgeeks.org/append-to-json-file-using-python/"""  
    
    with open(filename,'r+') as file:
        # First we load existing data into a dict.
        file_data = json.load(file)
        ## Choose extend or append
        if (type(new_data) == list) & (type(file_data) == list):
            file_data.extend(new_data)
        else:
             file_data.append(new_data)
        # Sets file's current position at offset.
        file.seek(0)
        # convert back to json.
        json.dump(file_data, file)

**Load in the Title Basics data**

You need to read in the filtered dataframe you created based on the specification of project 3 Part 1.

You will be filtering out the movies for each year inside the loop, so we will need this loaded and ready to be filtered.

In [7]:
# Load in the dataframe from project part 1 as basics:
basics = pd.read_csv('Data/title_basics.csv.gz')

**Create Required Lists for the Loop**

Define a list of the Years to Extract from the API

We have data from 2000 - 2020 available. If we just want results for the first two years, we will create a YEARS_TO_GET list that only contains those 2 years (for now). This will control our outer loop. 

In [8]:
YEARS_TO_GET = [2000,2001]

**Define an errors list**

We will want to be able to save the id's and error messages for any movie that causes an error. To do so, we will want to create an empty errors list before our loops that we can append to later.

In [10]:
errors = [ ]

In [16]:
# Start of OUTER loop
for YEAR in tqdm_notebook(YEARS_TO_GET, desc='YEARS', position=0):
    #Defining the JSON file to store results for year
    JSON_FILE = f'{FOLDER}tmdb_api_results_{YEAR}.json'
    ## Check if JSON_FILE exists
    file_exists = os.path.isfile(JSON_FILE)
    
    # If it does not exist: create it
    if file_exists == False:
    # save an empty dict with just "imdb_id" to the new json file.
        with open(JSON_FILE,'w') as f:
            json.dump([{'imdb_id':0}],f)
    
    #Saving new year as the current df
    df = basics.loc[ basics['startYear']==YEAR].copy()
    # saving movie ids to list
    movie_ids = df['tconst'].copy()
    
    # Load existing data from json into a dataframe called "previous_df"
    previous_df = pd.read_json(JSON_FILE)
    
    # filter out any ids that are already in the JSON_FILE
    movie_ids_to_get = movie_ids[~movie_ids.isin(previous_df['imdb_id'])]
    
    #Get index and movie id from list
    # INNER Loop
    for movie_id in tqdm_notebook(movie_ids_to_get,
                                  desc=f'Movies from {YEAR}',
                                  position=1,
                                  leave=True):
        try:
            # Retrieve then data for the movie id
            temp = get_movie_with_rating(movie_id)  
            # Append/extend results to existing file using a pre-made function
            write_json(temp,JSON_FILE)
            # Short 20 ms sleep to prevent overwhelming server
            time.sleep(0.02)
            
        except Exception as e:
            errors.append([movie_id, e])
    
    final_year_df = pd.read_json(JSON_FILE)
    final_year_df.to_csv(f"{FOLDER}final_tmdb_data_{YEAR}.csv.gz", compression="gzip", index=False)

YEARS:   0%|          | 0/2 [00:00<?, ?it/s]

Movies from 2000:   0%|          | 0/1432 [00:00<?, ?it/s]

Movies from 2001:   0%|          | 0/1548 [00:00<?, ?it/s]

In [17]:
print(f"- Total errors: {len(errors)}")

- Total errors: 2980
