# Part 2 Extract from TMDB

### Use an API to extract box office revenue and profit data to add to your IMDB data and perform exploratory data analysis.

### source of financial data
(https://www.themoviedb.org/). 

* Your stakeholder wants you to extract the budget, revenue, and MPAA Rating (G/PG/PG-13/R), which is also called "Certification".




In [1]:
import pandas as pd

In [2]:
!pip install tmdbsimple



In [3]:
import os, time,json
import tmdbsimple as tmdb 
from tqdm.notebook import tqdm_notebook
FOLDER = "Data/"
os.makedirs(FOLDER, exist_ok=True)
os.listdir(FOLDER)

['.ipynb_checkpoints',
 'Data',
 'title_akas.csv.gz',
 'title_basics.csv.gz',
 'title_ratings.csv.gz',
 'TMDB api.ipynb',
 'tmdb_api_results_2000.json']

In [4]:
import json
with open('/Users/Jacob/.secret/tmdb_api.json', 'r') as f:
    login = json.load(f)

login.keys()

dict_keys(['Client-ID', 'API Key'])

In [5]:
import tmdbsimple as tmdb
tmdb.API_KEY =  login['API Key']

### Specifications - Financial Data

* Your stakeholder would like you to extract and save the results for movies that meet all of the criteria established in part 1 of the project (You should already have a filtered dataframe saved from part one as a csv.gz file)

* As a proof-of-concept, they requested you perform a test extraction of movies that started in 2000 or 2001

* Each year should be saved as a separate .csv.gz file


#### Hint: Use the two custom functions from the lessons (Intro to TMDB API, and Efficient TMDB API Calls). Be sure to define these functions prior to calling them in your code!

* One function will add the certification (MPGG Rating) to movie.info

* The other function will help you append/extend a JSON file with Python



In [6]:
def get_movie_with_rating(movie_id): 
    
    movie = tmdb.Movies(movie_id)

    info = movie.info()
    releases = movie.releases()

    for c in releases['countries']:
   
        if c['iso_3166_1' ] =='US':
       
           info['certification'] = c['certification']
    return info

In [7]:
def write_json(new_data, filename): 
    """Appends a list of records (new_data) to a json file (filename). 
    Adapted from: https://www.geeksforgeeks.org/append-to-json-file-using-python/"""  
    
    with open(filename,'r+') as file:
   
        file_data = json.load(file)
  
        if (type(new_data) == list) & (type(file_data) == list):
            file_data.extend(new_data)
        else:
             file_data.append(new_data)
  
        file.seek(0)

        json.dump(file_data, file)

### Confirm Your API Function works.

In order to ensure your function for extracting movie data from TMDB is working, test your function on these 2 movie ids: tt0848228 ("The Avengers") and tt0332280 ("The Notebook"). Make sure that your function runs without error and that it returns the correct movie's data for both test ids.

#### Hint: Ideally you can organize the code segments from the previous lesson to create an outer and inner loop, but if you get stuck, you can complete 1 year at a time.

* Once you have retrieved and saved the final results to 2 separate .csv.gz files, move on to a new Exploratory Data Analysis notebook to explore the following questions.




In [8]:
test = get_movie_with_rating("tt0848228")
test

{'adult': False,
 'backdrop_path': '/9BBTo63ANSmhC4e6r62OJFuK2GL.jpg',
 'belongs_to_collection': {'id': 86311,
  'name': 'The Avengers Collection',
  'poster_path': '/yFSIUVTCvgYrpalUktulvk3Gi5Y.jpg',
  'backdrop_path': '/zuW6fOiusv4X9nnW3paHGfXcSll.jpg'},
 'budget': 220000000,
 'genres': [{'id': 878, 'name': 'Science Fiction'},
  {'id': 28, 'name': 'Action'},
  {'id': 12, 'name': 'Adventure'}],
 'homepage': 'https://www.marvel.com/movies/the-avengers',
 'id': 24428,
 'imdb_id': 'tt0848228',
 'original_language': 'en',
 'original_title': 'The Avengers',
 'overview': 'When an unexpected enemy emerges and threatens global safety and security, Nick Fury, director of the international peacekeeping agency known as S.H.I.E.L.D., finds himself in need of a team to pull the world back from the brink of disaster. Spanning the globe, a daring recruitment effort begins!',
 'popularity': 118.586,
 'poster_path': '/RYMX2wcKCBAr24UyPD7xwmjaTn.jpg',
 'production_companies': [{'id': 420,
   'logo_path

In [9]:
test = get_movie_with_rating("tt0332280")
test

{'adult': False,
 'backdrop_path': '/qom1SZSENdmHFNZBXbtJAU0WTlC.jpg',
 'belongs_to_collection': None,
 'budget': 29000000,
 'genres': [{'id': 10749, 'name': 'Romance'}, {'id': 18, 'name': 'Drama'}],
 'homepage': 'http://www.newline.com/properties/notebookthe.html',
 'id': 11036,
 'imdb_id': 'tt0332280',
 'original_language': 'en',
 'original_title': 'The Notebook',
 'overview': "An epic love story centered around an older man who reads aloud to a woman with Alzheimer's. From a faded notebook, the old man's words bring to life the story about a couple who is separated by World War II, and is then passionately reunited, seven years later, after they have taken different paths.",
 'popularity': 66.459,
 'poster_path': '/rNzQyW4f8B8cQeg7Dgj3n6eT5k9.jpg',
 'production_companies': [{'id': 12,
   'logo_path': '/5ThIuO93vsk47oexKTSdfKEr7EC.png',
   'name': 'New Line Cinema',
   'origin_country': 'US'},
  {'id': 1565, 'logo_path': None, 'name': 'Avery Pix', 'origin_country': 'US'},
  {'id': 26

In [10]:
errors = [ ]

In [11]:
print(f"- Number of errors: {len(errors)}")
errors

- Number of errors: 0


[]

In [12]:
FOLDER = "Data/"
os.makedirs(FOLDER, exist_ok=True)
os.listdir(FOLDER)

['.ipynb_checkpoints',
 'Data',
 'title_akas.csv.gz',
 'title_basics.csv.gz',
 'title_ratings.csv.gz',
 'TMDB api.ipynb',
 'tmdb_api_results_2000.json']

In [13]:
basics = pd.read_csv(r'\Users\Jacob\Documents\GitHub\0424 d e\Project-3\Data\title_basics.csv.gz')

In [14]:
YEARS_TO_GET = [2000,2001]

In [15]:
for YEAR in tqdm_notebook(YEARS_TO_GET, desc='YEARS', position=0):
    JSON_FILE = f'{FOLDER}tmdb_api_results_{YEAR}.json'
    file_exists = os.path.isfile(JSON_FILE)
    
    if file_exists == False:

        with open(JSON_FILE,'w') as f:
            json.dump([{'imdb_id':0}],f)
            
    df = basics.loc[ basics['startYear']==YEAR].copy()
    movie_ids = df['tconst'].copy()
    movie_ids
    
    previous_df = pd.read_json(JSON_FILE)
    previous_df
    
    movie_ids_to_get = movie_ids[~movie_ids.isin(previous_df['imdb_id'])]
    
    for movie_id in tqdm_notebook(movie_ids_to_get,
                                  desc=f'Movies from {YEAR}',
                                  position=1,
                                  leave=True):
        try:
            # Retrieve then data for the movie id
            temp = get_movie_with_rating(movie_id)  
            # Append/extend results to existing file using a pre-made function
            write_json(temp,JSON_FILE)
            # Short 20 ms sleep to prevent overwhelming server
            time.sleep(0.02)
            
        except Exception as e:
            errors.append([movie_id, e])
    
    
    final_year_df = pd.read_json(JSON_FILE)
    final_year_df.to_csv(f"{FOLDER}final_tmdb_data_{YEAR}.csv.gz", compression="gzip", index=False)
    
    

YEARS:   0%|          | 0/2 [00:00<?, ?it/s]

Movies from 2000:   0%|          | 0/1461 [00:00<?, ?it/s]

Movies from 2001:   0%|          | 0/1578 [00:00<?, ?it/s]

In [16]:
# YEAR = 2000

### Deliverables
After you have joined the tmdb results into 1 dataframe in the EDA Notebook,

* Save a final merged .csv.gz of all of the tmdb api data
* The file name should be "tmdb_results_combined.csv.gz"
* Make sure this is pushed to your github repository along with all of your code
 * One code file for API calls
 * One code file for EDA

* Submit the link