# Intro

Jon Messier

2/20/2023

---

**Business Problem**

For this project, you have been hired to produce a MySQL database on Movies from a subset of IMDB's publicly available dataset. Ultimately, you will use this database to analyze what makes a movie successful and will provide recommendations to the stakeholder on how to make a successful movie.

# Part 1

For Part 1 of the project, you will be creating your project repository, loading the official IMDB data for the requested tables, filtering out unnecessary data, and saving the filtered tables as gzip-compressed csv files (".csv.gz") in your repository.

**Getting Started Tips:**

    Please make sure to read the following lesson ["Getting Started - Project 3"](https://login.codingdojo.com/m/376/12528/88061) for additional tips and directions!
    
  **The Data**

    IMDB Provides Several Files with varied information for Movies, TV Shows, Made for TV Movies, etc.
-   Overview/Data Dictionary: https://www.imdb.com/interfaces/
- Downloads page: https://datasets.imdbws.com/

- From their previous research, they realized they want to focus on the following files:
        title.basics.tsv.gz
        title.ratings.tsv.gz
        title.akas.tsv.gz


**Specifications**

Your stakeholder only wants you to include information for movies based on the following specifications:

-    Exclude any movie with missing values for genre or runtime
-    Include only full-length movies (titleType = "movie").
-    Include only fictional movies (not from documentary genre)
-   Include only movies that were released 2000 - 2021 (include 2000 and 2021)
-    Include only movies that were released in the United States


**Deliverable**

After filtering out movies that do not meet the stakeholder's specifications:

-    Before saving, run a final .info() for each of the dataframes to show a summary of how many movies remain and the datatypes of each feature
-    Save each file to a compressed csv file "Data/" folder inside your repository.
-    Commit your changes to your repository in GitHub desktop and Publish repository / Push Changes.
-    Submit the link to your repository



## Class/Data imports

In [None]:
import pandas as pd
import numpy as np

In [None]:
basics_url = "https://datasets.imdbws.com/title.basics.tsv.gz"
ratings_url = 'https://datasets.imdbws.com/title.ratings.tsv.gz'
akas_url = 'https://datasets.imdbws.com/title.akas.tsv.gz'

## Data Inspection and cleanup

### Aka's
- [x]  Replace "\N" with np.nan
- [x] Keep only US movies.

In [None]:
df_akas = pd.read_csv(akas_url, sep='\t', low_memory=False)
df_akas.info()
df_akas.head()

In [None]:
#replace \N with np.nan
df_akas = df_akas.replace({'\\N':np.nan})

In [None]:
#Keep only US movies
df_akas=df_akas.loc[df_akas['region']=="US"]

### Basics
- [x]   Replace "\N" with np.nan
- [x]   Eliminate movies that are null for runtimeMinutes
- [x] Eliminate movies that are null for genre
- [x] keep only titleType==Movie
- [x] keep startYear 2000-2022
- [x]  Eliminate movies that include "Documentary" in genre (see tip below)
- [x]  Keep only US movies (Use AKAs table, see "Filtering one dataframe based on another" section below)

In [None]:
df_basics = pd.read_csv(basics_url, sep='\t', low_memory=False)
df_basics.info()
df_basics.head()

In [None]:
#replace \N with np.nan
df_basics = df_basics.replace({'\\N':np.nan})
df_basics.info()

In [None]:
#drop null runtimes
df_basics = df_basics.dropna(axis=0, subset='runtimeMinutes')
df_basics.info()

In [None]:
#drop null genre
df_basics = df_basics.dropna(axis=0,subset = 'genres')
df_basics.info()

In [None]:
#keep titletype=movie
df_basics=df_basics[df_basics['titleType']=='movie']
df_basics.info()

In [None]:
#drop null startYears
df_basics = df_basics.dropna(axis=0,subset = 'startYear')
df_basics.info()

In [None]:
#convert year to an int
df_basics['startYear']=df_basics['startYear'].astype('int')

#Keep only the movies between 2000-2022
df_basics=df_basics.loc[(df_basics["startYear"]>= 2000) 
                        & (df_basics["startYear"]<= 2022)]
df_basics['startYear'].describe()

In [None]:
# Exclude movies that are included in the documentary category.
is_documentary = df_basics['genres'].str.contains('documentary',case=False)
df_basics = df_basics[~is_documentary]
df_basics.head()

In [None]:
# Filter the basics table down to only include the US by using the filter ...
#Akas dataframe
keepers =df_basics['tconst'].isin(df_akas['titleId'])
df_basics = df_basics[keepers]
df_basics.head()

### Ratings

- [x]   Replace "\N" with np.nan
- [x]   Keep only US movies (Use AKAs table, see "Filtering one dataframe based on another" section below)

In [None]:
df_ratings = pd.read_csv(ratings_url, sep='\t', low_memory=False)
df_ratings.info()
df_ratings.head()

In [None]:
#replace \N with np.nan
df_ratings = df_ratings.replace({'\\N':np.nan})

In [None]:
# Filter the ratings table down to only include the US by using the filter ...
#Akas dataframe
keepers =df_ratings['tconst'].isin(df_akas['titleId'])
df_ratings = df_ratings[keepers]
df_ratings.head()

## Data File storage

In [None]:
# example making new folder with os
import os
os.makedirs('Data/',exist_ok=True) 
# Confirm folder created
os.listdir("Data/")

In [None]:
## Save current dataframe to file.
df_basics.to_csv("Data/title_basics.csv.gz",compression='gzip',index=False)
df_ratings.to_csv("Data/title_akas.csv.gz",compression='gzip',index=False)
df_akas.to_csv("Data/title_ratings.csv.gz",compression='gzip',index=False)

In [None]:
# Open saved file and preview again
df_basics = pd.read_csv("Data/title_basics.csv.gz", low_memory = False)
df_basics.head()

In [None]:
# Open saved file and preview again
df_akas = pd.read_csv("Data/title_akas.csv.gz", low_memory = False)
df_akas.head()

In [None]:
# Open saved file and preview again
df_ratings = pd.read_csv("Data/title_ratings.csv.gz", low_memory = False)
df_ratings.head()

## File .info() summary

In [None]:
df_basics.info()

In [None]:
df_ratings.info()

In [None]:
df_akas.info()

# Part 2a.
### Your Stakeholder Wants More Data!

-   After investigating the preview of your data from Part 1, your stakeholder realized that there is no financial information included in the IMDB data (e.g. budget or revenue).
 -    This will be a major roadblock when attempting to analyze which movies are successful and must be addressed before you will be able to determine which movies are successful.

-  Your stakeholder identified The Movie Database (TMDB) as a great source of financial data (https://www.themoviedb.org/). Thankfully, TMDB offers a free API for programmatic access to their data!

-  Your stakeholder wants you to extract the budget, revenue, and MPAA Rating (G/PG/PG-13/R), which is also called "Certification".

-   **Note: this process can take a long time and may need to run overnight.**

---

### Specifications - Financial Data

- Your stakeholder would like you to extract and save the results for movies that meet all of the criteria established in part 1 of the project (You should already have a filtered dataframe saved from part one as a csv.gz file)

-  As a proof-of-concept, they requested you perform a test extraction of movies that started in 2000 or 2001

-   Each year should be saved as a separate .csv.gz file

**Hint: Use the two custom functions from the lessons (Intro to TMDB API, and Efficient TMDB API Calls). Be sure to define these functions prior to calling them in your code!**

-  One function will add the certification (MPGG Rating) to movie.info
-  The other function will help you append/extend a JSON file with Python

**Confirm Your API Function works.**

In order to ensure your function for extracting movie data from TMDB is working, test your function on these 2 movie ids: tt0848228 ("The Avengers") and tt0332280 ("The Notebook"). Make sure that your function runs without error and that it returns the correct movie's data for both test ids.

**Hint: Ideally you can organize the code segments from the previous lesson to create an outer and inner loop, but if you get stuck, you can complete 1 year at a time.**

-  Once you have retrieved and saved the final results to 2 separate .csv.gz files, move on to a new Exploratory Data Analysis notebook to explore the following questions.


## Class Import, API connection
Import key clases, connect to TMDB api and local data

In [None]:
import tmdbsimple as tmdb
import pandas as pd
from tqdm.notebook import tqdm_notebook
import os, time, json

with open('/Users/jonme/.secret/tmdb_api.json', 'r') as f:
    login = json.load(f)
## Display the keys of the loaded dict
login.keys()

In [None]:
tmdb.API_KEY =  login['api-key']

## Custom Functions

### `get_movie_with_rating`

In [None]:
def get_movie_with_rating(movie_id):
    movie = tmdb.Movies(movie_id)
    
    info = movie.info()
    
    releases = movie.releases()
    
    for c in releases['countries']:
        if c['iso_3166_1']=='US':
            info['certification']=c['certification']
    return info     

In [None]:
#test get_movie... function with Avengers "tt0848228"
test = get_movie_with_rating("tt0848228") 
test

### `write_json`

In [None]:
def write_json(new_data, filename): 
    """Appends a list of records (new_data) to a json file (filename). 
    Adapted from: https://www.geeksforgeeks.org/append-to-json-file-using-python/"""  
    
    with open(filename,'r+') as file:
        # First we load existing data into a dict.
        file_data = json.load(file)
        ## Choose extend or append
        if (type(new_data) == list) & (type(file_data) == list):
            file_data.extend(new_data)
        else:
             file_data.append(new_data)
        # Sets file's current position at offset.
        file.seek(0)
        # convert back to json.
        json.dump(file_data, file)

## Load existing data
Load in the Title Basics data

You need to read in the filtered dataframe you created based on the specification of project 3 Part 1.

You will be filtering out the movies for each year inside the loop, so we will need this loaded and ready to be filtered.


In [None]:
#Check for DATA folder
FOLDER = "Data/"
os.makedirs(FOLDER, exist_ok=True)
os.listdir(FOLDER)

In [None]:
# Load in the dataframe from project part 1 as basics:
basics = pd.read_csv('Data/title_basics.csv.gz')

**Create Required Lists for the Loop**

Define a list of the Years to Extract from the API

We have data from 2000 - 2020 available. If we just want results for the first two years, we will create a YEARS_TO_GET list. This will control our outer loop.


In [None]:
YEARS_TO_GET = list(range(2000,2021))

**Define an errors list**

We will want to be able to save the id's and error messages for any movie that causes an error. To do so, we will want to create an empty errors list before our loops that we can append to later.


In [None]:
errors = [ ]

### Write JSON data to CSV files

In [None]:
# Start of OUTER loop
for YEAR in tqdm_notebook(YEARS_TO_GET, desc='YEARS', position=0):
    #Defining the JSON file to store results for year
    JSON_FILE = f'{FOLDER}tmdb_api_results_{YEAR}.json'
    ## Check if JSON_FILE exists
    file_exists = os.path.isfile(JSON_FILE)
    print(f'{JSON_FILE} status {file_exists}')
    # If it does not exist: create it
    if file_exists == False:
    # save an empty dict with just "imdb_id" to the new json file.
        with open(JSON_FILE,'w') as f:
            json.dump([{'imdb_id':0}],f)
    
        #Saving new year as the current df
        df = basics.loc[ basics['startYear']==YEAR].copy()
        # saving movie ids to list
        movie_ids = df['tconst'].copy()
    
        # Load existing data from json into a dataframe called "previous_df"
        previous_df = pd.read_json(JSON_FILE)
    
        # filter out any ids that are already in the JSON_FILE
        movie_ids_to_get = movie_ids[~movie_ids.isin(previous_df['imdb_id'])]
    
        #Get index and movie id from list
        # INNER Loop
        for movie_id in tqdm_notebook(movie_ids_to_get,
                                  desc=f'Movies from {YEAR}',
                                  position=1,
                                  leave=True):
            try:
                # Retrieve then data for the movie id
                temp = get_movie_with_rating(movie_id)  
                # Append/extend results to existing file using a pre-made function
                write_json(temp,JSON_FILE)
                # Short 20 ms sleep to prevent overwhelming server
                time.sleep(0.02)
            
            except Exception as e:
                errors.append([movie_id, e])
    
    final_year_df = pd.read_json(JSON_FILE)
    final_year_df.to_csv(f"{FOLDER}final_tmdb_data_{YEAR}.csv.gz",
                         compression="gzip", index=False)

# Part 2b. Exploratory DATA Analysis

In [1]:
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt

## Load .csv.gz
- [ ] Load in your csv.gz's of results for each year extracted.

- [ ]    Concatenate the data into 1 dataframe for the remainder of the analysis.

In [10]:
df = pd.read_csv('Data/final_tmdb_data_2000.csv.gz')
df.head()

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,0,,,,,,,,,,...,,,,,,,,,,
1,tt0113026,0.0,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127.0,en,The Fantasticks,...,0.0,86.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Try to remember the first time magic happened,The Fantasticks,0.0,5.5,22.0,
2,tt0113092,0.0,,,0.0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977.0,en,For the Cause,...,0.0,100.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The ultimate showdown on a forbidden planet.,For the Cause,0.0,5.1,8.0,
3,tt0116391,0.0,,,0.0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869.0,hi,Gang,...,0.0,152.0,"[{'english_name': 'Hindi', 'iso_639_1': 'hi', ...",Released,,Gang,0.0,4.0,1.0,
4,tt0118694,0.0,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,,150000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,843.0,cn,花樣年華,...,12854953.0,99.0,"[{'english_name': 'Cantonese', 'iso_639_1': 'c...",Released,"Feel the heat, keep the feeling burning, let t...",In the Mood for Love,0.0,8.115,2133.0,PG


In [29]:
#df2 = pd.DataFrame()
df = pd.DataFrame()
#Load files to a variable
for i in range(2000,2021):
   f = f'Data/final_tmdb_data_{i}.csv.gz'
   #with open(f'Data/tmdb_api_results_{i}.json', 'r') as f:
   df = pd.concat([df,pd.read_csv(f)])
    #df2 = pd.concat([df2,pd.read_json(f)])

df.head()

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,0,,,,,,,,,,...,,,,,,,,,,
1,tt0113026,0.0,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127.0,en,The Fantasticks,...,0.0,86.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Try to remember the first time magic happened,The Fantasticks,0.0,5.5,22.0,
2,tt0113092,0.0,,,0.0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977.0,en,For the Cause,...,0.0,100.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The ultimate showdown on a forbidden planet.,For the Cause,0.0,5.1,8.0,
3,tt0116391,0.0,,,0.0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869.0,hi,Gang,...,0.0,152.0,"[{'english_name': 'Hindi', 'iso_639_1': 'hi', ...",Released,,Gang,0.0,4.0,1.0,
4,tt0118694,0.0,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,,150000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,843.0,cn,花樣年華,...,12854953.0,99.0,"[{'english_name': 'Cantonese', 'iso_639_1': 'c...",Released,"Feel the heat, keep the feeling burning, let t...",In the Mood for Love,0.0,8.115,2133.0,PG


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27967 entries, 0 to 3891
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   imdb_id                27967 non-null  object 
 1   adult                  27946 non-null  float64
 2   backdrop_path          19967 non-null  object 
 3   belongs_to_collection  1709 non-null   object 
 4   budget                 27946 non-null  float64
 5   genres                 27946 non-null  object 
 6   homepage               7116 non-null   object 
 7   id                     27946 non-null  float64
 8   original_language      27946 non-null  object 
 9   original_title         27946 non-null  object 
 10  overview               27376 non-null  object 
 11  popularity             27946 non-null  float64
 12  poster_path            26829 non-null  object 
 13  production_companies   27946 non-null  object 
 14  production_countries   27946 non-null  object 
 15  rel

## EDA 
1.   How many movies had at least some valid financial information (values > 0 for budget OR revenue)?
  -  Please exclude any movies with 0's for budget AND revenue from the remaining visualizations.
2.   How many movies are there in each of the certification categories (G/PG/PG-13/R)?
3.  What is the average revenue per certification category?
4.  What is the average budget per certification category?





In [6]:
#1 How many movies have valid financial info
len(df.loc[(df["budget"]>0) | (df['revenue']>0)])

5534

In [17]:
#Drop movies with null budget or revenue
df.dropna(subset=["budget","revenue"],inplace=True)
#Drop zero values
df = df.loc[(df["budget"]>0) | (df['revenue']>0)]
#check the resize
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5534 entries, 1 to 3881
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   imdb_id                5534 non-null   object 
 1   adult                  5534 non-null   float64
 2   backdrop_path          4544 non-null   object 
 3   belongs_to_collection  671 non-null    object 
 4   budget                 5534 non-null   float64
 5   genres                 5534 non-null   object 
 6   homepage               2181 non-null   object 
 7   id                     5534 non-null   float64
 8   original_language      5534 non-null   object 
 9   original_title         5534 non-null   object 
 10  overview               5493 non-null   object 
 11  popularity             5534 non-null   float64
 12  poster_path            5409 non-null   object 
 13  production_companies   5534 non-null   object 
 14  production_countries   5534 non-null   object 
 15  rele

In [18]:
#2 How many movies are in each certification category
df["certification"].value_counts()

R          1084
PG-13       674
NR          382
PG          269
G            57
NC-17        10
Unrated       1
Name: certification, dtype: int64

In [22]:
#3 What is the average revenue per certification category?
df.groupby('certification')["revenue"].mean()

certification
G          7.037925e+07
NC-17      1.407171e+06
NR         6.072150e+06
PG         1.450376e+08
PG-13      1.291000e+08
R          3.549841e+07
Unrated    0.000000e+00
Name: revenue, dtype: float64

In [23]:
#4 What is the average budget per certification category?
df.groupby('certification')["budget"].mean()

certification
G          2.375433e+07
NC-17      1.186300e+06
NR         2.679198e+06
PG         4.195983e+07
PG-13      4.018244e+07
R          1.518844e+07
Unrated    2.600000e+02
Name: budget, dtype: float64

# Part 2: Deliverables

After you have joined the tmdb results into 1 dataframe in the EDA Notebook,

- Save a final merged .csv.gz of all of the tmdb api data
- The file name should be "tmdb_results_combined.csv.gz"
- Make sure this is pushed to your github repository along with all of your code
  - One code file for API calls
  - One code file for EDA
    Submit the link

In [31]:
df.to_csv("Data/tmdb_results_combined.csv.gz",
                         compression="gzip", index=False)