# EDA - Extract from TMDB (Core)

## Deliverables

- Save a final merged .csv.gz of all of the tmdb api data
- The file name should be "tmdb_results_combined.csv.gz"
- Make sure this is pushed to your github repository along with all of your code
- One code file for API calls
- One code file for EDA

## Tasks

1. Load in your csv.gz's of results for each year extracted.
> - Concatenate the data into 1 dataframe for the remainder of the analysis.
2. Once you have your data from the API, they would like you to perform some light EDA to show:
> 1. How many movies had at least some valid financial information (values > 0 for budget OR revenue)?
>> - Please exclude any movies with 0's for budget AND revenue from the remaining visualizations.
> 2. How many movies are there in each of the certification categories (G/PG/PG-13/R)?
> 3. What is the average revenue per certification category?
> 4. What is the average budget per certification category?

## Imports

In [2]:
import pandas as pd
import numpy as np
import json
import tmdbsimple as tmdb
from tqdm.notebook import tqdm_notebook

## Functions

## Load & Inspect the Data

In [3]:
#load the 2000 data created from the API calls section
df_2000 = pd.read_csv("Data/final_tmdb_data_2000.csv.gz", low_memory = False)

In [4]:
#Load the 2001 data created from the API calls section
df_2001 = pd.read_csv("Data/final_tmdb_data_2001.csv.gz", low_memory = False)

In [5]:
df_2001.head(2)

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,0,,,,,,,,,,...,,,,,,,,,,
1,tt0035423,0.0,/hfeiSfWYujh6MKhtGTXyK3DD4nN.jpg,,48000000.0,"[{'id': 10749, 'name': 'Romance'}, {'id': 14, ...",,11232.0,en,Kate & Leopold,...,76019048.0,118.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,"If they lived in the same century, they'd be p...",Kate & Leopold,0.0,6.33,1195.0,PG-13


In [6]:
df_2000.head(2)

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,0,,,,,,,,,,...,,,,,,,,,,
1,tt0113026,0.0,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127.0,en,The Fantasticks,...,0.0,86.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Try to remember the first time magic happened,The Fantasticks,0.0,5.5,22.0,


In [7]:
df_2000.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1241 entries, 0 to 1240
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   imdb_id                1241 non-null   object 
 1   adult                  1240 non-null   float64
 2   backdrop_path          690 non-null    object 
 3   belongs_to_collection  116 non-null    object 
 4   budget                 1240 non-null   float64
 5   genres                 1240 non-null   object 
 6   homepage               63 non-null     object 
 7   id                     1240 non-null   float64
 8   original_language      1240 non-null   object 
 9   original_title         1240 non-null   object 
 10  overview               1219 non-null   object 
 11  popularity             1240 non-null   float64
 12  poster_path            1119 non-null   object 
 13  production_companies   1240 non-null   object 
 14  production_countries   1240 non-null   object 
 15  rele

In [8]:
df_2001.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1337 entries, 0 to 1336
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   imdb_id                1337 non-null   object 
 1   adult                  1336 non-null   float64
 2   backdrop_path          720 non-null    object 
 3   belongs_to_collection  92 non-null     object 
 4   budget                 1336 non-null   float64
 5   genres                 1336 non-null   object 
 6   homepage               108 non-null    object 
 7   id                     1336 non-null   float64
 8   original_language      1336 non-null   object 
 9   original_title         1336 non-null   object 
 10  overview               1308 non-null   object 
 11  popularity             1336 non-null   float64
 12  poster_path            1201 non-null   object 
 13  production_companies   1336 non-null   object 
 14  production_countries   1336 non-null   object 
 15  rele

## Combine the Data

In [9]:
#Concatenate both data sources into one
merged_df = pd.concat((df_2000, df_2001), axis = 0)
merged_df.head()

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,0,,,,,,,,,,...,,,,,,,,,,
1,tt0113026,0.0,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127.0,en,The Fantasticks,...,0.0,86.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Try to remember the first time magic happened,The Fantasticks,0.0,5.5,22.0,
2,tt0113092,0.0,,,0.0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977.0,en,For the Cause,...,0.0,100.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The ultimate showdown on a forbidden planet.,For the Cause,0.0,5.1,8.0,
3,tt0116391,0.0,,,0.0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869.0,hi,Gang,...,0.0,152.0,"[{'english_name': 'Hindi', 'iso_639_1': 'hi', ...",Released,,Gang,0.0,4.0,1.0,
4,tt0118694,0.0,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,"{'id': 1131062, 'name': 'Wong Kar-Wai’s Love T...",150000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,843.0,cn,花樣年華,...,14204632.0,99.0,"[{'english_name': 'Cantonese', 'iso_639_1': 'c...",Released,"Feel the heat, keep the feeling burning, let t...",In the Mood for Love,0.0,8.109,2229.0,PG


In [10]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2578 entries, 0 to 1336
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   imdb_id                2578 non-null   object 
 1   adult                  2576 non-null   float64
 2   backdrop_path          1410 non-null   object 
 3   belongs_to_collection  208 non-null    object 
 4   budget                 2576 non-null   float64
 5   genres                 2576 non-null   object 
 6   homepage               171 non-null    object 
 7   id                     2576 non-null   float64
 8   original_language      2576 non-null   object 
 9   original_title         2576 non-null   object 
 10  overview               2527 non-null   object 
 11  popularity             2576 non-null   float64
 12  poster_path            2320 non-null   object 
 13  production_companies   2576 non-null   object 
 14  production_countries   2576 non-null   object 
 15  rele

## EDA

**1.) How many movies had at least some valid financial information (values > 0 for budget OR revenue)?** <br>
*Please exclude any movies with 0's for budget AND revenue from the remaining visualizations.*

In [11]:
#dropping rows where the movie has a 0 for budget and revenue
merged_df_filtered = merged_df[merged_df['budget'] != 0]

In [12]:
#dropping rows where the movie has a 0 for budget and revenue
merged_df_filtered = merged_df[merged_df['revenue'] != 0]

In [13]:
print(f'There are {merged_df_filtered.shape[0]} movies that have at least some form of valid financial information')

There are 450 movies that have at least some form of valid financial information


**2.) How many movies are there in each of the certification categories (G/PG/PG-13/R)?**

In [14]:
# Group the DataFrame by "Certification" and calculate the sum
movies_per_certification = merged_df_filtered.groupby('certification').size()

# Display the resulting DataFrame
print(movies_per_certification)

certification
G         15
NR        12
PG        31
PG-13    123
R        170
dtype: int64


- Based on grouping by certification category, R movies account for the largest number

**3.) What is the average revenue per certification category?**

In [15]:
# Group the DataFrame by "Certification" and find the average revenue for each
average_revenue_certification = merged_df_filtered.groupby('certification')['revenue'].mean()

# Display the resulting DataFrame
print(average_revenue_certification)

certification
G        1.173648e+08
NR       1.358396e+07
PG       1.249603e+08
PG-13    1.057456e+08
R        4.430799e+07
Name: revenue, dtype: float64


- Based on grouping by certification category, R movies have made the most revenue

**4.) What is the average budget per certification category?**

In [16]:
# Group the DataFrame by "Certification" and find the average budget for each
average_budget_certification = merged_df_filtered.groupby('certification')['budget'].mean()

# Display the resulting DataFrame
print(average_budget_certification)

certification
G        3.813333e+07
NR       7.891667e+06
PG       5.029830e+07
PG-13    4.435282e+07
R        2.248008e+07
Name: budget, dtype: float64


- Based on grouping by certification category, NR movies seems to have the highest budget

### Export to csv.gz

In [17]:
## Save current unfiltered dataframe to file.
merged_df.to_csv("Data/tmdb_results_combined.csv.gz",compression='gzip',index=False)