# Successful Movies
---

## Project 3 - Part 2 Extract from TMDB (Core)

## EDA

* ### ***What Makes a Movie Successful?***

* ### Ingrid Arbieto Nelson

## Business Problem
> *For this project, you have been hired to produce a MySQL database on Movies from a subset of IMDB's publicly available dataset. Ultimately, you will use this database to analyze what makes a movie successful, and will provide recommendations to the stakeholder on how to make a successful movie.*

Over the course of this project, you will:

* **Part 1**: Download several files from IMDB’s movie data set and filter out the subset of moves requested by the stakeholder.
* **Part 2**: Use an API to extract box office revenue and profit data to add to your IMDB data and perform exploratory data analysis.
* **Part 3**: Construct and export a MySQL database using your data.
* **Part 4**: Apply hypothesis testing to explore what makes a movie successful.
* **Part 5** (Optional): Produce a Linear Regression model to predict movie performance.

<img src ="Images/theater.png">

### Exploratory Data Analysis
* Load in your csv.gz's of results for each year extracted.
  * Concatenate the data into 1 dataframe for the remainder of the analysis.
* Once you have your data from the API, they would like you to perform some light EDA to show:
  1. How many movies had at least some valid financial information (values > 0 for budget OR revenue)?
    * Please exclude any movies with 0's for budget AND revenue from the remaining visualizations.
  2. How many movies are there in each of the certification categories (G/PG/PG-13/R)?
  3. What is the average revenue per certification category?
  4. What is the average budget per certification category?

### Deliverables
After you have joined the tmdb results into 1 dataframe in the EDA Notebook,

* Save a final merged .csv.gz of all of the tmdb api data *(The file name should be "tmdb_results_combined.csv.gz"*
* Make sure this is pushed to your github repository along with all of your code
  * One code file for API calls
  * One code file for EDA
* Submit the link



## EDA Code Section

### Imports

In [1]:
import pandas as pd

### Load TMDB Movie Files

In [2]:
tmdbdata_2000 = pd.read_csv('Data/final_tmdb_data_2000.csv.gz')
tmdbdata_2000.head()

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,0,,,,,,,,,,...,,,,,,,,,,
1,tt0113026,0.0,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127.0,en,The Fantasticks,...,0.0,86.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Try to remember the first time magic happened,The Fantasticks,0.0,5.5,22.0,
2,tt0113092,0.0,,,0.0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977.0,en,For the Cause,...,0.0,100.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The ultimate showdown on a forbidden planet.,For the Cause,0.0,5.1,8.0,
3,tt0116391,0.0,,,0.0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869.0,hi,Gang,...,0.0,152.0,"[{'english_name': 'Hindi', 'iso_639_1': 'hi', ...",Released,,Gang,0.0,4.0,1.0,
4,tt0118694,0.0,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,,150000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,843.0,cn,花樣年華,...,12854953.0,99.0,"[{'english_name': 'Cantonese', 'iso_639_1': 'c...",Released,"Feel the heat, keep the feeling burning, let t...",In the Mood for Love,0.0,8.114,2188.0,PG


In [3]:
# drop the first nan row 2000 movies
tmdbdata_2000.drop([0],axis=0, inplace=True)
tmdbdata_2000.head()

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
1,tt0113026,0.0,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127.0,en,The Fantasticks,...,0.0,86.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Try to remember the first time magic happened,The Fantasticks,0.0,5.5,22.0,
2,tt0113092,0.0,,,0.0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977.0,en,For the Cause,...,0.0,100.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The ultimate showdown on a forbidden planet.,For the Cause,0.0,5.1,8.0,
3,tt0116391,0.0,,,0.0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869.0,hi,Gang,...,0.0,152.0,"[{'english_name': 'Hindi', 'iso_639_1': 'hi', ...",Released,,Gang,0.0,4.0,1.0,
4,tt0118694,0.0,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,,150000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,843.0,cn,花樣年華,...,12854953.0,99.0,"[{'english_name': 'Cantonese', 'iso_639_1': 'c...",Released,"Feel the heat, keep the feeling burning, let t...",In the Mood for Love,0.0,8.114,2188.0,PG
5,tt0118852,0.0,/vceiGZ3uavAEHlTA7v0GjQsGVKe.jpg,,0.0,"[{'id': 18, 'name': 'Drama'}]",,49511.0,en,Chinese Coffee,...,0.0,99.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,There's a fine line between friendship and bet...,Chinese Coffee,0.0,6.8,49.0,R


In [4]:
tmdbdata_2001 = pd.read_csv('Data/final_tmdb_data_2001.csv.gz')
tmdbdata_2001.head()

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,0,,,,,,,,,,...,,,,,,,,,,
1,tt0035423,0.0,/hfeiSfWYujh6MKhtGTXyK3DD4nN.jpg,,48000000.0,"[{'id': 10749, 'name': 'Romance'}, {'id': 14, ...",,11232.0,en,Kate & Leopold,...,76019048.0,118.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,"If they lived in the same century, they'd be p...",Kate & Leopold,0.0,6.325,1185.0,PG-13
2,tt0114447,0.0,,,0.0,"[{'id': 53, 'name': 'Thriller'}, {'id': 28, 'n...",,151007.0,en,The Silent Force,...,0.0,90.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,They left him for dead... They should have fin...,The Silent Force,0.0,5.0,3.0,
3,tt0118589,0.0,/9NZAirJahVilTiDNCHLFcdkwkiy.jpg,,22000000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10402, 'n...",,10696.0,en,Glitter,...,5271666.0,104.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,"In music she found her dream, her love, herself.",Glitter,0.0,4.536,124.0,PG-13
4,tt0118652,0.0,/mWxJEFRMvkG4UItYJkRDMgWQ08Y.jpg,,1000000.0,"[{'id': 27, 'name': 'Horror'}, {'id': 9648, 'n...",,17140.0,en,The Attic Expeditions,...,0.0,100.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,His search for peace of mind... will leave his...,The Attic Expeditions,0.0,5.1,29.0,R


In [5]:
# drop the first nan row 2001 movies
tmdbdata_2001.drop([0],axis=0, inplace=True)
tmdbdata_2001.head()

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
1,tt0035423,0.0,/hfeiSfWYujh6MKhtGTXyK3DD4nN.jpg,,48000000.0,"[{'id': 10749, 'name': 'Romance'}, {'id': 14, ...",,11232.0,en,Kate & Leopold,...,76019048.0,118.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,"If they lived in the same century, they'd be p...",Kate & Leopold,0.0,6.325,1185.0,PG-13
2,tt0114447,0.0,,,0.0,"[{'id': 53, 'name': 'Thriller'}, {'id': 28, 'n...",,151007.0,en,The Silent Force,...,0.0,90.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,They left him for dead... They should have fin...,The Silent Force,0.0,5.0,3.0,
3,tt0118589,0.0,/9NZAirJahVilTiDNCHLFcdkwkiy.jpg,,22000000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10402, 'n...",,10696.0,en,Glitter,...,5271666.0,104.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,"In music she found her dream, her love, herself.",Glitter,0.0,4.536,124.0,PG-13
4,tt0118652,0.0,/mWxJEFRMvkG4UItYJkRDMgWQ08Y.jpg,,1000000.0,"[{'id': 27, 'name': 'Horror'}, {'id': 9648, 'n...",,17140.0,en,The Attic Expeditions,...,0.0,100.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,His search for peace of mind... will leave his...,The Attic Expeditions,0.0,5.1,29.0,R
5,tt0119004,0.0,/7xrlSPGDO4CDT6IHTctDlkYxTzw.jpg,,0.0,"[{'id': 18, 'name': 'Drama'}]",,37857.0,en,Don's Plum,...,6297.0,89.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,,Don's Plum,0.0,5.4,66.0,


In [6]:
tmdbdata_2000.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1234 entries, 1 to 1234
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   imdb_id                1234 non-null   object 
 1   adult                  1234 non-null   float64
 2   backdrop_path          679 non-null    object 
 3   belongs_to_collection  113 non-null    object 
 4   budget                 1234 non-null   float64
 5   genres                 1234 non-null   object 
 6   homepage               64 non-null     object 
 7   id                     1234 non-null   float64
 8   original_language      1234 non-null   object 
 9   original_title         1234 non-null   object 
 10  overview               1213 non-null   object 
 11  popularity             1234 non-null   float64
 12  poster_path            1112 non-null   object 
 13  production_companies   1234 non-null   object 
 14  production_countries   1234 non-null   object 
 15  rele

In [7]:
tmdbdata_2001.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1328 entries, 1 to 1328
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   imdb_id                1328 non-null   object 
 1   adult                  1328 non-null   float64
 2   backdrop_path          712 non-null    object 
 3   belongs_to_collection  93 non-null     object 
 4   budget                 1328 non-null   float64
 5   genres                 1328 non-null   object 
 6   homepage               109 non-null    object 
 7   id                     1328 non-null   float64
 8   original_language      1328 non-null   object 
 9   original_title         1328 non-null   object 
 10  overview               1299 non-null   object 
 11  popularity             1328 non-null   float64
 12  poster_path            1193 non-null   object 
 13  production_companies   1328 non-null   object 
 14  production_countries   1328 non-null   object 
 15  rele

### Combine 2000 & 2001 Movies

In [8]:
## concatenate the tmdb 2000 & 2001 movies
combined_tmdb = pd.concat([tmdbdata_2000, tmdbdata_2001],
                      ignore_index=True)

In [9]:
combined_tmdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2562 entries, 0 to 2561
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   imdb_id                2562 non-null   object 
 1   adult                  2562 non-null   float64
 2   backdrop_path          1391 non-null   object 
 3   belongs_to_collection  206 non-null    object 
 4   budget                 2562 non-null   float64
 5   genres                 2562 non-null   object 
 6   homepage               173 non-null    object 
 7   id                     2562 non-null   float64
 8   original_language      2562 non-null   object 
 9   original_title         2562 non-null   object 
 10  overview               2512 non-null   object 
 11  popularity             2562 non-null   float64
 12  poster_path            2305 non-null   object 
 13  production_companies   2562 non-null   object 
 14  production_countries   2562 non-null   object 
 15  rele

### 1. How many movies had at least some valid financial information (values > 0 for budget OR revenue)?
* Please exclude any movies with 0's for budget AND revenue from the remaining visualizations.

In [10]:
budget_plus = combined_tmdb['budget'] > 0

In [11]:
revenue_plus = combined_tmdb['revenue'] > 0

In [12]:
budget_df = combined_tmdb.loc[budget_plus,:]

In [13]:
revenue_df = combined_tmdb.loc[revenue_plus,:]

In [14]:
budget_df['imdb_id'].count()

545

In [16]:
revenue_df['imdb_id'].count()

446

In [17]:
total_count = budget_df['imdb_id'].count() + revenue_df['imdb_id'].count()
total_count

991

### 2. How many movies are there in each of the certification categories (G/PG/PG-13/R)?

In [18]:
combined_tmdb.groupby('certification')['imdb_id'].count()

certification
G           24
NC-17        6
NR          73
PG          63
PG-13      182
R          466
Unrated      1
Name: imdb_id, dtype: int64

### 3. What is the average revenue per certification category?

In [19]:
combined_tmdb.groupby('certification')['revenue'].mean()

certification
G          7.218533e+07
NC-17      0.000000e+00
NR         2.232979e+06
PG         6.191177e+07
PG-13      7.146544e+07
R          1.614397e+07
Unrated    0.000000e+00
Name: revenue, dtype: float64

### 4. What is the average budget per certification category?

In [20]:
combined_tmdb.groupby('certification')['budget'].mean()

certification
G          2.383333e+07
NC-17      0.000000e+00
NR         1.467673e+06
PG         2.490472e+07
PG-13      3.094592e+07
R          9.700224e+06
Unrated    0.000000e+00
Name: budget, dtype: float64

## All TMDB Movie Data

In [21]:
title_basics = pd.read_csv('Data/title_basics.csv.gz')
title_basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,70,Drama
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,122,Drama
3,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005.0,100,"Comedy,Horror,Sci-Fi"
4,tt0096056,movie,Crime and Punishment,Crime and Punishment,0,2002.0,126,Drama


In [23]:
## example tmdb data combined & basics
merge_tmdbdata_basics = pd.merge(combined_tmdb, title_basics, left_on='imdb_id', right_on='tconst', how='inner')
merge_tmdbdata_basics.head()

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres_x,homepage,id,original_language,original_title,...,vote_count,certification,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres_y
0,tt0113026,0.0,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127.0,en,The Fantasticks,...,22.0,,tt0113026,movie,The Fantasticks,The Fantasticks,0,2000.0,86,"Musical,Romance"
1,tt0113092,0.0,,,0.0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977.0,en,For the Cause,...,8.0,,tt0113092,movie,For the Cause,For the Cause,0,2000.0,100,"Action,Adventure,Drama"
2,tt0116391,0.0,,,0.0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869.0,hi,Gang,...,1.0,,tt0116391,movie,Gang,Gang,0,2000.0,167,"Action,Crime,Drama"
3,tt0118694,0.0,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,,150000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,843.0,cn,花樣年華,...,2188.0,PG,tt0118694,movie,In the Mood for Love,Fa yeung nin wah,0,2000.0,98,"Drama,Romance"
4,tt0118852,0.0,/vceiGZ3uavAEHlTA7v0GjQsGVKe.jpg,,0.0,"[{'id': 18, 'name': 'Drama'}]",,49511.0,en,Chinese Coffee,...,49.0,R,tt0118852,movie,Chinese Coffee,Chinese Coffee,0,2000.0,99,Drama


In [24]:
# drop duplicate column merged on
merge_tmdbdata_basics = merge_tmdbdata_basics.drop(columns=['tconst'])
merge_tmdbdata_basics.head()

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres_x,homepage,id,original_language,original_title,...,vote_average,vote_count,certification,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres_y
0,tt0113026,0.0,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127.0,en,The Fantasticks,...,5.5,22.0,,movie,The Fantasticks,The Fantasticks,0,2000.0,86,"Musical,Romance"
1,tt0113092,0.0,,,0.0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977.0,en,For the Cause,...,5.1,8.0,,movie,For the Cause,For the Cause,0,2000.0,100,"Action,Adventure,Drama"
2,tt0116391,0.0,,,0.0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869.0,hi,Gang,...,4.0,1.0,,movie,Gang,Gang,0,2000.0,167,"Action,Crime,Drama"
3,tt0118694,0.0,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,,150000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,843.0,cn,花樣年華,...,8.114,2188.0,PG,movie,In the Mood for Love,Fa yeung nin wah,0,2000.0,98,"Drama,Romance"
4,tt0118852,0.0,/vceiGZ3uavAEHlTA7v0GjQsGVKe.jpg,,0.0,"[{'id': 18, 'name': 'Drama'}]",,49511.0,en,Chinese Coffee,...,6.8,49.0,R,movie,Chinese Coffee,Chinese Coffee,0,2000.0,99,Drama


In [25]:
title_ratings = pd.read_csv('Data/title_ratings.csv.gz')
title_ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1966
1,tt0000002,5.8,264
2,tt0000005,6.2,2608
3,tt0000006,5.2,181
4,tt0000007,5.4,816


In [26]:
## example tmdb data combined & basics & ratings
merge_tmdbbasics_ratings = pd.merge(merge_tmdbdata_basics, title_ratings, left_on='imdb_id', right_on='tconst', how='inner')
merge_tmdbbasics_ratings.head()

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres_x,homepage,id,original_language,original_title,...,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres_y,tconst,averageRating,numVotes
0,tt0113026,0.0,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127.0,en,The Fantasticks,...,movie,The Fantasticks,The Fantasticks,0,2000.0,86,"Musical,Romance",tt0113026,5.6,1398
1,tt0113092,0.0,,,0.0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977.0,en,For the Cause,...,movie,For the Cause,For the Cause,0,2000.0,100,"Action,Adventure,Drama",tt0113092,3.4,838
2,tt0116391,0.0,,,0.0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869.0,hi,Gang,...,movie,Gang,Gang,0,2000.0,167,"Action,Crime,Drama",tt0116391,6.2,260
3,tt0118694,0.0,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,,150000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,843.0,cn,花樣年華,...,movie,In the Mood for Love,Fa yeung nin wah,0,2000.0,98,"Drama,Romance",tt0118694,8.1,155839
4,tt0118852,0.0,/vceiGZ3uavAEHlTA7v0GjQsGVKe.jpg,,0.0,"[{'id': 18, 'name': 'Drama'}]",,49511.0,en,Chinese Coffee,...,movie,Chinese Coffee,Chinese Coffee,0,2000.0,99,Drama,tt0118852,7.1,4466


In [27]:
# drop duplicate column merged on
merge_tmdbbasics_ratings = merge_tmdbbasics_ratings.drop(columns=['tconst'])
merge_tmdbbasics_ratings.head()

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres_x,homepage,id,original_language,original_title,...,certification,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres_y,averageRating,numVotes
0,tt0113026,0.0,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127.0,en,The Fantasticks,...,,movie,The Fantasticks,The Fantasticks,0,2000.0,86,"Musical,Romance",5.6,1398
1,tt0113092,0.0,,,0.0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977.0,en,For the Cause,...,,movie,For the Cause,For the Cause,0,2000.0,100,"Action,Adventure,Drama",3.4,838
2,tt0116391,0.0,,,0.0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869.0,hi,Gang,...,,movie,Gang,Gang,0,2000.0,167,"Action,Crime,Drama",6.2,260
3,tt0118694,0.0,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,,150000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,843.0,cn,花樣年華,...,PG,movie,In the Mood for Love,Fa yeung nin wah,0,2000.0,98,"Drama,Romance",8.1,155839
4,tt0118852,0.0,/vceiGZ3uavAEHlTA7v0GjQsGVKe.jpg,,0.0,"[{'id': 18, 'name': 'Drama'}]",,49511.0,en,Chinese Coffee,...,R,movie,Chinese Coffee,Chinese Coffee,0,2000.0,99,Drama,7.1,4466


In [28]:
title_akas = pd.read_csv('Data/title_akas.csv.gz')
title_akas.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,6,Carmencita,US,,imdbDisplay,,0.0
1,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0.0
2,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0.0
3,tt0000005,1,Blacksmithing Scene,US,,alternative,,0.0
4,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0.0


In [29]:
## example tmdb data combined & basics & ratings & akas
merge_tmdbbasicsrate_akas = pd.merge(merge_tmdbbasics_ratings, title_akas, left_on='imdb_id', right_on='titleId', how='inner')
merge_tmdbbasicsrate_akas.head()

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres_x,homepage,id,original_language,original_title,...,averageRating,numVotes,titleId,ordering,title_y,region,language,types,attributes,isOriginalTitle
0,tt0113026,0.0,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127.0,en,The Fantasticks,...,5.6,1398,tt0113026,12,The Fantasticks,US,,imdbDisplay,,0.0
1,tt0113092,0.0,,,0.0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977.0,en,For the Cause,...,3.4,838,tt0113092,17,Final Encounter,US,,dvd,,0.0
2,tt0113092,0.0,,,0.0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977.0,en,For the Cause,...,3.4,838,tt0113092,2,For the Cause,US,,imdbDisplay,,0.0
3,tt0116391,0.0,,,0.0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869.0,hi,Gang,...,6.2,260,tt0116391,4,Gang,US,,imdbDisplay,,0.0
4,tt0118694,0.0,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,,150000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,843.0,cn,花樣年華,...,8.1,155839,tt0118694,51,In the Mood for Love,US,,imdbDisplay,,0.0


In [30]:
# drop duplicate column merged on
merge_tmdbbasicsrate_akas = merge_tmdbbasicsrate_akas.drop(columns=['titleId'])
merge_tmdbbasicsrate_akas.head()

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres_x,homepage,id,original_language,original_title,...,genres_y,averageRating,numVotes,ordering,title_y,region,language,types,attributes,isOriginalTitle
0,tt0113026,0.0,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127.0,en,The Fantasticks,...,"Musical,Romance",5.6,1398,12,The Fantasticks,US,,imdbDisplay,,0.0
1,tt0113092,0.0,,,0.0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977.0,en,For the Cause,...,"Action,Adventure,Drama",3.4,838,17,Final Encounter,US,,dvd,,0.0
2,tt0113092,0.0,,,0.0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977.0,en,For the Cause,...,"Action,Adventure,Drama",3.4,838,2,For the Cause,US,,imdbDisplay,,0.0
3,tt0116391,0.0,,,0.0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869.0,hi,Gang,...,"Action,Crime,Drama",6.2,260,4,Gang,US,,imdbDisplay,,0.0
4,tt0118694,0.0,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,,150000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,843.0,cn,花樣年華,...,"Drama,Romance",8.1,155839,51,In the Mood for Love,US,,imdbDisplay,,0.0


In [31]:
# info on all merged df
merge_tmdbbasicsrate_akas.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3169 entries, 0 to 3168
Data columns (total 42 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   imdb_id                3169 non-null   object 
 1   adult                  3169 non-null   float64
 2   backdrop_path          1837 non-null   object 
 3   belongs_to_collection  349 non-null    object 
 4   budget                 3169 non-null   float64
 5   genres_x               3169 non-null   object 
 6   homepage               245 non-null    object 
 7   id                     3169 non-null   float64
 8   original_language      3169 non-null   object 
 9   original_title         3169 non-null   object 
 10  overview               3122 non-null   object 
 11  popularity             3169 non-null   float64
 12  poster_path            2891 non-null   object 
 13  production_companies   3169 non-null   object 
 14  production_countries   3169 non-null   object 
 15  rele

In [32]:
# write merged results to final combined file
merge_tmdbbasicsrate_akas.to_csv("Data/tmdb_results_combined.csv.gz", compression="gzip", index=False)