# Part 3 - ETL & MySQL Database (Revisited)

- 11/22/22
- Revisiting to normalize more tables for Tableau viz.

## Introduction

Your stakeholder wants you to take the data you have been cleaning and collecting in Parts 1 &amp; 2 of the project, and wants you to create a MySQL database for them.
<p>You should normalize the tables as best you can before adding them to your new database.</p>
<ul>
<li>Note: an important exception to their request is that they would like you to keep all of the data from the TMDB API in 1 table together (even though it will not be perfectly normalized).&nbsp;&nbsp;</li><li>You only need to keep the imdb_id, revenue, budget, and certification columns</li>
</ul>
</li>
</ul>
<h4>Required Transformation steps:</h4>
<ul>
<li>
<p><strong>Normalize Genre:</strong></p>
<ul>
<li>Convert the single string of genres from title basics into 2 new tables.
<ol>
<li>
<p><code>title_genres</code>:  with the columns:</p>
<ul>
<li>tconst</li>
<li>genre_id</li>
</ul>
</li>
<li>
<p><code>genres</code>:</p>
<ul>
<li>genre_id</li>
<li>genre_name</li>
</ul>
</li>
</ol>
</li>
</ul>
</li>
<li>
<p><strong>Discard unnecessary information:</strong></p>
<ul>
<li>For the <code>title basics</code> table, drop the following columns:
<ul>
<li>"original_title" (we will use the primary title column instead)</li>
<li>"isAdult" ("Adult" will show up in the genres so this is redundant information).</li>
<li>"titleType" (every row will be a movie).</li>
<li>"genres" and other variants of genre (genre is now represented in the 2 new tables described above.</li>
</ul>
</li>
<li>Do not include the <code>title_akas</code> table in your SQL database.
<ul>
<li>You have already filtered out the desired movies using this table and the remaining data is mostly nulls and not of-interest to the stakeholder.</li>
</ul>
</li>
</ul>
</li>
</ul>
<h4>MySQL Database Requirements</h4>
<ul>
<li>
<p>Use sqlalchemy with pandas to execute your SQL queries inside your notebook.</p>
</li>
<li>
<p>Create a new database on your MySQL server and  call it "movies".</p>
</li>
<li>
<p>Make sure to have the following tables in your "movies" database:</p>
<ul>
<li><code>title_basics</code></li>
<li><code>title_ratings</code></li>
<li><code>title_genres</code></li>
<li><code>genres</code></li>
<li><code>tmdb_data</code></li>
</ul>
</li>
<li>
<p>Make sure to set a Primary Key for each table.</p>
</li>
<li>
<p>After creating each table, show the first 5 rows of that table using a SQL query.</p>
</li>
<li>
<p>Make sure to run the "SHOW TABLES" SQL query at the end of your notebook to show that all required tables have been created.</p></li>
</ul>


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os,json
import missingno as ms

plt.rcParams['figure.facecolor'] = 'white'

pd.set_option('display.max_columns',100)

In [2]:
## Folder to save IMDB files
FOLDER = "Data/"
# os.makedirs(FOLDER,exist_ok=True)
sorted(os.listdir(FOLDER))

['.DS_Store',
 '.ipynb_checkpoints',
 'For Tableau',
 'combined_tmdb_api_data.csv.gz',
 'final_tmdb_data_2000.csv.gz',
 'final_tmdb_data_2001.csv.gz',
 'final_tmdb_data_2002.csv.gz',
 'final_tmdb_data_2003.csv.gz',
 'final_tmdb_data_2004.csv.gz',
 'final_tmdb_data_2005.csv.gz',
 'final_tmdb_data_2006.csv.gz',
 'final_tmdb_data_2007.csv.gz',
 'final_tmdb_data_2008.csv.gz',
 'final_tmdb_data_2009.csv.gz',
 'final_tmdb_data_2010.csv.gz',
 'final_tmdb_data_2011.csv.gz',
 'final_tmdb_data_2012.csv.gz',
 'final_tmdb_data_2013.csv.gz',
 'final_tmdb_data_2014.csv.gz',
 'final_tmdb_data_2015.csv.gz',
 'final_tmdb_data_2016.csv.gz',
 'final_tmdb_data_2017.csv.gz',
 'final_tmdb_data_2018.csv.gz',
 'final_tmdb_data_2019.csv.gz',
 'final_tmdb_data_2020.csv.gz',
 'final_tmdb_data_2021.csv.gz',
 'final_tmdb_data_2022.csv.gz',
 'title_akas_cleaned.csv.gz',
 'title_basics_cleaned.csv.gz',
 'title_ratings_cleaned.csv.gz',
 'tmdb_api_results_2000.json']

## Extract: Loading Previous Data

### Title Basics

In [3]:
## title basics
basics = pd.read_csv(f'{FOLDER}title_basics_cleaned.csv.gz',low_memory=False)
basics.info()
basics.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116404 entries, 0 to 116403
Data columns (total 8 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   tconst          116404 non-null  object 
 1   titleType       116404 non-null  object 
 2   primaryTitle    116404 non-null  object 
 3   originalTitle   116404 non-null  object 
 4   isAdult         116404 non-null  int64  
 5   startYear       116404 non-null  float64
 6   runtimeMinutes  116404 non-null  int64  
 7   genres          116404 non-null  object 
dtypes: float64(1), int64(2), object(5)
memory usage: 7.1+ MB


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El Tango del Viudo y Su Espejo Deformante,0,2020.0,70,Drama
2,tt0067683,movie,Workers '71: Nothing About Us Without Us,Robotnicy 1971 - Nic o nas bez nas,0,2006.0,47,Documentary
3,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,122,Drama
4,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005.0,100,"Comedy,Horror,Sci-Fi"


### Title Ratings

In [4]:
## title ratings
ratings = pd.read_csv(f"{FOLDER}/title_ratings_cleaned.csv.gz",low_memory=False)
ratings.info()
ratings.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87817 entries, 0 to 87816
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   tconst         87817 non-null  object 
 1   averageRating  87817 non-null  float64
 2   numVotes       87817 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 2.0+ MB


Unnamed: 0,tconst,averageRating,numVotes
0,tt0035423,6.4,85245
1,tt0062336,6.4,163
2,tt0067683,6.4,90
3,tt0069049,6.7,7435
4,tt0088751,5.2,327


### TMDB API Results

- Students should have combined the years 2000 and 2001 as part of Part 2. I forgot to do so so I am doing here

In [5]:
## INSTRUCTOR CODE (not taught to students)
import glob
q = f"{FOLDER}final*.csv.gz"
files = glob.glob(q)
files

['Data/final_tmdb_data_2006.csv.gz',
 'Data/final_tmdb_data_2018.csv.gz',
 'Data/final_tmdb_data_2014.csv.gz',
 'Data/final_tmdb_data_2022.csv.gz',
 'Data/final_tmdb_data_2008.csv.gz',
 'Data/final_tmdb_data_2016.csv.gz',
 'Data/final_tmdb_data_2004.csv.gz',
 'Data/final_tmdb_data_2020.csv.gz',
 'Data/final_tmdb_data_2000.csv.gz',
 'Data/final_tmdb_data_2012.csv.gz',
 'Data/final_tmdb_data_2010.csv.gz',
 'Data/final_tmdb_data_2002.csv.gz',
 'Data/final_tmdb_data_2019.csv.gz',
 'Data/final_tmdb_data_2007.csv.gz',
 'Data/final_tmdb_data_2015.csv.gz',
 'Data/final_tmdb_data_2021.csv.gz',
 'Data/final_tmdb_data_2017.csv.gz',
 'Data/final_tmdb_data_2009.csv.gz',
 'Data/final_tmdb_data_2005.csv.gz',
 'Data/final_tmdb_data_2001.csv.gz',
 'Data/final_tmdb_data_2013.csv.gz',
 'Data/final_tmdb_data_2011.csv.gz',
 'Data/final_tmdb_data_2003.csv.gz']

In [6]:
## INSTRUCTOR CODE (not taught to students)
df = pd.concat([pd.read_csv(f, lineterminator='\n') for f in files])
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 78397 entries, 0 to 1516
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  78397 non-null  bool   
 1   backdrop_path          43654 non-null  object 
 2   belongs_to_collection  3943 non-null   object 
 3   budget                 78397 non-null  int64  
 4   genres                 78397 non-null  object 
 5   homepage               21370 non-null  object 
 6   id                     78397 non-null  int64  
 7   imdb_id                78397 non-null  object 
 8   original_language      78397 non-null  object 
 9   original_title         78397 non-null  object 
 10  overview               76518 non-null  object 
 11  popularity             78397 non-null  float64
 12  poster_path            68663 non-null  object 
 13  production_companies   78397 non-null  object 
 14  production_countries   78397 non-null  object 
 15  rel

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,False,,,100000,[],,30356,tt0144280,en,Whispers from a Shallow Grave,"When a beautiful photo model disappears, polic...",0.714,/cjRqAh3u2Z0MBx9GtT1k7Z9732w.jpg,[],"[{'iso_3166_1': 'US', 'name': 'United States o...",1997-01-01,0,90.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,,Whispers from a Shallow Grave,False,2.0,2,
1,False,/1hGrzTajJ5JC6VfPzrUj8Gj5t5k.jpg,,0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",,58520,tt0197633,en,Live Freaky Die Freaky,It is the year 3069 and the Earth has been rob...,1.957,/qt7IVXzGca5wS68WZ6PBWHkTcMz.jpg,[],"[{'iso_3166_1': 'US', 'name': 'United States o...",2006-01-31,0,75.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,,Live Freaky Die Freaky,False,3.8,8,NR
2,False,/qKiOZMJH9Riqlk2SkUOgSMZHHJu.jpg,,0,"[{'id': 9648, 'name': 'Mystery'}, {'id': 35, '...",,459563,tt0204250,en,Death of a Saleswoman,Top-ranking RubberTubber saleswoman Agatha J. ...,0.676,/tIfllOpkhZl5sbqZd1X77iiaJDj.jpg,"[{'id': 90905, 'logo_path': None, 'name': 'Gir...","[{'iso_3166_1': 'US', 'name': 'United States o...",2006-09-18,0,77.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,6 housewives. 1 dead body. A trunk full of pla...,Death of a Saleswoman,False,3.0,1,
3,False,/z8FzomPxfCb2nv827VWLSI6oYiB.jpg,,76000000,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",http://www.universalstudiosentertainment.com/c...,9693,tt0206634,en,Children of Men,"In 2027, in a chaotic world in which humans ca...",18.314,/k9IAS4TehZFcKi4HVByxZNPfqex.jpg,"[{'id': 33, 'logo_path': '/8lvHyhjr8oUKOOy2dKX...","[{'iso_3166_1': 'GB', 'name': 'United Kingdom'...",2006-09-22,70595464,109.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The future's a thing of the past.,Children of Men,False,7.6,5794,R
4,False,/4W53mm2nvOtiOuPNW2oiBm9pmUZ.jpg,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 80, 'nam...",,9925,tt0244521,en,Funny Money,"Henry Perkins, a mild-mannered accountant, acc...",4.218,/oG3jWUZiDqqrdk3oKwPEkuwoN89.jpg,"[{'id': 110622, 'logo_path': None, 'name': 'FW...","[{'iso_3166_1': 'DE', 'name': 'Germany'}, {'is...",2006-01-01,0,98.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Henry Perkins accidentally trades briefcases w...,Funny Money,False,5.4,39,R


In [7]:
## Dropping Placeholder rows with id=0
df = df.loc[ df['imdb_id']!='0']
df

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,False,,,100000,[],,30356,tt0144280,en,Whispers from a Shallow Grave,"When a beautiful photo model disappears, polic...",0.714,/cjRqAh3u2Z0MBx9GtT1k7Z9732w.jpg,[],"[{'iso_3166_1': 'US', 'name': 'United States o...",1997-01-01,0,90.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,,Whispers from a Shallow Grave,False,2.0,2,
1,False,/1hGrzTajJ5JC6VfPzrUj8Gj5t5k.jpg,,0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",,58520,tt0197633,en,Live Freaky Die Freaky,It is the year 3069 and the Earth has been rob...,1.957,/qt7IVXzGca5wS68WZ6PBWHkTcMz.jpg,[],"[{'iso_3166_1': 'US', 'name': 'United States o...",2006-01-31,0,75.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,,Live Freaky Die Freaky,False,3.8,8,NR
2,False,/qKiOZMJH9Riqlk2SkUOgSMZHHJu.jpg,,0,"[{'id': 9648, 'name': 'Mystery'}, {'id': 35, '...",,459563,tt0204250,en,Death of a Saleswoman,Top-ranking RubberTubber saleswoman Agatha J. ...,0.676,/tIfllOpkhZl5sbqZd1X77iiaJDj.jpg,"[{'id': 90905, 'logo_path': None, 'name': 'Gir...","[{'iso_3166_1': 'US', 'name': 'United States o...",2006-09-18,0,77.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,6 housewives. 1 dead body. A trunk full of pla...,Death of a Saleswoman,False,3.0,1,
3,False,/z8FzomPxfCb2nv827VWLSI6oYiB.jpg,,76000000,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",http://www.universalstudiosentertainment.com/c...,9693,tt0206634,en,Children of Men,"In 2027, in a chaotic world in which humans ca...",18.314,/k9IAS4TehZFcKi4HVByxZNPfqex.jpg,"[{'id': 33, 'logo_path': '/8lvHyhjr8oUKOOy2dKX...","[{'iso_3166_1': 'GB', 'name': 'United Kingdom'...",2006-09-22,70595464,109.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The future's a thing of the past.,Children of Men,False,7.6,5794,R
4,False,/4W53mm2nvOtiOuPNW2oiBm9pmUZ.jpg,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 80, 'nam...",,9925,tt0244521,en,Funny Money,"Henry Perkins, a mild-mannered accountant, acc...",4.218,/oG3jWUZiDqqrdk3oKwPEkuwoN89.jpg,"[{'id': 110622, 'logo_path': None, 'name': 'FW...","[{'iso_3166_1': 'DE', 'name': 'Germany'}, {'is...",2006-01-01,0,98.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Henry Perkins accidentally trades briefcases w...,Funny Money,False,5.4,39,R
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1512,False,/8AiCju3Guhxiq9Sznj8f7aJZofS.jpg,,0,"[{'id': 99, 'name': 'Documentary'}]",http://icarusfilms.com/if-cli,387250,tt7098636,fr,"Le Clitoris, ce cher inconnu",Topics about female sexuality are growing in p...,1.400,/aWHfVqrTlEkTjUmyiMWYjyd9gsW.jpg,"[{'id': 92745, 'logo_path': None, 'name': 'Cat...","[{'iso_3166_1': 'FR', 'name': 'France'}]",2004-01-16,0,60.0,"[{'english_name': 'French', 'iso_639_1': 'fr',...",Released,,The Clitoris: Forbidden Pleasure,False,8.0,2,
1513,False,,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10751, '...",,640680,tt7246872,en,Tokyo Bedoone Tavaghof,,0.600,/764JbpPIGeApSheIj0v5WvAUpuk.jpg,[],"[{'iso_3166_1': 'IR', 'name': 'Iran'}]",2003-09-10,0,98.0,"[{'english_name': 'Persian', 'iso_639_1': 'fa'...",Released,,"Tokyo, Non-Stop",False,0.0,0,
1514,False,,,0,"[{'id': 99, 'name': 'Documentary'}]",https://www.michaelraeburn.com/site/web/mr/zim...,515137,tt7637550,en,Zimbabwe Countdown,"Once touted as Zimbabwe's saviour, Robert Muga...",0.711,/vnpAEhWBDhf7I6m4XZqj3Vf27px.jpg,"[{'id': 94, 'logo_path': '/huC7HqorvUThGIrENrb...","[{'iso_3166_1': 'FR', 'name': 'France'}, {'iso...",2003-08-28,0,55.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,,Zimbabwe Countdown,False,5.0,4,
1515,False,,,0,"[{'id': 99, 'name': 'Documentary'}]",,713681,tt9414698,en,The Hot Shoe,A Documentary Movie About Card Counting In The...,0.600,,[],[],2003-01-01,0,96.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,,The Hot Shoe,False,0.0,0,


In [8]:
df = df.reset_index(drop=True)
df

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,False,,,100000,[],,30356,tt0144280,en,Whispers from a Shallow Grave,"When a beautiful photo model disappears, polic...",0.714,/cjRqAh3u2Z0MBx9GtT1k7Z9732w.jpg,[],"[{'iso_3166_1': 'US', 'name': 'United States o...",1997-01-01,0,90.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,,Whispers from a Shallow Grave,False,2.0,2,
1,False,/1hGrzTajJ5JC6VfPzrUj8Gj5t5k.jpg,,0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",,58520,tt0197633,en,Live Freaky Die Freaky,It is the year 3069 and the Earth has been rob...,1.957,/qt7IVXzGca5wS68WZ6PBWHkTcMz.jpg,[],"[{'iso_3166_1': 'US', 'name': 'United States o...",2006-01-31,0,75.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,,Live Freaky Die Freaky,False,3.8,8,NR
2,False,/qKiOZMJH9Riqlk2SkUOgSMZHHJu.jpg,,0,"[{'id': 9648, 'name': 'Mystery'}, {'id': 35, '...",,459563,tt0204250,en,Death of a Saleswoman,Top-ranking RubberTubber saleswoman Agatha J. ...,0.676,/tIfllOpkhZl5sbqZd1X77iiaJDj.jpg,"[{'id': 90905, 'logo_path': None, 'name': 'Gir...","[{'iso_3166_1': 'US', 'name': 'United States o...",2006-09-18,0,77.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,6 housewives. 1 dead body. A trunk full of pla...,Death of a Saleswoman,False,3.0,1,
3,False,/z8FzomPxfCb2nv827VWLSI6oYiB.jpg,,76000000,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",http://www.universalstudiosentertainment.com/c...,9693,tt0206634,en,Children of Men,"In 2027, in a chaotic world in which humans ca...",18.314,/k9IAS4TehZFcKi4HVByxZNPfqex.jpg,"[{'id': 33, 'logo_path': '/8lvHyhjr8oUKOOy2dKX...","[{'iso_3166_1': 'GB', 'name': 'United Kingdom'...",2006-09-22,70595464,109.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The future's a thing of the past.,Children of Men,False,7.6,5794,R
4,False,/4W53mm2nvOtiOuPNW2oiBm9pmUZ.jpg,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 80, 'nam...",,9925,tt0244521,en,Funny Money,"Henry Perkins, a mild-mannered accountant, acc...",4.218,/oG3jWUZiDqqrdk3oKwPEkuwoN89.jpg,"[{'id': 110622, 'logo_path': None, 'name': 'FW...","[{'iso_3166_1': 'DE', 'name': 'Germany'}, {'is...",2006-01-01,0,98.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Henry Perkins accidentally trades briefcases w...,Funny Money,False,5.4,39,R
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78392,False,/8AiCju3Guhxiq9Sznj8f7aJZofS.jpg,,0,"[{'id': 99, 'name': 'Documentary'}]",http://icarusfilms.com/if-cli,387250,tt7098636,fr,"Le Clitoris, ce cher inconnu",Topics about female sexuality are growing in p...,1.400,/aWHfVqrTlEkTjUmyiMWYjyd9gsW.jpg,"[{'id': 92745, 'logo_path': None, 'name': 'Cat...","[{'iso_3166_1': 'FR', 'name': 'France'}]",2004-01-16,0,60.0,"[{'english_name': 'French', 'iso_639_1': 'fr',...",Released,,The Clitoris: Forbidden Pleasure,False,8.0,2,
78393,False,,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10751, '...",,640680,tt7246872,en,Tokyo Bedoone Tavaghof,,0.600,/764JbpPIGeApSheIj0v5WvAUpuk.jpg,[],"[{'iso_3166_1': 'IR', 'name': 'Iran'}]",2003-09-10,0,98.0,"[{'english_name': 'Persian', 'iso_639_1': 'fa'...",Released,,"Tokyo, Non-Stop",False,0.0,0,
78394,False,,,0,"[{'id': 99, 'name': 'Documentary'}]",https://www.michaelraeburn.com/site/web/mr/zim...,515137,tt7637550,en,Zimbabwe Countdown,"Once touted as Zimbabwe's saviour, Robert Muga...",0.711,/vnpAEhWBDhf7I6m4XZqj3Vf27px.jpg,"[{'id': 94, 'logo_path': '/huC7HqorvUThGIrENrb...","[{'iso_3166_1': 'FR', 'name': 'France'}, {'iso...",2003-08-28,0,55.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,,Zimbabwe Countdown,False,5.0,4,
78395,False,,,0,"[{'id': 99, 'name': 'Documentary'}]",,713681,tt9414698,en,The Hot Shoe,A Documentary Movie About Card Counting In The...,0.600,,[],[],2003-01-01,0,96.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,,The Hot Shoe,False,0.0,0,


# Transform

## basics

- normalize and separate genre
- "original_title" (we will use the primary title column instead)
- "isAdult" ("Adult" will show up in the genres so this is redundant information).
- "titleType" (every row will be a movie).
- "genres" and other variants of genre (genre is now represented in the 2 new tables described above.

In [9]:
basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116404 entries, 0 to 116403
Data columns (total 8 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   tconst          116404 non-null  object 
 1   titleType       116404 non-null  object 
 2   primaryTitle    116404 non-null  object 
 3   originalTitle   116404 non-null  object 
 4   isAdult         116404 non-null  int64  
 5   startYear       116404 non-null  float64
 6   runtimeMinutes  116404 non-null  int64  
 7   genres          116404 non-null  object 
dtypes: float64(1), int64(2), object(5)
memory usage: 7.1+ MB


In [10]:
cols_to_drop = ['originalTitle','isAdult','titleType'] #endYear not req'd
basics = basics.drop(columns=cols_to_drop)
basics

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres
0,tt0035423,Kate & Leopold,2001.0,118,"Comedy,Fantasy,Romance"
1,tt0062336,The Tango of the Widower and Its Distorting Mi...,2020.0,70,Drama
2,tt0067683,Workers '71: Nothing About Us Without Us,2006.0,47,Documentary
3,tt0069049,The Other Side of the Wind,2018.0,122,Drama
4,tt0088751,The Naked Monster,2005.0,100,"Comedy,Horror,Sci-Fi"
...,...,...,...,...,...
116399,tt9915436,Vida em Movimento,2019.0,70,Documentary
116400,tt9915872,The Last White Witch,2019.0,97,"Comedy,Drama,Fantasy"
116401,tt9916170,The Rehearsal,2019.0,51,Drama
116402,tt9916190,Safeguard,2020.0,95,"Action,Adventure,Thriller"


### Normalizing Genre

In [11]:
## Fill Missing Genres 
## Create a new column with the single-string genres as a list of strings
basics['genres_split'] = basics['genres'].str.split(',')
basics

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres,genres_split
0,tt0035423,Kate & Leopold,2001.0,118,"Comedy,Fantasy,Romance","[Comedy, Fantasy, Romance]"
1,tt0062336,The Tango of the Widower and Its Distorting Mi...,2020.0,70,Drama,[Drama]
2,tt0067683,Workers '71: Nothing About Us Without Us,2006.0,47,Documentary,[Documentary]
3,tt0069049,The Other Side of the Wind,2018.0,122,Drama,[Drama]
4,tt0088751,The Naked Monster,2005.0,100,"Comedy,Horror,Sci-Fi","[Comedy, Horror, Sci-Fi]"
...,...,...,...,...,...,...
116399,tt9915436,Vida em Movimento,2019.0,70,Documentary,[Documentary]
116400,tt9915872,The Last White Witch,2019.0,97,"Comedy,Drama,Fantasy","[Comedy, Drama, Fantasy]"
116401,tt9916170,The Rehearsal,2019.0,51,Drama,[Drama]
116402,tt9916190,Safeguard,2020.0,95,"Action,Adventure,Thriller","[Action, Adventure, Thriller]"


In [12]:
## Explode the dataframe to make each genre into a separate row
exploded_genres = basics.explode('genres_split')
exploded_genres

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres,genres_split
0,tt0035423,Kate & Leopold,2001.0,118,"Comedy,Fantasy,Romance",Comedy
0,tt0035423,Kate & Leopold,2001.0,118,"Comedy,Fantasy,Romance",Fantasy
0,tt0035423,Kate & Leopold,2001.0,118,"Comedy,Fantasy,Romance",Romance
1,tt0062336,The Tango of the Widower and Its Distorting Mi...,2020.0,70,Drama,Drama
2,tt0067683,Workers '71: Nothing About Us Without Us,2006.0,47,Documentary,Documentary
...,...,...,...,...,...,...
116402,tt9916190,Safeguard,2020.0,95,"Action,Adventure,Thriller",Action
116402,tt9916190,Safeguard,2020.0,95,"Action,Adventure,Thriller",Adventure
116402,tt9916190,Safeguard,2020.0,95,"Action,Adventure,Thriller",Thriller
116403,tt9916362,Coven,2020.0,92,"Drama,History",Drama


#### Create title_genres Table - title_genres
tconst: GenreID

In [13]:
## Save just tconst and genres_split as new df
title_genres = exploded_genres[['tconst','genres_split']].copy()
title_genres.head()

Unnamed: 0,tconst,genres_split
0,tt0035423,Comedy
0,tt0035423,Fantasy
0,tt0035423,Romance
1,tt0062336,Drama
2,tt0067683,Documentary


#### Replacing Text Genres with Integer IDs

In [14]:
## Getting unique list of genres
unique_genres = sorted(title_genres['genres_split'].unique())
unique_genres

['Action',
 'Adult',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Game-Show',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'News',
 'Reality-TV',
 'Romance',
 'Sci-Fi',
 'Short',
 'Sport',
 'Talk-Show',
 'Thriller',
 'War',
 'Western']

In [15]:
## Make a dictionary with list of unique genres as the key and the new iteger id as vlaues
genre_id_map = dict(zip(unique_genres, range(len(unique_genres))))
genre_id_map

{'Action': 0,
 'Adult': 1,
 'Adventure': 2,
 'Animation': 3,
 'Biography': 4,
 'Comedy': 5,
 'Crime': 6,
 'Documentary': 7,
 'Drama': 8,
 'Family': 9,
 'Fantasy': 10,
 'Game-Show': 11,
 'History': 12,
 'Horror': 13,
 'Music': 14,
 'Musical': 15,
 'Mystery': 16,
 'News': 17,
 'Reality-TV': 18,
 'Romance': 19,
 'Sci-Fi': 20,
 'Short': 21,
 'Sport': 22,
 'Talk-Show': 23,
 'Thriller': 24,
 'War': 25,
 'Western': 26}

#### Replacing Values in title_genres Table with Genre ID

In [16]:
## Use .map or .replace with our genre_id_map dictionary
title_genres['Genre_ID'] = title_genres['genres_split'].replace(genre_id_map)

## Drop the original genre column
title_genres.drop(columns=['genres_split'],inplace=True)
title_genres

Unnamed: 0,tconst,Genre_ID
0,tt0035423,5
0,tt0035423,10
0,tt0035423,19
1,tt0062336,8
2,tt0067683,7
...,...,...
116402,tt9916190,0
116402,tt9916190,2
116402,tt9916190,24
116403,tt9916362,8


#### Create `genres` table
- Genre : Genre_ID

In [17]:
## Manaully make dataframe with named cols from the .keyd and .values
genre_lookup = pd.DataFrame({'Genre_Name': genre_id_map.keys(),
                             'Genre_ID':genre_id_map.values()})
genre_lookup.head()

Unnamed: 0,Genre_Name,Genre_ID
0,Action,0
1,Adult,1
2,Adventure,2
3,Animation,3
4,Biography,4


In [18]:
## Dropping original genre columns 
basics = basics.drop(columns=['genres','genres_split'])
basics

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes
0,tt0035423,Kate & Leopold,2001.0,118
1,tt0062336,The Tango of the Widower and Its Distorting Mi...,2020.0,70
2,tt0067683,Workers '71: Nothing About Us Without Us,2006.0,47
3,tt0069049,The Other Side of the Wind,2018.0,122
4,tt0088751,The Naked Monster,2005.0,100
...,...,...,...,...
116399,tt9915436,Vida em Movimento,2019.0,70
116400,tt9915872,The Last White Witch,2019.0,97
116401,tt9916170,The Rehearsal,2019.0,51
116402,tt9916190,Safeguard,2020.0,95


## Normalizing TMDB API Data

In [19]:
df

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,False,,,100000,[],,30356,tt0144280,en,Whispers from a Shallow Grave,"When a beautiful photo model disappears, polic...",0.714,/cjRqAh3u2Z0MBx9GtT1k7Z9732w.jpg,[],"[{'iso_3166_1': 'US', 'name': 'United States o...",1997-01-01,0,90.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,,Whispers from a Shallow Grave,False,2.0,2,
1,False,/1hGrzTajJ5JC6VfPzrUj8Gj5t5k.jpg,,0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",,58520,tt0197633,en,Live Freaky Die Freaky,It is the year 3069 and the Earth has been rob...,1.957,/qt7IVXzGca5wS68WZ6PBWHkTcMz.jpg,[],"[{'iso_3166_1': 'US', 'name': 'United States o...",2006-01-31,0,75.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,,Live Freaky Die Freaky,False,3.8,8,NR
2,False,/qKiOZMJH9Riqlk2SkUOgSMZHHJu.jpg,,0,"[{'id': 9648, 'name': 'Mystery'}, {'id': 35, '...",,459563,tt0204250,en,Death of a Saleswoman,Top-ranking RubberTubber saleswoman Agatha J. ...,0.676,/tIfllOpkhZl5sbqZd1X77iiaJDj.jpg,"[{'id': 90905, 'logo_path': None, 'name': 'Gir...","[{'iso_3166_1': 'US', 'name': 'United States o...",2006-09-18,0,77.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,6 housewives. 1 dead body. A trunk full of pla...,Death of a Saleswoman,False,3.0,1,
3,False,/z8FzomPxfCb2nv827VWLSI6oYiB.jpg,,76000000,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",http://www.universalstudiosentertainment.com/c...,9693,tt0206634,en,Children of Men,"In 2027, in a chaotic world in which humans ca...",18.314,/k9IAS4TehZFcKi4HVByxZNPfqex.jpg,"[{'id': 33, 'logo_path': '/8lvHyhjr8oUKOOy2dKX...","[{'iso_3166_1': 'GB', 'name': 'United Kingdom'...",2006-09-22,70595464,109.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The future's a thing of the past.,Children of Men,False,7.6,5794,R
4,False,/4W53mm2nvOtiOuPNW2oiBm9pmUZ.jpg,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 80, 'nam...",,9925,tt0244521,en,Funny Money,"Henry Perkins, a mild-mannered accountant, acc...",4.218,/oG3jWUZiDqqrdk3oKwPEkuwoN89.jpg,"[{'id': 110622, 'logo_path': None, 'name': 'FW...","[{'iso_3166_1': 'DE', 'name': 'Germany'}, {'is...",2006-01-01,0,98.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Henry Perkins accidentally trades briefcases w...,Funny Money,False,5.4,39,R
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78392,False,/8AiCju3Guhxiq9Sznj8f7aJZofS.jpg,,0,"[{'id': 99, 'name': 'Documentary'}]",http://icarusfilms.com/if-cli,387250,tt7098636,fr,"Le Clitoris, ce cher inconnu",Topics about female sexuality are growing in p...,1.400,/aWHfVqrTlEkTjUmyiMWYjyd9gsW.jpg,"[{'id': 92745, 'logo_path': None, 'name': 'Cat...","[{'iso_3166_1': 'FR', 'name': 'France'}]",2004-01-16,0,60.0,"[{'english_name': 'French', 'iso_639_1': 'fr',...",Released,,The Clitoris: Forbidden Pleasure,False,8.0,2,
78393,False,,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10751, '...",,640680,tt7246872,en,Tokyo Bedoone Tavaghof,,0.600,/764JbpPIGeApSheIj0v5WvAUpuk.jpg,[],"[{'iso_3166_1': 'IR', 'name': 'Iran'}]",2003-09-10,0,98.0,"[{'english_name': 'Persian', 'iso_639_1': 'fa'...",Released,,"Tokyo, Non-Stop",False,0.0,0,
78394,False,,,0,"[{'id': 99, 'name': 'Documentary'}]",https://www.michaelraeburn.com/site/web/mr/zim...,515137,tt7637550,en,Zimbabwe Countdown,"Once touted as Zimbabwe's saviour, Robert Muga...",0.711,/vnpAEhWBDhf7I6m4XZqj3Vf27px.jpg,"[{'id': 94, 'logo_path': '/huC7HqorvUThGIrENrb...","[{'iso_3166_1': 'FR', 'name': 'France'}, {'iso...",2003-08-28,0,55.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,,Zimbabwe Countdown,False,5.0,4,
78395,False,,,0,"[{'id': 99, 'name': 'Documentary'}]",,713681,tt9414698,en,The Hot Shoe,A Documentary Movie About Card Counting In The...,0.600,,[],[],2003-01-01,0,96.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,,The Hot Shoe,False,0.0,0,


### Clean & Normalize Production Companies

In [20]:
test = df.iloc[3]['production_companies']
test

"[{'id': 33, 'logo_path': '/8lvHyhjr8oUKOOy2dKXoALWKdp0.png', 'name': 'Universal Pictures', 'origin_country': 'US'}, {'id': 655, 'logo_path': None, 'name': 'Strike Entertainment', 'origin_country': ''}, {'id': 11429, 'logo_path': None, 'name': 'Hit & Run Productions', 'origin_country': ''}]"

In [21]:
## Function to get just the genre names as a list 
import json
def get_company_name(x):
    try:
        x = x.replace('None','"None"')
        x = x.replace('\\','')
        
#         x = x.replace("'",'"')
        x = x.replace("'s",'s')
        for key in ['id','logo_path','name','origin_country']:
            x = x.replace(f"'{key}'",f'"{key}"')
            
        x = x.replace("'",'"')
        
        x = json.loads(x)

        companies = []
        for company in x:
            companies.append(company['name'])
        return companies
    except:
        print(x)

In [22]:
get_company_name(test)

['Universal Pictures', 'Strike Entertainment', 'Hit & Run Productions']

In [23]:
# json.loads(test)

In [24]:
## Use ourn function and exploding the new column
df['production_company_list'] = df['production_companies'].fillna('[]').apply(get_company_name)

[{"id": 431, "logo_path": "None", "name": "The Donners" Company", "origin_country": "US"}]
[{"id": 94, "logo_path": "/huC7HqorvUThGIrENrbcHmQVUA0.png", "name": "ARTE France Cinéma", "origin_country": "FR"}, {"id": 122638, "logo_path": "None", "name": "République de Côte d"Ivoire", "origin_country": ""}, {"id": 13542, "logo_path": "/sJ0R9LDMGR5ildYS4GMhQcbKJVD.png", "name": "CNRS Images", "origin_country": "FR"}, {"id": 76921, "logo_path": "/2r3bPkN23MRdAD1iaTKVfK8NtEQ.png", "name": "Les Films d"ici", "origin_country": "FR"}, {"id": 2326, "logo_path": "None", "name": "Autonomous", "origin_country": ""}, {"id": 122639, "logo_path": "None", "name": "Éliane de Latour Production", "origin_country": ""}, {"id": 310, "logo_path": "/4Gp5A4YdjPBfsHT47ngHMaXWmb9.png", "name": "CNC", "origin_country": "FR"}, {"id": 1249, "logo_path": "None", "name": "Les Films du Lendemain", "origin_country": "FR"}, {"id": 122640, "logo_path": "None", "name": "Cognition", "origin_country": ""}, {"id": 122641, "lo

[{"id": 157118, "logo_path": "None", "name": "Ali n" Productions", "origin_country": ""}, {"id": 20219, "logo_path": "None", "name": "Les Films du Nouveau Monde", "origin_country": "FR"}, {"id": 72352, "logo_path": "None", "name": "Unité de Production", "origin_country": "FR"}]
[{"id": 126286, "logo_path": "None", "name": "Heure d"Eté Productions", "origin_country": ""}]
[{"id": 87045, "logo_path": "/6WLvW3VlxfXOPukpeEPYnjdFtwH.png", "name": "Propagate Content", "origin_country": "US"}, {"id": 159529, "logo_path": "None", "name": "Stardust Frames", "origin_country": ""}, {"id": 159530, "logo_path": "None", "name": "The Players" Tribune", "origin_country": ""}]
[{"id": 7573, "logo_path": "None", "name": "Les Films de l"Astrophore", "origin_country": "FR"}]
[{"id": 121251, "logo_path": "None", "name": "Les Films d"Argile", "origin_country": "FR"}, {"id": 16691, "logo_path": "None", "name": "Shellac Films", "origin_country": "FR"}]
[{"id": 87045, "logo_path": "/6WLvW3VlxfXOPukpeEPYnjdFtwH

In [25]:
df['production_company_list'].isna().sum() / len(df)

0.006262994757452453

In [26]:
edge_case = df.loc[df['production_companies'].str.contains("Donner"),'production_companies'].index#.iloc[0]
edge_case

Int64Index([   66,   466,  1023,  8292, 14047, 14362, 16771, 17416, 28453,
            41839, 57045, 62114, 62179, 65211, 67553, 68679, 73284, 77021,
            77054],
           dtype='int64')

In [27]:
donner_ids = [ 66,  466, 1023,  298,   42,  357,   76,  721,   18, 2975,  452,
              47,  112,  125,  482,  150,  221,  141,  174]
donner_ids

[66,
 466,
 1023,
 298,
 42,
 357,
 76,
 721,
 18,
 2975,
 452,
 47,
 112,
 125,
 482,
 150,
 221,
 141,
 174]

In [28]:
df.loc[donner_ids[0], 'production_companies']

'[{\'id\': 431, \'logo_path\': None, \'name\': "The Donners\' Company", \'origin_country\': \'US\'}]'

In [29]:
get_company_name(df.loc[donner_ids[0], 'production_companies'])

[{"id": 431, "logo_path": "None", "name": "The Donners" Company", "origin_country": "US"}]


- Issue with company names that include an apostrophe.

In [30]:
pd.set_option('display.max_colwidth',300)

In [31]:
df[df['production_company_list'].isna()]['production_companies']

66                                                                                                                                                                                                                          [{'id': 431, 'logo_path': None, 'name': "The Donners' Company", 'origin_country': 'US'}]
207      [{'id': 94, 'logo_path': '/huC7HqorvUThGIrENrbcHmQVUA0.png', 'name': 'ARTE France Cinéma', 'origin_country': 'FR'}, {'id': 122638, 'logo_path': None, 'name': "République de Côte d'Ivoire", 'origin_country': ''}, {'id': 13542, 'logo_path': '/sJ0R9LDMGR5ildYS4GMhQcbKJVD.png', 'name': 'CNRS Images'...
399      [{'id': 83, 'logo_path': '/9OQ0rm55xtlgX7KcAKMUePJSrQc.png', 'name': 'France 2 Cinéma', 'origin_country': 'FR'}, {'id': 4045, 'logo_path': None, 'name': 'Cinegram', 'origin_country': 'GR'}, {'id': 23495, 'logo_path': None, 'name': 'Titania Produzioni', 'origin_country': 'IT'}, {'id': 98757, 'log...
466      [{'id': 1088, 'logo_path': '/9WOE5AQUXbOtLU6GTwfjS8OMF0v.png', '

#### Explode & Separate

In [32]:
df_explode = df.explode('production_company_list')
df_explode

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification,production_company_list
0,False,,,100000,[],,30356,tt0144280,en,Whispers from a Shallow Grave,"When a beautiful photo model disappears, police track the photographer who last saw her. At his trial, the model herself tells her side of the story.",0.714,/cjRqAh3u2Z0MBx9GtT1k7Z9732w.jpg,[],"[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1997-01-01,0,90.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,,Whispers from a Shallow Grave,False,2.0,2,,
1,False,/1hGrzTajJ5JC6VfPzrUj8Gj5t5k.jpg,,0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10402, 'name': 'Music'}, {'id': 878, 'name': 'Science Fiction'}, {'id': 27, 'name': 'Horror'}]",,58520,tt0197633,en,Live Freaky Die Freaky,"It is the year 3069 and the Earth has been robbed of all its natural resources, destroyed by war and ozone depletion.",1.957,/qt7IVXzGca5wS68WZ6PBWHkTcMz.jpg,[],"[{'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-01-31,0,75.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,,Live Freaky Die Freaky,False,3.8,8,NR,
2,False,/qKiOZMJH9Riqlk2SkUOgSMZHHJu.jpg,,0,"[{'id': 9648, 'name': 'Mystery'}, {'id': 35, 'name': 'Comedy'}]",,459563,tt0204250,en,Death of a Saleswoman,"Top-ranking RubberTubber saleswoman Agatha J. Ruby was brutally beaten and shot on her morning jog by an unknown assailant. Over six thousand dollars worth of her best plastic storage ware was stolen from her trunk. Nearly all 200 residents in her hometown of Mametville, Washington knew her pers...",0.676,/tIfllOpkhZl5sbqZd1X77iiaJDj.jpg,"[{'id': 90905, 'logo_path': None, 'name': 'Girl Next Door Productions', 'origin_country': ''}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-09-18,0,77.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,6 housewives. 1 dead body. A trunk full of plastic storage wear.,Death of a Saleswoman,False,3.0,1,,Girl Next Door Productions
3,False,/z8FzomPxfCb2nv827VWLSI6oYiB.jpg,,76000000,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name': 'Action'}, {'id': 53, 'name': 'Thriller'}, {'id': 878, 'name': 'Science Fiction'}]",http://www.universalstudiosentertainment.com/children-of-men/,9693,tt0206634,en,Children of Men,"In 2027, in a chaotic world in which humans can no longer procreate, a former activist agrees to help transport a miraculously pregnant woman to a sanctuary at sea, where her child's birth may help scientists save the future of humankind.",18.314,/k9IAS4TehZFcKi4HVByxZNPfqex.jpg,"[{'id': 33, 'logo_path': '/8lvHyhjr8oUKOOy2dKXoALWKdp0.png', 'name': 'Universal Pictures', 'origin_country': 'US'}, {'id': 655, 'logo_path': None, 'name': 'Strike Entertainment', 'origin_country': ''}, {'id': 11429, 'logo_path': None, 'name': 'Hit & Run Productions', 'origin_country': ''}]","[{'iso_3166_1': 'GB', 'name': 'United Kingdom'}, {'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-09-22,70595464,109.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,The future's a thing of the past.,Children of Men,False,7.6,5794,R,Universal Pictures
3,False,/z8FzomPxfCb2nv827VWLSI6oYiB.jpg,,76000000,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name': 'Action'}, {'id': 53, 'name': 'Thriller'}, {'id': 878, 'name': 'Science Fiction'}]",http://www.universalstudiosentertainment.com/children-of-men/,9693,tt0206634,en,Children of Men,"In 2027, in a chaotic world in which humans can no longer procreate, a former activist agrees to help transport a miraculously pregnant woman to a sanctuary at sea, where her child's birth may help scientists save the future of humankind.",18.314,/k9IAS4TehZFcKi4HVByxZNPfqex.jpg,"[{'id': 33, 'logo_path': '/8lvHyhjr8oUKOOy2dKXoALWKdp0.png', 'name': 'Universal Pictures', 'origin_country': 'US'}, {'id': 655, 'logo_path': None, 'name': 'Strike Entertainment', 'origin_country': ''}, {'id': 11429, 'logo_path': None, 'name': 'Hit & Run Productions', 'origin_country': ''}]","[{'iso_3166_1': 'GB', 'name': 'United Kingdom'}, {'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-09-22,70595464,109.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,The future's a thing of the past.,Children of Men,False,7.6,5794,R,Strike Entertainment
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78392,False,/8AiCju3Guhxiq9Sznj8f7aJZofS.jpg,,0,"[{'id': 99, 'name': 'Documentary'}]",http://icarusfilms.com/if-cli,387250,tt7098636,fr,"Le Clitoris, ce cher inconnu","Topics about female sexuality are growing in popularity. Magazines and talk shows all discuss it. Yet a fair percentage of women are said to suffer from female sexual dysfunction. While male sexual problems have traditionally received the most publicity, only recently has research begun into the...",1.400,/aWHfVqrTlEkTjUmyiMWYjyd9gsW.jpg,"[{'id': 92745, 'logo_path': None, 'name': 'Cats & Dogs Films', 'origin_country': 'FR'}, {'id': 7106, 'logo_path': None, 'name': 'Sylicone', 'origin_country': 'FR'}, {'id': 201, 'logo_path': '/6UIpEURdjnmcJPwgTDRzVRuwADr.png', 'name': 'ARTE', 'origin_country': 'FR'}, {'id': 639, 'logo_path': '/fN...","[{'iso_3166_1': 'FR', 'name': 'France'}]",2004-01-16,0,60.0,"[{'english_name': 'French', 'iso_639_1': 'fr', 'name': 'Français'}]",Released,,The Clitoris: Forbidden Pleasure,False,8.0,2,,Angoa-Agicoa
78393,False,,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}, {'id': 18, 'name': 'Drama'}]",,640680,tt7246872,en,Tokyo Bedoone Tavaghof,,0.600,/764JbpPIGeApSheIj0v5WvAUpuk.jpg,[],"[{'iso_3166_1': 'IR', 'name': 'Iran'}]",2003-09-10,0,98.0,"[{'english_name': 'Persian', 'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,,"Tokyo, Non-Stop",False,0.0,0,,
78394,False,,,0,"[{'id': 99, 'name': 'Documentary'}]",https://www.michaelraeburn.com/site/web/mr/zimbabwe-countdown_documentary,515137,tt7637550,en,Zimbabwe Countdown,"Once touted as Zimbabwe's saviour, Robert Mugabe has become synonymous with bad government and misrule. But where did it all go wrong?",0.711,/vnpAEhWBDhf7I6m4XZqj3Vf27px.jpg,"[{'id': 94, 'logo_path': '/huC7HqorvUThGIrENrbcHmQVUA0.png', 'name': 'ARTE France Cinéma', 'origin_country': 'FR'}]","[{'iso_3166_1': 'FR', 'name': 'France'}, {'iso_3166_1': 'ZW', 'name': 'Zimbabwe'}]",2003-08-28,0,55.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,,Zimbabwe Countdown,False,5.0,4,,ARTE France Cinéma
78395,False,,,0,"[{'id': 99, 'name': 'Documentary'}]",,713681,tt9414698,en,The Hot Shoe,"A Documentary Movie About Card Counting In The Casino Game of 21. The director used $5,000 of the movie's budget to learn the counting methods, and put them to the test on screen.",0.600,,[],[],2003-01-01,0,96.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,,The Hot Shoe,False,0.0,0,,


In [33]:
## save unique genres
unique_companies = sorted(df_explode['production_company_list'].dropna().unique())
len(unique_companies)

39872

In [34]:
company_map = dict(zip(unique_companies, range(len(unique_companies))))
company_map

{'# Andrea Sperling Productions': 0,
 '#Beardforce Films': 1,
 '#Sinning Works': 2,
 '#littlesecretfilm': 3,
 '(주)라인필름': 4,
 '(주)유비컬쳐': 5,
 '(주)유비콘텐츠': 6,
 '+Magenta': 7,
 '.40 Caliber': 8,
 '.Mille et Une. Films.': 9,
 '.hack Conglomerate': 10,
 '0': 11,
 '01 Distribution': 12,
 '011 Productions': 13,
 '011Films': 14,
 '02 Filmes': 15,
 '0708 Films': 16,
 '1 Bullet in the Gun Productions': 17,
 '1 Media Productions': 18,
 '1 Production Film': 19,
 '1+1 Production': 20,
 '1-2-3 Production': 21,
 '1.85 Films': 22,
 '10 Below Films': 23,
 '10 Cates Pictures': 24,
 '10 East Pictures': 25,
 '10 West Studios': 26,
 '10-13 Productions': 27,
 '10.80 Films': 28,
 '100': 29,
 '100 Bares': 30,
 '100 Feet Productions': 31,
 '100 Year Films': 32,
 '100 doo': 33,
 '100% Entertainment': 34,
 '100% Halal': 35,
 '1000 Volt': 36,
 '100th Monkey Films': 37,
 '101 Films': 38,
 '101 Studios': 39,
 '1019 Entertainment': 40,
 '1019 Pictures': 41,
 '102 Distribution': 42,
 '1024 FILM': 43,
 '1031 Films': 44,

In [35]:
df_title_companies = df_explode[['imdb_id','production_company_list']]
df_title_companies

Unnamed: 0,imdb_id,production_company_list
0,tt0144280,
1,tt0197633,
2,tt0204250,Girl Next Door Productions
3,tt0206634,Universal Pictures
3,tt0206634,Strike Entertainment
...,...,...
78392,tt7098636,Angoa-Agicoa
78393,tt7246872,
78394,tt7637550,ARTE France Cinéma
78395,tt9414698,


In [36]:
df_title_companies['company_id'] = df_title_companies['production_company_list'].map(company_map)
df_title_companies

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_title_companies['company_id'] = df_title_companies['production_company_list'].map(company_map)


Unnamed: 0,imdb_id,production_company_list,company_id
0,tt0144280,,
1,tt0197633,,
2,tt0204250,Girl Next Door Productions,14402.0
3,tt0206634,Universal Pictures,36827.0
3,tt0206634,Strike Entertainment,33356.0
...,...,...,...
78392,tt7098636,Angoa-Agicoa,2167.0
78393,tt7246872,,
78394,tt7637550,ARTE France Cinéma,961.0
78395,tt9414698,,


In [37]:
df_company = pd.DataFrame({'company_id':company_map.values(), 'company_name':company_map.keys()})
df_company

Unnamed: 0,company_id,company_name
0,0,# Andrea Sperling Productions
1,1,#Beardforce Films
2,2,#Sinning Works
3,3,#littlesecretfilm
4,4,(주)라인필름
...,...,...
39867,39867,타임스토리그룹
39868,39868,파파스필름
39869,39869,플랫폼픽쳐스
39870,39870,한맥문화


In [38]:
df

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification,production_company_list
0,False,,,100000,[],,30356,tt0144280,en,Whispers from a Shallow Grave,"When a beautiful photo model disappears, police track the photographer who last saw her. At his trial, the model herself tells her side of the story.",0.714,/cjRqAh3u2Z0MBx9GtT1k7Z9732w.jpg,[],"[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1997-01-01,0,90.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,,Whispers from a Shallow Grave,False,2.0,2,,[]
1,False,/1hGrzTajJ5JC6VfPzrUj8Gj5t5k.jpg,,0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10402, 'name': 'Music'}, {'id': 878, 'name': 'Science Fiction'}, {'id': 27, 'name': 'Horror'}]",,58520,tt0197633,en,Live Freaky Die Freaky,"It is the year 3069 and the Earth has been robbed of all its natural resources, destroyed by war and ozone depletion.",1.957,/qt7IVXzGca5wS68WZ6PBWHkTcMz.jpg,[],"[{'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-01-31,0,75.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,,Live Freaky Die Freaky,False,3.8,8,NR,[]
2,False,/qKiOZMJH9Riqlk2SkUOgSMZHHJu.jpg,,0,"[{'id': 9648, 'name': 'Mystery'}, {'id': 35, 'name': 'Comedy'}]",,459563,tt0204250,en,Death of a Saleswoman,"Top-ranking RubberTubber saleswoman Agatha J. Ruby was brutally beaten and shot on her morning jog by an unknown assailant. Over six thousand dollars worth of her best plastic storage ware was stolen from her trunk. Nearly all 200 residents in her hometown of Mametville, Washington knew her pers...",0.676,/tIfllOpkhZl5sbqZd1X77iiaJDj.jpg,"[{'id': 90905, 'logo_path': None, 'name': 'Girl Next Door Productions', 'origin_country': ''}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-09-18,0,77.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,6 housewives. 1 dead body. A trunk full of plastic storage wear.,Death of a Saleswoman,False,3.0,1,,[Girl Next Door Productions]
3,False,/z8FzomPxfCb2nv827VWLSI6oYiB.jpg,,76000000,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name': 'Action'}, {'id': 53, 'name': 'Thriller'}, {'id': 878, 'name': 'Science Fiction'}]",http://www.universalstudiosentertainment.com/children-of-men/,9693,tt0206634,en,Children of Men,"In 2027, in a chaotic world in which humans can no longer procreate, a former activist agrees to help transport a miraculously pregnant woman to a sanctuary at sea, where her child's birth may help scientists save the future of humankind.",18.314,/k9IAS4TehZFcKi4HVByxZNPfqex.jpg,"[{'id': 33, 'logo_path': '/8lvHyhjr8oUKOOy2dKXoALWKdp0.png', 'name': 'Universal Pictures', 'origin_country': 'US'}, {'id': 655, 'logo_path': None, 'name': 'Strike Entertainment', 'origin_country': ''}, {'id': 11429, 'logo_path': None, 'name': 'Hit & Run Productions', 'origin_country': ''}]","[{'iso_3166_1': 'GB', 'name': 'United Kingdom'}, {'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-09-22,70595464,109.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,The future's a thing of the past.,Children of Men,False,7.6,5794,R,"[Universal Pictures, Strike Entertainment, Hit & Run Productions]"
4,False,/4W53mm2nvOtiOuPNW2oiBm9pmUZ.jpg,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 80, 'name': 'Crime'}]",,9925,tt0244521,en,Funny Money,"Henry Perkins, a mild-mannered accountant, accidentally trades briefcases with another man, to find out that there's a million dollars inside. Henry tells his unsuspecting wife of their new-found fortune, but she doesn't embrace it as well as he does. Soon they're joined by their best friends, a...",4.218,/oG3jWUZiDqqrdk3oKwPEkuwoN89.jpg,"[{'id': 110622, 'logo_path': None, 'name': 'FWE Picture Company', 'origin_country': ''}, {'id': 30584, 'logo_path': '/7s0nJTTztkOVeIgavfOwx36EBtj.png', 'name': 'Thinkfactory Media', 'origin_country': 'US'}, {'id': 110623, 'logo_path': None, 'name': 'Tobebo Filmproduktions', 'origin_country': ''}...","[{'iso_3166_1': 'DE', 'name': 'Germany'}, {'iso_3166_1': 'RO', 'name': 'Romania'}, {'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-01-01,0,98.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,"Henry Perkins accidentally trades briefcases with another man, to find out that there's five million dollars inside...",Funny Money,False,5.4,39,R,"[FWE Picture Company, Thinkfactory Media, Tobebo Filmproduktions, Castel Film]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78392,False,/8AiCju3Guhxiq9Sznj8f7aJZofS.jpg,,0,"[{'id': 99, 'name': 'Documentary'}]",http://icarusfilms.com/if-cli,387250,tt7098636,fr,"Le Clitoris, ce cher inconnu","Topics about female sexuality are growing in popularity. Magazines and talk shows all discuss it. Yet a fair percentage of women are said to suffer from female sexual dysfunction. While male sexual problems have traditionally received the most publicity, only recently has research begun into the...",1.400,/aWHfVqrTlEkTjUmyiMWYjyd9gsW.jpg,"[{'id': 92745, 'logo_path': None, 'name': 'Cats & Dogs Films', 'origin_country': 'FR'}, {'id': 7106, 'logo_path': None, 'name': 'Sylicone', 'origin_country': 'FR'}, {'id': 201, 'logo_path': '/6UIpEURdjnmcJPwgTDRzVRuwADr.png', 'name': 'ARTE', 'origin_country': 'FR'}, {'id': 639, 'logo_path': '/fN...","[{'iso_3166_1': 'FR', 'name': 'France'}]",2004-01-16,0,60.0,"[{'english_name': 'French', 'iso_639_1': 'fr', 'name': 'Français'}]",Released,,The Clitoris: Forbidden Pleasure,False,8.0,2,,"[Cats & Dogs Films, Sylicone, ARTE, DR TV, Planète+, SBS TV, CNC, PROCIREP, Angoa-Agicoa]"
78393,False,,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}, {'id': 18, 'name': 'Drama'}]",,640680,tt7246872,en,Tokyo Bedoone Tavaghof,,0.600,/764JbpPIGeApSheIj0v5WvAUpuk.jpg,[],"[{'iso_3166_1': 'IR', 'name': 'Iran'}]",2003-09-10,0,98.0,"[{'english_name': 'Persian', 'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,,"Tokyo, Non-Stop",False,0.0,0,,[]
78394,False,,,0,"[{'id': 99, 'name': 'Documentary'}]",https://www.michaelraeburn.com/site/web/mr/zimbabwe-countdown_documentary,515137,tt7637550,en,Zimbabwe Countdown,"Once touted as Zimbabwe's saviour, Robert Mugabe has become synonymous with bad government and misrule. But where did it all go wrong?",0.711,/vnpAEhWBDhf7I6m4XZqj3Vf27px.jpg,"[{'id': 94, 'logo_path': '/huC7HqorvUThGIrENrbcHmQVUA0.png', 'name': 'ARTE France Cinéma', 'origin_country': 'FR'}]","[{'iso_3166_1': 'FR', 'name': 'France'}, {'iso_3166_1': 'ZW', 'name': 'Zimbabwe'}]",2003-08-28,0,55.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,,Zimbabwe Countdown,False,5.0,4,,[ARTE France Cinéma]
78395,False,,,0,"[{'id': 99, 'name': 'Documentary'}]",,713681,tt9414698,en,The Hot Shoe,"A Documentary Movie About Card Counting In The Casino Game of 21. The director used $5,000 of the movie's budget to learn the counting methods, and put them to the test on screen.",0.600,,[],[],2003-01-01,0,96.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,,The Hot Shoe,False,0.0,0,,[]


In [39]:
# # using regular expressions to extrap just the name
# import re
# exp= r"\'name\'\:.?\'(\w*.*?)\'"
# re.findall(exp, test)

In [40]:
# def get_prod_company_names(x):
#     if x=='[]':
#         return ["MISSING"]
    
#     exp= r"\'name\'\:.?\'(\w*.*?)\'"
#     companies = re.findall(exp, x)
#     return companies

### 📌[TO DO] Clean and Normalize Belongs to collection

In [41]:

df[~df['belongs_to_collection'].isna()]

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification,production_company_list
15,False,/tuu79CwctvIo4RUg7E3FGkGIg4I.jpg,"{'id': 141084, 'name': ""Happily N'Ever After Collection"", 'poster_path': '/aPjLYxKIBys6dlxXMZrDi8VowQK.jpg', 'backdrop_path': '/fw6YfVOmK0tVtaSUGb3GvLzEmvK.jpg'}",47000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}, {'id': 14, 'name': 'Fantasy'}]",http://www.happilyneverafterthefilm.com/,5393,tt0308353,en,Happily N'Ever After,"An alliance of evil-doers, led by Frieda, looks to take over Fairy Tale Land. But when Ella realizes her stepmother is out to ruin her storybook existence, she takes a dramatic turn and blossoms into the leader of the resistance effort.",10.113,/MjfHDN5WrV6RBVhCvfdlPpt7gH.jpg,"[{'id': 1718, 'logo_path': None, 'name': 'Odyssey', 'origin_country': 'US'}, {'id': 1632, 'logo_path': '/cisLn1YAUuptXVBa0xjq7ST9cH0.png', 'name': 'Lionsgate', 'origin_country': 'US'}, {'id': 87859, 'logo_path': None, 'name': 'BAF Berlin Animation Film', 'origin_country': ''}, {'id': 149640, 'lo...","[{'iso_3166_1': 'CA', 'name': 'Canada'}, {'iso_3166_1': 'DE', 'name': 'Germany'}, {'iso_3166_1': 'GB', 'name': 'United Kingdom'}, {'iso_3166_1': 'US', 'name': 'United States of America'}]",2007-01-05,38085778,87.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,Fairy Tale Endings Aren't What They Used To Be.,Happily N'Ever After,False,4.9,358,PG,"[Odyssey, Lionsgate, BAF Berlin Animation Film, BFC Berliner Film Companie, Vanguard Animation]"
18,False,/sd4xN5xi8tKRPrJOWwNiZEile7f.jpg,"{'id': 87118, 'name': 'Cars Collection', 'poster_path': '/ooYvY9DMEdUHH2dOPHbZmyfMENy.jpg', 'backdrop_path': '/A8DqaTGwZ8iCEjWMNRsZumzfKLw.jpg'}",120000000,"[{'id': 16, 'name': 'Animation'}, {'id': 12, 'name': 'Adventure'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]",http://disney.go.com/disneyvideos/animatedfilms/cars/,920,tt0317219,en,Cars,"Lightning McQueen, a hotshot rookie race car driven to succeed, discovers that life is about the journey, not the finish line, when he finds himself unexpectedly detoured in the sleepy Route 66 town of Radiator Springs. On route across the country to the big Piston Cup Championship in California...",80.250,/qa6HCwP4Z15l3hpsASz3auugEW6.jpg,"[{'id': 3, 'logo_path': '/1TjvGVDMYsj6JBxOAkUHpPEwLf7.png', 'name': 'Pixar', 'origin_country': 'US'}, {'id': 2, 'logo_path': '/wdrCwmRnLFJhEoH8GSfymY85KHT.png', 'name': 'Walt Disney Pictures', 'origin_country': 'US'}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-06-08,461983149,117.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}, {'english_name': 'Italian', 'iso_639_1': 'it', 'name': 'Italiano'}, {'english_name': 'Japanese', 'iso_639_1': 'ja', 'name': '日本語'}, {'english_name': 'Yiddish', 'iso_639_1': 'yi', 'name': ''}]",Released,Ahhh... it's got that new movie smell.,Cars,False,6.9,11611,G,"[Pixar, Walt Disney Pictures]"
19,False,/mOyF7rHVLF6uLbyb18FWTmcrjJe.jpg,"{'id': 87359, 'name': 'Mission: Impossible Collection', 'poster_path': '/geEjCGfdmRAA1skBPwojcdvnZ8A.jpg', 'backdrop_path': '/jYl0UuJFcmhymv9ZNO14lPLDY1Z.jpg'}",150000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 28, 'name': 'Action'}, {'id': 53, 'name': 'Thriller'}]",,956,tt0317919,en,Mission: Impossible III,"Retired from active duty to train new IMF agents, Ethan Hunt is called back into action to confront sadistic arms dealer, Owen Davian. Hunt must try to protect his girlfriend while working with his new team to complete the mission.",30.946,/5l0hS4A119jCA1A02XsFZHq1uRD.jpg,"[{'id': 4, 'logo_path': '/gz66EfNoYPqHTYI4q9UEN4CbHRc.png', 'name': 'Paramount', 'origin_country': 'US'}, {'id': 44, 'logo_path': None, 'name': 'Cruise/Wagner Productions', 'origin_country': ''}, {'id': 658, 'logo_path': None, 'name': 'MI 3 Film', 'origin_country': ''}, {'id': 2269, 'logo_path':...","[{'iso_3166_1': 'CN', 'name': 'China'}, {'iso_3166_1': 'DE', 'name': 'Germany'}, {'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-05-03,397850012,126.0,"[{'english_name': 'German', 'iso_639_1': 'de', 'name': 'Deutsch'}, {'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}, {'english_name': 'Italian', 'iso_639_1': 'it', 'name': 'Italiano'}, {'english_name': 'Mandarin', 'iso_639_1': 'zh', 'name': '普通话'}, {'english_name': 'Czech', 'iso...",Released,The mission begins 05:05:06.,Mission: Impossible III,False,6.7,5390,PG-13,"[Paramount, Cruise/Wagner Productions, MI 3 Film, China Film Co-Production Corporation, The Fourth Production Company Film Group, China Film Group Corporation, Cinecittá Studios, Studio Babelsberg]"
34,False,/isH5BjYSpAZrNbrDFBfqYWouqfY.jpg,"{'id': 85817, 'name': 'Arthur and the Invisibles Collection', 'poster_path': '/2zhBMVy8E5fMnixTQUFVHcVncv6.jpg', 'backdrop_path': '/qzi1GxLQp8fk5DBmJ2c8lohhlAr.jpg'}",86000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 16, 'name': 'Animation'}, {'id': 10751, 'name': 'Family'}]",,9992,tt0344854,en,Arthur et les Minimoys,"Arthur is a spirited ten-year old whose parents are away looking for work, whose eccentric grandfather has been missing for several years, and who lives with his grandmother in a country house that, in two days, will be repossessed, torn down, and turned into a block of flats unless Arthur's gra...",23.025,/gH27K0S7BiZCsNMxzOQTacG9IZp.jpg,"[{'id': 104, 'logo_path': '/9aotxauvc9685tq9pTcRJszuT06.png', 'name': 'Canal+', 'origin_country': 'FR'}, {'id': 854, 'logo_path': None, 'name': 'Sofica EuropaCorp', 'origin_country': ''}, {'id': 2525, 'logo_path': None, 'name': 'Avalanche Productions', 'origin_country': 'FR'}, {'id': 6896, 'logo...","[{'iso_3166_1': 'FR', 'name': 'France'}]",2006-12-13,107944236,94.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,Adventure awaits in your own backyard.,Arthur and the Invisibles,False,6.4,2464,PG,"[Canal+, Sofica EuropaCorp, Avalanche Productions, EuropaCorp]"
36,False,/8eRscFbRYl681zDfkjv1jjW1KAA.jpg,"{'id': 8537, 'name': 'Superman Collection', 'poster_path': '/fgTdu62v1jjNDPh68V9TJZMgXbo.jpg', 'backdrop_path': '/la0EhOrgVAiY8sf3ILz3wgoaNXH.jpg'}",270000000,"[{'id': 878, 'name': 'Science Fiction'}, {'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'Adventure'}]",https://www.warnerbros.com/movies/superman-returns,1452,tt0348150,en,Superman Returns,"Superman returns to discover his 5-year absence has allowed Lex Luthor to walk free, and that those he was closest to felt abandoned and have moved on. Luthor plots his ultimate revenge that could see millions killed and change the face of the planet forever, as well as ridding himself of the Ma...",24.129,/385XwTQZDpRX2d3kxtnpiLrjBXw.jpg,"[{'id': 429, 'logo_path': '/2Tc1P3Ac8M479naPp1kYT3izLS5.png', 'name': 'DC Comics', 'origin_country': 'US'}, {'id': 923, 'logo_path': '/5UQsZrfbfG2dYJbx8DxfoTr2Bvu.png', 'name': 'Legendary Pictures', 'origin_country': 'US'}, {'id': 9168, 'logo_path': '/hQNXrYIuyLRxMSBuGHk0GX7CCBS.png', 'name': 'B...","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-06-28,391081192,154.0,"[{'english_name': 'German', 'iso_639_1': 'de', 'name': 'Deutsch'}, {'english_name': 'French', 'iso_639_1': 'fr', 'name': 'Français'}, {'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,You’ll Believe A Man Can Fly Again!,Superman Returns,False,5.7,3453,PG-13,"[DC Comics, Legendary Pictures, Bad Hat Harry Productions, Warner Bros. Pictures, Peters Entertainment, Red Sun Productions Pty. Ltd.]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78301,False,/bIlgVsRkASYjLh5BABmAMAMeHji.jpg,"{'id': 39199, 'name': 'Detective Conan Collection', 'poster_path': '/1wBfr532NOQK68wlo5ApjCmiQIB.jpg', 'backdrop_path': '/9bogrpii4e61SR6a9qLHow7I46U.jpg'}",0,"[{'id': 9648, 'name': 'Mystery'}, {'id': 16, 'name': 'Animation'}]",https://www.tms-e.co.jp/global/alltitles/conan/087307.html,39202,tt1133935,ja,名探偵コナン 迷宮の十字路,"Mouri Kogoro is called to a special case in the ancient capital of Kyoto. There, Conan meets Heiji and they team up once again to solve the case, recover the stolen Healing Buddha statue, and even discover the identity of Heiji's first love.",8.046,/lzPnHHhYWSMGwBLTDCIT6sgehF5.jpg,"[{'id': 7164, 'logo_path': '/vHsaqTkQ2ZTbdqrCA272YahoVlg.png', 'name': 'TMS Entertainment', 'origin_country': 'JP'}]","[{'iso_3166_1': 'JP', 'name': 'Japan'}]",2003-04-19,0,110.0,"[{'english_name': 'Japanese', 'iso_639_1': 'ja', 'name': '日本語'}]",Released,,Detective Conan: Crossroad in the Ancient Capital,False,7.0,107,PG-13,[TMS Entertainment]
78349,False,/dGTQNe1jVYn3zhLsW2PP3PpLaKo.jpg,"{'id': 977715, 'name': 'Ultraman Cosmos Collection', 'poster_path': None, 'backdrop_path': None}",0,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'Adventure'}, {'id': 10751, 'name': 'Family'}, {'id': 878, 'name': 'Science Fiction'}]",http://www.m-78.jp/,44045,tt2243192,ja,ウルトラマンコスモスVSウルトラマンジャスティス THE FINAL BATTLE,"Once again, Musashi has encounter Ultraman Cosmos. Now Ultraman Cosmos has gained some new modes to defeat Scropis. Musashi has joined Team Sea to save the Blue Area. But there is mysterious evil monster, Sandros, who wants to destroy the Earth. Now it is up Musashi and Ultraman Cosmos to defeat...",2.661,/dEQIkE7S3KCCIE2Mjl0WMdBBkhk.jpg,"[{'id': 5905, 'logo_path': '/A9A3W8VyqnznQOf70ROYtwskK4A.png', 'name': 'Tsuburaya Productions', 'origin_country': 'JP'}, {'id': 192, 'logo_path': '/93eAKlnQecDdMtucfOprzcjMw9V.png', 'name': 'Shochiku', 'origin_country': 'JP'}]","[{'iso_3166_1': 'JP', 'name': 'Japan'}]",2003-08-02,0,77.0,"[{'english_name': 'Japanese', 'iso_639_1': 'ja', 'name': '日本語'}]",Released,,Ultraman Cosmos vs. Ultraman Justice: The Final Battle,False,3.5,2,,"[Tsuburaya Productions, Shochiku]"
78374,False,,"{'id': 221910, 'name': 'Raped by an Angel Collection', 'poster_path': '/7nQI5UN2UOwvb5jXPQTE13qaXaC.jpg', 'backdrop_path': None}",0,"[{'id': 53, 'name': 'Thriller'}]",,221913,tt3533690,cn,强奸5广告诱惑,"Aaron is in his twenties, thin and silent. No-one notices that he is psychologically deviant. He is obsessed with three female advertisement models: Judy, from a wedding ad, Pauline, presenting shower gel, and Yen, an advertising girl for soda. He imagines that the women will surrender to him on...",1.042,/myE5lwfRSzevyavMNOmgbJm3jD6.jpg,[],[],2003-01-01,0,88.0,"[{'english_name': 'Cantonese', 'iso_639_1': 'cn', 'name': '广州话 / 廣州話'}]",Released,,Raped by an Angel 5: Advertising Temptations,False,1.0,1,,[]
78376,False,,"{'id': 863888, 'name': 'Twin Dragon Encounter Collection', 'poster_path': None, 'backdrop_path': None}",0,"[{'id': 28, 'name': 'Action'}]",,850062,tt3679304,en,The Real Twin Dragons,"A true, one of a kind story that is totally unique in the film world.",1.091,/kcWKNgsptWfQukYU38lYoxXOcKD.jpg,[],[],2003-09-03,0,78.0,[],Released,You can't fight the government ... no one told the twins !,The Real Twin Dragons,False,0.0,0,,[]


In [42]:
test = df.iloc[19]['belongs_to_collection']
test

"{'id': 87359, 'name': 'Mission: Impossible Collection', 'poster_path': '/geEjCGfdmRAA1skBPwojcdvnZ8A.jpg', 'backdrop_path': '/jYl0UuJFcmhymv9ZNO14lPLDY1Z.jpg'}"

In [43]:
## Function to get just the genre names as a list 
import json
def get_collection(x):
    try:
        x = x.replace('None','"None"')
        x = x.replace('\\','')

    #         x = x.replace("'",'"')
        x = x.replace("'s",'s')
        for key in ['id','poster_path','name','backdrop_path']:
            x = x.replace(f"'{key}'",f'"{key}"')

        x = x.replace("'",'"')

        x = json.loads(x)
    except:
        return None
    if isinstance(x,dict):
        collections= x['name']
        
    elif isinstance(x,list):
        collections = []
        for collection in x:
            collections.append(collection['name'])
    else:
        collections = "MISSING"
    return collections


In [44]:
get_collection(test)

'Mission: Impossible Collection'

In [45]:
# json.loads(test)

In [46]:
## Use ourn function and exploding the new column
df['collection'] = df['belongs_to_collection'].apply(get_collection)

In [47]:
df['collection'].isna().sum() / len(df)

0.9500363534318915

In [48]:
df['belongs_to_collection'].isna().sum() / len(df)

0.9497047080883197

In [49]:
df['collection'].value_counts()

Detective Conan Collection            18
Doraemon Collection                   16
Troublesome Night Collection          15
One Piece Collection                  13
Madea - Collection                    12
                                      ..
Çakallarla Dans [Seri]                 1
Pizza                                  1
美景之屋（系列）                               1
Bloodbath at Pinky High Collection     1
Twin Dragon Encounter Collection       1
Name: collection, Length: 1943, dtype: int64

#### Explode & Separate

In [50]:
df['collection']

0        None
1        None
2        None
3        None
4        None
         ... 
78392    None
78393    None
78394    None
78395    None
78396    None
Name: collection, Length: 78397, dtype: object

In [51]:
# df_explode = df.explode('collecti')
# df_explode

In [52]:
## save unique genres
unique_collections = sorted(df['collection'].dropna().unique())
len(unique_collections)

1943

In [53]:
collection_map = dict(zip(unique_collections, range(len(unique_collections))))
collection_map

{'#TemanTapiMenikah': 0,
 '... Cameras Collection': 1,
 '.hack Collection': 2,
 '10 giorni... - Collezione': 3,
 '100 Girls Collection': 4,
 '100% Coco Collection': 5,
 '101 Dalmatians (Live-Action) Collection': 6,
 '12 Dogs of Christmas Collection': 7,
 '12 Rounds Collection': 8,
 '12 Westerns in 12 Months Collection': 9,
 '14+ (Коллекция)': 10,
 '1920 Collection': 11,
 '1981 Collection': 12,
 '2 Days In... Collection': 13,
 '2001 Maniacs Collection': 14,
 '20th Century Boys Collection': 15,
 '28 Days/Weeks Later Collection': 16,
 '3 A.M. Collection': 17,
 '3 Braves Universe': 18,
 '3 Little Pigs collection': 19,
 '30 Days of Night Collection': 20,
 '300 Collection': 21,
 '365 Days Collection': 22,
 '47 Meters Down Collection': 23,
 '47 Ronin Collection': 24,
 '64 Collection': 25,
 '666 Collection': 26,
 '8 свиданий (Коллекция)': 27,
 'A California Christmas Collection': 28,
 'A Chinese Odyssey Collection': 29,
 'A Christmas Prince Collection': 30,
 'A Doggone Collection': 31,
 'A Dog

#### ~~WAIT!~~ Nevermind

In [54]:
df

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification,production_company_list,collection
0,False,,,100000,[],,30356,tt0144280,en,Whispers from a Shallow Grave,"When a beautiful photo model disappears, police track the photographer who last saw her. At his trial, the model herself tells her side of the story.",0.714,/cjRqAh3u2Z0MBx9GtT1k7Z9732w.jpg,[],"[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1997-01-01,0,90.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,,Whispers from a Shallow Grave,False,2.0,2,,[],
1,False,/1hGrzTajJ5JC6VfPzrUj8Gj5t5k.jpg,,0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10402, 'name': 'Music'}, {'id': 878, 'name': 'Science Fiction'}, {'id': 27, 'name': 'Horror'}]",,58520,tt0197633,en,Live Freaky Die Freaky,"It is the year 3069 and the Earth has been robbed of all its natural resources, destroyed by war and ozone depletion.",1.957,/qt7IVXzGca5wS68WZ6PBWHkTcMz.jpg,[],"[{'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-01-31,0,75.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,,Live Freaky Die Freaky,False,3.8,8,NR,[],
2,False,/qKiOZMJH9Riqlk2SkUOgSMZHHJu.jpg,,0,"[{'id': 9648, 'name': 'Mystery'}, {'id': 35, 'name': 'Comedy'}]",,459563,tt0204250,en,Death of a Saleswoman,"Top-ranking RubberTubber saleswoman Agatha J. Ruby was brutally beaten and shot on her morning jog by an unknown assailant. Over six thousand dollars worth of her best plastic storage ware was stolen from her trunk. Nearly all 200 residents in her hometown of Mametville, Washington knew her pers...",0.676,/tIfllOpkhZl5sbqZd1X77iiaJDj.jpg,"[{'id': 90905, 'logo_path': None, 'name': 'Girl Next Door Productions', 'origin_country': ''}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-09-18,0,77.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,6 housewives. 1 dead body. A trunk full of plastic storage wear.,Death of a Saleswoman,False,3.0,1,,[Girl Next Door Productions],
3,False,/z8FzomPxfCb2nv827VWLSI6oYiB.jpg,,76000000,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name': 'Action'}, {'id': 53, 'name': 'Thriller'}, {'id': 878, 'name': 'Science Fiction'}]",http://www.universalstudiosentertainment.com/children-of-men/,9693,tt0206634,en,Children of Men,"In 2027, in a chaotic world in which humans can no longer procreate, a former activist agrees to help transport a miraculously pregnant woman to a sanctuary at sea, where her child's birth may help scientists save the future of humankind.",18.314,/k9IAS4TehZFcKi4HVByxZNPfqex.jpg,"[{'id': 33, 'logo_path': '/8lvHyhjr8oUKOOy2dKXoALWKdp0.png', 'name': 'Universal Pictures', 'origin_country': 'US'}, {'id': 655, 'logo_path': None, 'name': 'Strike Entertainment', 'origin_country': ''}, {'id': 11429, 'logo_path': None, 'name': 'Hit & Run Productions', 'origin_country': ''}]","[{'iso_3166_1': 'GB', 'name': 'United Kingdom'}, {'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-09-22,70595464,109.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,The future's a thing of the past.,Children of Men,False,7.6,5794,R,"[Universal Pictures, Strike Entertainment, Hit & Run Productions]",
4,False,/4W53mm2nvOtiOuPNW2oiBm9pmUZ.jpg,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 80, 'name': 'Crime'}]",,9925,tt0244521,en,Funny Money,"Henry Perkins, a mild-mannered accountant, accidentally trades briefcases with another man, to find out that there's a million dollars inside. Henry tells his unsuspecting wife of their new-found fortune, but she doesn't embrace it as well as he does. Soon they're joined by their best friends, a...",4.218,/oG3jWUZiDqqrdk3oKwPEkuwoN89.jpg,"[{'id': 110622, 'logo_path': None, 'name': 'FWE Picture Company', 'origin_country': ''}, {'id': 30584, 'logo_path': '/7s0nJTTztkOVeIgavfOwx36EBtj.png', 'name': 'Thinkfactory Media', 'origin_country': 'US'}, {'id': 110623, 'logo_path': None, 'name': 'Tobebo Filmproduktions', 'origin_country': ''}...","[{'iso_3166_1': 'DE', 'name': 'Germany'}, {'iso_3166_1': 'RO', 'name': 'Romania'}, {'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-01-01,0,98.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,"Henry Perkins accidentally trades briefcases with another man, to find out that there's five million dollars inside...",Funny Money,False,5.4,39,R,"[FWE Picture Company, Thinkfactory Media, Tobebo Filmproduktions, Castel Film]",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78392,False,/8AiCju3Guhxiq9Sznj8f7aJZofS.jpg,,0,"[{'id': 99, 'name': 'Documentary'}]",http://icarusfilms.com/if-cli,387250,tt7098636,fr,"Le Clitoris, ce cher inconnu","Topics about female sexuality are growing in popularity. Magazines and talk shows all discuss it. Yet a fair percentage of women are said to suffer from female sexual dysfunction. While male sexual problems have traditionally received the most publicity, only recently has research begun into the...",1.400,/aWHfVqrTlEkTjUmyiMWYjyd9gsW.jpg,"[{'id': 92745, 'logo_path': None, 'name': 'Cats & Dogs Films', 'origin_country': 'FR'}, {'id': 7106, 'logo_path': None, 'name': 'Sylicone', 'origin_country': 'FR'}, {'id': 201, 'logo_path': '/6UIpEURdjnmcJPwgTDRzVRuwADr.png', 'name': 'ARTE', 'origin_country': 'FR'}, {'id': 639, 'logo_path': '/fN...","[{'iso_3166_1': 'FR', 'name': 'France'}]",2004-01-16,0,60.0,"[{'english_name': 'French', 'iso_639_1': 'fr', 'name': 'Français'}]",Released,,The Clitoris: Forbidden Pleasure,False,8.0,2,,"[Cats & Dogs Films, Sylicone, ARTE, DR TV, Planète+, SBS TV, CNC, PROCIREP, Angoa-Agicoa]",
78393,False,,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}, {'id': 18, 'name': 'Drama'}]",,640680,tt7246872,en,Tokyo Bedoone Tavaghof,,0.600,/764JbpPIGeApSheIj0v5WvAUpuk.jpg,[],"[{'iso_3166_1': 'IR', 'name': 'Iran'}]",2003-09-10,0,98.0,"[{'english_name': 'Persian', 'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,,"Tokyo, Non-Stop",False,0.0,0,,[],
78394,False,,,0,"[{'id': 99, 'name': 'Documentary'}]",https://www.michaelraeburn.com/site/web/mr/zimbabwe-countdown_documentary,515137,tt7637550,en,Zimbabwe Countdown,"Once touted as Zimbabwe's saviour, Robert Mugabe has become synonymous with bad government and misrule. But where did it all go wrong?",0.711,/vnpAEhWBDhf7I6m4XZqj3Vf27px.jpg,"[{'id': 94, 'logo_path': '/huC7HqorvUThGIrENrbcHmQVUA0.png', 'name': 'ARTE France Cinéma', 'origin_country': 'FR'}]","[{'iso_3166_1': 'FR', 'name': 'France'}, {'iso_3166_1': 'ZW', 'name': 'Zimbabwe'}]",2003-08-28,0,55.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,,Zimbabwe Countdown,False,5.0,4,,[ARTE France Cinéma],
78395,False,,,0,"[{'id': 99, 'name': 'Documentary'}]",,713681,tt9414698,en,The Hot Shoe,"A Documentary Movie About Card Counting In The Casino Game of 21. The director used $5,000 of the movie's budget to learn the counting methods, and put them to the test on screen.",0.600,,[],[],2003-01-01,0,96.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,,The Hot Shoe,False,0.0,0,,[],


In [55]:
df_title_collection = df[['imdb_id','collection']]
df_title_collection

Unnamed: 0,imdb_id,collection
0,tt0144280,
1,tt0197633,
2,tt0204250,
3,tt0206634,
4,tt0244521,
...,...,...
78392,tt7098636,
78393,tt7246872,
78394,tt7637550,
78395,tt9414698,


In [56]:
df_title_collection.isna().sum()

imdb_id           0
collection    74480
dtype: int64

In [57]:
df_title_collection = df_title_collection.dropna()
df_title_collection.isna().sum()

imdb_id       0
collection    0
dtype: int64

In [58]:
df_title_collection['collection_id'] = df_title_collection['collection'].map(collection_map)
df_title_collection

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_title_collection['collection_id'] = df_title_collection['collection'].map(collection_map)


Unnamed: 0,imdb_id,collection,collection_id
18,tt0317219,Cars Collection,274
19,tt0317919,Mission: Impossible Collection,963
34,tt0344854,Arthur and the Invisibles Collection,131
36,tt0348150,Superman Collection,1389
49,tt0362120,Scary Movie Collection,1256
...,...,...,...
78301,tt1133935,Detective Conan Collection,391
78349,tt2243192,Ultraman Cosmos Collection,1778
78374,tt3533690,Raped by an Angel Collection,1187
78376,tt3679304,Twin Dragon Encounter Collection,1773


In [59]:
df_collection = pd.DataFrame({'collection_id':collection_map.values(), 'collection_name':collection_map.keys()})
df_collection

Unnamed: 0,collection_id,collection_name
0,0,#TemanTapiMenikah
1,1,... Cameras Collection
2,2,.hack Collection
3,3,10 giorni... - Collezione
4,4,100 Girls Collection
...,...,...
1938,1938,달마야 시리즈
1939,1939,독전 시리즈
1940,1940,동갑내기 과외하기 시리즈
1941,1941,반드시 크게 들을 것


# LOAD

## Connect to MySQL

In [60]:
import pymysql
pymysql.install_as_MySQLdb()
from sqlalchemy import create_engine
from sqlalchemy_utils import create_database, database_exists
from sqlalchemy.types import *


In [61]:
## Getting mysql server password
import json
with open('/Users/codingdojo/.secret/mysql.json') as f:
    login = json.load(f)

login.keys()

dict_keys(['user', 'password'])

In [62]:
## create a new movies database
# connect_str = f"mysql+pymysql://{login['user']}:{login['password']}@localhost/movies"
connect_str = f"mysql+pymysql://{login['user']}:{login['password']}@localhost/movies-viz"

In [63]:
## Check if database exists, if not, create it
if database_exists(connect_str) == False:
    print("Creating the database.")
    create_database(connect_str)
else:
    print('The database already exists.')

The database already exists.


In [64]:
## create engine
engine = create_engine(connect_str)

In [65]:
## Test your connection by checking for any tables that exist (there should be none at this point)
q = """SHOW TABLES;"""
tables = pd.read_sql(q, engine)
tables

Unnamed: 0,Tables_in_movies-viz
0,genres
1,production_companies
2,title_basics
3,title_genres
4,title_production_company
5,title_ratings
6,tmdb


### Saving`title_basics` to db

In [66]:
def get_schema(table,debug=False):
    ## save pandas dtypes in list, make empty dict
    dtypes = table.dtypes
    schema = {}
    
    # for each column
    for col in dtypes.index:
        ## print info if in debug mode
        if debug:
            print(f"{col} = {dtypes.loc[col]}")

        ## if its a string column (object)
        if dtypes.loc[col]=='object':
            
            ## Fill null values and make sure whole column is str
            data = table[col].fillna('').astype(str)
            
            ## get len first
            len_str = data.map(len).max()
            
            ## if the string is shorter than 21845 use String
            # (forget how i knew it was max size)
            if len_str < 21845:
                schema[col] = String( len_str + 1)
                
            ## If longer use Text
            else:
                schema[col] = Text(len_str+1)
        
        # if float make Float
        elif dtypes.loc[col] == 'float':
            schema[col] = Float()

        ## if int make Integer
        elif dtypes.loc[col] == 'int':
            schema[col] = Integer()#Float()#
            
        ## if bool make Boolean
        elif dtypes.loc[col] == 'bool':
            schema[col] = Boolean()
            
    return schema


In [67]:
basics_schema = get_schema(basics)
basics_schema

{'tconst': String(length=11),
 'primaryTitle': String(length=243),
 'startYear': Float(),
 'runtimeMinutes': Integer()}

In [68]:
# ## saving text length
# key_len = basics['tconst'].map(len).max()
# title_len = basics['primaryTitle'].map(len).max()
# key_len, title_len

In [69]:
# basics_schema = {
#     "tconst": String(key_len+1), 
#     "primaryTitle": Text(title_len+1),
#     'startYear':Float(),
#     'runtimeMinutes':Integer()
#     }
# basics_schema

In [70]:
## Saving basics as table with schema,
basics.to_sql('title_basics',engine,dtype=basics_schema,if_exists='replace',index=False)

## setting title basics as the primary key
engine.execute('ALTER TABLE title_basics ADD PRIMARY KEY (`tconst`);')

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x290dd7b20>

In [71]:
## query first rows 
q = """SELECT * FROM title_basics LIMIT 5"""
pd.read_sql(q,engine)

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes
0,tt0035423,Kate & Leopold,2001.0,118
1,tt0062336,The Tango of the Widower and Its Distorting Mirror,2020.0,70
2,tt0067683,Workers '71: Nothing About Us Without Us,2006.0,47
3,tt0069049,The Other Side of the Wind,2018.0,122
4,tt0088751,The Naked Monster,2005.0,100


### Saving Genre Tables

#### Saving genre_lookup

In [72]:
genre_schema = get_schema(genre_lookup)
genre_schema

{'Genre_Name': String(length=12), 'Genre_ID': Integer()}

In [73]:
genre_lookup.head()

Unnamed: 0,Genre_Name,Genre_ID
0,Action,0
1,Adult,1
2,Adventure,2
3,Animation,3
4,Biography,4


In [74]:
## Primary key is Genre_ID
genre_lookup.to_sql('genres',engine,index=False,  if_exists='replace')

engine.execute('ALTER TABLE genres ADD PRIMARY KEY (`Genre_ID`);')

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x292d797f0>

In [75]:
q = """SELECT * FROM genres LIMIT 5"""
pd.read_sql(q,engine)

Unnamed: 0,Genre_Name,Genre_ID
0,Action,0
1,Adult,1
2,Adventure,2
3,Animation,3
4,Biography,4


#### Saving `title_genres` table

In [76]:
## NO PRIMARY KEY - DUPLCIATE VALUES
title_genres.to_sql('title_genres',engine,index=False,
                    if_exists='replace' )

# engine.execute('ALTER TABLE title_genres ADD PRIMARY KEY (`tconst`);')

207423

In [77]:
q = """SELECT * FROM title_genres LIMIT 5"""
pd.read_sql(q,engine)

Unnamed: 0,tconst,Genre_ID
0,tt0035423,5
1,tt0035423,10
2,tt0035423,19
3,tt0062336,8
4,tt0067683,7


### Saving `title_ratings`


In [78]:
# ratings_schema = {'tconst':String(key_len+1), 
#                  'averageRating':Float(),
#                  'numVotes':Integer()}#get_schema(ratings)
ratings_schema = get_schema(ratings)
ratings_schema

{'tconst': String(length=11), 'averageRating': Float(), 'numVotes': Integer()}

In [79]:
ratings.to_sql('title_ratings',engine,if_exists='replace',index=False,
              dtype=ratings_schema)
engine.execute("ALTER TABLE title_ratings ADD PRIMARY KEY (`tconst`)")

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x292d7d4c0>

In [80]:
q = """SELECT * FROM title_ratings LIMIT 5"""
pd.read_sql(q,engine)

Unnamed: 0,tconst,averageRating,numVotes
0,tt0035423,6.4,85245
1,tt0062336,6.4,163
2,tt0067683,6.4,90
3,tt0069049,6.7,7435
4,tt0088751,5.2,327


### Saving TMDB API Data

In [81]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78397 entries, 0 to 78396
Data columns (total 28 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   adult                    78397 non-null  bool   
 1   backdrop_path            43654 non-null  object 
 2   belongs_to_collection    3943 non-null   object 
 3   budget                   78397 non-null  int64  
 4   genres                   78397 non-null  object 
 5   homepage                 21370 non-null  object 
 6   id                       78397 non-null  int64  
 7   imdb_id                  78397 non-null  object 
 8   original_language        78397 non-null  object 
 9   original_title           78397 non-null  object 
 10  overview                 76518 non-null  object 
 11  popularity               78397 non-null  float64
 12  poster_path              68663 non-null  object 
 13  production_companies     78397 non-null  object 
 14  production_countries  

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification,production_company_list,collection
0,False,,,100000,[],,30356,tt0144280,en,Whispers from a Shallow Grave,"When a beautiful photo model disappears, police track the photographer who last saw her. At his trial, the model herself tells her side of the story.",0.714,/cjRqAh3u2Z0MBx9GtT1k7Z9732w.jpg,[],"[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1997-01-01,0,90.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,,Whispers from a Shallow Grave,False,2.0,2,,[],
1,False,/1hGrzTajJ5JC6VfPzrUj8Gj5t5k.jpg,,0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10402, 'name': 'Music'}, {'id': 878, 'name': 'Science Fiction'}, {'id': 27, 'name': 'Horror'}]",,58520,tt0197633,en,Live Freaky Die Freaky,"It is the year 3069 and the Earth has been robbed of all its natural resources, destroyed by war and ozone depletion.",1.957,/qt7IVXzGca5wS68WZ6PBWHkTcMz.jpg,[],"[{'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-01-31,0,75.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,,Live Freaky Die Freaky,False,3.8,8,NR,[],
2,False,/qKiOZMJH9Riqlk2SkUOgSMZHHJu.jpg,,0,"[{'id': 9648, 'name': 'Mystery'}, {'id': 35, 'name': 'Comedy'}]",,459563,tt0204250,en,Death of a Saleswoman,"Top-ranking RubberTubber saleswoman Agatha J. Ruby was brutally beaten and shot on her morning jog by an unknown assailant. Over six thousand dollars worth of her best plastic storage ware was stolen from her trunk. Nearly all 200 residents in her hometown of Mametville, Washington knew her pers...",0.676,/tIfllOpkhZl5sbqZd1X77iiaJDj.jpg,"[{'id': 90905, 'logo_path': None, 'name': 'Girl Next Door Productions', 'origin_country': ''}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-09-18,0,77.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,6 housewives. 1 dead body. A trunk full of plastic storage wear.,Death of a Saleswoman,False,3.0,1,,[Girl Next Door Productions],
3,False,/z8FzomPxfCb2nv827VWLSI6oYiB.jpg,,76000000,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name': 'Action'}, {'id': 53, 'name': 'Thriller'}, {'id': 878, 'name': 'Science Fiction'}]",http://www.universalstudiosentertainment.com/children-of-men/,9693,tt0206634,en,Children of Men,"In 2027, in a chaotic world in which humans can no longer procreate, a former activist agrees to help transport a miraculously pregnant woman to a sanctuary at sea, where her child's birth may help scientists save the future of humankind.",18.314,/k9IAS4TehZFcKi4HVByxZNPfqex.jpg,"[{'id': 33, 'logo_path': '/8lvHyhjr8oUKOOy2dKXoALWKdp0.png', 'name': 'Universal Pictures', 'origin_country': 'US'}, {'id': 655, 'logo_path': None, 'name': 'Strike Entertainment', 'origin_country': ''}, {'id': 11429, 'logo_path': None, 'name': 'Hit & Run Productions', 'origin_country': ''}]","[{'iso_3166_1': 'GB', 'name': 'United Kingdom'}, {'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-09-22,70595464,109.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,The future's a thing of the past.,Children of Men,False,7.6,5794,R,"[Universal Pictures, Strike Entertainment, Hit & Run Productions]",
4,False,/4W53mm2nvOtiOuPNW2oiBm9pmUZ.jpg,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 80, 'name': 'Crime'}]",,9925,tt0244521,en,Funny Money,"Henry Perkins, a mild-mannered accountant, accidentally trades briefcases with another man, to find out that there's a million dollars inside. Henry tells his unsuspecting wife of their new-found fortune, but she doesn't embrace it as well as he does. Soon they're joined by their best friends, a...",4.218,/oG3jWUZiDqqrdk3oKwPEkuwoN89.jpg,"[{'id': 110622, 'logo_path': None, 'name': 'FWE Picture Company', 'origin_country': ''}, {'id': 30584, 'logo_path': '/7s0nJTTztkOVeIgavfOwx36EBtj.png', 'name': 'Thinkfactory Media', 'origin_country': 'US'}, {'id': 110623, 'logo_path': None, 'name': 'Tobebo Filmproduktions', 'origin_country': ''}...","[{'iso_3166_1': 'DE', 'name': 'Germany'}, {'iso_3166_1': 'RO', 'name': 'Romania'}, {'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-01-01,0,98.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,"Henry Perkins accidentally trades briefcases with another man, to find out that there's five million dollars inside...",Funny Money,False,5.4,39,R,"[FWE Picture Company, Thinkfactory Media, Tobebo Filmproduktions, Castel Film]",


### Saving Production Companies

In [82]:
df_title_companies

Unnamed: 0,imdb_id,production_company_list,company_id
0,tt0144280,,
1,tt0197633,,
2,tt0204250,Girl Next Door Productions,14402.0
3,tt0206634,Universal Pictures,36827.0
3,tt0206634,Strike Entertainment,33356.0
...,...,...,...
78392,tt7098636,Angoa-Agicoa,2167.0
78393,tt7246872,,
78394,tt7637550,ARTE France Cinéma,961.0
78395,tt9414698,,


#### `title_company`

In [83]:
table_title_company = df_title_companies.drop(columns=['production_company_list']) 
table_title_company.isna().sum()

imdb_id           0
company_id    31660
dtype: int64

In [84]:
table_title_company = table_title_company.dropna()
table_title_company['company_id'] = table_title_company['company_id'].astype(int)
table_title_company

Unnamed: 0,imdb_id,company_id
2,tt0204250,14402
3,tt0206634,36827
3,tt0206634,33356
3,tt0206634,16040
4,tt0244521,11845
...,...,...
78392,tt7098636,6068
78392,tt7098636,26073
78392,tt7098636,2167
78394,tt7637550,961


In [85]:
## saving text length
key_len = table_title_company['imdb_id'].map(len).max()

## saving schema
api_data_schema = {'imdb_id':String(key_len+1)}
api_data_schema

{'imdb_id': String(length=11)}

In [86]:
# saving mvp version
## Primary key is Genre_ID
table_title_company.to_sql('title_production_company',engine, 
                           index=False,dtype=api_data_schema, if_exists='replace')


106553

### `companies`

In [87]:
table_companies = df_company.copy()
table_companies

Unnamed: 0,company_id,company_name
0,0,# Andrea Sperling Productions
1,1,#Beardforce Films
2,2,#Sinning Works
3,3,#littlesecretfilm
4,4,(주)라인필름
...,...,...
39867,39867,타임스토리그룹
39868,39868,파파스필름
39869,39869,플랫폼픽쳐스
39870,39870,한맥문화


In [88]:
# saving mvp version
## Primary key is Genre_ID
table_companies.to_sql('production_companies',engine, index=False,dtype=api_data_schema, if_exists='replace')


39872

### `title_collections`

In [89]:
df_title_collection

Unnamed: 0,imdb_id,collection,collection_id
18,tt0317219,Cars Collection,274
19,tt0317919,Mission: Impossible Collection,963
34,tt0344854,Arthur and the Invisibles Collection,131
36,tt0348150,Superman Collection,1389
49,tt0362120,Scary Movie Collection,1256
...,...,...,...
78301,tt1133935,Detective Conan Collection,391
78349,tt2243192,Ultraman Cosmos Collection,1778
78374,tt3533690,Raped by an Angel Collection,1187
78376,tt3679304,Twin Dragon Encounter Collection,1773


In [90]:
table_title_collection = df_title_collection.drop(columns=['collection']) 
table_title_collection.isna().sum()

imdb_id          0
collection_id    0
dtype: int64

In [91]:
table_title_collection = table_title_collection.dropna()
table_title_collection['collection_id'] = table_title_collection['collection_id'].astype(int)
table_title_collection

Unnamed: 0,imdb_id,collection_id
18,tt0317219,274
19,tt0317919,963
34,tt0344854,131
36,tt0348150,1389
49,tt0362120,1256
...,...,...
78301,tt1133935,391
78349,tt2243192,1778
78374,tt3533690,1187
78376,tt3679304,1773


In [96]:
# ## saving text length
# key_len = table_title_collection['imdb_id'].map(len).max()
# ## saving schema
# api_data_schema = {'imdb_id':String(key_len+1)}
# api_data_schema
schema = get_schema(table_title_collection)
schema

{'imdb_id': String(length=11), 'collection_id': Integer()}

In [98]:
# saving mvp version
## Primary key is Genre_ID
table_title_collection.to_sql('title_collection',engine, index=False,dtype=schema, if_exists='replace')


3917

### `collections`

In [99]:
table_collections = df_collection
table_collections

Unnamed: 0,collection_id,collection_name
0,0,#TemanTapiMenikah
1,1,... Cameras Collection
2,2,.hack Collection
3,3,10 giorni... - Collezione
4,4,100 Girls Collection
...,...,...
1938,1938,달마야 시리즈
1939,1939,독전 시리즈
1940,1940,동갑내기 과외하기 시리즈
1941,1941,반드시 크게 들을 것


In [100]:
# saving mvp version
table_collections.to_sql('collections',engine, index=False,
                         dtype=get_schema(table_collections),
                         if_exists='replace')

1943

### TMDB API Data 

~~Instructor Note: Saving 2 versions of API data: MVP and AAB~~

In [101]:
## Convert financial cols to float (error?)
df['revenue'] = df['revenue'].astype(float)
df['budget'] = df['budget'].astype(float)

In [102]:
df

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification,production_company_list,collection
0,False,,,100000.0,[],,30356,tt0144280,en,Whispers from a Shallow Grave,"When a beautiful photo model disappears, police track the photographer who last saw her. At his trial, the model herself tells her side of the story.",0.714,/cjRqAh3u2Z0MBx9GtT1k7Z9732w.jpg,[],"[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1997-01-01,0.0,90.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,,Whispers from a Shallow Grave,False,2.0,2,,[],
1,False,/1hGrzTajJ5JC6VfPzrUj8Gj5t5k.jpg,,0.0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10402, 'name': 'Music'}, {'id': 878, 'name': 'Science Fiction'}, {'id': 27, 'name': 'Horror'}]",,58520,tt0197633,en,Live Freaky Die Freaky,"It is the year 3069 and the Earth has been robbed of all its natural resources, destroyed by war and ozone depletion.",1.957,/qt7IVXzGca5wS68WZ6PBWHkTcMz.jpg,[],"[{'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-01-31,0.0,75.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,,Live Freaky Die Freaky,False,3.8,8,NR,[],
2,False,/qKiOZMJH9Riqlk2SkUOgSMZHHJu.jpg,,0.0,"[{'id': 9648, 'name': 'Mystery'}, {'id': 35, 'name': 'Comedy'}]",,459563,tt0204250,en,Death of a Saleswoman,"Top-ranking RubberTubber saleswoman Agatha J. Ruby was brutally beaten and shot on her morning jog by an unknown assailant. Over six thousand dollars worth of her best plastic storage ware was stolen from her trunk. Nearly all 200 residents in her hometown of Mametville, Washington knew her pers...",0.676,/tIfllOpkhZl5sbqZd1X77iiaJDj.jpg,"[{'id': 90905, 'logo_path': None, 'name': 'Girl Next Door Productions', 'origin_country': ''}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-09-18,0.0,77.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,6 housewives. 1 dead body. A trunk full of plastic storage wear.,Death of a Saleswoman,False,3.0,1,,[Girl Next Door Productions],
3,False,/z8FzomPxfCb2nv827VWLSI6oYiB.jpg,,76000000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name': 'Action'}, {'id': 53, 'name': 'Thriller'}, {'id': 878, 'name': 'Science Fiction'}]",http://www.universalstudiosentertainment.com/children-of-men/,9693,tt0206634,en,Children of Men,"In 2027, in a chaotic world in which humans can no longer procreate, a former activist agrees to help transport a miraculously pregnant woman to a sanctuary at sea, where her child's birth may help scientists save the future of humankind.",18.314,/k9IAS4TehZFcKi4HVByxZNPfqex.jpg,"[{'id': 33, 'logo_path': '/8lvHyhjr8oUKOOy2dKXoALWKdp0.png', 'name': 'Universal Pictures', 'origin_country': 'US'}, {'id': 655, 'logo_path': None, 'name': 'Strike Entertainment', 'origin_country': ''}, {'id': 11429, 'logo_path': None, 'name': 'Hit & Run Productions', 'origin_country': ''}]","[{'iso_3166_1': 'GB', 'name': 'United Kingdom'}, {'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-09-22,70595464.0,109.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,The future's a thing of the past.,Children of Men,False,7.6,5794,R,"[Universal Pictures, Strike Entertainment, Hit & Run Productions]",
4,False,/4W53mm2nvOtiOuPNW2oiBm9pmUZ.jpg,,0.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 80, 'name': 'Crime'}]",,9925,tt0244521,en,Funny Money,"Henry Perkins, a mild-mannered accountant, accidentally trades briefcases with another man, to find out that there's a million dollars inside. Henry tells his unsuspecting wife of their new-found fortune, but she doesn't embrace it as well as he does. Soon they're joined by their best friends, a...",4.218,/oG3jWUZiDqqrdk3oKwPEkuwoN89.jpg,"[{'id': 110622, 'logo_path': None, 'name': 'FWE Picture Company', 'origin_country': ''}, {'id': 30584, 'logo_path': '/7s0nJTTztkOVeIgavfOwx36EBtj.png', 'name': 'Thinkfactory Media', 'origin_country': 'US'}, {'id': 110623, 'logo_path': None, 'name': 'Tobebo Filmproduktions', 'origin_country': ''}...","[{'iso_3166_1': 'DE', 'name': 'Germany'}, {'iso_3166_1': 'RO', 'name': 'Romania'}, {'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-01-01,0.0,98.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,"Henry Perkins accidentally trades briefcases with another man, to find out that there's five million dollars inside...",Funny Money,False,5.4,39,R,"[FWE Picture Company, Thinkfactory Media, Tobebo Filmproduktions, Castel Film]",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78392,False,/8AiCju3Guhxiq9Sznj8f7aJZofS.jpg,,0.0,"[{'id': 99, 'name': 'Documentary'}]",http://icarusfilms.com/if-cli,387250,tt7098636,fr,"Le Clitoris, ce cher inconnu","Topics about female sexuality are growing in popularity. Magazines and talk shows all discuss it. Yet a fair percentage of women are said to suffer from female sexual dysfunction. While male sexual problems have traditionally received the most publicity, only recently has research begun into the...",1.400,/aWHfVqrTlEkTjUmyiMWYjyd9gsW.jpg,"[{'id': 92745, 'logo_path': None, 'name': 'Cats & Dogs Films', 'origin_country': 'FR'}, {'id': 7106, 'logo_path': None, 'name': 'Sylicone', 'origin_country': 'FR'}, {'id': 201, 'logo_path': '/6UIpEURdjnmcJPwgTDRzVRuwADr.png', 'name': 'ARTE', 'origin_country': 'FR'}, {'id': 639, 'logo_path': '/fN...","[{'iso_3166_1': 'FR', 'name': 'France'}]",2004-01-16,0.0,60.0,"[{'english_name': 'French', 'iso_639_1': 'fr', 'name': 'Français'}]",Released,,The Clitoris: Forbidden Pleasure,False,8.0,2,,"[Cats & Dogs Films, Sylicone, ARTE, DR TV, Planète+, SBS TV, CNC, PROCIREP, Angoa-Agicoa]",
78393,False,,,0.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}, {'id': 18, 'name': 'Drama'}]",,640680,tt7246872,en,Tokyo Bedoone Tavaghof,,0.600,/764JbpPIGeApSheIj0v5WvAUpuk.jpg,[],"[{'iso_3166_1': 'IR', 'name': 'Iran'}]",2003-09-10,0.0,98.0,"[{'english_name': 'Persian', 'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,,"Tokyo, Non-Stop",False,0.0,0,,[],
78394,False,,,0.0,"[{'id': 99, 'name': 'Documentary'}]",https://www.michaelraeburn.com/site/web/mr/zimbabwe-countdown_documentary,515137,tt7637550,en,Zimbabwe Countdown,"Once touted as Zimbabwe's saviour, Robert Mugabe has become synonymous with bad government and misrule. But where did it all go wrong?",0.711,/vnpAEhWBDhf7I6m4XZqj3Vf27px.jpg,"[{'id': 94, 'logo_path': '/huC7HqorvUThGIrENrbcHmQVUA0.png', 'name': 'ARTE France Cinéma', 'origin_country': 'FR'}]","[{'iso_3166_1': 'FR', 'name': 'France'}, {'iso_3166_1': 'ZW', 'name': 'Zimbabwe'}]",2003-08-28,0.0,55.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,,Zimbabwe Countdown,False,5.0,4,,[ARTE France Cinéma],
78395,False,,,0.0,"[{'id': 99, 'name': 'Documentary'}]",,713681,tt9414698,en,The Hot Shoe,"A Documentary Movie About Card Counting In The Casino Game of 21. The director used $5,000 of the movie's budget to learn the counting methods, and put them to the test on screen.",0.600,,[],[],2003-01-01,0.0,96.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,,The Hot Shoe,False,0.0,0,,[],


In [103]:
## Keeping some columns
cols_to_keep = ['imdb_id','budget','revenue','certification',
                'original_language','release_date','popularity','vote_average',
               'vote_count']
mvp = df[cols_to_keep]
mvp

Unnamed: 0,imdb_id,budget,revenue,certification,original_language,release_date,popularity,vote_average,vote_count
0,tt0144280,100000.0,0.0,,en,1997-01-01,0.714,2.0,2
1,tt0197633,0.0,0.0,NR,en,2006-01-31,1.957,3.8,8
2,tt0204250,0.0,0.0,,en,2006-09-18,0.676,3.0,1
3,tt0206634,76000000.0,70595464.0,R,en,2006-09-22,18.314,7.6,5794
4,tt0244521,0.0,0.0,R,en,2006-01-01,4.218,5.4,39
...,...,...,...,...,...,...,...,...,...
78392,tt7098636,0.0,0.0,,fr,2004-01-16,1.400,8.0,2
78393,tt7246872,0.0,0.0,,en,2003-09-10,0.600,0.0,0
78394,tt7637550,0.0,0.0,,en,2003-08-28,0.711,5.0,4
78395,tt9414698,0.0,0.0,,en,2003-01-01,0.600,0.0,0


In [104]:
# table_production_companies = df_companies

In [105]:
mvp.isna().sum()

imdb_id                  0
budget                   0
revenue                  0
certification        61436
original_language        0
release_date          1955
popularity               0
vote_average             0
vote_count               0
dtype: int64

In [106]:
mvp

Unnamed: 0,imdb_id,budget,revenue,certification,original_language,release_date,popularity,vote_average,vote_count
0,tt0144280,100000.0,0.0,,en,1997-01-01,0.714,2.0,2
1,tt0197633,0.0,0.0,NR,en,2006-01-31,1.957,3.8,8
2,tt0204250,0.0,0.0,,en,2006-09-18,0.676,3.0,1
3,tt0206634,76000000.0,70595464.0,R,en,2006-09-22,18.314,7.6,5794
4,tt0244521,0.0,0.0,R,en,2006-01-01,4.218,5.4,39
...,...,...,...,...,...,...,...,...,...
78392,tt7098636,0.0,0.0,,fr,2004-01-16,1.400,8.0,2
78393,tt7246872,0.0,0.0,,en,2003-09-10,0.600,0.0,0
78394,tt7637550,0.0,0.0,,en,2003-08-28,0.711,5.0,4
78395,tt9414698,0.0,0.0,,en,2003-01-01,0.600,0.0,0


In [107]:
# ## saving text length
# key_len = mvp['imdb_id'].map(len).max()
# cert_len = mvp['certification'].fillna('').map(len).max()
# lang_len = mvp['original_language'].map(len).max()

# key_len, cert_len,lang_len

In [108]:
# ## saving schema
# api_data_schema = {'imdb_id':String(key_len+1), 
#                  'budget':Float(),
#                    'revenue':Float(),
#                  'certification':Text(cert_len+1),
#                   'original_language':Text(lang_len+1)}
api_data_schema = get_schema(mvp.fillna('MISSING'))
api_data_schema

{'imdb_id': String(length=11),
 'budget': Float(),
 'revenue': Float(),
 'certification': String(length=32),
 'original_language': String(length=3),
 'release_date': String(length=11),
 'popularity': Float(),
 'vote_average': Float(),
 'vote_count': Integer()}

In [109]:
mvp = mvp.drop_duplicates()
mvp

Unnamed: 0,imdb_id,budget,revenue,certification,original_language,release_date,popularity,vote_average,vote_count
0,tt0144280,100000.0,0.0,,en,1997-01-01,0.714,2.0,2
1,tt0197633,0.0,0.0,NR,en,2006-01-31,1.957,3.8,8
2,tt0204250,0.0,0.0,,en,2006-09-18,0.676,3.0,1
3,tt0206634,76000000.0,70595464.0,R,en,2006-09-22,18.314,7.6,5794
4,tt0244521,0.0,0.0,R,en,2006-01-01,4.218,5.4,39
...,...,...,...,...,...,...,...,...,...
78392,tt7098636,0.0,0.0,,fr,2004-01-16,1.400,8.0,2
78393,tt7246872,0.0,0.0,,en,2003-09-10,0.600,0.0,0
78394,tt7637550,0.0,0.0,,en,2003-08-28,0.711,5.0,4
78395,tt9414698,0.0,0.0,,en,2003-01-01,0.600,0.0,0


In [110]:
mvp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 78397 entries, 0 to 78396
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   imdb_id            78397 non-null  object 
 1   budget             78397 non-null  float64
 2   revenue            78397 non-null  float64
 3   certification      16961 non-null  object 
 4   original_language  78397 non-null  object 
 5   release_date       76442 non-null  object 
 6   popularity         78397 non-null  float64
 7   vote_average       78397 non-null  float64
 8   vote_count         78397 non-null  int64  
dtypes: float64(4), int64(1), object(4)
memory usage: 6.0+ MB


In [111]:
# saving mvp version
## Primary key is Genre_ID
mvp.to_sql('tmdb',engine, index=False,dtype=api_data_schema, if_exists='replace')
engine.execute('ALTER TABLE tmdb ADD PRIMARY KEY (`imdb_id`);')

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x2928e9b50>

In [112]:
q = """SELECT * FROM tmdb LIMIT 5"""
pd.read_sql(q,engine)

Unnamed: 0,imdb_id,budget,revenue,certification,original_language,release_date,popularity,vote_average,vote_count
0,tt0035423,48000000.0,76019000.0,PG-13,en,2001-12-25,9.183,6.3,1112
1,tt0062336,0.0,0.0,,es,2020-02-21,1.4,5.3,3
2,tt0069049,12000000.0,0.0,R,en,2018-11-02,5.155,6.7,155
3,tt0088751,350000.0,0.0,,en,2005-04-22,2.19,3.4,5
4,tt0093119,7500000.0,0.0,,en,2020-02-17,9.226,4.4,14


In [113]:
df.head()

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification,production_company_list,collection
0,False,,,100000.0,[],,30356,tt0144280,en,Whispers from a Shallow Grave,"When a beautiful photo model disappears, police track the photographer who last saw her. At his trial, the model herself tells her side of the story.",0.714,/cjRqAh3u2Z0MBx9GtT1k7Z9732w.jpg,[],"[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1997-01-01,0.0,90.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,,Whispers from a Shallow Grave,False,2.0,2,,[],
1,False,/1hGrzTajJ5JC6VfPzrUj8Gj5t5k.jpg,,0.0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10402, 'name': 'Music'}, {'id': 878, 'name': 'Science Fiction'}, {'id': 27, 'name': 'Horror'}]",,58520,tt0197633,en,Live Freaky Die Freaky,"It is the year 3069 and the Earth has been robbed of all its natural resources, destroyed by war and ozone depletion.",1.957,/qt7IVXzGca5wS68WZ6PBWHkTcMz.jpg,[],"[{'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-01-31,0.0,75.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,,Live Freaky Die Freaky,False,3.8,8,NR,[],
2,False,/qKiOZMJH9Riqlk2SkUOgSMZHHJu.jpg,,0.0,"[{'id': 9648, 'name': 'Mystery'}, {'id': 35, 'name': 'Comedy'}]",,459563,tt0204250,en,Death of a Saleswoman,"Top-ranking RubberTubber saleswoman Agatha J. Ruby was brutally beaten and shot on her morning jog by an unknown assailant. Over six thousand dollars worth of her best plastic storage ware was stolen from her trunk. Nearly all 200 residents in her hometown of Mametville, Washington knew her pers...",0.676,/tIfllOpkhZl5sbqZd1X77iiaJDj.jpg,"[{'id': 90905, 'logo_path': None, 'name': 'Girl Next Door Productions', 'origin_country': ''}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-09-18,0.0,77.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,6 housewives. 1 dead body. A trunk full of plastic storage wear.,Death of a Saleswoman,False,3.0,1,,[Girl Next Door Productions],
3,False,/z8FzomPxfCb2nv827VWLSI6oYiB.jpg,,76000000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name': 'Action'}, {'id': 53, 'name': 'Thriller'}, {'id': 878, 'name': 'Science Fiction'}]",http://www.universalstudiosentertainment.com/children-of-men/,9693,tt0206634,en,Children of Men,"In 2027, in a chaotic world in which humans can no longer procreate, a former activist agrees to help transport a miraculously pregnant woman to a sanctuary at sea, where her child's birth may help scientists save the future of humankind.",18.314,/k9IAS4TehZFcKi4HVByxZNPfqex.jpg,"[{'id': 33, 'logo_path': '/8lvHyhjr8oUKOOy2dKXoALWKdp0.png', 'name': 'Universal Pictures', 'origin_country': 'US'}, {'id': 655, 'logo_path': None, 'name': 'Strike Entertainment', 'origin_country': ''}, {'id': 11429, 'logo_path': None, 'name': 'Hit & Run Productions', 'origin_country': ''}]","[{'iso_3166_1': 'GB', 'name': 'United Kingdom'}, {'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-09-22,70595464.0,109.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,The future's a thing of the past.,Children of Men,False,7.6,5794,R,"[Universal Pictures, Strike Entertainment, Hit & Run Productions]",
4,False,/4W53mm2nvOtiOuPNW2oiBm9pmUZ.jpg,,0.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 80, 'name': 'Crime'}]",,9925,tt0244521,en,Funny Money,"Henry Perkins, a mild-mannered accountant, accidentally trades briefcases with another man, to find out that there's a million dollars inside. Henry tells his unsuspecting wife of their new-found fortune, but she doesn't embrace it as well as he does. Soon they're joined by their best friends, a...",4.218,/oG3jWUZiDqqrdk3oKwPEkuwoN89.jpg,"[{'id': 110622, 'logo_path': None, 'name': 'FWE Picture Company', 'origin_country': ''}, {'id': 30584, 'logo_path': '/7s0nJTTztkOVeIgavfOwx36EBtj.png', 'name': 'Thinkfactory Media', 'origin_country': 'US'}, {'id': 110623, 'logo_path': None, 'name': 'Tobebo Filmproduktions', 'origin_country': ''}...","[{'iso_3166_1': 'DE', 'name': 'Germany'}, {'iso_3166_1': 'RO', 'name': 'Romania'}, {'iso_3166_1': 'US', 'name': 'United States of America'}]",2006-01-01,0.0,98.0,"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]",Released,"Henry Perkins accidentally trades briefcases with another man, to find out that there's five million dollars inside...",Funny Money,False,5.4,39,R,"[FWE Picture Company, Thinkfactory Media, Tobebo Filmproduktions, Castel Film]",


In [114]:
# # saving AAB version
# ## Primary key is Genre_ID
# df.to_sql('tmdb_data_aab',engine, index=False,dtype=api_data_schema, if_exists='replace')
# engine.execute('ALTER TABLE tmdb_data_aab ADD PRIMARY KEY (`imdb_id`);')

### Troubleshoot: Error with AAB Version

- Error when trying to save AAB version. 
- Built a recursive feature addition loop to try adding more and more columns before running command that error'd.
- Used function to programmatically make new schema each time (thinking the schema was part of the issue - but it may have been fine). 

## Bookmark: Need to Fix Title/Original Title

- Asian characters seem to be the issue. Need to identify and remove rows with incompatible characters.

In [115]:
len(df)

78397

In [116]:
# # identifying incomaptible rows
# bad_titles = (df['original_title']!=df['title']) &\
#                 (df['original_language']!='en') &\
#                ~df['spoken_languages'].str.contains('english',case=False)
# df[bad_titles]

In [117]:
# df = df[~bad_titles]
# df

In [118]:
# df['revenue'] = df['revenue'].astype(float)

In [119]:
# df = df.reset_index(drop=True)
# df

In [120]:
# schema = get_schema(df)
# schema['title'] = Text()
# schema['original_title'] = Text()

In [121]:
# schema

### Dropping Titles, since these should be joinable with title basics table anyway

In [122]:
# df_for_db = df.drop(columns=['title','original_title'])
# df_for_db

In [123]:
# df_for_db.to_sql('tmdb_data_aab',engine, index=False, 
#                             if_exists='replace',dtype=get_schema(df_for_db))
# #                           method='multi')

In [124]:
# ## loop through adding more columns and remove what errors
# good_cols = [*cols_to_keep]
# bad_cols = []
# all_cols = df.drop(columns=cols_to_keep).columns

# for col in all_cols:
#     print(f"- Adding {col}")
#     try: 
#         cols_to_try = [*good_cols, col]
#         df_filtered = df[ cols_to_try]
#         schema= get_schema(df_filtered)
#         ## Primary key is Genre_ID
#         df_filtered.to_sql('tmdb_data_aab',engine, index=False,
#                             if_exists='replace',#dtype=schema,
#                           method='multi')
        
#         # append col name to good_cols if no error
#         good_cols.append(col)
#     except Exception as e:
#         print("   - ERROR")
#         bad_cols.append({col:e})
# bad_cols    

In [125]:
# engine.execute('ALTER TABLE tmdb_data_aab ADD PRIMARY KEY (`imdb_id`);')

In [126]:
# q = """SELECT * FROM tmdb_data_aab LIMIT 5"""
# pd.read_sql(q,engine)

# Final Tables Check

In [127]:
## Test your connection by checking for any tables that exist (there should be none at this point)
q = """SHOW TABLES;"""
pd.read_sql(q, engine)

Unnamed: 0,Tables_in_movies-viz
0,collections
1,genres
2,production_companies
3,title_basics
4,title_collection
5,title_genres
6,title_production_company
7,title_ratings
8,tmdb
