# Part 3 - ETL & MySQL Database (Revisited) v2

- **02/10/23**


<blockquote class="alert alert-info" style="font-size:1.8rem; line-height:1.3em;">
<p>⭐️ <strong>Revisiting to normalize more tables for Tableau Dashboard.</strong></p>
<ul>
<li>There are 2 new sections to this notebook/workflow<ol>
<li><a href="#new" rel="nofollow">Click here</a> for cleaning and normalizing "production_companies" and "belongs_to_collection" columns from TMDB API.</li>
<li><a href="#new2" rel="nofollow">Click here</a> for saving the new tables to MySQL DB</li>
<li><a href="#new-imdb" rel="nofollow">Click here</a> for saving the new title_crew, title_principals, and name_basics tables from imdb.</li>
</ol>
</li>
</ul>
</blockquote>

<hr>

## Introduction

Your stakeholder wants you to take the data you have been cleaning and collecting in Parts 1 &amp; 2 of the project, and wants you to create a MySQL database for them.
<p>You should normalize the tables as best you can before adding them to your new database.</p>
<ul>
<li>Note: an important exception to their request is that they would like you to keep all of the data from the TMDB API in 1 table together (even though it will not be perfectly normalized).&nbsp;&nbsp;</li><li>You only need to keep the imdb_id, revenue, budget, and certification columns</li>
</ul>
</li>
</ul>
<h4>Required Transformation steps:</h4>
<ul>
<li>
<p><strong>Normalize Genre:</strong></p>
<ul>
<li>Convert the single string of genres from title basics into 2 new tables.
<ol>
<li>
<p><code>title_genres</code>:  with the columns:</p>
<ul>
<li>tconst</li>
<li>genre_id</li>
</ul>
</li>
<li>
<p><code>genres</code>:</p>
<ul>
<li>genre_id</li>
<li>genre_name</li>
</ul>
</li>
</ol>
</li>
</ul>
</li>
<li>
<p><strong>Discard unnecessary information:</strong></p>
<ul>
<li>For the <code>title basics</code> table, drop the following columns:
<ul>
<li>"original_title" (we will use the primary title column instead)</li>
<li>"isAdult" ("Adult" will show up in the genres so this is redundant information).</li>
<li>"titleType" (every row will be a movie).</li>
<li>"genres" and other variants of genre (genre is now represented in the 2 new tables described above.</li>
</ul>
</li>
<li>Do not include the <code>title_akas</code> table in your SQL database.
<ul>
<li>You have already filtered out the desired movies using this table and the remaining data is mostly nulls and not of-interest to the stakeholder.</li>
</ul>
</li>
</ul>
</li>
</ul>
<h4>MySQL Database Requirements</h4>
<ul>
<li>
<p>Use sqlalchemy with pandas to execute your SQL queries inside your notebook.</p>
</li>
<li>
<p>Create a new database on your MySQL server and  call it "movies".</p>
</li>
<li>
<p>Make sure to have the following tables in your "movies" database:</p>
<ul>
<li><code>title_basics</code></li>
<li><code>title_ratings</code></li>
<li><code>title_genres</code></li>
<li><code>genres</code></li>
<li><code>tmdb_data</code></li>
</ul>
</li>
<li>
<p>Make sure to set a Primary Key for each table.</p>
</li>
<li>
<p>After creating each table, show the first 5 rows of that table using a SQL query.</p>
</li>
<li>
<p>Make sure to run the "SHOW TABLES" SQL query at the end of your notebook to show that all required tables have been created.</p></li>
</ul>


## Updated Code/Functions

- Added more streamlined way to add new tables to database.
- Added setting the DB_NAME to avoid overwriting previous databases.

<div class="alert alert-danger">
    <blockquote style="font-weight:bold;"> Note: update the <code>DB_NAME</code> variable immediately below this cell to change the name of the exported database.</blockquote>
</div>



In [1]:
# VAR FOR CHANGING MYSQL DB NAME TO AVOID OVERWRITING PREVIOUS
DB_NAME = "movies-db"

In [2]:
def get_schema(table,debug=False):
    ## save pandas dtypes in list, make empty dict
    dtypes = table.dtypes
    schema = {}
    
    # for each column
    for col in dtypes.index:
        ## print info if in debug mode
        if debug:
            print(f"{col} = {dtypes.loc[col]}")

        ## if its a string column (object)
        if dtypes.loc[col]=='object':
            
            ## Fill null values and make sure whole column is str
            data = table[col].fillna('').astype(str)
            
            ## get len first
            len_str = data.map(len).max()
            
            ## if the string is shorter than 21845 use String
            # (forget how i knew it was max size)
            if len_str < 21845:
                schema[col] = String( len_str + 1)
                
            ## If longer use Text
            else:
                schema[col] = Text(len_str+1)
        
        # if float make Float
        elif dtypes.loc[col] == 'float':
            schema[col] = Float()

        ## if int make Integer
        elif dtypes.loc[col] == 'int':
            schema[col] = Integer()#Float()#
            
        ## if bool make Boolean
        elif dtypes.loc[col] == 'bool':
            schema[col] = Boolean()
            
    return schema


# Original Project Part 3

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os,json
import missingno as ms

plt.rcParams['figure.facecolor'] = 'white'

pd.set_option('display.max_columns',100)

In [4]:
## Folder to save IMDB files
FOLDER = "Data/"
# os.makedirs(FOLDER,exist_ok=True)
sorted(os.listdir(FOLDER))

['.DS_Store',
 '.ipynb_checkpoints',
 'For Tableau',
 'combined_tmdb_api_data.csv.gz',
 'final_tmdb_data_2000.csv.gz',
 'final_tmdb_data_2001.csv.gz',
 'final_tmdb_data_2002.csv.gz',
 'final_tmdb_data_2003.csv.gz',
 'final_tmdb_data_2004.csv.gz',
 'final_tmdb_data_2005.csv.gz',
 'final_tmdb_data_2006.csv.gz',
 'final_tmdb_data_2007.csv.gz',
 'final_tmdb_data_2008.csv.gz',
 'final_tmdb_data_2009.csv.gz',
 'final_tmdb_data_2010.csv.gz',
 'final_tmdb_data_2011.csv.gz',
 'final_tmdb_data_2012.csv.gz',
 'final_tmdb_data_2013.csv.gz',
 'final_tmdb_data_2014.csv.gz',
 'final_tmdb_data_2015.csv.gz',
 'final_tmdb_data_2016.csv.gz',
 'final_tmdb_data_2017.csv.gz',
 'final_tmdb_data_2018.csv.gz',
 'final_tmdb_data_2019.csv.gz',
 'final_tmdb_data_2020.csv.gz',
 'final_tmdb_data_2021.csv.gz',
 'final_tmdb_data_2022.csv.gz',
 'final_tmdb_data_2023.csv.gz',
 'final_tmdb_data_2024.csv.gz',
 'modeling',
 'name_basics.csv.gz',
 'processed-nlp-data.joblib',
 'testing-data-tf',
 'title_akas_cleaned.csv.gz

## Extract: Loading Previous Data

### Title Basics

In [5]:
## title basics
basics = pd.read_csv(f'{FOLDER}title_basics_cleaned.csv.gz',low_memory=False)
basics.info()
basics.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128875 entries, 0 to 128874
Data columns (total 8 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   tconst          128875 non-null  object 
 1   titleType       128875 non-null  object 
 2   primaryTitle    128875 non-null  object 
 3   originalTitle   128875 non-null  object 
 4   isAdult         128875 non-null  int64  
 5   startYear       128875 non-null  float64
 6   runtimeMinutes  128875 non-null  int64  
 7   genres          128875 non-null  object 
dtypes: float64(1), int64(2), object(5)
memory usage: 7.9+ MB


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,70,Drama
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,122,Drama
3,tt0070596,movie,Socialist Realism,El realismo socialista,0,2023.0,78,Drama
4,tt0082328,movie,Embodiment of Evil,Encarnação do Demônio,0,2008.0,94,Horror


### Title Ratings

In [6]:
## title ratings
ratings = pd.read_csv(f"{FOLDER}/title_ratings_cleaned.csv.gz",low_memory=False)
ratings.info()
ratings.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98163 entries, 0 to 98162
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   tconst         98163 non-null  object 
 1   averageRating  98163 non-null  float64
 2   numVotes       98163 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 2.2+ MB


Unnamed: 0,tconst,averageRating,numVotes
0,tt0035423,6.4,88417
1,tt0062336,6.4,185
2,tt0069049,6.7,7939
3,tt0070596,7.5,52
4,tt0082328,5.9,1765


### TMDB API Results

- Students should have combined the years 2000 and 2001 as part of Part 2. I forgot to do so so I am doing here

In [7]:
import json
from pprint import pprint

with open("config/filepaths.json") as f:
    FPATHS = json.load(f)
pprint(FPATHS['data'])

{'app': {'movie-data_csv': 'app-assets/movie-data-streamlit.csv'},
 'ml-nlp': {'reviews-with-target_json': 'Data-NLP/modeling/processed-nlp-reviews-for-ml.json',
            'test_joblib': 'Data-NLP/modeling/testing-data.joblib',
            'train_joblib': 'Data-NLP/modeling/training-data.joblib'},
 'ml-tabular': {'movie-info-with-ml-target_json': 'Data/modeling/processed-movie-data-for-ml.json',
                'test_joblib': 'Data/modeling/testing-data.joblib',
                'train_joblib': 'Data/modeling/training-data.joblib'},
 'nn': {'test_dir': 'Data/modeling/testing-data-tf/',
        'train_dir': 'Data/modeling/training-data-tf/'},
 'raw': {'cleaned-tmdb-movie-info-tmdb_csv': 'Data-NLP/tmdb-movie-info.csv.gz',
         'combined-tmdb-movie-data_csv': 'Data/combined_tmdb_api_data.csv.gz',
         'movie-reviews-tmdb_csv': 'Data-NLP/tmdb-movie-reviews.csv.gz',
         'processed-reviews-spacy_joblib': 'Data-NLP/processed-nlp-reviews.joblib',
         'processed-reviews-spacy

In [8]:
# # ## INSTRUCTOR CODE (not taught to students)
# # import glob
# # q = f"{FOLDER}final*.csv.gz"
# # files = glob.glob(q)
# df = pd.read_csv('Data/combined_tmdb_api_data.csv.gz')
# # files

In [9]:
fpath_movie_info = FPATHS['data']['raw']['cleaned-tmdb-movie-info-tmdb_csv']

In [10]:
df = pd.read_csv(fpath_movie_info, low_memory=False, lineterminator='\n')
df

Unnamed: 0,belongs_to_collection,budget,id,imdb_id,original_language,original_title,overview,popularity,release_date,revenue,runtime,status,tagline,title,video,vote_average,vote_count,certification,Release Year,genre_names,production_companies_names,collection_name,production_countries_iso,production_countries_name,spoken_language_iso,poster_path,adult,backdrop_path,homepage
0,False,10000000,62127,tt0113026,en,The Fantasticks,Two rural teens sing and dance their way throu...,2.559,2000-09-22,0,86,Released,Try to remember the first time magic happened,The Fantasticks,False,5.500,22,,2000.0,Comedy; Music; Romance,Sullivan Street Productions; Michael Ritchie P...,,US,United States of America,en,/hfO64mXz3DgUxkBVU7no2UWRP7x.jpg,False,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,
1,False,0,110977,tt0113092,en,For the Cause,Earth is in a state of constant war and two co...,3.393,2000-11-15,0,100,Released,The ultimate showdown on a forbidden planet.,For the Cause,False,4.958,12,,2000.0,Science Fiction,Dimension Films; Grand Design Entertainment; M...,,US,United States of America,en,/h9bWO13nWRGZJo4XVPiElXyrRMU.jpg,False,,
2,False,0,442869,tt0116391,hi,Gang,"After falling prey to underworld, four friends...",2.748,2000-04-14,0,165,Released,,Gang,False,5.000,2,,2000.0,Drama; Action; Crime,,,IN,India,hi,/dYcuiiBDpPUvCcPbiWdH4REjGn3.jpg,False,/krEZg9tb6blhc7sV6Us2ZGQ0gA.jpg,
3,False,150000,843,tt0118694,cn,花樣年華,Two neighbors become intimate after discoverin...,27.940,2000-09-29,14204632,99,Released,"Feel the heat, keep the feeling burning, let t...",In the Mood for Love,False,8.109,2433,PG,2000.0,Drama; Romance,Block 2 Pictures; Orly Films; Jet Tone Films; ...,,FR; HK; NL; CN,France; Hong Kong; Netherlands; China,cn; fr; es,/iYypPT4bhqXfq1b6EnmxvRt6b2Y.jpg,False,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,
4,False,0,49511,tt0118852,en,Chinese Coffee,"When Harry Levine, an aging, unsuccessful Gree...",5.495,2000-09-02,0,99,Released,There's a fine line between friendship and bet...,Chinese Coffee,False,6.600,56,R,2000.0,Drama,Chal Productions; The Shooting Gallery,,US,United States of America,en,/nZGWnSuf1FIuzyEuMRZHHZWViAp.jpg,False,/vceiGZ3uavAEHlTA7v0GjQsGVKe.jpg,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93054,False,0,1167732,tt8736506,en,To Die Alone,After suffering a terrible injury while hiking...,5.474,2024-02-10,0,84,In Production,,To Die Alone,False,0.000,0,NR,2024.0,Thriller; Drama,Glass Creek Films; Charming Stranger Films,,US,United States of America,en,/d7rHwkRlCHhVGBTQQkvGALeC5oz.jpg,False,/dyQvD0BDlWk187fDBmJTU2uUVGH.jpg,https://www.todiealonefilm.com
93055,False,0,560016,tt9214772,en,Monkey Man,An unlikely hero emerges from prison to take o...,7.925,2024-01-22,0,0,Post Production,,Monkey Man,False,0.000,0,,2024.0,Thriller; Drama,Thunder Road; 87Eleven; Bron Studios; Creative...,,CA; US,Canada; United States of America,en; hi,,False,,
93056,False,0,1163894,tt9357860,en,The Hopeful,Aboard a steamship sailing across the Atlantic...,2.594,2024-02-14,0,90,Post Production,The end of the world is just the beginning.,The Hopeful,False,0.000,0,,2024.0,Drama,Advent Hope Productions; Hope Studios; Kyle Po...,,CA,Canada,en,/9zRA1Vefx1gJdt6fKPDt4JJ6FGC.jpg,False,,
93057,False,0,1184693,tt9680562,mr,सत्यशोधक,The film follows the life of social reformer a...,4.889,2024-01-05,0,145,Released,HE BRINGS THE TORCH OF REVOLUTION TO ENLIGHTEN...,Satyashodhak,False,0.000,0,,2024.0,Drama; History,Samata Films; Abhita Films Production Pvt. Ltd...,,IN,India,mr,/jzQPp0oSYPgu8crP3aq8EIjQ6Ti.jpg,False,/C28T7GAlCJQFWVDB4vUV8eZyf9.jpg,


In [11]:
(df['imdb_id']!='0').all()

True

In [12]:
# ## Dropping Placeholder rows with id=0
# df = df.loc[ df['imdb_id']!='0']
# df

In [13]:
# df = df.reset_index(drop=True)
# df

## Transform

### basics

- normalize and separate genre
- "original_title" (we will use the primary title column instead)
- "isAdult" ("Adult" will show up in the genres so this is redundant information).
- "titleType" (every row will be a movie).
- "genres" and other variants of genre (genre is now represented in the 2 new tables described above.

In [14]:
basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128875 entries, 0 to 128874
Data columns (total 8 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   tconst          128875 non-null  object 
 1   titleType       128875 non-null  object 
 2   primaryTitle    128875 non-null  object 
 3   originalTitle   128875 non-null  object 
 4   isAdult         128875 non-null  int64  
 5   startYear       128875 non-null  float64
 6   runtimeMinutes  128875 non-null  int64  
 7   genres          128875 non-null  object 
dtypes: float64(1), int64(2), object(5)
memory usage: 7.9+ MB


In [15]:
cols_to_drop = ['originalTitle','isAdult','titleType'] #endYear not req'd
basics = basics.drop(columns=cols_to_drop)
basics

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres
0,tt0035423,Kate & Leopold,2001.0,118,"Comedy,Fantasy,Romance"
1,tt0062336,The Tango of the Widower and Its Distorting Mi...,2020.0,70,Drama
2,tt0069049,The Other Side of the Wind,2018.0,122,Drama
3,tt0070596,Socialist Realism,2023.0,78,Drama
4,tt0082328,Embodiment of Evil,2008.0,94,Horror
...,...,...,...,...,...
128870,tt9915436,Vida em Movimento,2019.0,70,Documentary
128871,tt9915872,The Last White Witch,2019.0,97,"Comedy,Drama,Fantasy"
128872,tt9916170,The Rehearsal,2019.0,51,Drama
128873,tt9916190,Safeguard,2020.0,95,"Action,Adventure,Thriller"


### Normalizing Genre

In [16]:
## Fill Missing Genres 
## Create a new column with the single-string genres as a list of strings
basics['genres_split'] = basics['genres'].str.split(',')
basics

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres,genres_split
0,tt0035423,Kate & Leopold,2001.0,118,"Comedy,Fantasy,Romance","[Comedy, Fantasy, Romance]"
1,tt0062336,The Tango of the Widower and Its Distorting Mi...,2020.0,70,Drama,[Drama]
2,tt0069049,The Other Side of the Wind,2018.0,122,Drama,[Drama]
3,tt0070596,Socialist Realism,2023.0,78,Drama,[Drama]
4,tt0082328,Embodiment of Evil,2008.0,94,Horror,[Horror]
...,...,...,...,...,...,...
128870,tt9915436,Vida em Movimento,2019.0,70,Documentary,[Documentary]
128871,tt9915872,The Last White Witch,2019.0,97,"Comedy,Drama,Fantasy","[Comedy, Drama, Fantasy]"
128872,tt9916170,The Rehearsal,2019.0,51,Drama,[Drama]
128873,tt9916190,Safeguard,2020.0,95,"Action,Adventure,Thriller","[Action, Adventure, Thriller]"


In [17]:
## Explode the dataframe to make each genre into a separate row
exploded_genres = basics.explode('genres_split')
exploded_genres

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres,genres_split
0,tt0035423,Kate & Leopold,2001.0,118,"Comedy,Fantasy,Romance",Comedy
0,tt0035423,Kate & Leopold,2001.0,118,"Comedy,Fantasy,Romance",Fantasy
0,tt0035423,Kate & Leopold,2001.0,118,"Comedy,Fantasy,Romance",Romance
1,tt0062336,The Tango of the Widower and Its Distorting Mi...,2020.0,70,Drama,Drama
2,tt0069049,The Other Side of the Wind,2018.0,122,Drama,Drama
...,...,...,...,...,...,...
128873,tt9916190,Safeguard,2020.0,95,"Action,Adventure,Thriller",Action
128873,tt9916190,Safeguard,2020.0,95,"Action,Adventure,Thriller",Adventure
128873,tt9916190,Safeguard,2020.0,95,"Action,Adventure,Thriller",Thriller
128874,tt9916362,Coven,2020.0,92,"Drama,History",Drama


#### Create title_genres Table - title_genres
tconst: GenreID

In [18]:
## Save just tconst and genres_split as new df
title_genres = exploded_genres[['tconst','genres_split']].copy()
title_genres.head()

Unnamed: 0,tconst,genres_split
0,tt0035423,Comedy
0,tt0035423,Fantasy
0,tt0035423,Romance
1,tt0062336,Drama
2,tt0069049,Drama


#### Replacing Text Genres with Integer IDs

In [19]:
## Getting unique list of genres
unique_genres = sorted(title_genres['genres_split'].unique())
unique_genres

['Action',
 'Adult',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Game-Show',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'News',
 'Reality-TV',
 'Romance',
 'Sci-Fi',
 'Sport',
 'Talk-Show',
 'Thriller',
 'War',
 'Western']

In [20]:
## Make a dictionary with list of unique genres as the key and the new iteger id as vlaues
genre_id_map = dict(zip(unique_genres, range(len(unique_genres))))
genre_id_map

{'Action': 0,
 'Adult': 1,
 'Adventure': 2,
 'Animation': 3,
 'Biography': 4,
 'Comedy': 5,
 'Crime': 6,
 'Documentary': 7,
 'Drama': 8,
 'Family': 9,
 'Fantasy': 10,
 'Game-Show': 11,
 'History': 12,
 'Horror': 13,
 'Music': 14,
 'Musical': 15,
 'Mystery': 16,
 'News': 17,
 'Reality-TV': 18,
 'Romance': 19,
 'Sci-Fi': 20,
 'Sport': 21,
 'Talk-Show': 22,
 'Thriller': 23,
 'War': 24,
 'Western': 25}

#### Replacing Values in title_genres Table with Genre ID

In [21]:
## Use .map or .replace with our genre_id_map dictionary
title_genres['Genre_ID'] = title_genres['genres_split'].replace(genre_id_map)

## Drop the original genre column
title_genres.drop(columns=['genres_split'],inplace=True)
title_genres

Unnamed: 0,tconst,Genre_ID
0,tt0035423,5
0,tt0035423,10
0,tt0035423,19
1,tt0062336,8
2,tt0069049,8
...,...,...
128873,tt9916190,0
128873,tt9916190,2
128873,tt9916190,23
128874,tt9916362,8


#### Create `genres` table
- Genre : Genre_ID

In [22]:
## Manaully make dataframe with named cols from the .keyd and .values
genre_lookup = pd.DataFrame({'Genre_Name': genre_id_map.keys(),
                             'Genre_ID':genre_id_map.values()})
genre_lookup.head()

Unnamed: 0,Genre_Name,Genre_ID
0,Action,0
1,Adult,1
2,Adventure,2
3,Animation,3
4,Biography,4


In [23]:
## Dropping original genre columns 
basics = basics.drop(columns=['genres','genres_split'])
basics

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes
0,tt0035423,Kate & Leopold,2001.0,118
1,tt0062336,The Tango of the Widower and Its Distorting Mi...,2020.0,70
2,tt0069049,The Other Side of the Wind,2018.0,122
3,tt0070596,Socialist Realism,2023.0,78
4,tt0082328,Embodiment of Evil,2008.0,94
...,...,...,...,...
128870,tt9915436,Vida em Movimento,2019.0,70
128871,tt9915872,The Last White Witch,2019.0,97
128872,tt9916170,The Rehearsal,2019.0,51
128873,tt9916190,Safeguard,2020.0,95





#   ⭐️ **[NEW] Normalizing More TMDB API Data**
<a id="new"></a>



In [24]:
# df

## ⭐️Clean & Normalize Production Companies


#### Handling `production_companies` column
- Each entry in the column is a string-version of a list of dictionaries. 
    - For each row, we need to save just the `name` key of every company from the list of dictionaries as a list of names.
        - There may be multiple companies for each movie, just like the genre column. 
    - We will then split/normalize the companies just like we did with genres.
    
- Tips: 
    - 1) You should make a function that takes 1 cell from the 'production_companies' at a time and return a list of every "name" from the list of dictionaries. 
    
    - 2) To deal with null values, try using replacing the null values with a string-version of an empty list.
```
df['production_companies] = df['production_companies].fillna('[]')
```

In [25]:
# slicing out a test entry
test = df.iloc[3]['production_companies_names']
test

'Block 2 Pictures; Orly Films; Jet Tone Films; Paradis Films; Fortissimo Films'

In [26]:
# ## Function to get just the genre names as a list 
# import json
# def get_company_name(x):
#     try:
#         x = x.replace('None','"None"')
#         x = x.replace('\\','')
        
#         x = x.replace("'s",'s')
#         for key in ['id','logo_path','name','origin_country']:
#             x = x.replace(f"'{key}'",f'"{key}"')
            
#         x = x.replace("'",'"')
        
#         x = json.loads(x)

#         companies = []
#         for company in x:
#             companies.append(company['name'])
#         return companies
#     except:
#         print(x)

In [27]:
# # testing the function
# get_company_name(test)

In [28]:
## Apply/map function to entire column 
df['production_company_list'] = df['production_companies_names'].str.split(";") #'#fillna('[]').apply(get_company_name)
df['production_company_list']

0        [Sullivan Street Productions,  Michael Ritchie...
1        [Dimension Films,  Grand Design Entertainment,...
2                                                      NaN
3        [Block 2 Pictures,  Orly Films,  Jet Tone Film...
4                [Chal Productions,  The Shooting Gallery]
                               ...                        
93054        [Glass Creek Films,  Charming Stranger Films]
93055    [Thunder Road,  87Eleven,  Bron Studios,  Crea...
93056    [Advent Hope Productions,  Hope Studios,  Kyle...
93057    [Samata Films,  Abhita Films Production Pvt. L...
93058             [Atomic Monster,  Blumhouse Productions]
Name: production_company_list, Length: 93059, dtype: object

In [29]:
# checking % of null values.
df['production_company_list'].isna().sum() / len(df)

0.363489829033194

- Some company names caused errors. Investigating below:

In [30]:
# edge_case = df.loc[df['production_companies'].fillna('').str.contains("Donner"),'production_companies'].index#.iloc[0]
# edge_case

In [31]:
# donner_ids = [ 66,  466, 1023,  298,   42,  357,   76,  721,   18, 2975,  452,
#               47,  112,  125,  482,  150,  221,  141,  174]
# donner_ids

In [32]:
# ## Checking an example edge case
# df.loc[donner_ids[0], 'production_companies']

In [33]:
# ## testing function on edge cases.
# get_company_name(df.loc[donner_ids[0], 'production_companies'])

- 🚨 **Issue with company names that include an apostrophe.**
    - Did not find a solution (that did not involved regular expressions.

In [34]:
pd.set_option('display.max_colwidth',300)

In [35]:
## additional investigation of companies that error'd
# df[df['production_company_list'].isna()]['production_companies']

#### Explode & Separate Production Companies

In [36]:
df_explode = df.explode('production_company_list')
df_explode

Unnamed: 0,belongs_to_collection,budget,id,imdb_id,original_language,original_title,overview,popularity,release_date,revenue,runtime,status,tagline,title,video,vote_average,vote_count,certification,Release Year,genre_names,production_companies_names,collection_name,production_countries_iso,production_countries_name,spoken_language_iso,poster_path,adult,backdrop_path,homepage,production_company_list
0,False,10000000,62127,tt0113026,en,The Fantasticks,Two rural teens sing and dance their way through a forbidden romance and a dangerous travelling carnival.,2.559,2000-09-22,0,86,Released,Try to remember the first time magic happened,The Fantasticks,False,5.500,22,,2000.0,Comedy; Music; Romance,"Sullivan Street Productions; Michael Ritchie Productions; Radmin Company, The; United Artists",,US,United States of America,en,/hfO64mXz3DgUxkBVU7no2UWRP7x.jpg,False,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,Sullivan Street Productions
0,False,10000000,62127,tt0113026,en,The Fantasticks,Two rural teens sing and dance their way through a forbidden romance and a dangerous travelling carnival.,2.559,2000-09-22,0,86,Released,Try to remember the first time magic happened,The Fantasticks,False,5.500,22,,2000.0,Comedy; Music; Romance,"Sullivan Street Productions; Michael Ritchie Productions; Radmin Company, The; United Artists",,US,United States of America,en,/hfO64mXz3DgUxkBVU7no2UWRP7x.jpg,False,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,Michael Ritchie Productions
0,False,10000000,62127,tt0113026,en,The Fantasticks,Two rural teens sing and dance their way through a forbidden romance and a dangerous travelling carnival.,2.559,2000-09-22,0,86,Released,Try to remember the first time magic happened,The Fantasticks,False,5.500,22,,2000.0,Comedy; Music; Romance,"Sullivan Street Productions; Michael Ritchie Productions; Radmin Company, The; United Artists",,US,United States of America,en,/hfO64mXz3DgUxkBVU7no2UWRP7x.jpg,False,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,"Radmin Company, The"
0,False,10000000,62127,tt0113026,en,The Fantasticks,Two rural teens sing and dance their way through a forbidden romance and a dangerous travelling carnival.,2.559,2000-09-22,0,86,Released,Try to remember the first time magic happened,The Fantasticks,False,5.500,22,,2000.0,Comedy; Music; Romance,"Sullivan Street Productions; Michael Ritchie Productions; Radmin Company, The; United Artists",,US,United States of America,en,/hfO64mXz3DgUxkBVU7no2UWRP7x.jpg,False,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,United Artists
1,False,0,110977,tt0113092,en,For the Cause,Earth is in a state of constant war and two colonies of humans have been sent to a far away world. The colonists create a new civilization but have inherited the worst traits of their forebears and war breaks out leaving the new cities in ruins. It falls to a bunch of teens to bring peace to the...,3.393,2000-11-15,0,100,Released,The ultimate showdown on a forbidden planet.,For the Cause,False,4.958,12,,2000.0,Science Fiction,Dimension Films; Grand Design Entertainment; Martien Holdings A.V.V.; Nu Image,,US,United States of America,en,/h9bWO13nWRGZJo4XVPiElXyrRMU.jpg,False,,,Dimension Films
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93057,False,0,1184693,tt9680562,mr,सत्यशोधक,The film follows the life of social reformer and writer Mahatma Jyotiba Phule.,4.889,2024-01-05,0,145,Released,HE BRINGS THE TORCH OF REVOLUTION TO ENLIGHTEN HIS PEOPLE,Satyashodhak,False,0.000,0,,2024.0,Drama; History,Samata Films; Abhita Films Production Pvt. Ltd.; Sound Ideaz Academy; Afterplay Studio,,IN,India,mr,/jzQPp0oSYPgu8crP3aq8EIjQ6Ti.jpg,False,/C28T7GAlCJQFWVDB4vUV8eZyf9.jpg,,Abhita Films Production Pvt. Ltd.
93057,False,0,1184693,tt9680562,mr,सत्यशोधक,The film follows the life of social reformer and writer Mahatma Jyotiba Phule.,4.889,2024-01-05,0,145,Released,HE BRINGS THE TORCH OF REVOLUTION TO ENLIGHTEN HIS PEOPLE,Satyashodhak,False,0.000,0,,2024.0,Drama; History,Samata Films; Abhita Films Production Pvt. Ltd.; Sound Ideaz Academy; Afterplay Studio,,IN,India,mr,/jzQPp0oSYPgu8crP3aq8EIjQ6Ti.jpg,False,/C28T7GAlCJQFWVDB4vUV8eZyf9.jpg,,Sound Ideaz Academy
93057,False,0,1184693,tt9680562,mr,सत्यशोधक,The film follows the life of social reformer and writer Mahatma Jyotiba Phule.,4.889,2024-01-05,0,145,Released,HE BRINGS THE TORCH OF REVOLUTION TO ENLIGHTEN HIS PEOPLE,Satyashodhak,False,0.000,0,,2024.0,Drama; History,Samata Films; Abhita Films Production Pvt. Ltd.; Sound Ideaz Academy; Afterplay Studio,,IN,India,mr,/jzQPp0oSYPgu8crP3aq8EIjQ6Ti.jpg,False,/C28T7GAlCJQFWVDB4vUV8eZyf9.jpg,,Afterplay Studio
93058,False,15000000,1072342,tt9682428,en,Night Swim,"Forced into early retirement by a degenerative illness, former baseball player Ray Waller moves into a new house with his wife and two children. He hopes that the backyard swimming pool will be fun for the kids and provide physical therapy for himself. However, a dark secret from the home's past...",168.637,2024-01-03,30600000,98,Released,Everything you fear is under the surface.,Night Swim,False,5.095,41,PG-13,2024.0,Horror; Thriller,Atomic Monster; Blumhouse Productions,,US,United States of America,en,/gSkfBGdxdialBMM7P02V4hcI6Ij.jpg,False,/klb3yODwKMRle19EcvdeSatEZL7.jpg,https://www.nightswim.movie,Atomic Monster


In [37]:
## save unique genres
unique_companies = sorted(df_explode['production_company_list'].dropna().unique())
len(unique_companies)

58037

In [38]:
## create integer ids for companies
company_map = dict(zip(unique_companies, range(len(unique_companies))))
company_map

{' "DIA" Productions GmbH & Co. KG': 0,
 ' "Weathering With You" Film Partners': 1,
 ' #littlesecretfilm': 2,
 ' (Colossal) Pictures': 3,
 ' (주)유비컬쳐': 4,
 ' (주)유비콘텐츠': 5,
 ' +Magenta': 6,
 ' .40 Caliber': 7,
 ' .Mille et Une. Films.': 8,
 ' .hack Conglomerate': 9,
 ' 010 Films': 10,
 ' 011 Productions': 11,
 ' 1 Production Film': 12,
 ' 1 Production Film Company': 13,
 ' 1-2-3 Production': 14,
 ' 1.85 Films': 15,
 ' 10 Cates Pictures': 16,
 ' 10 East Pictures': 17,
 ' 10 West Studios': 18,
 ' 10.80 Films': 19,
 ' 10/09 Films': 20,
 ' 100': 21,
 ' 100 Bares': 22,
 ' 100 Chapters': 23,
 ' 100 Feet Productions': 24,
 ' 100 doo': 25,
 ' 100% Halal': 26,
 ' 100% Synthetic Films': 27,
 ' 1000 Volt': 28,
 ' 101 Films': 29,
 ' 101 Films International': 30,
 ' 101 Studios': 31,
 ' 1011 Film': 32,
 ' 1014 Pictures': 33,
 ' 1019 Entertainment': 34,
 ' 1019 Pictures': 35,
 ' 102 Distribution': 36,
 ' 104 Films': 37,
 ' 108 Media': 38,
 ' 1091': 39,
 ' 1091 Media': 40,
 ' 1091 Pictures': 41,
 ' 10t

In [39]:
## Explode id and company columns to make joiner table
df_title_companies = df_explode[['imdb_id','production_company_list']]
df_title_companies

Unnamed: 0,imdb_id,production_company_list
0,tt0113026,Sullivan Street Productions
0,tt0113026,Michael Ritchie Productions
0,tt0113026,"Radmin Company, The"
0,tt0113026,United Artists
1,tt0113092,Dimension Films
...,...,...
93057,tt9680562,Abhita Films Production Pvt. Ltd.
93057,tt9680562,Sound Ideaz Academy
93057,tt9680562,Afterplay Studio
93058,tt9682428,Atomic Monster


In [40]:
## add integer ids to table
df_title_companies['company_id'] = df_title_companies['production_company_list'].map(company_map)
df_title_companies

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_title_companies['company_id'] = df_title_companies['production_company_list'].map(company_map)


Unnamed: 0,imdb_id,production_company_list,company_id
0,tt0113026,Sullivan Street Productions,53370.0
0,tt0113026,Michael Ritchie Productions,16176.0
0,tt0113026,"Radmin Company, The",20260.0
0,tt0113026,United Artists,26222.0
1,tt0113092,Dimension Films,35941.0
...,...,...,...
93057,tt9680562,Abhita Films Production Pvt. Ltd.,738.0
93057,tt9680562,Sound Ideaz Academy,23117.0
93057,tt9680562,Afterplay Studio,924.0
93058,tt9682428,Atomic Monster,30877.0


In [41]:
df_company = pd.DataFrame({'company_id':company_map.values(), 'company_name':company_map.keys()})
df_company

Unnamed: 0,company_id,company_name
0,0,"""DIA"" Productions GmbH & Co. KG"
1,1,"""Weathering With You"" Film Partners"
2,2,#littlesecretfilm
3,3,(Colossal) Pictures
4,4,(주)유비컬쳐
...,...,...
58032,58032,오피오티픽쳐스
58033,58033,유비네트워크
58034,58034,코라필름
58035,58035,플랫폼픽쳐스


In [42]:
# df

## ⭐️Clean & Normalize ~~Belongs to~~ collection

In [43]:
#looking for non-null entries to use as test
df[~df['collection_name'].isna()]

Unnamed: 0,belongs_to_collection,budget,id,imdb_id,original_language,original_title,overview,popularity,release_date,revenue,runtime,status,tagline,title,video,vote_average,vote_count,certification,Release Year,genre_names,production_companies_names,collection_name,production_countries_iso,production_countries_name,spoken_language_iso,poster_path,adult,backdrop_path,homepage,production_company_list
5,True,15000000,16225,tt0119273,en,Heavy Metal 2000,"Upon discovery of a shard of what could be the Loc-Nar, a miner named Tyler becomes possessed with an insatiable hunger for power and a thirst for immortality. On his way to the planet of youth, Tyler wipes out most of a space colony and kidnaps a beautiful young woman. His only mistake is that ...",13.224,2000-04-19,0,88,Released,not to survive the fight in the external world without inner strength,Heavy Metal 2000,False,6.077,196,R,2000.0,Action; Adventure; Animation; Fantasy; Science Fiction,Helkon Media; Das Werk Productions; Columbia Pictures; CinéGroupe,Heavy Metal Collection,CA; DE,Canada; Germany,en,/mzOgKYOXiuwE5DIfagFs34bOb3J.jpg,False,/f5C03doOWiauu37bToKXtpgP5bS.jpg,,"[Helkon Media, Das Werk Productions, Columbia Pictures, CinéGroupe]"
14,True,45000000,7443,tt0120630,en,Chicken Run,"The creators of Wallace & Gromit bring you an exciting and original story about a group of chickens determined to fly the coop–even if they can’t fly! It’s hardly poultry in motion when Rocky attempts to teach Ginger and her feathered friends to fly…but, with teamwork, determination and a little...",91.080,2000-06-23,224834564,84,Released,There's nothing more determined than poultry with a plan.,Chicken Run,False,6.800,4704,G,2000.0,Animation; Comedy; Family,Aardman; Pathé; DreamWorks Pictures,Chicken Run Collection,GB; US; FR,United Kingdom; United States of America; France,en,/oYbVT9e0k2ZSrRhDSCw2Yqshe1n.jpg,False,/t7aJ7hqsMLEoqBJk7yAXxcAVvcz.jpg,,"[Aardman, Pathé, DreamWorks Pictures]"
18,True,125000000,955,tt0120755,en,Mission: Impossible II,"With computer genius Luther Stickell at his side and a beautiful thief on his mind, agent Ethan Hunt races across Australia and Spain to stop a former IMF agent from unleashing a genetically engineered biological weapon called Chimera. This mission, should Hunt choose to accept it, plunges him i...",43.305,2000-05-24,546400000,123,Released,Expect the impossible again.,Mission: Impossible II,False,6.123,6299,,2000.0,Adventure; Action; Thriller,Paramount; Cruise/Wagner Productions,Mission: Impossible Collection,US,United States of America,en,/1VMWLpk9VXyYcEZ8w3uUhp0OF1v.jpg,False,/24DZfupDlhXeTchmcOkoGRhP5Vg.jpg,,"[Paramount, Cruise/Wagner Productions]"
19,True,75000000,36657,tt0120903,en,X-Men,"Two mutants, Rogue and Wolverine, come to a private academy for their kind whose resident superhero team, the X-Men, must oppose a terrorist organization with similar powers.",2.067,2000-07-13,296339527,104,Released,Trust a few. Fear the rest.,X-Men,False,6.996,10708,PG-13,2000.0,Adventure; Action; Science Fiction,Bad Hat Harry Productions; 20th Century Fox; Marvel Entertainment Group,X-Men Collection,US,United States of America,en,/bRDAc4GogyS9ci3ow7UnInOcriN.jpg,False,/yGxMIIDSUvaXUjjJuIYZi40Jt3J.jpg,https://www.20thcenturystudios.com/movies/x-men,"[Bad Hat Harry Productions, 20th Century Fox, Marvel Entertainment Group]"
21,True,100000000,11688,tt0120917,en,The Emperor's New Groove,"Kuzco is a self-centered emperor who summons Pacha from a village and to tell him that his home will be destroyed to make room for Kuzco's new summer home. Kuzco's advisor, Yzma, tries to poison Kuzco and accidentally turns him into a llama, who accidentally ends up in Pacha's village. Pacha off...",102.103,2000-12-15,169327687,78,Released,It's all about... ME!,The Emperor's New Groove,False,7.549,6253,G,2000.0,Adventure; Animation; Comedy; Family; Fantasy,Walt Disney Pictures; Walt Disney Feature Animation,The Emperor's New Groove Collection,US,United States of America,en,/wwbgkXQBEKtnyIJapk6gUgWkVw8.jpg,False,/gGxXXbiaivFvuMhBRmA14tuJiRe.jpg,,"[Walt Disney Pictures, Walt Disney Feature Animation]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
92926,True,0,1096342,tt26546123,en,Beautiful Wedding,"In the aftermath of Beautiful Disaster, Abby and Travis wake after a crazy night in Vegas as accidental newlyweds! With the mob on their heels, they flee to Mexico for a wild, weird honeymoon—but are they in for another disaster?",16.253,2024-01-24,0,100,Post Production,,Beautiful Wedding,False,0.000,0,R,2024.0,Comedy; Drama; Romance,Voltage Pictures; Wattpad,Beautiful Disaster Collection,US,United States of America,en,/yA9CO79V0lr9IWEmqQdp2ozvYQE.jpg,False,/ik7weMW5cize6A1ApJ9dXHG6QQn.jpg,,"[Voltage Pictures, Wattpad]"
92932,True,0,1214314,tt27110516,en,One More Shot,"Following the attack on the black site in Poland, Navy SEAL Jake Harris is ordered to escort terrorist suspect Amin Mansur to Washington D.C. for interrogation. Before the prisoner transfer process is complete, though, the airport is attacked by a group of heavily armed, well-trained mercenaries.",147.998,2024-01-12,0,102,Released,,One More Shot,False,6.987,38,R,2024.0,Action; Thriller,Signature Films; Signature Entertainment; Lipsync Productions; Sky Cinema; Richmond Pictures,One Shot Collection,GB; US,United Kingdom; United States of America,en,/nQ1BQg4yMdlYSHvHZgwladzy7EF.jpg,False,/ay0PJQZizDXk0pzhoGX4v7K9h7A.jpg,,"[Signature Films, Signature Entertainment, Lipsync Productions, Sky Cinema, Richmond Pictures]"
92951,True,0,1139566,tt28105944,es,A través de tu mirada,Raquel and Ares face the challenges that arise in this last chapter of their relationship and their individual lives.,47.383,2024-02-23,0,105,Post Production,,Through My Window: Looking at You,False,0.000,0,,2024.0,Romance; Drama; Comedy,Nostromo Pictures,Through My Window Collection,ES,Spain,es,/yIR8mobHcGaIIFvhTAONfQBy0nB.jpg,False,/lwcTBhlfLOUFjg8elSBDmFK5W0h.jpg,https://www.netflix.com/title/81484069,[Nostromo Pictures]
92968,True,0,1155089,tt29195117,en,Justice League: Crisis on Infinite Earths Part One,"Death is coming. Worse than death: oblivion. Not just for our Earth, but for everyone, everywhere, in every universe! Against this ultimate destruction, the mysterious Monitor has gathered the greatest team of Super Heroes ever assembled. But what can the combined might of Superman, Wonder Woman...",400.226,2024-01-09,0,93,Released,"Worlds will live, worlds will die and nothing will ever be the same.",Justice League: Crisis on Infinite Earths Part One,False,7.744,78,PG-13,2024.0,Animation; Science Fiction; Action,DC Entertainment; Warner Bros. Animation,Justice League (Tomorrowverse) Collection,US,United States of America,en,/zR6C66EDklgTPLHRSmmMt5878MR.jpg,False,/rVJfabCz1ViynQCEz54MRqdZig1.jpg,https://www.warnerbros.com/movies/justice-league-crisis-infinite-earths-part-one,"[DC Entertainment, Warner Bros. Animation]"


In [44]:
# # slicing out a test entry from collections
# test = df.iloc[19]['belongs_to_collection']
# test

In [45]:
# # slicing out a test entry from collections
# test2 = df.loc[16]['belongs_to_collection']
# test2

> Unlike production companies, this column has some single dictionary instead of a list of dictionaries, since movies can only belong to 1 collection.


In [46]:
# df[df['belongs_to_collection'].fillna("").str.contains('N\\ ')]

In [47]:
# ## Function to get just the genre names as a list 
# import json

# def get_collection(x):
#     try:
#         x = x.replace('None','"None"')
#         x = x.replace('\\','')

#         x = x.replace("'s",'s')
#         for key in ['id','poster_path','name','backdrop_path']:
#             x = x.replace(f"'{key}'",f'"{key}"')

#         x = x.replace("'",'"')

#         x = json.loads(x)
#     except:
#         return None
    
#     if isinstance(x,dict):
#         collections= x['name']
        
#     elif isinstance(x,list):
#         collections = []
#         for collection in x:
#             collections.append(collection['name'])
#     else:
#         collections = "MISSING"
#     return collections


In [48]:
# ## ADDED 01/10/23 FOR CONFIRMATION ON TIPS TO ADD TO INSTRUCTiONS
# def test_collection(x):
#     try:
#         x = x.replace('None','"None"')
#         x = x.replace('\\','')

#         x = x.replace("'s",'s')
#         for key in ['id','poster_path','name','backdrop_path']:
#             x = x.replace(f"'{key}'",f'"{key}"')

#         x = x.replace("'",'"')

#         x = json.loads(x)
#     except:
#         return "ERROR"
    
#     if isinstance(x,dict):
# #         collections= x['name']
#         return "dict"
        
#     elif isinstance(x,list):
#         return "list"
# #         collections = []
# #         for collection in x:
# #             collections.append(collection['name'])
#     else:
# #         collections = 
#         return "MISSING"


In [49]:
# # test function
# test_collection(test2)

In [50]:
# df['collection_dtype'] = df['belongs_to_collection'].map(test_collection)
# df['collection_dtype'].value_counts(dropna=False, normalize=True)

In [51]:
## Apply the function to the entire column
df['collection'] = df['collection_name']#df['belongs_to_collection'].apply(get_collection)

In [52]:
# check % null values in new column
df['collection'].isna().sum() / len(df)

0.9439280456484596

In [53]:
# compare % null values from orginal column
df['belongs_to_collection'].isna().sum() / len(df)

0.0

In [54]:
df['collection'].value_counts()

Exhibition on Screen Collection             23
Detective Conan Collection                  23
Doraemon Collection                         18
Super Sentai Collection                     15
Troublesome Night Collection                15
                                            ..
The Wailer Collection                        1
The Corpse Grinders Collection               1
Spicy Beauty Queen of Bangkok Collection     1
The Lion Roars Collection                    1
PVCU                                         1
Name: collection, Length: 2567, dtype: int64

### Normalizing/Separating Collections

> Explode not needed since column does not have lists.

In [55]:
df['collection']

0        NaN
1        NaN
2        NaN
3        NaN
4        NaN
        ... 
93054    NaN
93055    NaN
93056    NaN
93057    NaN
93058    NaN
Name: collection, Length: 93059, dtype: object

In [56]:
df['collection'].value_counts()

Exhibition on Screen Collection             23
Detective Conan Collection                  23
Doraemon Collection                         18
Super Sentai Collection                     15
Troublesome Night Collection                15
                                            ..
The Wailer Collection                        1
The Corpse Grinders Collection               1
Spicy Beauty Queen of Bangkok Collection     1
The Lion Roars Collection                    1
PVCU                                         1
Name: collection, Length: 2567, dtype: int64

In [57]:
## save unique genres
unique_collections = sorted(df['collection'].dropna().unique())
len(unique_collections)

2567

In [58]:
## create an integer index map for colletion name
collection_map = dict(zip(unique_collections, range(len(unique_collections))))
collection_map

{'#TemanTapiMenikah': 0,
 '... Cameras Collection': 1,
 '..... Is Gone': 2,
 '.hack Collection': 3,
 '10 Days of a Man Collection': 4,
 '10/31 Collection': 5,
 '100 Candles Game Collection': 6,
 '100 Girls/Women Collection': 7,
 '100% Coco Collection': 8,
 '101 Dalmatians (Live-Action) Collection': 9,
 '12 Dogs of Christmas Collection': 10,
 '12 Rounds Collection': 11,
 '14+ (Коллекция)': 12,
 '18+ - Collection': 13,
 '1920 Collection': 14,
 '1981 Collection': 15,
 '2 Days In... Collection': 16,
 '2001 Maniacs Collection': 17,
 '20th Century Boys Collection': 18,
 '28 Days/Weeks Later Collection': 19,
 '3 A.M. Collection': 20,
 '3 Braves Universe': 21,
 '3 Little Pigs collection': 22,
 '3 Way Collection': 23,
 '30 Days of Night Collection': 24,
 '300 Collection': 25,
 '365 Days Collection': 26,
 '4 Horror Tales Collection': 27,
 '4 Kings': 28,
 '4-2-4 Collection': 29,
 '47 Meters Down Collection': 30,
 '47 Ronin Collection': 31,
 '50 Foot Cam Girl': 32,
 '60 Seconds to Die Collection':

In [59]:
df.head()

Unnamed: 0,belongs_to_collection,budget,id,imdb_id,original_language,original_title,overview,popularity,release_date,revenue,runtime,status,tagline,title,video,vote_average,vote_count,certification,Release Year,genre_names,production_companies_names,collection_name,production_countries_iso,production_countries_name,spoken_language_iso,poster_path,adult,backdrop_path,homepage,production_company_list,collection
0,False,10000000,62127,tt0113026,en,The Fantasticks,Two rural teens sing and dance their way through a forbidden romance and a dangerous travelling carnival.,2.559,2000-09-22,0,86,Released,Try to remember the first time magic happened,The Fantasticks,False,5.5,22,,2000.0,Comedy; Music; Romance,"Sullivan Street Productions; Michael Ritchie Productions; Radmin Company, The; United Artists",,US,United States of America,en,/hfO64mXz3DgUxkBVU7no2UWRP7x.jpg,False,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,"[Sullivan Street Productions, Michael Ritchie Productions, Radmin Company, The, United Artists]",
1,False,0,110977,tt0113092,en,For the Cause,Earth is in a state of constant war and two colonies of humans have been sent to a far away world. The colonists create a new civilization but have inherited the worst traits of their forebears and war breaks out leaving the new cities in ruins. It falls to a bunch of teens to bring peace to the...,3.393,2000-11-15,0,100,Released,The ultimate showdown on a forbidden planet.,For the Cause,False,4.958,12,,2000.0,Science Fiction,Dimension Films; Grand Design Entertainment; Martien Holdings A.V.V.; Nu Image,,US,United States of America,en,/h9bWO13nWRGZJo4XVPiElXyrRMU.jpg,False,,,"[Dimension Films, Grand Design Entertainment, Martien Holdings A.V.V., Nu Image]",
2,False,0,442869,tt0116391,hi,Gang,"After falling prey to underworld, four friends jeopardise their personal lives with point of no return.",2.748,2000-04-14,0,165,Released,,Gang,False,5.0,2,,2000.0,Drama; Action; Crime,,,IN,India,hi,/dYcuiiBDpPUvCcPbiWdH4REjGn3.jpg,False,/krEZg9tb6blhc7sV6Us2ZGQ0gA.jpg,,,
3,False,150000,843,tt0118694,cn,花樣年華,Two neighbors become intimate after discovering that their spouses are having an affair with one another.,27.94,2000-09-29,14204632,99,Released,"Feel the heat, keep the feeling burning, let the sensation explode.",In the Mood for Love,False,8.109,2433,PG,2000.0,Drama; Romance,Block 2 Pictures; Orly Films; Jet Tone Films; Paradis Films; Fortissimo Films,,FR; HK; NL; CN,France; Hong Kong; Netherlands; China,cn; fr; es,/iYypPT4bhqXfq1b6EnmxvRt6b2Y.jpg,False,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,,"[Block 2 Pictures, Orly Films, Jet Tone Films, Paradis Films, Fortissimo Films]",
4,False,0,49511,tt0118852,en,Chinese Coffee,"When Harry Levine, an aging, unsuccessful Greenwich Village writer, is fired from his job as restaurant doorman, he calls on friend and mentor Jake, ostensibly to collect a long-standing debt.",5.495,2000-09-02,0,99,Released,There's a fine line between friendship and betrayal.,Chinese Coffee,False,6.6,56,R,2000.0,Drama,Chal Productions; The Shooting Gallery,,US,United States of America,en,/nZGWnSuf1FIuzyEuMRZHHZWViAp.jpg,False,/vceiGZ3uavAEHlTA7v0GjQsGVKe.jpg,,"[Chal Productions, The Shooting Gallery]",


#### Make title-collection table

In [60]:
df_title_collection = df[['imdb_id','collection']]
df_title_collection

Unnamed: 0,imdb_id,collection
0,tt0113026,
1,tt0113092,
2,tt0116391,
3,tt0118694,
4,tt0118852,
...,...,...
93054,tt8736506,
93055,tt9214772,
93056,tt9357860,
93057,tt9680562,


In [61]:
df_title_collection.isna().sum()

imdb_id           0
collection    87841
dtype: int64

> We do not want to store null values in this table in our database, so drop the null values.

In [62]:
## Drop null values
df_title_collection = df_title_collection.dropna()
df_title_collection.isna().sum()

imdb_id       0
collection    0
dtype: int64

In [63]:
# Add the integer ids for collections
df_title_collection['collection_id'] = df_title_collection['collection'].map(collection_map)
df_title_collection

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_title_collection['collection_id'] = df_title_collection['collection'].map(collection_map)


Unnamed: 0,imdb_id,collection,collection_id
5,tt0119273,Heavy Metal Collection,853
14,tt0120630,Chicken Run Collection,392
18,tt0120755,Mission: Impossible Collection,1284
19,tt0120903,X-Men Collection,2473
21,tt0120917,The Emperor's New Groove Collection,2023
...,...,...,...
92926,tt26546123,Beautiful Disaster Collection,238
92932,tt27110516,One Shot Collection,1422
92951,tt28105944,Through My Window Collection,2274
92968,tt29195117,Justice League (Tomorrowverse) Collection,1013


In [64]:
## Saving the collection map dictionary as separate table
df_collection = pd.DataFrame({'collection_id':collection_map.values(), 
                              'collection_name':collection_map.keys()})
df_collection

Unnamed: 0,collection_id,collection_name
0,0,#TemanTapiMenikah
1,1,... Cameras Collection
2,2,..... Is Gone
3,3,.hack Collection
4,4,10 Days of a Man Collection
...,...,...
2562,2562,黒い乙女
2563,2563,몽정기 시리즈
2564,2564,반드시 크게 들을 것
2565,2565,식객 시리즈


> [Click here](#new2) to jump to loading the new tables into the mysql database.

# LOAD

## Connect to Amazon RDS (MySQL)

In [65]:
import pymysql
pymysql.install_as_MySQLdb()
from sqlalchemy import create_engine
from sqlalchemy_utils import create_database, database_exists
from sqlalchemy.types import *
from urllib.parse import quote_plus


In [66]:
quote_plus("tr@ing ")

'tr%40ing+'

In [67]:
## Getting mysql server password
import json
with open('/Users/codingdojo/.secret/aws-personal.json') as f:
    login = json.load(f)

login.keys()

dict_keys(['username', 'password', 'port', 'host', 'database'])

In [68]:
## create a new movies database
# connect_str = f"mysql+pymysql://{login['user']}:{login['password']}@localhost/movies"
host = login['host']
port = login['port']
password = quote_plus(login['password'])
username = login['username']
db_name = login['database']
connect_str = f"mysql+pymysql://{username}:{password}@{host}:{port}/{db_name}"
# connect_str

In [69]:
# ## create a new movies database
# # connect_str = f"mysql+pymysql://{login['user']}:{login['password']}@localhost/movies"
# connect_str = f"mysql+pymysql://{login['user']}:{login['password']}@localhost/{DB_NAME}"

In [70]:
database_exists(connect_str)

True

In [71]:
## Check if database exists, if not, create it
if database_exists(connect_str) == False:
    print("Creating the database.")
    create_database(connect_str)
else:
    print('The database already exists.')

The database already exists.


In [72]:
## create engine
engine = create_engine(connect_str)
conn = engine.connect()

In [73]:
## Test your connection by checking for any tables that exist (there should be none at this point)
q = """SHOW TABLES;"""
tables = pd.read_sql(q, conn)
tables

Unnamed: 0,Tables_in_movies-db
0,collections
1,genres
2,name_basics
3,name_characters
4,name_knownForTitles
5,name_professions
6,production_companies
7,reviews
8,title_basics
9,title_collection


### Saving`title_basics` to db

In [74]:
basics_schema = get_schema(basics)
basics_schema

{'tconst': String(length=11),
 'primaryTitle': String(length=209),
 'startYear': Float(),
 'runtimeMinutes': Integer()}

In [None]:
## Saving basics as table with schema,
basics.to_sql('title_basics',conn,dtype=basics_schema,if_exists='replace',index=False)

## setting title basics as the primary key
conn.execute('ALTER TABLE title_basics ADD PRIMARY KEY (`tconst`);')

In [None]:
## query first rows 
q = """SELECT * FROM title_basics LIMIT 5"""
pd.read_sql(q,conn)

### Saving Genre Tables

#### Saving genre_lookup

In [None]:
genre_schema = get_schema(genre_lookup)
genre_schema

In [None]:
genre_lookup.head()

In [None]:
## Primary key is Genre_ID
genre_lookup.to_sql('genres',conn,index=False,  if_exists='replace')

conn.execute('ALTER TABLE genres ADD PRIMARY KEY (`Genre_ID`);')

In [None]:
q = """SELECT * FROM genres LIMIT 5"""
pd.read_sql(q,conn)

#### Saving `title_genres` table

In [None]:
## NO PRIMARY KEY - DUPLCIATE VALUES
title_genres.to_sql('title_genres',conn,index=False,
                    if_exists='replace' )

# conn.execute('ALTER TABLE title_genres ADD PRIMARY KEY (`tconst`);')

In [None]:
q = """SELECT * FROM title_genres LIMIT 5"""
pd.read_sql(q,conn)

### Saving `title_ratings`


In [None]:
# ratings_schema = {'tconst':String(key_len+1), 
#                  'averageRating':Float(),
#                  'numVotes':Integer()}#get_schema(ratings)
ratings_schema = get_schema(ratings)
ratings_schema

In [None]:
ratings.to_sql('title_ratings',conn,if_exists='replace',index=False,
              dtype=ratings_schema)
conn.execute("ALTER TABLE title_ratings ADD PRIMARY KEY (`tconst`)")

In [None]:
q = """SELECT * FROM title_ratings LIMIT 5"""
pd.read_sql(q,conn)

### Saving TMDB API Data

In [None]:
df.info()
df.head()

# ⭐️ **Saving New TMDB API Tables**
<a id='new2'></a>


### Saving Production Company Tables

- Tables to save:
    - title_companies 
    - companies

In [None]:
df_title_companies

#### `title_company`

In [None]:
table_title_company = df_title_companies.drop(columns=['production_company_list']) 
table_title_company.isna().sum()

In [None]:
table_title_company = table_title_company.dropna()
table_title_company['company_id'] = table_title_company['company_id'].astype(int)
table_title_company

In [None]:
## saving text length
key_len = table_title_company['imdb_id'].map(len).max()

## saving schema
api_data_schema = {'imdb_id':String(key_len+1)}
api_data_schema

In [None]:
# saving mvp version
## Primary key is Genre_ID
table_title_company.to_sql('title_production_company',conn, 
                           index=False,dtype=api_data_schema, 
                           if_exists='replace')

#### `companies`

In [None]:
table_companies = df_company.copy()
table_companies

In [None]:
# saving mvp version
## Primary key is Genre_ID
table_companies.to_sql('production_companies',conn, index=False,dtype=api_data_schema, if_exists='replace')


### Saving Collections Tables


- Tables to save:
    - title_collection
    - collections

#### `title_collections`

In [None]:
df_title_collection

In [None]:
table_title_collection = df_title_collection.drop(columns=['collection']) 
table_title_collection.isna().sum()

In [None]:
table_title_collection = table_title_collection.dropna()
table_title_collection['collection_id'] = table_title_collection['collection_id'].astype(int)
table_title_collection

In [None]:
# ## saving text length
# key_len = table_title_collection['imdb_id'].map(len).max()
# ## saving schema
# api_data_schema = {'imdb_id':String(key_len+1)}
# api_data_schema
schema = get_schema(table_title_collection)
schema

In [None]:
# saving mvp version
## Primary key is Genre_ID
table_title_collection.to_sql('title_collection',conn, index=False,dtype=schema, if_exists='replace')


#### `collections`

In [None]:
table_collections = df_collection
table_collections

In [None]:
# saving mvp version
table_collections.to_sql('collections',conn, index=False,
                         dtype=get_schema(table_collections),
                         if_exists='replace')

## Original TMDB API Data Tables (Stack 4)

In [None]:
# df['budget']

In [None]:
## Convert financial cols to float (error?)
df['revenue'] = pd.to_numeric(df['revenue'], errors='coerce')#.astype(float)
df['budget'] = pd.to_numeric(df['budget'], errors='coerce')#.astype(float)

In [None]:
## New 01/22/24 cleaning certification
df['certification'] = df['certification'].str.strip()
df['certification'].value_counts(dropna=False)

In [None]:
df

In [None]:
## Keeping some columns
cols_to_keep = ['imdb_id','id',#New
                'budget','revenue','certification',
                'original_title', #New
                'poster_path','tagline',
                'original_language','release_date','popularity','vote_average',
               'vote_count']
mvp = df[cols_to_keep]
mvp

In [None]:
mvp.isna().sum()

In [None]:
mvp

In [None]:
## get schema
api_data_schema = get_schema(mvp)#.fillna('MISSING'))
api_data_schema

In [None]:
api_data_schema

In [None]:
mvp.duplicated(subset=['id']).sum()

In [None]:
mvp = mvp.drop_duplicates(subset=['id'])
mvp.duplicated(subset=['id']).sum()

## 👉🚨 Revisit dropping duplicates if final db is missing data

In [None]:
mvp.duplicated(subset=['imdb_id']).sum()

In [None]:
mvp = mvp.drop_duplicates(subset=['imdb_id'])
mvp

In [None]:
mvp.info()

In [None]:
# saving mvp version
## Primary key is Genre_ID
mvp.to_sql('tmdb',conn, index=False,dtype=api_data_schema, if_exists='replace')
conn.execute('ALTER TABLE tmdb ADD PRIMARY KEY (`id`);') # New 01-2024 - changing to id

In [None]:
q = """SELECT * FROM tmdb LIMIT 5"""
pd.read_sql(q,conn)

In [None]:
df.head()

# Final Tables Check

In [None]:
## Test your connection by checking for any tables that exist (there should be none at this point)
q = """SHOW TABLES;"""
pd.read_sql(q, conn)

# ⭐️ **New: Exporting IMDB crew/principals/names**
<a id='new-imdb'></a>

## Connect to MySQL

In [None]:
import pymysql
pymysql.install_as_MySQLdb()
from sqlalchemy import create_engine
from sqlalchemy_utils import create_database, database_exists
from sqlalchemy.types import *


In [None]:
# ## Getting mysql server password
# import json
# with open('/Users/codingdojo/.secret/mysql.json') as f:
#     login = json.load(f)

# login.keys()

In [None]:
# ## create a new movies database
# # connect_str = f"mysql+pymysql://{login['user']}:{login['password']}@localhost/movies"
# connect_str = f"mysql+pymysql://{login['user']}:{login['password']}@localhost/{DB_NAME}"

In [None]:
# ## Check if database exists, if not, create it
# if database_exists(connect_str) == False:
#     print("Creating the database.")
#     create_database(connect_str)
# else:
#     print('The database already exists.')

In [None]:
## create engine
engine = create_engine(connect_str)
conn = engine.connect()

In [None]:
## Test your connection by checking for any tables that exist (there should be none at this point)
q = """SHOW TABLES;"""
tables = pd.read_sql(q, conn)
tables

## title_principals

In [None]:
title_princ = pd.read_csv('Data/title_principals.csv.gz')
title_princ

### Normalization Required:

- `table_title_principals`: 
    - keep columns:
        - tconst
        - ordering
        - nconst
        - categor
        - and job.
- `table_name_characters':
    - make new df with copy of nconst & characters
    - explode characters column
    - **not going to bother normalizing character names*

### `table_name_chars`

In [None]:
name_chars = title_princ[['nconst','characters']]
name_chars = name_chars[~name_chars['characters'].str.contains(r'\\N')]
name_chars

In [None]:
import json
name_chars['characters'] = name_chars['characters'].apply(lambda x: json.loads(x))
name_chars

In [None]:
table_name_chars = name_chars.explode('characters')
table_name_chars

In [None]:
schema = get_schema(table_name_chars)
schema

In [None]:
# saving mvp version
## Primary key is Genre_ID
table_name_chars.to_sql('name_characters',conn, index=False,
                        dtype=schema, if_exists='replace')


In [None]:
## query first rows 
q = """SELECT * FROM name_characters LIMIT 5"""
pd.read_sql(q,conn)

### `table_title_principals`

In [None]:
title_princ = title_princ.drop(columns='characters')
title_princ

In [None]:
schema = get_schema(title_princ)
title_princ.to_sql('title_principals',conn, index=False,
                        dtype=schema, if_exists='replace')


In [None]:
## query first rows 
q = """SELECT * FROM title_principals LIMIT 5"""
pd.read_sql(q,conn)

## names_basics

In [None]:
names = pd.read_csv("Data/name_basics.csv.gz")
names

### Normalization Required:
-  Make `table_name_basics` using:
    - nconst
    - primaryName
    - birthYear
    - deathYear

- `table_name_professions`:
    - nconst
    - primaryProfession
        - split and explode profession
        
- **maybe?** `table_name_knownfortitles`:
    - nconst
    - knownForTitles
        - split and explode knownForTitles

In [None]:
table_name_basics = names[['nconst','primaryName','birthYear','deathYear']]
table_name_basics

###  `table_name_basics`

In [None]:
schema = get_schema(table_name_basics)
schema

In [None]:
## Saving basics as table with schema,
table_name_basics.to_sql('name_basics',conn,dtype=schema,
                         if_exists='replace',index=False)

## setting title basics as the primary key
engine.execute('ALTER TABLE name_basics ADD PRIMARY KEY (`nconst`);')

In [None]:
## query first rows 
q = """SELECT * FROM name_basics LIMIT 5"""
pd.read_sql(q,conn)

### `table_name_professions`

In [None]:
table_name_profs = names[['nconst','primaryProfession']]
table_name_profs['primaryProfession_list'] = table_name_profs['primaryProfession'].str.split(',')
table_name_profs

In [None]:
table_name_profs = table_name_profs.explode('primaryProfession_list')[['nconst','primaryProfession_list']]
table_name_profs

In [None]:
table_name_profs = table_name_profs.rename({'primaryProfession_list':'profession'},axis=1)
table_name_profs

In [None]:
schema = get_schema(table_name_profs)
table_name_profs.to_sql('name_professions',conn, index=False,
                        dtype=schema, if_exists='replace')


In [None]:
## query first rows 
q = """SELECT * FROM name_professions LIMIT 5"""
pd.read_sql(q,conn)

### `table_name_knownfortitles`

In [None]:
name_known = names[['nconst','knownForTitles']]
name_known['knownForTitles'] = name_known['knownForTitles'].str.split(',')
name_known

In [None]:
table_name_known = name_known.explode('knownForTitles')
table_name_known

In [None]:
schema = get_schema(table_name_known)
table_name_known.to_sql('name_knownForTitles',conn, index=False,
                        dtype=schema, if_exists='replace')


In [None]:
## query first rows 
q = """SELECT * FROM name_knownForTitles LIMIT 5"""
pd.read_sql(q,conn)

## title_crew

In [None]:
title_crew = pd.read_csv('Data/title_crew.csv.gz')
title_crew

- APPROACH:
    - create an exploded and melted datafames
    - final table will have:
        - `tconst`
        - `nconst`
        - `role` (director/writer)

In [None]:
## Melt the title_crew table using tconst as the id_vars
title_crew_melt = pd.melt(title_crew, id_vars='tconst', 
                         var_name='role', value_name='nconst')

## Replacing origninal "Directors" and "Writers" column names with single
roles_to_replace = {'directors':'director', 'writers':'writer'}
title_crew_melt['role'] = title_crew_melt['role'].replace(roles_to_replace)
title_crew_melt

In [None]:
title_crew_melt['nconst'] =  title_crew_melt['nconst'].str.split(',')
title_crew_melt

In [None]:
table_title_crew = title_crew_melt.explode('nconst')
table_title_crew

In [None]:
schema = get_schema(table_title_crew)
table_title_crew.to_sql('title_crew',conn, index=False,
                        dtype=schema, if_exists='replace')


In [None]:
## query first rows 
q = """SELECT * FROM title_crew LIMIT 5"""
pd.read_sql(q,conn)

# Final Check

In [None]:
q = """SHOW TABLES"""
pd.read_sql(q,conn)

# [New 01/2024] Adding Reviews to SQL Database

In [None]:
# import glob
# review_files = sorted(glob.glob("Data-NLP/*review*.json"))
# # review_files = df['reviews_file'].dropna().to_list()
# # review_files = reviews_files_df['filename'].to_list()
# len(review_files)

In [None]:
# # Loading all reviews from json 
# loaded = []

# for file in review_files:
#     temp = pd.read_json(file)
#     loaded.append(temp)

# reviews = pd.concat(loaded)
reviews = pd.read_csv("Data-NLP/combined-tmdb-movie-reviews.csv.gz")

reviews.info()

In [None]:
reviews.duplicated().sum() 

In [None]:
reviews = reviews.drop_duplicates()
reviews

In [None]:
import missingno
missingno.matrix(reviews)

In [None]:
reviews.duplicated('review_id').sum()

In [None]:
# Dropna and rename columns
reviews = reviews.dropna(subset=['review_text'])
reviews = reviews.rename({'author_rating':'rating'},axis=1)
reviews

In [None]:
reviews.duplicated('review_id').sum()

In [None]:
 missingno.matrix(reviews)

In [None]:
# fpath_reviews = 

In [None]:
## Prepare for MySQL

In [None]:
schema = get_schema(reviews)
schema

In [None]:
reviews.to_sql('reviews',conn, index=False,
                        dtype=schema, if_exists='replace')

# Test Database

In [None]:
q= 'SHOW TABLES'
pd.read_sql(q, conn)

In [None]:
q= """DESCRIBE reviews"""
pd.read_sql(q, conn)

In [None]:
q= """DESCRIBE title_basics"""
pd.read_sql(q, conn)

In [None]:
q= """DESCRIBE tmdb"""
pd.read_sql(q, conn)

In [None]:
# q = """SELECT * FROM title_basics tb
# JOIN tmdb ON tb.tconst """

In [None]:
# q =  """select """