# Project 3 - Part 3: MySQL

# Specifications - Database
    
Your stakeholder wants you to take the data you have been cleaning and collecting in Parts 1 & 2 of the project and wants you to create a MySQL database for them.

Specifically, they want the data from the following files included in your database:
1. Title Basics:
    1. Movie ID (tconst)
    2. Primary Title
    3.Start Year
    4.Runtime (in Minutes)
2. Genres
    1. Title Ratings
    2. Movie ID (tconst)
    4. Average Movie Rating
    4. Number of Votes
3. The TMDB API Results (multiple files)
    1. Movie ID
    2. Revenue
    3. Budget
    4. Certification (MPAA Rating)
    
You should normalize the tables as best you can before adding them to your new database.

* Note: an important exception to their request is that they would like you to keep all of the data from the TMDB API in 1 table together (even though it will not be perfectly normalized).
* You only need to keep the imdb_id, revenue, budget, and certification columns

In [1]:
import pandas as pd
import numpy as np

pd.options.mode.chained_assignment = None  # default='warn'

## Databases

In [2]:
basics = pd.read_csv('Data/title_basics.csv.gz')
basics_copy = basics.copy() # for later use
basics.info()
basics.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86756 entries, 0 to 86755
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          86756 non-null  object 
 1   titleType       86756 non-null  object 
 2   primaryTitle    86756 non-null  object 
 3   originalTitle   86756 non-null  object 
 4   isAdult         86756 non-null  int64  
 5   startYear       86756 non-null  int64  
 6   endYear         0 non-null      float64
 7   runtimeMinutes  86756 non-null  int64  
 8   genres          86756 non-null  object 
dtypes: float64(1), int64(3), object(5)
memory usage: 6.0+ MB


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
1,tt0043139,movie,Life of a Beijing Policeman,Wo zhe yi bei zi,0,2013,,120,"Drama,History"
2,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020,,70,Drama
3,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
4,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi"


In [3]:
ratings = pd.read_csv('Data/title_ratings.csv.gz')
ratings.info()
ratings.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 501007 entries, 0 to 501006
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   tconst         501007 non-null  object 
 1   averageRating  501007 non-null  float64
 2   numVotes       501007 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 11.5+ MB


Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1978
1,tt0000002,5.8,265
2,tt0000005,6.2,2621
3,tt0000006,5.1,182
4,tt0000007,5.4,821


In [4]:
tmdb = pd.read_csv('Data/tmdb_results_combined.csv.gz')
tmdb.info()
tmdb.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2574 entries, 0 to 2573
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   imdb_id                2574 non-null   object 
 1   adult                  2574 non-null   float64
 2   backdrop_path          1408 non-null   object 
 3   belongs_to_collection  208 non-null    object 
 4   budget                 2574 non-null   float64
 5   genres                 2574 non-null   object 
 6   homepage               172 non-null    object 
 7   id                     2574 non-null   float64
 8   original_language      2574 non-null   object 
 9   original_title         2574 non-null   object 
 10  overview               2524 non-null   object 
 11  popularity             2574 non-null   float64
 12  poster_path            2316 non-null   object 
 13  production_companies   2574 non-null   object 
 14  production_countries   2574 non-null   object 
 15  rele

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,tt0113026,0.0,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127.0,en,The Fantasticks,...,0.0,86.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Try to remember the first time magic happened,The Fantasticks,0.0,5.5,22.0,
1,tt0113092,0.0,,,0.0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977.0,en,For the Cause,...,0.0,100.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The ultimate showdown on a forbidden planet.,For the Cause,0.0,5.1,8.0,
2,tt0116391,0.0,,,0.0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869.0,hi,Gang,...,0.0,152.0,"[{'english_name': 'Hindi', 'iso_639_1': 'hi', ...",Released,,Gang,0.0,4.0,1.0,
3,tt0118694,0.0,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,"{'id': 1131062, 'name': 'Wong Kar-Wai’s Love T...",150000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,843.0,cn,花樣年華,...,14204632.0,99.0,"[{'english_name': 'Cantonese', 'iso_639_1': 'c...",Released,"Feel the heat, keep the feeling burning, let t...",In the Mood for Love,0.0,8.112,2221.0,PG
4,tt0118852,0.0,/vceiGZ3uavAEHlTA7v0GjQsGVKe.jpg,,0.0,"[{'id': 18, 'name': 'Drama'}]",,49511.0,en,Chinese Coffee,...,0.0,99.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,There's a fine line between friendship and bet...,Chinese Coffee,0.0,6.8,50.0,R


# Required Transformation Steps for Title Basics:
    
## A. Normalize Genre:
- Convert the single string of genres from title basics into 2 new tables.
1. title_genres: with the columns:
    - tconst
    - genre_id
2. genres:
    - genre_id
    - genre_name

In [5]:
basics['genre_split'] = basics['genres'].str.split(',')
basics.sample(3)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,genre_split
36815,tt1674768,movie,An American in Texas,An American in Texas,0,2017,,97,"Crime,Drama,Music","[Crime, Drama, Music]"
18912,tt1124037,movie,Free State of Jones,Free State of Jones,0,2016,,139,"Action,Biography,Drama","[Action, Biography, Drama]"
59149,tt3563156,movie,Anjaan,Anjaan,0,2014,,166,"Action,Crime,Thriller","[Action, Crime, Thriller]"


In [6]:
exploded_genres = basics.explode('genre_split')
exploded_genres.sample(3)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,genre_split
39924,tt1825022,movie,Helpless Stones,Koroishi,0,2010,,61,Fantasy,Fantasy
68052,tt5135646,movie,'Twas the Night Before Christmas,'Twas The Night Before Christmas,0,2014,,75,"Comedy,Family",Family
46914,tt2210657,movie,Highway,Highway,0,2012,,80,Drama,Drama


In [7]:
unique_genres = sorted(exploded_genres['genre_split'].unique())
unique_genres

['Action',
 'Adult',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Drama',
 'Family',
 'Fantasy',
 'Game-Show',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'News',
 'Reality-TV',
 'Romance',
 'Sci-Fi',
 'Short',
 'Sport',
 'Talk-Show',
 'Thriller',
 'War',
 'Western']

### Title Genre

In [8]:
title_genres = exploded_genres[['tconst', 'genre_split']].copy()
title_genres.sample(3)

Unnamed: 0,tconst,genre_split
12543,tt0847050,Music
33793,tt1545304,Romance
67604,tt5068162,Drama


In [9]:
# Making the genre mapper dictionary
genre_ints = range(len(unique_genres))
genre_map = dict(zip(unique_genres, genre_ints))
genre_map

{'Action': 0,
 'Adult': 1,
 'Adventure': 2,
 'Animation': 3,
 'Biography': 4,
 'Comedy': 5,
 'Crime': 6,
 'Drama': 7,
 'Family': 8,
 'Fantasy': 9,
 'Game-Show': 10,
 'History': 11,
 'Horror': 12,
 'Music': 13,
 'Musical': 14,
 'Mystery': 15,
 'News': 16,
 'Reality-TV': 17,
 'Romance': 18,
 'Sci-Fi': 19,
 'Short': 20,
 'Sport': 21,
 'Talk-Show': 22,
 'Thriller': 23,
 'War': 24,
 'Western': 25}

In [10]:
# make new integer genre_id and drop string genres
title_genres['genre_id'] = title_genres['genre_split'].map(genre_map)
title_genres.sample(3)

Unnamed: 0,tconst,genre_split,genre_id
17357,tt10843814,Horror,12
77673,tt7197132,Drama,7
17749,tt1093369,Horror,12


In [11]:
# save copy for genre table
genres = title_genres.copy()

title_genres = title_genres.drop(columns = 'genre_split')
title_genres.sample(3)

Unnamed: 0,tconst,genre_id
29354,tt14182984,8
56055,tt3150114,4
31576,tt1485749,24


### Genres

In [12]:
genres = genres.drop(columns = 'tconst')
genres = genres.rename(columns = {'genre_split' : 'genre_name'})
genres.sample(3)

Unnamed: 0,genre_name,genre_id
72312,Adventure,2
32141,Family,8
14457,Comedy,5


## B.  Discard unnecessary information:
1. For the title basics table, drop the following columns:
    - "original_title" (we will use the primary title column instead)
    - "isAdult" ("Adult" will show up in the genres so this is redundant information).
    - "titleType" (every row will be a movie).
    - "genres" and other variants of genre (genre is now represented in the 2 new tables described above.
2. Do not include the title_akas table in your SQL database.
    - You have already filtered out the desired movies using this table, and the remaining data is mostly nulls and not of interest to the stakeholder.

<details>
    <summary> <b> Click "here" for Specification Instructions </summary>


Specifically, they want the data from the following files included in your database:
1. Title Basics:
    1. Movie ID (tconst)
    2. Primary Title
    3.Start Year
    4.Runtime (in Minutes)
2. Genres
    1. Title Ratings
    2. Movie ID (tconst)
    4. Average Movie Rating
    4. Number of Votes
3. The TMDB API Results (multiple files)
    1. Movie ID
    2. Revenue
    3. Budget
    4. Certification (MPAA Rating)</p>
</details>






### Title_Basics

In [14]:
basics_copy

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
1,tt0043139,movie,Life of a Beijing Policeman,Wo zhe yi bei zi,0,2013,,120,"Drama,History"
2,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020,,70,Drama
3,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
4,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi"
...,...,...,...,...,...,...,...,...,...
86751,tt9914942,movie,Life Without Sara Amat,La vida sense la Sara Amat,0,2019,,74,Drama
86752,tt9915872,movie,The Last White Witch,My Girlfriend is a Wizard,0,2019,,97,"Comedy,Drama,Fantasy"
86753,tt9916170,movie,The Rehearsal,O Ensaio,0,2019,,51,Drama
86754,tt9916190,movie,Safeguard,Safeguard,0,2020,,95,"Action,Adventure,Thriller"


    
    
    
# MySQL Database Requirements
    
1. Use sqlalchemy with Pandas to execute your SQL queries inside your notebook.

2. Create a new database on your MySQL server and call it "movies."

3. Make sure to have the following tables in your "movies" database:

    - title_basics
    - title_ratings
    - title_genres
    - genres
    - tmdb_data
    
4. Make sure to set a Primary Key for each table that isn't a joiner table (e.g., title_genres is a joiner table).

5. After creating each table, show the first 5 rows of that table using a SQL query.

6. Make sure to run the "SHOW TABLES" SQL query at the end of your notebook to show that all required tables have been created.