# Specifications - Database
Your stakeholder wants you to take the data you have been cleaning and collecting in Parts 1 & 2 of the project and wants you to create a MySQL database for them.

Specifically, they want the data from the following files included in your database:

**Title Basics**
- Movie ID (tconst)
- Primary Title
- Start Year
- Runtime (in Minutes)
- Genres

**Title Ratings**
- Movie ID (tconst)
- Average Movie Rating
- Number of Votes

**The TMDB API Results (multiple files)**
- Movie ID
- Revenue
- Budget
- Certification (MPAA Rating)

You should normalize the tables as best you can before adding them to your new database.

- Note: an important exception to their request is that they would like you to keep all of the data from the TMDB API in 1 table together (even though it will not be perfectly normalized).
- You only need to keep the imdb_id, revenue, budget, and certification columns

Required Transformation Steps for Title Basics:
Normalize Genre:

Convert the single string of genres from title basics into 2 new tables.
title_genres: with the columns:

tconst
genre_id
genres:

genre_id
genre_name
Discard unnecessary information:

For the title basics table, drop the following columns:
"original_title" (we will use the primary title column instead)
"isAdult" ("Adult" will show up in the genres so this is redundant information).
"titleType" (every row will be a movie).
"genres" and other variants of genre (genre is now represented in the 2 new tables described above.
Do not include the title_akas table in your SQL database.
You have already filtered out the desired movies using this table, and the remaining data is mostly nulls and not of interest to the stakeholder.
MySQL Database Requirements
Use sqlalchemy with Pandas to execute your SQL queries inside your notebook.

Create a new database on your MySQL server and call it "movies."

Make sure to have the following tables in your "movies" database:

title_basics
title_ratings
title_genres
genres
tmdb_data
Make sure to set a Primary Key for each table that isn't a joiner table (e.g., title_genres is a joiner table).

After creating each table, show the first 5 rows of that table using a SQL query.

Make sure to run the "SHOW TABLES" SQL query at the end of your notebook to show that all required tables have been created.

In [1]:
#imports
import json
import pandas as pd
import numpy as np
import os
import pymysql
from sqlalchemy import create_engine
from sqlalchemy_utils import create_database, database_exists
pymysql.install_as_MySQLdb()

# Normalize Genre

In [2]:
basics = pd.read_csv('Data/title_basics.csv.gz')
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
1,tt0043139,movie,Life of a Beijing Policeman,Wo zhe yi bei zi,0,2013,,120,"Drama,History"
2,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020,,70,Drama
3,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
4,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi"


## Getting a list of Unique Values

In [3]:
#create a new col with list of genres
basics['genres_split']=basics['genres'].str.split(',')

#separate genres_split to different rows
exploded_genres = basics.explode('genres_split')
exploded_genres.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,genres_split
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance",Comedy
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance",Fantasy
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance",Romance
1,tt0043139,movie,Life of a Beijing Policeman,Wo zhe yi bei zi,0,2013,,120,"Drama,History",Drama
1,tt0043139,movie,Life of a Beijing Policeman,Wo zhe yi bei zi,0,2013,,120,"Drama,History",History


In [4]:
unique_genres = sorted(exploded_genres['genres_split'].unique())
print(type(unique_genres))
unique_genres

<class 'list'>


['Action',
 'Adult',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Drama',
 'Family',
 'Fantasy',
 'Game-Show',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'News',
 'Reality-TV',
 'Romance',
 'Sci-Fi',
 'Short',
 'Sport',
 'Talk-Show',
 'Thriller',
 'War',
 'Western']

## Create a new title_genres table

In [5]:
title_genres = exploded_genres[['tconst', 'genres_split']].copy()
title_genres.head()

Unnamed: 0,tconst,genres_split
0,tt0035423,Comedy
0,tt0035423,Fantasy
0,tt0035423,Romance
1,tt0043139,Drama
1,tt0043139,History


## Create a genre mapper dict to replace string genres with int

In [6]:
## Making the genre mapper dictionary
genre_ints = range(len(unique_genres))
genre_map = dict(zip(unique_genres, genre_ints))
genre_map

{'Action': 0,
 'Adult': 1,
 'Adventure': 2,
 'Animation': 3,
 'Biography': 4,
 'Comedy': 5,
 'Crime': 6,
 'Drama': 7,
 'Family': 8,
 'Fantasy': 9,
 'Game-Show': 10,
 'History': 11,
 'Horror': 12,
 'Music': 13,
 'Musical': 14,
 'Mystery': 15,
 'News': 16,
 'Reality-TV': 17,
 'Romance': 18,
 'Sci-Fi': 19,
 'Short': 20,
 'Sport': 21,
 'Talk-Show': 22,
 'Thriller': 23,
 'War': 24,
 'Western': 25}

## Replace the string genres in title_genres with the new integer ids

In [7]:
title_genres['genre_id']=title_genres['genres_split'].map(genre_map)
title_genres = title_genres.drop(columns='genres_split')
title_genres.head()

Unnamed: 0,tconst,genre_id
0,tt0035423,5
0,tt0035423,9
0,tt0035423,18
1,tt0043139,7
1,tt0043139,11


##  Convert the genre map dictionary into a dataframe

In [8]:
## Manually make df with name col from the key and value
genres=pd.DataFrame({'Genre_Name': genre_map.keys(),
                                                  'Genre_ID': genre_map.values()})
genres.head()

Unnamed: 0,Genre_Name,Genre_ID
0,Action,0
1,Adult,1
2,Adventure,2
3,Animation,3
4,Biography,4


In [13]:
keep_these = ['tconst', 'primaryTitle', 'startYear', 'runtimeMinutes']
basics = basics[keep_these]
basics.head()

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes
0,tt0035423,Kate & Leopold,2001,118
1,tt0043139,Life of a Beijing Policeman,2013,120
2,tt0062336,The Tango of the Widower and Its Distorting Mi...,2020,70
3,tt0069049,The Other Side of the Wind,2018,122
4,tt0088751,The Naked Monster,2005,100


# Saving the mySQL tables w/ tconst as the primary key

In [14]:
#create connection string
username='root'
password='Hjtsai$$1031'
db_name='IMDB Data'
connection=f'mysql+pymysql://{username}:{password}@localhost/{db_name}'
    
#create an engine
engine=create_engine(connection)
engine

Engine(mysql+pymysql://root:***@localhost/IMDB Data)

In [15]:
if database_exists(connection):
    print('It exists!')
else:
    create_database(connection)
    print("The database created!")

The database created!


In [None]:
# add tables

#read users csv
users=pd.read_csv('Data/users.csv')

print(users.head())

#add table
users.to_sql('users', engine, if_exists = 'replace')