# Project 2: ETL Challenge

This project was completed by Kyra, Kyna and Cindy.

This program will deliver a database containing Netflix movies and TV shows that have been nominated for (and possibly won) awards at the Golden Globes (1944 - 2020). It reads in 2 CSV files (Netflix movies/TV shows (from 2019) and Golden Globe nominations and winners). We merged on title and loaded the data into an Postgres SQL database.

These were the steps taken:

Extract:

1. netflix_titles.csv - source www.kaggle.com (6234 rows) 
2. golden_globe_awards.csv - source www.kaggle.com (7992 rows) 

Transform:

1. Removed all unnecessary columns from both data sources 
2. Ran the .isnull command to fix those rows that had the title in the wrong column 
3. Renamed some column headings 
4. Dropped rows with missing values (NaN)
5. We also set the title to upper case for both files. 

Load: We loaded the files into a Postgres SQL database.

1. www.quickdatabasediagrams.com was used to create our ERD. We created a foreign key in the Golden Globes table to link to the Netflix table.  See QuickDBD-export.png for our database schema.

2. The Quick DBD export function was used to generate the SQL for creating our tables. See QuickDBD-export.sql for our table create statements.

3. PGAdmin was used to create our Postgres SQL database. The reason we used a structured relational database is because our data is static and structured into rows and columns.  

4.  Some ideas for analysis are:

 - Which Golden Globe nominations are on Netflix? 
 - Which Golden Globe winners are on Netflix? 
 - How many Netflix shows are Golden Globe nominations or winners? 
 - What are the ratings for the nominations and winners?

In [1]:
import pandas as pd
import psycopg2 as pg
from sqlalchemy import create_engine

 # Read CSV files into dataframes


In [2]:
netflix = "netflix_cleaned.csv"
gg = "globes_transformed.csv"

In [3]:
netflix_df = pd.read_csv(netflix)
gg_df = pd.read_csv(gg)

In [4]:
netflix_df.head()

Unnamed: 0,id,show_id,type,title,director,year_film,rating,duration,listed_in,description
0,0,81145628,Movie,NORM OF THE NORTH: KING SIZED ADVENTURE,"Richard Finn, Tim Maltby",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,4,80125979,Movie,#REALITYHIGH,Fernando Lebrija,2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...
2,6,70304989,Movie,AUTOMATA,Gabe Ibáñez,2014,R,110 min,"International Movies, Sci-Fi & Fantasy, Thrillers","In a dystopian future, an insurance adjuster f..."
3,7,80164077,Movie,FABRIZIO COPANO: SOLO PIENSO EN MI,"Rodrigo Toro, Francisco Schultz",2017,TV-MA,60 min,Stand-Up Comedy,Fabrizio Copano takes audience participation t...
4,9,70304990,Movie,GOOD PEOPLE,Henrik Ruben Genz,2014,R,90 min,"Action & Adventure, Thrillers",A struggling couple can't believe their luck w...


# Create new Netflix data with select columns rearranged and renamed.  This dataframe gets loaded into the database.

In [5]:
new_netflix_df = netflix_df[['show_id', 'title', 'type', 'director', 'year_film', 'rating', 'duration', 'listed_in', 
                        'description']].copy()
new_netflix_df = new_netflix_df.rename(columns={"show_id":"netflix_id"})
new_netflix_df.head()

Unnamed: 0,netflix_id,title,type,director,year_film,rating,duration,listed_in,description
0,81145628,NORM OF THE NORTH: KING SIZED ADVENTURE,Movie,"Richard Finn, Tim Maltby",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80125979,#REALITYHIGH,Movie,Fernando Lebrija,2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...
2,70304989,AUTOMATA,Movie,Gabe Ibáñez,2014,R,110 min,"International Movies, Sci-Fi & Fantasy, Thrillers","In a dystopian future, an insurance adjuster f..."
3,80164077,FABRIZIO COPANO: SOLO PIENSO EN MI,Movie,"Rodrigo Toro, Francisco Schultz",2017,TV-MA,60 min,Stand-Up Comedy,Fabrizio Copano takes audience participation t...
4,70304990,GOOD PEOPLE,Movie,Henrik Ruben Genz,2014,R,90 min,"Action & Adventure, Thrillers",A struggling couple can't believe their luck w...


# Transformation on the Golden Globe data

In [6]:
gg_df.head()

Unnamed: 0,id,year_film,year_award,ceremony,category,nominee,title,win
0,0,1943,1944,1,Best Performance by an Actress in a Supporting...,Katina Paxinou,FOR WHOM THE BELL TOLLS,True
1,1,1943,1944,1,Best Performance by an Actor in a Supporting R...,Akim Tamiroff,FOR WHOM THE BELL TOLLS,True
2,2,1943,1944,1,Best Director - Motion Picture,Henry King,THE SONG OF BERNADETTE,True
3,3,1943,1944,1,Picture,The Song Of Bernadette,THE SONG OF BERNADETTE,True
4,4,1943,1944,1,Actress In A Leading Role,Jennifer Jones,THE SONG OF BERNADETTE,True


In [8]:
# Change the datatypes from int to string because of merge/NaN affects.

gg_df['year_film'] = gg_df['year_film'].apply(str)
gg_df['year_award'] = gg_df['year_award'].apply(str)


# Create new Golden Globes dataframe with select columns rearranged.  This data is merged with the Netflix data on title to acquire the foreign key.

In [9]:
 
new_gg_df = gg_df[['id', 'title', 'year_film', 'year_award', 'category', 'nominee', 'win']].copy()
new_gg_df.head()

Unnamed: 0,id,title,year_film,year_award,category,nominee,win
0,0,FOR WHOM THE BELL TOLLS,1943,1944,Best Performance by an Actress in a Supporting...,Katina Paxinou,True
1,1,FOR WHOM THE BELL TOLLS,1943,1944,Best Performance by an Actor in a Supporting R...,Akim Tamiroff,True
2,2,THE SONG OF BERNADETTE,1943,1944,Best Director - Motion Picture,Henry King,True
3,3,THE SONG OF BERNADETTE,1943,1944,Picture,The Song Of Bernadette,True
4,4,THE SONG OF BERNADETTE,1943,1944,Actress In A Leading Role,Jennifer Jones,True


In [10]:
merge_df = pd.merge(new_netflix_df, new_gg_df, on='title', how='left').dropna()
merge_df

Unnamed: 0,netflix_id,title,type,director,year_film_x,rating,duration,listed_in,description,id,year_film_y,year_award,category,nominee,win
12,80057969,LOVE,Movie,Gaspar Noé,2015,NR,135 min,"Cult Movies, Dramas, Independent Movies",A man in an unsatisfying marriage recalls the ...,2161.0,1973,1974,Documentary,Love,False
46,80201906,BLACK PANTHER,Movie,Ryan Coogler,2018,PG-13,135 min,"Action & Adventure, Sci-Fi & Fantasy","T'Challa, the superpowered new leader of the h...",7741.0,2018,2019,Best Motion Picture - Drama,Black Panther,False
47,80201906,BLACK PANTHER,Movie,Ryan Coogler,2018,PG-13,135 min,"Action & Adventure, Sci-Fi & Fantasy","T'Challa, the superpowered new leader of the h...",7800.0,2018,2019,Best Original Score - Motion Picture,Ludwig Göransson,False
48,80201906,BLACK PANTHER,Movie,Ryan Coogler,2018,PG-13,135 min,"Action & Adventure, Sci-Fi & Fantasy","T'Challa, the superpowered new leader of the h...",7803.0,2018,2019,Best Original Song - Motion Picture,All The Stars,False
213,60031232,MYSTIC RIVER,Movie,Clint Eastwood,2003,R,138 min,"Dramas, Thrillers","Tragedy reunites childhood friends Sean, Dave ...",5852.0,2003,2004,Best Motion Picture - Drama,Mystic River,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4428,60036359,SCHINDLER'S LIST,Movie,Steven Spielberg,1993,R,195 min,"Classic Movies, Dramas",Oskar Schindler becomes an unlikely humanitari...,4643.0,1993,1994,Best Original Score - Motion Picture,John Williams,False
4487,70142436,MERLIN,TV Show,James Hawes,2012,TV-PG,5 Seasons,"British TV Shows, International TV Shows, TV A...",This family-oriented retelling of the King Art...,5289.0,1998,1999,Best Television Limited Series or Motion Pictu...,Merlin,False
4488,70142436,MERLIN,TV Show,James Hawes,2012,TV-PG,5 Seasons,"British TV Shows, International TV Shows, TV A...",This family-oriented retelling of the King Art...,5294.0,1998,1999,Best Performance by an Actress in a Limited Se...,Miranda Richardson,False
4489,70142436,MERLIN,TV Show,James Hawes,2012,TV-PG,5 Seasons,"British TV Shows, International TV Shows, TV A...",This family-oriented retelling of the King Art...,5297.0,1998,1999,Best Performance by an Actor in a Limited Seri...,Sam Neill,False


In [12]:
# The datatype for 'id' was changed because of the merge (NaN values) so we changed it back to int.
# Select only specific columns needed for the new Golden Globe dataframe.

merge_df['id'] = merge_df['id'].apply(int)
new_gg_df = merge_df[['id', 'netflix_id', 'title', 'year_film_y', 'year_award', 'category', 'nominee', 'win']].copy()
new_gg_df.head()

Unnamed: 0,id,netflix_id,title,year_film_y,year_award,category,nominee,win
12,2161,80057969,LOVE,1973,1974,Documentary,Love,False
46,7741,80201906,BLACK PANTHER,2018,2019,Best Motion Picture - Drama,Black Panther,False
47,7800,80201906,BLACK PANTHER,2018,2019,Best Original Score - Motion Picture,Ludwig Göransson,False
48,7803,80201906,BLACK PANTHER,2018,2019,Best Original Song - Motion Picture,All The Stars,False
213,5852,60031232,MYSTIC RIVER,2003,2004,Best Motion Picture - Drama,Mystic River,False


In [13]:
# Rename columns to match the database table

renamed_gg_df = new_gg_df.rename(columns={"id": "gg_id", "year_film_y":"year_film"})
renamed_gg_df.head()

Unnamed: 0,gg_id,netflix_id,title,year_film,year_award,category,nominee,win
12,2161,80057969,LOVE,1973,1974,Documentary,Love,False
46,7741,80201906,BLACK PANTHER,2018,2019,Best Motion Picture - Drama,Black Panther,False
47,7800,80201906,BLACK PANTHER,2018,2019,Best Original Score - Motion Picture,Ludwig Göransson,False
48,7803,80201906,BLACK PANTHER,2018,2019,Best Original Song - Motion Picture,All The Stars,False
213,5852,60031232,MYSTIC RIVER,2003,2004,Best Motion Picture - Drama,Mystic River,False


# Connect to local database

In [14]:

connection_string = "postgres:postgres@localhost:5432/netflix_gg_db"
engine = create_engine(f'postgresql://{connection_string}')

In [15]:
engine.table_names()

['netflix', 'golden_globe']

# Load the data into the database tables

In [16]:
# Load the netflix table - use 'replace' if this isn't the first time through. 
new_netflix_df.to_sql(name='netflix', con=engine, if_exists='replace', index=False)

# may need to use this statement instead 
#new_netflix_df.tosql(name='netflix', con=engine, if_exists='append', index=False)


In [18]:
netflix_query = pd.read_sql_query('Select * from netflix', con=engine)
netflix_query.head()

Unnamed: 0,netflix_id,title,type,director,year_film,rating,duration,listed_in,description
0,81145628,NORM OF THE NORTH: KING SIZED ADVENTURE,Movie,"Richard Finn, Tim Maltby",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80125979,#REALITYHIGH,Movie,Fernando Lebrija,2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...
2,70304989,AUTOMATA,Movie,Gabe Ibáñez,2014,R,110 min,"International Movies, Sci-Fi & Fantasy, Thrillers","In a dystopian future, an insurance adjuster f..."
3,80164077,FABRIZIO COPANO: SOLO PIENSO EN MI,Movie,"Rodrigo Toro, Francisco Schultz",2017,TV-MA,60 min,Stand-Up Comedy,Fabrizio Copano takes audience participation t...
4,70304990,GOOD PEOPLE,Movie,Henrik Ruben Genz,2014,R,90 min,"Action & Adventure, Thrillers",A struggling couple can't believe their luck w...


In [19]:
# Load the Golden Globe table
renamed_gg_df.to_sql(name='golden_globe', con=engine, if_exists='replace', index=False)

# may need to use this statement instead 
#new_gg_df.tosql(name='golden_globe', con=engine, if_exists='append', index=False)

In [20]:
gg_query = pd.read_sql_query('Select * from golden_globe', con=engine)
gg_query.head()

Unnamed: 0,gg_id,netflix_id,title,year_film,year_award,category,nominee,win
0,2161,80057969,LOVE,1973,1974,Documentary,Love,False
1,7741,80201906,BLACK PANTHER,2018,2019,Best Motion Picture - Drama,Black Panther,False
2,7800,80201906,BLACK PANTHER,2018,2019,Best Original Score - Motion Picture,Ludwig Göransson,False
3,7803,80201906,BLACK PANTHER,2018,2019,Best Original Song - Motion Picture,All The Stars,False
4,5852,60031232,MYSTIC RIVER,2003,2004,Best Motion Picture - Drama,Mystic River,False
