# Project 2: ETL Challenge

This project was completed by Kyra, Kyna and Cindy.

This program will deliver a database containing Netflix movies and TV shows that have been nominated for (and possibly won) awards at the Golden Globes (1944 - 2020). It reads in 2 CSV files (Netflix movies/TV shows in 2019 and Golden Globe nominations and winners). We loaded the 2 files into an Postgres SQL database and join the tables by title for analysis purposes.

These were the steps taken:

Extract:

1. netflix_titles.csv - source www.kaggle.com (6234 rows) 
2. golden_globe_awards.csv - source www.kaggle.com (7992 rows) 

Transform:

1. Removed all unnecessary columns from both data sources. 
2. Ran the .isnull command to fix those rows that had the title in the wrong column. 
3. Renamed some column headings 
4. Dropped rows with missing (NaN values) 
5. We also set the title to upper case for both files. 

Load: We loaded the 2 files into an Postgres SQL database and join the tables by title for analysis purposes.

1. www.quickdatabasediagrams.com was used to create our ERD. See QuickDBD-export.png for our database schema.

2. The Quick DBD export function was used to generate the SQL for creating our tables. See QuickDBD-esport.sql for our table create statements.

3. PGAdmin was used to create our Postgres SQL database. The reason we used a structured relational database is because our data is static and structured into rows and columns.

4.  The tables can be joined by title to perform analysis such as:

 - Which Golden Globe nominations are on Netflix 
 - Which Golden Globe winners are and are not on Netflix 
 - How many Netflix shows are Golden Globe nominations and winners 
 - What are the ratings for the nominations and winners

In [2]:
import pandas as pd
import psycopg2 as pg
from sqlalchemy import create_engine

 # Read CSV files into dataframes


In [30]:
netflix = "netflix_cleaned.csv"
gg = "globes_transformed.csv"

In [31]:
netflix_df = pd.read_csv(netflix)
gg_df = pd.read_csv(gg)

In [32]:
netflix_df.head()

Unnamed: 0,id,show_id,type,title,director,year_film,rating,duration,listed_in,description
0,0,81145628,Movie,NORM OF THE NORTH: KING SIZED ADVENTURE,"Richard Finn, Tim Maltby",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,4,80125979,Movie,#REALITYHIGH,Fernando Lebrija,2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...
2,6,70304989,Movie,AUTOMATA,Gabe Ibáñez,2014,R,110 min,"International Movies, Sci-Fi & Fantasy, Thrillers","In a dystopian future, an insurance adjuster f..."
3,7,80164077,Movie,FABRIZIO COPANO: SOLO PIENSO EN MI,"Rodrigo Toro, Francisco Schultz",2017,TV-MA,60 min,Stand-Up Comedy,Fabrizio Copano takes audience participation t...
4,9,70304990,Movie,GOOD PEOPLE,Henrik Ruben Genz,2014,R,90 min,"Action & Adventure, Thrillers",A struggling couple can't believe their luck w...


# Create new Netfilx data with select columns rearranged and renamed

In [33]:
new_netflix_df = netflix_df[['show_id', 'title', 'type', 'director', 'year_film', 'rating', 'duration', 'listed_in', 
                        'description']].copy()
new_netflix_df = new_netflix_df.rename(columns={"show_id":"id"})
new_netflix_df.head()

Unnamed: 0,id,title,type,director,year_film,rating,duration,listed_in,description
0,81145628,NORM OF THE NORTH: KING SIZED ADVENTURE,Movie,"Richard Finn, Tim Maltby",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80125979,#REALITYHIGH,Movie,Fernando Lebrija,2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...
2,70304989,AUTOMATA,Movie,Gabe Ibáñez,2014,R,110 min,"International Movies, Sci-Fi & Fantasy, Thrillers","In a dystopian future, an insurance adjuster f..."
3,80164077,FABRIZIO COPANO: SOLO PIENSO EN MI,Movie,"Rodrigo Toro, Francisco Schultz",2017,TV-MA,60 min,Stand-Up Comedy,Fabrizio Copano takes audience participation t...
4,70304990,GOOD PEOPLE,Movie,Henrik Ruben Genz,2014,R,90 min,"Action & Adventure, Thrillers",A struggling couple can't believe their luck w...


In [34]:
new_netflix_df.dtypes

id              int64
title          object
type           object
director       object
year_film       int64
rating         object
duration       object
listed_in      object
description    object
dtype: object

In [35]:
gg_df.head()

Unnamed: 0,id,year_film,year_award,ceremony,category,nominee,title,win
0,0,1943,1944,1,Best Performance by an Actress in a Supporting...,Katina Paxinou,FOR WHOM THE BELL TOLLS,True
1,1,1943,1944,1,Best Performance by an Actor in a Supporting R...,Akim Tamiroff,FOR WHOM THE BELL TOLLS,True
2,2,1943,1944,1,Best Director - Motion Picture,Henry King,THE SONG OF BERNADETTE,True
3,3,1943,1944,1,Picture,The Song Of Bernadette,THE SONG OF BERNADETTE,True
4,4,1943,1944,1,Actress In A Leading Role,Jennifer Jones,THE SONG OF BERNADETTE,True


# Create new Golden Globes dataframe with select columns rearranged

In [36]:
 
new_gg_df = gg_df[['id', 'title', 'year_film', 'year_award', 'category', 'nominee', 'win']].copy()
new_gg_df.head()

Unnamed: 0,id,title,year_film,year_award,category,nominee,win
0,0,FOR WHOM THE BELL TOLLS,1943,1944,Best Performance by an Actress in a Supporting...,Katina Paxinou,True
1,1,FOR WHOM THE BELL TOLLS,1943,1944,Best Performance by an Actor in a Supporting R...,Akim Tamiroff,True
2,2,THE SONG OF BERNADETTE,1943,1944,Best Director - Motion Picture,Henry King,True
3,3,THE SONG OF BERNADETTE,1943,1944,Picture,The Song Of Bernadette,True
4,4,THE SONG OF BERNADETTE,1943,1944,Actress In A Leading Role,Jennifer Jones,True


In [37]:
new_gg_df.dtypes

id             int64
title         object
year_film      int64
year_award     int64
category      object
nominee       object
win             bool
dtype: object

# Connect to local database

In [38]:

connection_string = "postgres:postgres@localhost:5432/netflix_gg_db"
engine = create_engine(f'postgresql://{connection_string}')

In [39]:
engine.table_names()

['netflix', 'golden_globe']

# Load the data into the database tables

In [40]:

new_netflix_df.to_sql(name='netflix', con=engine, if_exists='replace', index=False)

# use these statments below for the first run: 
#new_netflix_df.tosql(name='netflix', con=engine, if_exists='append', index=False)


In [41]:
netflix_query = pd.read_sql_query('Select * from netflix', con=engine)
netflix_query

Unnamed: 0,id,title,type,director,year_film,rating,duration,listed_in,description
0,81145628,NORM OF THE NORTH: KING SIZED ADVENTURE,Movie,"Richard Finn, Tim Maltby",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80125979,#REALITYHIGH,Movie,Fernando Lebrija,2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...
2,70304989,AUTOMATA,Movie,Gabe Ibáñez,2014,R,110 min,"International Movies, Sci-Fi & Fantasy, Thrillers","In a dystopian future, an insurance adjuster f..."
3,80164077,FABRIZIO COPANO: SOLO PIENSO EN MI,Movie,"Rodrigo Toro, Francisco Schultz",2017,TV-MA,60 min,Stand-Up Comedy,Fabrizio Copano takes audience participation t...
4,70304990,GOOD PEOPLE,Movie,Henrik Ruben Genz,2014,R,90 min,"Action & Adventure, Thrillers",A struggling couple can't believe their luck w...
...,...,...,...,...,...,...,...,...,...
4254,80063224,THE GREAT BRITISH BAKING SHOW,TV Show,Andy Devonshire,2019,TV-PG,7 Seasons,"British TV Shows, Reality TV",A talented batch of amateur bakers face off in...
4255,80164216,MIRACULOUS: TALES OF LADYBUG & CAT NOIR,TV Show,Thomas Astruc,2018,TV-Y7,4 Seasons,"Kids' TV, TV Action & Adventure","When Paris is in peril, Marinette becomes Lady..."
4256,80115328,SACRED GAMES,TV Show,"Vikramaditya Motwane, Anurag Kashyap",2019,TV-MA,2 Seasons,"Crime TV Shows, International TV Shows, TV Dramas",A link in their pasts leads an honest cop to a...
4257,80176842,MEN ON A MISSION,TV Show,Jung-ah Im,2019,TV-14,4 Seasons,"International TV Shows, Korean TV Shows, Stand...",Male celebs play make-believe as high schooler...


In [42]:

new_gg_df.to_sql(name='golden_globe', con=engine, if_exists='append', index=False)

# use these statments below to recreate the tables if needed.

#new_gg_df.tosql(name='golden_globe', con=engine, if_exists='replace', index=False)

In [43]:
gg_query = pd.read_sql_query('Select * from golden_globe', con=engine)
gg_query

Unnamed: 0,id,title,year_film,year_award,category,nominee,win
0,0,FOR WHOM THE BELL TOLLS,1943,1944,Best Performance by an Actress in a Supporting...,Katina Paxinou,true
1,1,FOR WHOM THE BELL TOLLS,1943,1944,Best Performance by an Actor in a Supporting R...,Akim Tamiroff,true
2,2,THE SONG OF BERNADETTE,1943,1944,Best Director - Motion Picture,Henry King,true
3,3,THE SONG OF BERNADETTE,1943,1944,Picture,The Song Of Bernadette,true
4,4,THE SONG OF BERNADETTE,1943,1944,Actress In A Leading Role,Jennifer Jones,true
...,...,...,...,...,...,...,...
7986,7986,SUCCESSION,2019,2020,Best Performance by an Actor in a Supporting R...,Kieran Culkin,false
7987,7987,FLEABAG,2019,2020,Best Performance by an Actor in a Supporting R...,Andrew Scott,false
7988,7988,BARRY,2019,2020,Best Performance by an Actor in a Supporting R...,Henry Winkler,false
7989,7989,TOM HANKS,2019,2020,Cecil B. deMille Award,Tom Hanks,false


#  Analysis

In [44]:
inspector = inspect(engine)
inspector.get_table_names()

NameError: name 'inspect' is not defined

In [None]:
columns = inspector.get_columns('netfilx')
for c in columns:
    print(c['name'], c['type'])

print('')
    
columns = inspector.get_columns('golden_globes')
for c in columns:
    print(c['name'], c['type'])    

In [None]:
# find the netflix titles in the database 
netflix_title = session.query(Netflix.title, Netflix.release_year).order_by(Netflix.title)
netflix_title