## 13-ETL-Project - Day 1 - A Case Study of Extract, Transform, Load

### Resorces:
* [data.world](https://data.world/)
* [Kaggle](https://www.kaggle.com/)
* [Google Dataset Search](https://toolbox.google.com/datasetsearch)

# Install
* `pip install psycopg2` [psycopg Documents](http://initd.org/psycopg/docs/install.html)

# ==========================================

### 1.01 Instructor Do: ETL with Pandas (15 mins)

# Instructions:
* in your pgAdmin Create database `customer_db`
* Excute this sql

```sql
CREATE TABLE customer_name (
    id INT PRIMARY KEY,
    first_name TEXT,
    last_name TEXT
);

CREATE TABLE customer_location (
    id INT PRIMARY KEY,
    address TEXT,
    us_state TEXT
);
```

In [2]:
import pandas as pd
from sqlalchemy import create_engine

In [13]:
csv_file = "../ETL-project/Resources/cost-of-living.csv"
cost_of_living_df = pd.read_csv(csv_file)
cost_of_living_df.head()

Unnamed: 0.1,Unnamed: 0,"Saint Petersburg, Russia","Istanbul, Turkey","Izmir, Turkey","Helsinki, Finland","Chisinau, Moldova","Milan, Italy","Cairo, Egypt","Banja Luka, Bosnia And Herzegovina","Baku, Azerbaijan",...,"Lviv, Ukraine","Novosibirsk, Russia","Bursa, Turkey","Brussels, Belgium","Jerusalem, Israel","Melbourne, Australia","Perth, Australia","Sydney, Australia","Alexandria, Egypt","Quito, Ecuador"
0,"Meal, Inexpensive Restaurant",7.34,4.58,3.06,12.0,4.67,15.0,3.38,3.58,5.27,...,3.75,5.72,3.82,15.0,15.56,10.22,12.43,11.81,2.81,3.59
1,"Meal for 2 People, Mid-range Restaurant, Three...",29.35,15.28,12.22,65.0,20.74,60.0,17.48,22.99,23.73,...,18.76,22.01,11.47,60.0,62.24,49.54,56.55,54.37,14.06,31.45
2,McMeal at McDonalds (or Equivalent Combo Meal),4.4,3.82,3.06,8.0,4.15,8.0,4.51,3.58,4.22,...,3.56,3.67,3.06,8.2,12.97,7.12,7.32,7.15,3.38,5.39
3,Domestic Beer (0.5 liter draught),2.2,3.06,2.29,6.5,1.04,5.0,1.69,1.02,0.84,...,1.5,1.1,2.37,4.0,7.26,5.57,5.9,4.97,1.69,1.35
4,Imported Beer (0.33 liter bottle),2.2,3.06,2.75,6.75,1.43,5.0,2.82,1.53,2.11,...,1.5,2.2,3.06,4.0,7.26,5.57,5.59,4.97,2.81,2.7


In [14]:
csv_file = "../ETL-project/Resources/world-happiness-report-2019.csv"
worldHappiness_df = pd.read_csv(csv_file)
worldHappiness_df.head()

Unnamed: 0,Country (region),Ladder,SD of Ladder,Positive affect,Negative affect,Social support,Freedom,Corruption,Generosity,Log of GDP\nper capita,Healthy life\nexpectancy
0,Finland,1,4,41.0,10.0,2.0,5.0,4.0,47.0,22.0,27.0
1,Denmark,2,13,24.0,26.0,4.0,6.0,3.0,22.0,14.0,23.0
2,Norway,3,8,16.0,29.0,3.0,3.0,8.0,11.0,7.0,12.0
3,Iceland,4,9,3.0,3.0,1.0,7.0,45.0,3.0,15.0,13.0
4,Netherlands,5,1,12.0,25.0,15.0,19.0,12.0,7.0,12.0,18.0


In [3]:
new_customer_data_df = customer_data_df[['id', 'first_name', 'last_name']].copy()
new_customer_data_df.head()

Unnamed: 0,id,first_name,last_name
0,1,Benetta,Cancott
1,2,Lilyan,Cherry
2,3,Ezekiel,Benasik
3,4,Kennedy,Atlay
4,5,Sanford,Salmen


In [4]:
json_file = "1/01-Ins_ETL_Pandas/Resources/customer_location.json"
customer_location_df = pd.read_json(json_file)
customer_location_df.head()

Unnamed: 0,id,address,longitude,latitude,us_state
0,1,043 Mockingbird Place,-86.5186,39.1682,Indiana
1,2,4 Prentice Point,-85.0707,41.0938,Indiana
2,3,46 Derek Junction,-96.7776,32.7673,Texas
3,4,11966 Old Shore Place,-94.3567,39.035,Missouri
4,5,5 Evergreen Circle,-73.9772,40.7808,New York


In [5]:
new_customer_location_df = customer_location_df[["id", "address", "us_state"]].copy()
new_customer_location_df.head()

Unnamed: 0,id,address,us_state
0,1,043 Mockingbird Place,Indiana
1,2,4 Prentice Point,Indiana
2,3,46 Derek Junction,Texas
3,4,11966 Old Shore Place,Missouri
4,5,5 Evergreen Circle,New York


In [12]:
# rds_connection_string = "<insert user name>:<insert password>@localhost:5432/customer_db"
# rds_connection_string = "postgres:raw123@localhost:5432/customer_db"
rds_connection_string = "postgres:____@localhost:5432/customer_db"
engine = create_engine(f'postgresql://{rds_connection_string}')

In [13]:
engine.table_names()

['customer_name', 'customer_location']

In [22]:
new_customer_data_df.to_sql(name='customer_name', con=engine, if_exists='replace', index=False)

In [23]:
new_customer_location_df.to_sql(name='customer_location', con=engine, if_exists='replace', index=False)

In [16]:
pd.read_sql_query('select * from customer_name', con=engine).head()

Unnamed: 0,id,first_name,last_name
0,1,Benetta,Cancott
1,2,Lilyan,Cherry
2,3,Ezekiel,Benasik
3,4,Kennedy,Atlay
4,5,Sanford,Salmen


In [17]:
pd.read_sql_query('select * from customer_location', con=engine).head()

Unnamed: 0,id,address,us_state
0,1,043 Mockingbird Place,Indiana
1,2,4 Prentice Point,Indiana
2,3,46 Derek Junction,Texas
3,4,11966 Old Shore Place,Missouri
4,5,5 Evergreen Circle,New York


# ==========================================

### 1.02 

## Local Data ETL

### Instructions

* Create a `customer_db` database in pgAdmin 4 then create the following two tables within:

  * A `premise` table that contains the columns `id`, `premise_name` and `county_id`.

  * A `county` table that contains the columns `id`, `county_name`, `license_count` and `county_id`.

  * Be sure to assign a primary key, as Pandas will not be able to do so.

* In Jupyter Notebook perform all ETL.

* **Extraction**

  * Put each CSV into a pandas DataFrame.

* **Transform**

  * Copy only the columns needed into a new DataFrame.

  * Rename columns to fit the tables created in the database.

  * Handle any duplicates. **HINT:** some locations have the same name but each license number is unique.

  * Set index to the previously created primary key.

* **Load**

  * Create a connection to database.

  * Check for a successful connection to the database and confirm that the tables have been created.

  * Append DataFrames to tables. Be sure to use the index set earlier.

* Confirm successful **Load** by querying database.

* Join the two tables and select the `id` and `premise_name` from the `premise` table and `county_name` from the `county` table.


# Solution

### schema.sql

```sql
-- Create Tables


```

### query.sql

```sql
-- Query to check successful load


```

In [None]:
import pandas as pd
from sqlalchemy import create_engine

### Extract CSVs into DataFrames

In [None]:
premise_file = "1/02-Stu_ETL_Pandas_Local/Resources/LicensePremise.csv"


In [None]:
county_file = "1/02-Stu_ETL_Pandas_Local/Resources/CountyLicenseCount.csv"


### Transform premise DataFrame

In [None]:
# Create a filtered dataframe from specific columns

# Rename the column headers

# Clean the data by dropping duplicates and setting the index


### Transform county DataFrame

In [None]:

# Rename the column headers

# Set index


### Create database connection

In [None]:
# connection_string = "postgres:postgres@localhost:5432/customer_db"


In [None]:
# Confirm tables


### Load DataFrames into database

# ==========================================

# ==========================================

### Rating Class Objectives

* rate your understanding using 1-5 method in each objective

In [None]:
objectives = [
    "A",
    "B",
    "C",
]
rating = []
total = 0
for i in range(len(objectives)):
    rate = input(objectives[i]+"? ")
    total += int(rate)
    rating.append(objectives[i] + ". (" + rate + "/5)")
print("="*96)
print("My rating today is:")
print("-"*24)
for i in rating:
    print(i)
print("-"*64)
print("Average: " + str(total/len(objectives)))