# Intro to ETL & ELT Data Pipelines (*and SQLAlchemy ORM*)

### ETL: Extract --> Transform --> Load
* Traditional data pipeline process
* Strict schema
* Typically loaded to data warehouses and/or relational databases after transformation
* Schema-on-write technique

### ELT: Extract --> Load --> Transform
* Newer data pipeline process, particularly for big data datasets
* Flexible schema, dependent on end-user's specific needs
* Typically loaded to a data lake then is transformed by the end user to fit their needs
* Schema-on-read technique

In [44]:
import os
from sqlalchemy import create_engine
import pandas as pd

*Get username, password, etc from separate file*

In [45]:
secrets = ""
secrets_dict = dict()

with open(f"{os.getcwd()}\\..\\SECRETS.txt", "r") as f:
    secrets = f.readlines()

for secret in secrets:
    secret_no_newline_char = secret.replace("\n", "")
    key_value_pair = secret_no_newline_char.split(" = ")
    key = key_value_pair[0]
    value = key_value_pair[1]

    secrets_dict[key] = value

*Globals*

In [46]:
POSTGRES_USERNAME = secrets_dict["PostgreSQL Username"]
POSTGRES_PASSWORD = secrets_dict["PostgreSQL Password"]
POSTGRES_HOST = secrets_dict["PostgreSQL Host"]
POSTGRES_PORT = secrets_dict["PostgreSQL Port"]

*Get PostgreSQL database connection object*

In [47]:
def get_postgres_db_connection(username, password, host, port, db_name):
    connection_url = f'postgresql://{username}:{password}@{host}:{port}/{db_name}'

    try:

        engine = create_engine(connection_url) # Create an engine
        connection = engine.connect() # Connect to the database
        print("Connection successful:", connection)
    except Exception as e:
        print(f"Connection was unsuccessful.\nEXCEPTION: {e}")
        return None

    return connection


## Extraction

In [48]:
books_db_name = "Books"
books_conn = get_postgres_db_connection(POSTGRES_USERNAME, POSTGRES_PASSWORD, POSTGRES_HOST, POSTGRES_PORT, books_db_name)
authors_df = pd.read_sql("SELECT * FROM myauthors", books_conn)

authors_df.head()

Connection successful: <sqlalchemy.engine.base.Connection object at 0x000001B5AFFB8E50>


Unnamed: 0,author_id,first_name,middle_name,last_name
0,2,Linda,,Mul
1,1,Merrit,,Eric
2,3,Alecos,,Papadatos
3,4,Paul,C.van,Oorschot
4,5,David,,Cronin


In [49]:
books_conn.close()

In [50]:
coffee_db_name = "COFFEE_Final_Project"
coffee_conn = get_postgres_db_connection(POSTGRES_USERNAME, POSTGRES_PASSWORD, POSTGRES_HOST, POSTGRES_PORT, db_name)
products_df = pd.read_sql("SELECT * FROM products", coffee_conn)

products_df.head()

Connection successful: <sqlalchemy.engine.base.Connection object at 0x000001B5AFFB8B50>


Unnamed: 0,product_id,product_name,description,price,product_type_id
0,1,Brazilian - Organic,It's like Carnival in a cup. Clean and smooth.,18.0,1
1,2,Our Old Time Diner Blend,Our packed blend of beans that is reminiscent ...,18.0,2
2,3,Espresso Roast,Our house blend for a good espresso shot.,14.75,3
3,4,Primo Espresso Roast,Our premium single source of hand roasted beans.,20.45,3
4,5,Columbian Medium Roast,A smooth cup of coffee any time of day.,15.0,4


In [51]:
expensive_products_df = pd.read_sql("SELECT * FROM products WHERE price > 20", coffee_conn)

expensive_products_df.head()

Unnamed: 0,product_id,product_name,description,price,product_type_id
0,4,Primo Espresso Roast,Our premium single source of hand roasted beans.,20.45,3
1,6,Ethiopia,From the home of coffee.,21.0,4
2,8,Civet Cat,"The most expensive coffee in the world, the ca...",45.0,5
3,9,Organic Decaf Blend,Our blend of hand picked organic beans that ha...,28.0,6
4,80,I Need My Bean! Toque,keep your head bean warm,23.0,31


In [52]:
coffee_conn.close()

*Reusable, modular extract function*

In [53]:
def extract_data_to_df(db_name, sql_query):
    conn = get_postgres_db_connection(POSTGRES_USERNAME, POSTGRES_PASSWORD, POSTGRES_HOST, POSTGRES_PORT, db_name)
    try:
        df = pd.read_sql(sql_query, conn)
    except Exception as e:
        print(f"Query failed.\nEXCEPTION: {e}")
        conn.close()
        return None
    
    conn.close()
    
    return df

In [57]:
customers_sql_query = "SELECT * FROM customers"
customers_df = extract_data_to_df(coffee_db_name, customers_sql_query)
customers_df.head()

Connection successful: <sqlalchemy.engine.base.Connection object at 0x000001B5AFE92C10>


Unnamed: 0,customer_id,customer_name,email,reg_date,card_number,date_of_birth,gender
0,0,,,,,,
1,3001,Kelly Key,Venus@adipiscing.edu,2017-01-04,908-424-2890,1950-05-29,M
2,3002,Clark Schroeder,Nora@fames.gov,2017-01-07,032-732-6308,1950-07-30,M
3,3003,Elvis Cardenas,Brianna@tellus.edu,2017-01-10,459-375-9187,1950-09-30,M
4,3004,Rafael Estes,Ina@non.gov,2017-01-13,576-640-9226,1950-12-01,M


## Transformation

## Loading