### **Step 1️⃣ - Import Dependencies**
This block imports the following libraries:
- `psycopg2`: connects to the PostgreSQL databases.
- `pandas`: handles data transformation.
- `dotenv`: imports database secrets from the `.env` file.
- `os`: retrieves environment variables and injects them where needed.

In [12]:
import psycopg2
import pandas as pd
from dotenv import load_dotenv
import os



### **Step 2️⃣ - Extract Data from PostgreSQL DB**

- Loads **database credentials** securely using `dotenv`.
- Defines the `fetch_data(query)` function to **connect to PostgreSQL and execute a SQL query**.
- Extracts **facility booking data** using an SQL `JOIN` statement.
- Prints the results to verify the extraction.

In [13]:
load_dotenv()

conn_params = {
    'host': os.getenv('DB_HOST'),
    'dbname': os.getenv('DB_NAME'),
    'user': os.getenv('DB_USER'),
    'password': os.getenv('DB_PASSWORD')
}

def fetch_data(query):
    try:
        with psycopg2.connect(**conn_params) as conn:
            with conn.cursor() as cur:
                cur.execute("SELECT current_database();")
                db_name = cur.fetchone()[0]
                print("Connected to database:", db_name)
                cur.execute(query)
                data = cur.fetchall()
                colnames = [desc[0] for desc in cur.description]
        return colnames, data
    except psycopg2.Error as e:
        print("Error fetching data from source", e)

query = 'SELECT b.facid, f.name, b.slots FROM bookings b JOIN facilities f on b.facid = f.facid'

data = fetch_data(query)
print(data)

Connected to database: postgres
(['facid', 'name', 'slots'], [(3, 'Table Tennis', 2), (4, 'Massage Room 1', 2), (6, 'Squash Court', 2), (7, 'Snooker Table', 2), (8, 'Pool Table', 1), (8, 'Pool Table', 1), (0, 'Tennis Court 1', 3), (0, 'Tennis Court 1', 3), (4, 'Massage Room 1', 2), (4, 'Massage Room 1', 2), (4, 'Massage Room 1', 2), (6, 'Squash Court', 2), (6, 'Squash Court', 2), (6, 'Squash Court', 2), (7, 'Snooker Table', 2), (8, 'Pool Table', 1), (8, 'Pool Table', 1), (1, 'Tennis Court 2', 3), (2, 'Badminton Court', 3), (3, 'Table Tennis', 2), (3, 'Table Tennis', 2), (4, 'Massage Room 1', 2), (6, 'Squash Court', 2), (6, 'Squash Court', 2), (7, 'Snooker Table', 2), (8, 'Pool Table', 1), (0, 'Tennis Court 1', 3), (0, 'Tennis Court 1', 3), (0, 'Tennis Court 1', 3), (2, 'Badminton Court', 3), (3, 'Table Tennis', 2), (4, 'Massage Room 1', 2), (6, 'Squash Court', 2), (7, 'Snooker Table', 2), (7, 'Snooker Table', 2), (8, 'Pool Table', 1), (0, 'Tennis Court 1', 3), (0, 'Tennis Court 1', 3),

### **Step 3️⃣ - Transform the Extracted Data**

This block:
- Converts the extracted data into a dataframe using Pandas
- Calculates the total booking duration (ie. the total number of minutes a space has been booked for) - I based these calculations on the assumption that 1 slot = 1 hour
- Groups the data by facility ID and facility name.
- Keeps only the relevant columns: 'facility_id' and 'total_booking_duration'

In [14]:

df = pd.DataFrame(data[1], columns=['facility_id', 'facility_name', 'slots_reserved_per_booking'])
df['total_booking_duration'] = df['slots_reserved_per_booking'] * 60

grouped_df = df.groupby(['facility_id', 'facility_name']).sum()
aggredated_grouped_df = grouped_df.reset_index()

aggredated_grouped_df = aggredated_grouped_df[['facility_id', 'total_booking_duration']]


display(aggredated_grouped_df)


Unnamed: 0,facility_id,total_booking_duration
0,0,79200
1,1,76680
2,2,72540
3,3,49800
4,4,84240
5,5,13680
6,6,66240
7,7,54480
8,8,54660


### **Step 4️⃣ Load (Part 1) - Set Destination Database and Table**

This block:
- Sets up connection parameters for the analytical PostgreSQL database ('etl_bites')
- Defines a function execute_query_postgresql() to run SQL commands.
- Defines a SQL query string to create table total_booking_duration if it doesn't already exist.
- Executes the SQL commands using the connection paramaters and SQL query string

In [16]:

analytical_database_conn_params = {
    'host': 'localhost',
    'dbname': 'etl_bites',
    'user': 'olikelly',
    'password': 'i_am_a_password'
}

conn_string = "dbname=etl_bites user=olikelly password=i_am_a_password host=localhost port='5432'"


def execute_query_postgresql(conn_string, query):
    try:
        with connect(conn_string) as conn:
            with conn.cursor() as cur:
                cur.execute(query)
                conn.commit()
    except psycopg2.Error as e:
        print("Error executing Postgres query:", e)


create_booking_duration_data_table = '''
DROP TABLE IF EXISTS total_booking_duration;
CREATE TABLE IF NOT EXISTS total_booking_duration (
facility_id INTEGER NOT NULL,
total_booking_duration INTEGER NOT NULL)
'''

execute_query_postgresql(conn_string, create_booking_duration_data_table)


NameError: name 'connect' is not defined

### **Step 5️⃣ Load (Part 2) - Insert Transformed Data into PostgreSQL**

This block:
- Defines the insert_data() function to insert transformed data into total_booking_duration table.
- Uses parameterised SQL queries to prevent SQL injection attacks.
- Commits changes to ensure data is saved in the database's memory.

In [17]:

def insert_data(parameters, table_name, data, columns):
    try:
        with psycopg2.connect(**parameters) as conn:
            with conn.cursor() as cur:
                for row in data.itertuples(index=False):
                    insert_query = f"INSERT INTO {table_name} ({', '.join(columns)}) VALUES ({', '.join(['%s'] * len(columns))});"
                    cur.execute(insert_query, row)
                    print("row inserted", row)
                conn.commit()
    except psycopg2.Error as e:
        print("Error inserting data into DB", e)


insert_data(analytical_database_conn_params, 'total_booking_duration', aggredated_grouped_df, ['facility_id', 'total_booking_duration'])

row inserted Pandas(facility_id=0, total_booking_duration=79200)
row inserted Pandas(facility_id=1, total_booking_duration=76680)
row inserted Pandas(facility_id=2, total_booking_duration=72540)
row inserted Pandas(facility_id=3, total_booking_duration=49800)
row inserted Pandas(facility_id=4, total_booking_duration=84240)
row inserted Pandas(facility_id=5, total_booking_duration=13680)
row inserted Pandas(facility_id=6, total_booking_duration=66240)
row inserted Pandas(facility_id=7, total_booking_duration=54480)
row inserted Pandas(facility_id=8, total_booking_duration=54660)


### **Step 6️⃣ - Identify Top 5 members in terms of Booking Frequency**

This block:
- Defines a new SQL query string to fetch the most frequent bookers, using JOIN, ORDER BY and LIMIT to fetch all relevant data.
- Runs the fetch_data function with the new SQL query string (output recorded in comments below)


In [None]:

most_frequent_bookers = '''
SELECT m.firstname, m.surname, b.memid, COUNT(b.memid) AS total_bookings 
FROM members m 
JOIN bookings b 
ON b.memid = m.memid
GROUP BY m.firstname, m.surname, b.memid
ORDER BY total_bookings DESC
LIMIT 6;
'''

fetch_data(most_frequent_bookers)

# RESULT:
# (['firstname', 'surname', 'memid', 'total_bookings'],
#  [('GUEST', 'GUEST', 0, 883),
#   ('Tim', 'Rownam', 3, 408),
#   ('Darren', 'Smith', 1, 261),
#   ('Tracy', 'Smith', 2, 210),
#   ('Tim', 'Boothe', 8, 188),
#   ('Burton', 'Tracy', 6, 176)])


# ASSUMPTION: 'GUEST' doesn't represent a unique individual. 
# Instead, it's a catch-all for anonymous or missing data.
# Including it in this ranking skews the results, making it seem 
# like an actual person is responsible for the most bookings.
# Given this, best to exclude it. We could do this by adding a 
# confition to the SQL query eg. 'WHERE memid !- 0'


NameError: name 'fetch_data' is not defined