## Step 1 — Extract

## What we do:
* Read the dataset (Online_Retail.csv) from disk into a pandas DataFrame.
* Remove rows missing essential values:
* InvoiceNo → needed to identify transactions.
* StockCode → product identification.
* Quantity and UnitPrice → required for sales calculations.
* InvoiceDate → needed for time-based analysis.
* Convert InvoiceDate to a proper datetime type so we can filter and group by time later.
* Remove any rows where the date could not be parsed.

## Why we do it:
* Ensures we are working only with valid, complete data before transformations.
* Makes sure the InvoiceDate column is in a format that allows filtering and aggregations.
* Avoids issues in later steps from missing or invalid values.

In [14]:
import pandas as pd
import sqlite3

# === STEP 1: EXTRACT ===
def extract(file_path):
    """
    Reads CSV into DataFrame and cleans data.
    """
    # Read CSV with encoding fix for £ symbol and other special chars
    df = pd.read_csv(file_path, encoding="ISO-8859-1")
    print(f"[Extract] Raw rows read: {len(df)}")

    # Drop rows with missing critical fields
    df = df.dropna(subset=["InvoiceNo", "StockCode", "Description", "Quantity", "InvoiceDate", "UnitPrice", "CustomerID", "Country"])

    # Convert InvoiceDate to datetime
    df["InvoiceDate"] = pd.to_datetime(df["InvoiceDate"], errors="coerce")
    df = df.dropna(subset=["InvoiceDate"])  # Remove bad date rows

    # Ensure numeric types
    df["Quantity"] = pd.to_numeric(df["Quantity"], errors="coerce")
    df["UnitPrice"] = pd.to_numeric(df["UnitPrice"], errors="coerce")

    print(f"[Extract] Rows after cleaning: {len(df)}")
    return df


## Step 2 — Transform

## What we do:
* Remove invalid transactions:
* Negative or zero Quantity values.
* Zero or negative UnitPrice.
* Create a new column:
* TotalSales = Quantity * UnitPrice → This is the key sales measure.
* Filter transactions to the last year relative to 2025-08-12 (exam requirement).
* Create dimension-like tables:
* CustomerDim: unique CustomerID and Country.
* TimeDim: unique dates with TimeID, Month, Quarter, Year for time-based OLAP.
* Prepare fact table:
* SalesFact: contains CustomerID, TimeID, Quantity, and TotalSales.

## Why we do it:
* Removes bad data so our metrics are accurate.
* Adds new calculated metrics for reporting.
* Structures the data into star schema format to make OLAP queries easier in Task 3.
* Filters for recent transactions to keep analysis relevant and within the scope.

In [15]:
def transform(df):
    """
    Creates dimension and fact tables from cleaned DataFrame.
    """

    # Add TotalSales column
    df["TotalSales"] = df["Quantity"] * df["UnitPrice"]

    # Customer Dimension (ensure unique CustomerID)
    customer_dim = (
        df[["CustomerID", "Country"]]
        .drop_duplicates(subset=["CustomerID"])
        .reset_index(drop=True)
    )

    # Time Dimension (ensure unique TimeID)
    time_dim = df[["InvoiceDate"]].drop_duplicates().reset_index(drop=True)
    time_dim["TimeID"] = range(1, len(time_dim) + 1)
    time_dim["Date"] = time_dim["InvoiceDate"].dt.date
    time_dim["Month"] = time_dim["InvoiceDate"].dt.month
    time_dim["Quarter"] = time_dim["InvoiceDate"].dt.quarter
    time_dim["Year"] = time_dim["InvoiceDate"].dt.year
    time_dim = time_dim.drop(columns=["InvoiceDate"])

    # Map TimeID back to main dataframe
    df = df.merge(
        time_dim,
        left_on=df["InvoiceDate"].dt.date,
        right_on="Date",
        how="left"
    )

    # Sales Fact Table
    sales_fact = df[["CustomerID", "TimeID", "Quantity", "TotalSales"]].reset_index(drop=True)

    print(f"[Transform] CustomerDim: {len(customer_dim)} rows")
    print(f"[Transform] TimeDim: {len(time_dim)} rows")
    print(f"[Transform] SalesFact: {len(sales_fact)} rows")

    return customer_dim, time_dim, sales_fact


## Step 3 — Load

## What we do:
* Connect to a SQLite database (retail_dw.db).
* Create tables:
* CustomerDim
* TimeDim
* SalesFact
* Load the cleaned/transformed data into these tables.
* Enforce foreign key constraints to maintain referential integrity.

## Why we do it:
* Moves data into a data warehouse structure for analysis.
* Allows running SQL queries efficiently in later steps (Task 3).
* Ensures we follow proper relational database design.

In [16]:
def load(customer_dim, time_dim, sales_fact, db_name="retail_dw.db"):
    """
    Loads dimension and fact tables into SQLite database.
    Drops existing tables before inserting new data.
    """
    conn = sqlite3.connect(db_name)
    cursor = conn.cursor()

    # Drop old tables if exist
    cursor.executescript("""
    DROP TABLE IF EXISTS SalesFact;
    DROP TABLE IF EXISTS TimeDim;
    DROP TABLE IF EXISTS CustomerDim;
    """)

    # Create schema
    cursor.executescript("""
    CREATE TABLE CustomerDim (
        CustomerID INTEGER PRIMARY KEY,
        Country TEXT
    );
    CREATE TABLE TimeDim (
        TimeID INTEGER PRIMARY KEY,
        Date TEXT,
        Month INTEGER,
        Quarter INTEGER,
        Year INTEGER
    );
    CREATE TABLE SalesFact (
        SalesID INTEGER PRIMARY KEY AUTOINCREMENT,
        CustomerID INTEGER,
        TimeID INTEGER,
        Quantity INTEGER,
        TotalSales REAL,
        FOREIGN KEY (CustomerID) REFERENCES CustomerDim(CustomerID),
        FOREIGN KEY (TimeID) REFERENCES TimeDim(TimeID)
    );
    """)

    # Insert data
    customer_dim.to_sql("CustomerDim", conn, if_exists="append", index=False)
    time_dim.to_sql("TimeDim", conn, if_exists="append", index=False)
    sales_fact.to_sql("SalesFact", conn, if_exists="append", index=False)

    conn.commit()
    conn.close()
    print(f"[Load] Data loaded into {db_name}")



In [17]:
# === RUN THE FULL ETL ===
file_path = r"C:\Users\Salma\New folder\OneDrive\Desktop\DSA 2040_Practical_Exam\DSA-2040_Practical_Exam_Halima_315\Online_Retail.csv"
df_extracted = extract(file_path)
customer_dim, time_dim, sales_fact = transform(df_extracted)
load(customer_dim, time_dim, sales_fact)

[Extract] Raw rows read: 541909
[Extract] Rows after cleaning: 172782
[Transform] CustomerDim: 3125 rows
[Transform] TimeDim: 8774 rows
[Transform] SalesFact: 13746099 rows
[Load] Data loaded into retail_dw.db
