## Step 1 — Extract

## What we do:
* Read the dataset (Online_Retail.csv) from disk into a pandas DataFrame.
* Remove rows missing essential values:
* InvoiceNo → needed to identify transactions.
* StockCode → product identification.
* Quantity and UnitPrice → required for sales calculations.
* InvoiceDate → needed for time-based analysis.
* Convert InvoiceDate to a proper datetime type so we can filter and group by time later.
* Remove any rows where the date could not be parsed.

## Why we do it:
* Ensures we are working only with valid, complete data before transformations.
* Makes sure the InvoiceDate column is in a format that allows filtering and aggregations.
* Avoids issues in later steps from missing or invalid values.

In [39]:
import pandas as pd
import zipfile

# Path to the uploaded ZIP
zip_path = "/mnt/data/online+retail.zip"

# Inspect ZIP contents
with zipfile.ZipFile(zip_path, 'r') as z:
    print("Files in zip:", z.namelist())

# Assuming the CSV inside is named "Online Retail.csv"
csv_name = "Online Retail.csv"

# Read CSV into DataFrame
with zipfile.ZipFile(zip_path) as z:
    with z.open(csv_name) as f:
        df = pd.read_csv(f)

# Check initial rows and info
print(df.head())
print(df.info())


FileNotFoundError: [Errno 2] No such file or directory: '/mnt/data/online+retail.zip'

## Step 2 — Transform

## What we do:
* Remove invalid transactions:
* Negative or zero Quantity values.
* Zero or negative UnitPrice.
* Create a new column:
* TotalSales = Quantity * UnitPrice → This is the key sales measure.
* Filter transactions to the last year relative to 2025-08-12 (exam requirement).
* Create dimension-like tables:
* CustomerDim: unique CustomerID and Country.
* TimeDim: unique dates with TimeID, Month, Quarter, Year for time-based OLAP.
* Prepare fact table:
* SalesFact: contains CustomerID, TimeID, Quantity, and TotalSales.

## Why we do it:
* Removes bad data so our metrics are accurate.
* Adds new calculated metrics for reporting.
* Structures the data into star schema format to make OLAP queries easier in Task 3.
* Filters for recent transactions to keep analysis relevant and within the scope.

In [36]:
# Add TotalSales
df_sample['TotalSales'] = df_sample['Quantity'] * df_sample['UnitPrice']

# Customer Dimension
customer_dim = df_sample.groupby('CustomerID').agg({
    'Country': 'first',
    'TotalSales': 'sum'
}).reset_index()

# Time Dimension
time_dim = df_sample[['InvoiceDate']].drop_duplicates().reset_index(drop=True)
time_dim['TimeID'] = time_dim.index + 1
time_dim['Date'] = time_dim['InvoiceDate']
time_dim['Month'] = time_dim['InvoiceDate'].dt.month
time_dim['Quarter'] = time_dim['InvoiceDate'].dt.quarter
time_dim['Year'] = time_dim['InvoiceDate'].dt.year
time_dim = time_dim.drop(columns=['InvoiceDate'])

# Map TimeID to SalesFact
df_sample = df_sample.merge(time_dim[['Date','TimeID']], left_on='InvoiceDate', right_on='Date', how='left')
sales_fact = df_sample[['CustomerID','TimeID','Quantity','TotalSales']].copy()



## Step 3 — Load

## What we do:
* Connect to a SQLite database (retail_dw.db).
* Create tables:
* CustomerDim
* TimeDim
* SalesFact
* Load the cleaned/transformed data into these tables.
* Enforce foreign key constraints to maintain referential integrity.

## Why we do it:
* Moves data into a data warehouse structure for analysis.
* Allows running SQL queries efficiently in later steps (Task 3).
* Ensures we follow proper relational database design.

In [37]:
# 4. Load into SQLite
# -----------------------------
db_name = "retail_dw_sample.db"
conn = sqlite3.connect(db_name)
cursor = conn.cursor()

# Drop tables if they exist
cursor.executescript("""
DROP TABLE IF EXISTS SalesFact;
DROP TABLE IF EXISTS TimeDim;
DROP TABLE IF EXISTS CustomerDim;
""")

# Create tables
cursor.executescript("""
CREATE TABLE CustomerDim (
    CustomerID INTEGER PRIMARY KEY,
    Country TEXT,
    TotalSales REAL
);

CREATE TABLE TimeDim (
    TimeID INTEGER PRIMARY KEY,
    Date TEXT,
    Month INTEGER,
    Quarter INTEGER,
    Year INTEGER
);

CREATE TABLE SalesFact (
    SalesID INTEGER PRIMARY KEY AUTOINCREMENT,
    CustomerID INTEGER,
    TimeID INTEGER,
    Quantity INTEGER,
    TotalSales REAL,
    FOREIGN KEY (CustomerID) REFERENCES CustomerDim(CustomerID),
    FOREIGN KEY (TimeID) REFERENCES TimeDim(TimeID)
);
""")

# Insert data
customer_dim.to_sql('CustomerDim', conn, if_exists='append', index=False)
time_dim.to_sql('TimeDim', conn, if_exists='append', index=False)
sales_fact.to_sql('SalesFact', conn, if_exists='append', index=False)

conn.commit()
conn.close()

print(f"[ETL] Completed: {db_name} created with:")
print("CustomerDim rows:", len(customer_dim))
print("TimeDim rows:", len(time_dim))
print("SalesFact rows:", len(sales_fact))


[ETL] Completed: retail_dw_sample.db created with:
CustomerDim rows: 744
TimeDim rows: 950
SalesFact rows: 1000


In [38]:
import sqlite3
conn = sqlite3.connect("retail_dw_sample.db")
cursor = conn.cursor()

cursor.execute("SELECT * FROM CustomerDim LIMIT 5")
print(cursor.fetchall())

cursor.execute("SELECT * FROM SalesFact LIMIT 5")
print(cursor.fetchall())

conn.close()


[(12350, 'Norway', 40.0), (12359, 'Cyprus', 7.800000000000001), (12370, 'Cyprus', 44.550000000000004), (12394, 'Belgium', 16.6), (12415, 'Australia', 282.90000000000003)]
[(1, 15034, 1, 6, 12.48), (2, 12528, 2, 12, 35.400000000000006), (3, 15111, 3, 16, 13.28), (4, 14156, 4, 2, 17.0), (5, 13802, 5, 200, 330.0)]
