# Task 2: ETL Process Implementation

This notebook implements an end‑to‑end ETL pipeline for the Online Retail dataset into a simple SQLite star schema consisting of a SalesFact table and two dimensions: CustomerDim and TimeDim. Each section is clearly labeled to align with rubric requirements.

We now use ONLY the provided Online Retail Excel file (`raw_data/Online Retail.xlsx`). Synthetic data generation has been removed for clarity and reproducibility.

Outline of Steps (Sections 1–20 below): Imports, Parameters, Excel Load, Profiling, Cleaning, Transformations, Dimension Builds, Fact Prep, SQLite Load, Orchestration Function, Logging & Validation, and Sanity Queries.

# Task 2: ETL Process (Modular Version)


This notebook now calls functions from `utils/etl.py` for all core logic so it remains minimal and modular.

In [2]:
# 1. Imports and Configuration
import sys
from pathlib import Path

# Robust project root detection for notebooks (no __file__ in Jupyter)
# Strategy:
# 1) If the workspace folder name is present in CWD path, use it as root.
# 2) Otherwise walk up until we find markers like 'utils' folder or 'raw_data' folder.
# 3) Finally, fall back to the notebook directory.

CWD = Path.cwd().resolve()
markers = ["utils", "raw_data", ".git", "data_mining_notebook", "data_warehouse_notebook"]

def find_project_root(start: Path) -> Path:
    p = start
    for _ in range(10):  # prevent infinite loops
        if any((p / m).exists() for m in markers):
            return p
        if p.parent == p:
            break
        p = p.parent
    return start

ROOT = find_project_root(CWD)
sys.path.insert(0, str(ROOT))

import pandas as pd
import logging
from utils.etl import run_etl

logging.basicConfig(level=logging.INFO, format='[%(levelname)s] %(message)s')

# Parameters
FIXED_CURRENT_DATE = pd.Timestamp('2025-08-12')
DATA_DIR = (ROOT / 'raw_data').resolve()
EXCEL_PATH = DATA_DIR / 'Online Retail.xlsx'
DB_PATH = (ROOT / 'data_warehouse_notebook' / 'retail_dw.db').resolve()

print('Project ROOT:', ROOT)
print('Excel:', EXCEL_PATH)
print('DB:', DB_PATH)

Project ROOT: K:\Code Projects\DSA2040_Practical_Exam_Justice_444
Excel: K:\Code Projects\DSA2040_Practical_Exam_Justice_444\raw_data\Online Retail.xlsx
DB: K:\Code Projects\DSA2040_Practical_Exam_Justice_444\data_warehouse_notebook\retail_dw.db


In [3]:
# 2. Run ETL
counts = run_etl(EXCEL_PATH, DB_PATH, FIXED_CURRENT_DATE)
print('Row Count Summary:', counts)
assert counts['customers'] > 0 and counts['products'] > 0 and counts['dates'] > 0

[INFO] ETL complete: {'raw': 541909, 'cleaned': 530104, 'last_year': 509238, 'customers': 4273, 'products': 4112, 'dates': 297, 'fact': 384512}


Row Count Summary: {'raw': 541909, 'cleaned': 530104, 'last_year': 509238, 'customers': 4273, 'products': 4112, 'dates': 297, 'fact': 384512}
