## Step 1 — Extract

## What we do:
* Read the dataset (Online_Retail.csv) from disk into a pandas DataFrame.
* Remove rows missing essential values:
* InvoiceNo → needed to identify transactions.
* StockCode → product identification.
* Quantity and UnitPrice → required for sales calculations.
* InvoiceDate → needed for time-based analysis.
* Convert InvoiceDate to a proper datetime type so we can filter and group by time later.
* Remove any rows where the date could not be parsed.

## Why we do it:
* Ensures we are working only with valid, complete data before transformations.
* Makes sure the InvoiceDate column is in a format that allows filtering and aggregations.
* Avoids issues in later steps from missing or invalid values.

In [42]:
import zipfile
import pandas as pd

# Correct path to your ZIP file
zip_path = r"C:\Users\Salma\Downloads\online+retail.zip"

# Inspect ZIP contents
with zipfile.ZipFile(zip_path, 'r') as z:
    print("Files in zip:", z.namelist())

# Read the Excel inside the ZIP
excel_name = z.namelist()[0]
with zipfile.ZipFile(zip_path) as z:
    with z.open(excel_name) as f:
        df = pd.read_excel(f)

# Take a random sample of 1000 rows
df_sample = df.sample(n=1000, random_state=42)

print(df_sample.head())
print(df_sample.info())


Files in zip: ['Online Retail.xlsx']
       InvoiceNo StockCode                       Description  Quantity  \
209268    555200     71459    HANGING JAM JAR T-LIGHT HOLDER        24   
207108    554974     21128                GOLD FISHING GNOME         4   
167085    550972     21086       SET/6 RED SPOTTY PAPER CUPS         4   
471836    576652     22812  PACK 3 BOXES CHRISTMAS PANETTONE         3   
115865    546157     22180                    RETROSPOT LAMP         2   

               InvoiceDate  UnitPrice  CustomerID         Country  
209268 2011-06-01 12:05:00       0.85     17315.0  United Kingdom  
207108 2011-05-27 17:14:00       6.95     14031.0  United Kingdom  
167085 2011-04-21 17:05:00       0.65     14031.0  United Kingdom  
471836 2011-11-16 10:39:00       1.95     17198.0  United Kingdom  
115865 2011-03-10 08:40:00       9.95     13502.0  United Kingdom  
<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 209268 to 352115
Data columns (total 8 columns):
 #

## Step 2 — Transform

## What we do:
* Remove invalid transactions:
* Negative or zero Quantity values.
* Zero or negative UnitPrice.
* Create a new column:
* TotalSales = Quantity * UnitPrice → This is the key sales measure.
* Filter transactions to the last year relative to 2025-08-12 (exam requirement).
* Create dimension-like tables:
* CustomerDim: unique CustomerID and Country.
* TimeDim: unique dates with TimeID, Month, Quarter, Year for time-based OLAP.
* Prepare fact table:
* SalesFact: contains CustomerID, TimeID, Quantity, and TotalSales.

## Why we do it:
* Removes bad data so our metrics are accurate.
* Adds new calculated metrics for reporting.
* Structures the data into star schema format to make OLAP queries easier in Task 3.
* Filters for recent transactions to keep analysis relevant and within the scope.

In [43]:
# Add TotalSales
df_sample['TotalSales'] = df_sample['Quantity'] * df_sample['UnitPrice']

# Customer Dimension
customer_dim = df_sample.groupby('CustomerID').agg({
    'Country': 'first',
    'TotalSales': 'sum'
}).reset_index()

# Time Dimension
time_dim = df_sample[['InvoiceDate']].drop_duplicates().reset_index(drop=True)
time_dim['TimeID'] = time_dim.index + 1
time_dim['Date'] = time_dim['InvoiceDate']
time_dim['Month'] = time_dim['InvoiceDate'].dt.month
time_dim['Quarter'] = time_dim['InvoiceDate'].dt.quarter
time_dim['Year'] = time_dim['InvoiceDate'].dt.year
time_dim = time_dim.drop(columns=['InvoiceDate'])

# Map TimeID to SalesFact
df_sample = df_sample.merge(time_dim[['Date','TimeID']], left_on='InvoiceDate', right_on='Date', how='left')
sales_fact = df_sample[['CustomerID','TimeID','Quantity','TotalSales']].copy()



## Step 3 — Load

## What we do:
* Connect to a SQLite database (retail_dw.db).
* Create tables:
* CustomerDim
* TimeDim
* SalesFact
* Load the cleaned/transformed data into these tables.
* Enforce foreign key constraints to maintain referential integrity.

## Why we do it:
* Moves data into a data warehouse structure for analysis.
* Allows running SQL queries efficiently in later steps (Task 3).
* Ensures we follow proper relational database design.

In [44]:
# 4. Load into SQLite
# -----------------------------
db_name = "retail_dw_sample.db"
conn = sqlite3.connect(db_name)
cursor = conn.cursor()

# Drop tables if they exist
cursor.executescript("""
DROP TABLE IF EXISTS SalesFact;
DROP TABLE IF EXISTS TimeDim;
DROP TABLE IF EXISTS CustomerDim;
""")

# Create tables
cursor.executescript("""
CREATE TABLE CustomerDim (
    CustomerID INTEGER PRIMARY KEY,
    Country TEXT,
    TotalSales REAL
);

CREATE TABLE TimeDim (
    TimeID INTEGER PRIMARY KEY,
    Date TEXT,
    Month INTEGER,
    Quarter INTEGER,
    Year INTEGER
);

CREATE TABLE SalesFact (
    SalesID INTEGER PRIMARY KEY AUTOINCREMENT,
    CustomerID INTEGER,
    TimeID INTEGER,
    Quantity INTEGER,
    TotalSales REAL,
    FOREIGN KEY (CustomerID) REFERENCES CustomerDim(CustomerID),
    FOREIGN KEY (TimeID) REFERENCES TimeDim(TimeID)
);
""")

# Insert data
customer_dim.to_sql('CustomerDim', conn, if_exists='append', index=False)
time_dim.to_sql('TimeDim', conn, if_exists='append', index=False)
sales_fact.to_sql('SalesFact', conn, if_exists='append', index=False)

conn.commit()
conn.close()

print(f"[ETL] Completed: {db_name} created with:")
print("CustomerDim rows:", len(customer_dim))
print("TimeDim rows:", len(time_dim))
print("SalesFact rows:", len(sales_fact))


[ETL] Completed: retail_dw_sample.db created with:
CustomerDim rows: 559
TimeDim rows: 908
SalesFact rows: 1000


In [45]:
import sqlite3
conn = sqlite3.connect("retail_dw_sample.db")
cursor = conn.cursor()

cursor.execute("SELECT * FROM CustomerDim LIMIT 5")
print(cursor.fetchall())

cursor.execute("SELECT * FROM SalesFact LIMIT 5")
print(cursor.fetchall())

conn.close()


[(12353, 'Bahrain', 39.8), (12354, 'Spain', 8.5), (12360, 'Austria', 33.9), (12391, 'Cyprus', 35.400000000000006), (12397, 'Belgium', 16.6)]
[(1, 17315, 1, 24, 20.4), (2, 14031, 2, 4, 27.8), (3, 14031, 3, 4, 2.6), (4, 17198, 4, 3, 5.85), (5, 13502, 5, 2, 19.9)]
