# Day 3 In-Class Exercise: Mini-Pipeline

**Time:** 15 minutes
**Grading:** Completion-based (5% of course grade)
**Goal:** Build a 3-stage pipeline with validations

---

## Your Mission

Build a data pipeline using the mini Olist dataset:
1. **Bronze:** Load raw CSV files
2. **Silver:** Clean, validate (write 2 assertions)
3. **Gold:** Create 2-3 summary metrics
4. **Document:** Write a risk note (assumptions & limitations)

**Dataset:** `../../data/day3/exercise/`
- `mini_orders.csv` (500 orders)
- `mini_customers.csv` (500 customers)
- `mini_order_items.csv` (~540 items)

---

## Submission

- Complete all TODOs
- Ensure notebook runs end-to-end (Restart & Run All)
- Submit on Moodle by end of class

**Let's build!** 🚀

---

In [None]:
# Setup (PROVIDED - don't modify)
import pandas as pd
import duckdb
from IPython.display import display
import warnings
warnings.filterwarnings('ignore')

con = duckdb.connect(':memory:')
print("✅ Setup complete")

---

## Part 1: Bronze Layer (3 min)

**TODO:** Load the three CSV files into bronze tables.

**Requirements:**
- Load exactly as-is (no transformations)
- Name tables: `bronze_orders`, `bronze_customers`, `bronze_items`
- Verify row counts match files

In [None]:
# TODO: Load bronze layer

# Load orders


# Load customers


# Load order items


# Verify (print row counts)
print("Bronze layer loaded:")
# TODO: Print counts

---

## Part 2: Silver Layer - Cleaning (5 min)

**TODO:** Create clean, typed silver tables.

**Requirements:**
- Fix date types (use TRY_CAST for timestamps)
- Cast price/freight to DOUBLE
- Remove rows with NULL in critical fields (order_id, customer_id)
- Name tables: `silver_orders`, `silver_customers`, `silver_items`

In [None]:
# TODO: Create silver tables

# Clean orders
con.execute("""
    CREATE TABLE silver_orders AS
    SELECT
        -- TODO: Select and cast columns
        -- Hint: TRY_CAST(order_purchase_timestamp AS TIMESTAMP)
    FROM bronze_orders
    WHERE order_id IS NOT NULL
""")

# Clean customers
con.execute("""
    CREATE TABLE silver_customers AS
    SELECT
        -- TODO: Select relevant columns
    FROM bronze_customers
    WHERE customer_id IS NOT NULL
""")

# Clean order items
con.execute("""
    CREATE TABLE silver_items AS
    SELECT
        -- TODO: Select and cast columns
        -- Hint: CAST(price AS DOUBLE)
    FROM bronze_items
    WHERE order_id IS NOT NULL
""")

print("Silver layer created")

---

## Part 3: Silver Layer - Validation (5 min)

**TODO:** Write 2 assertions to validate data quality.

**Requirements:**
- Assertion 1: Check primary key uniqueness (order_id in silver_orders)
- Assertion 2: Check foreign key integrity (all items have valid orders)

**Remember:** Assertions should raise errors if validation fails!

In [None]:
# TODO: Write validations

print("=== VALIDATIONS ===\n")

# Validation 1: Primary key uniqueness
# TODO: Check that order_id is unique in silver_orders
# Hint: Compare COUNT(*) to COUNT(DISTINCT order_id)


# Validation 2: Foreign key integrity
# TODO: Check that all items in silver_items have matching orders
# Hint: LEFT JOIN and check for NULLs


print("✅ All validations passed!")

---

## Part 4: Gold Layer (4 min)

**TODO:** Create 2-3 summary metrics for reporting.

**Requirements:**
- At least 2 aggregated tables
- Use JOINs and GROUP BY
- Make them business-relevant

**Suggestions:**
- Daily sales summary
- Customer summary (orders per customer)
- State-level metrics

In [None]:
# TODO: Create gold tables

# Gold table 1: Your choice
con.execute("""
    CREATE TABLE gold_summary_1 AS
    SELECT
        -- TODO: Write aggregation query
    FROM silver_orders o
    INNER JOIN silver_items i ON o.order_id = i.order_id
    -- TODO: Add GROUP BY
""")

# Gold table 2: Your choice
con.execute("""
    CREATE TABLE gold_summary_2 AS
    SELECT
        -- TODO: Write aggregation query
    FROM silver_customers c
    INNER JOIN silver_orders o ON c.customer_id = o.customer_id
    -- TODO: Add GROUP BY
""")

# Display results
print("Gold table 1:")
display(con.execute("SELECT * FROM gold_summary_1 LIMIT 5").df())

print("\nGold table 2:")
display(con.execute("SELECT * FROM gold_summary_2 LIMIT 5").df())

---

## Part 5: Risk Note (3 min)

**TODO:** Write 3-5 sentences documenting:
1. What assumptions did you make about the data?
2. What limitations does this analysis have?
3. What questions can this data NOT answer?

**Example:**
> "This analysis assumes all orders in the dataset are valid completed transactions. We removed rows with NULL order_ids (23 rows, 4.6% of data), which means our metrics undercount total orders. This data cannot answer questions about cancelled orders or refunds, as those are not included in the dataset. We also assume prices are in Brazilian Real (BRL) and have not been adjusted for inflation."

**Your risk note:**

### Risk Note

TODO: Write your 3-5 sentence risk note here.

(Double-click to edit this cell)

---

---

## ✅ Submission Checklist

Before submitting, verify:

- [ ] All TODO sections completed
- [ ] Notebook runs end-to-end (Kernel → Restart & Run All)
- [ ] All assertions pass (no errors)
- [ ] Gold tables display results
- [ ] Risk note written (3-5 sentences)
- [ ] File named: `day3_exercise_[your_name].ipynb`

**Submit on Moodle by end of class!**

---

## Done Early?

**Challenge:** Add a 3rd validation checking that all prices are positive:
```python
negative_prices = con.execute(
    "SELECT COUNT(*) FROM silver_items WHERE price < 0"
).fetchone()[0]
assert negative_prices == 0, f"Found {negative_prices} negative prices!"
```

**Great work!** 🎉