# Week 2 Day 1: Data Workflow Foundations

**Goal:** Set up a clean project scaffold and produce your first processed Parquet output.

## Learning Objectives

By the end of today, you can:
- Explain **raw vs cache vs processed** and why it matters
- Scaffold a repo with a standard **data project layout**
- Read CSV with **explicit dtypes** (avoid silent inference)
- Write **Parquet** outputs to `data/processed/`
- Implement a small **schema enforcement** step (`enforce_schema`)

---

## How to Complete Exercises

1. Look for `# TODO:` comments - these mark where you need to write code
2. Replace `...` placeholders with your solution
3. Run the test cell after each exercise to verify your answer
4. Use `Ctrl+F` to search for `TODO` to find all incomplete exercises

---

## Policy: GenAI usage

- ‚úÖ Allowed: **clarifying questions** (definitions, error explanations)
- ‚ùå Not allowed: generating code, writing solutions, or debugging by copy-paste
- If unsure: ask the instructor first

> **In this course:** you build skill by typing, running, breaking, and fixing.


## Setup

Run this cell first to import required libraries:

In [None]:
# Standard imports for data work
from dataclasses import dataclass
from pathlib import Path
import pandas as pd
import json
from datetime import datetime, timezone
from io import StringIO

# Check versions
print(f"pandas version: {pd.__version__}")

# Try importing pyarrow (needed for Parquet)
try:
    import pyarrow
    print(f"pyarrow version: {pyarrow.__version__}")
except ImportError:
    print("WARNING: pyarrow not installed. Run: pip install pyarrow")

print("\n‚úÖ Setup complete!")


---

# Session 1: Offline-First Mindset + Project Layout

## Concept: raw vs cache vs processed

Offline-first projects separate data by **role**:

| Folder | Purpose | Rule |
|--------|---------|------|
| `data/raw/` | Original snapshots | **Never edit** |
| `data/cache/` | Downloaded/API responses | **Safe to delete** |
| `data/processed/` | Clean, typed outputs | **Safe to re-create** |
| `data/external/` | Reference data (lookup tables) | **Manually downloaded** |

**Why this matters:**
- You will re-run your ETL many times (debugging, adding rules, fixing bugs)
- If your pipeline depends on the internet, you lose time
- Separating data by role prevents accidents


---

## Exercise 1: Classify These Files

Put each file into the correct folder. Options: `"raw"`, `"cache"`, `"processed"`, `"external"`

1. `orders.csv` you received from a teammate
2. `users_api_page_1.json` downloaded from an endpoint
3. `orders_clean.parquet` generated by your ETL
4. `country_codes.xlsx` you manually downloaded as reference data


In [None]:
# ============================================================================
# Exercise 1: Classify files into folders
# ============================================================================

file_classifications = {
    "orders.csv": ...,              # TODO: Replace ... with "raw", "cache", "processed", or "external"
    "users_api_page_1.json": ...,   # TODO: your code here
    "orders_clean.parquet": ...,    # TODO: your code here
    "country_codes.xlsx": ...,      # TODO: your code here
}

# Print your answers
for filename, folder in file_classifications.items():
    print(f"{filename} ‚Üí data/{folder}/")


<details>
<summary>üîç Click to reveal solution</summary>

```python
file_classifications = {
    "orders.csv": "raw",
    "users_api_page_1.json": "cache",
    "orders_clean.parquet": "processed",
    "country_codes.xlsx": "external",
}
```

**Explanation:**
- `orders.csv` is original data from a teammate ‚Üí **raw** (immutable)
- `users_api_page_1.json` is downloaded from API ‚Üí **cache** (can re-download)
- `orders_clean.parquet` is your output ‚Üí **processed** (can recreate)
- `country_codes.xlsx` is reference data you downloaded ‚Üí **external**

</details>


In [None]:
# Test Exercise 1
def test_file_classifications():
    expected = {
        "orders.csv": "raw",
        "users_api_page_1.json": "cache",
        "orders_clean.parquet": "processed",
        "country_codes.xlsx": "external",
    }
    for filename, expected_folder in expected.items():
        actual = file_classifications.get(filename)
        assert actual == expected_folder, f"{filename}: expected '{expected_folder}', got '{actual}'"
    print("‚úÖ Exercise 1 passed: All file classifications correct!")

test_file_classifications()


---

## Concept: Idempotent Outputs

**Idempotent:** running the pipeline again produces the same outputs (same inputs/config).

**Good pattern:** Overwrite `data/processed/orders.parquet` every run

**Bad pattern:** Append to `data/processed/orders.csv` every run

```python
# BAD (append) - duplicates accumulate!
df.to_csv("data/processed/orders.csv", mode="a", index=False)

# GOOD (overwrite) - same result every run
df.to_parquet("data/processed/orders.parquet", index=False)
```


---

## Exercise 2: Idempotent or Not?

For each code snippet, determine if it's idempotent (safe to re-run multiple times):
- `True` = idempotent (running twice gives same result)
- `False` = NOT idempotent (running twice changes the result)

```python
# A) df.to_csv("output.csv", mode="a")
# B) df.to_parquet("output.parquet", index=False)
# C) df.to_csv("output.csv", mode="w", index=False)
# D) with open("log.txt", "a") as f: f.write(str(datetime.now()))
```


In [None]:
# ============================================================================
# Exercise 2: Identify idempotent operations
# ============================================================================

is_idempotent = {
    "A": ...,  # TODO: True or False? df.to_csv("output.csv", mode="a")
    "B": ...,  # TODO: True or False? df.to_parquet("output.parquet", index=False)
    "C": ...,  # TODO: True or False? df.to_csv("output.csv", mode="w", index=False)
    "D": ...,  # TODO: True or False? with open("log.txt", "a") as f: f.write(...)
}

for key, value in is_idempotent.items():
    if value is ...:
        print(f"Option {key}: ‚ùì (not answered yet)")
    else:
        status = "idempotent ‚úÖ" if value else "NOT idempotent ‚ùå"
        print(f"Option {key}: {status}")


<details>
<summary>üîç Click to reveal solution</summary>

```python
is_idempotent = {
    "A": False,  # mode="a" appends, so duplicates accumulate
    "B": True,   # Parquet overwrites by default
    "C": True,   # mode="w" overwrites the file
    "D": False,  # Appending timestamps changes the file each run
}
```

**Key insight:** Anything with `mode="a"` (append) is NOT idempotent because it adds data each run.

</details>


In [None]:
# Test Exercise 2
def test_idempotency():
    expected = {"A": False, "B": True, "C": True, "D": False}
    for key, expected_val in expected.items():
        actual = is_idempotent.get(key)
        assert actual == expected_val, f"Option {key}: expected {expected_val}, got {actual}"
    print("‚úÖ Exercise 2 passed: Idempotency understanding verified!")

test_idempotency()


---

## Exercise 3: Quick Check

**Question:** What is the most common symptom of a non-idempotent pipeline?

Fill in the answer below:


In [None]:
# ============================================================================
# Exercise 3: Identify the symptom of non-idempotent pipelines
# ============================================================================

# TODO: Choose the correct answer: "a", "b", "c", or "d"
#
# a) "Files get deleted on every run"
# b) "Row counts grow every run, even when inputs did not change"
# c) "The pipeline runs faster each time"
# d) "Column names change randomly"

non_idempotent_symptom = ...  # TODO: your code here (e.g., "a", "b", "c", or "d")

print(f"Your answer: {non_idempotent_symptom}")


<details>
<summary>üîç Click to reveal solution</summary>

```python
non_idempotent_symptom = "b"
```

**Explanation:** When you append instead of overwrite, row counts grow every run because you're adding duplicates.

</details>


In [None]:
# Test Exercise 3
def test_non_idempotent_symptom():
    assert non_idempotent_symptom == "b", f"Expected 'b', got '{non_idempotent_symptom}'"
    print("‚úÖ Exercise 3 passed: Correct! Row counts growing is the telltale sign.")

test_non_idempotent_symptom()


---

# Session 2: Centralized Paths with `pathlib`

## The Problem

Hardcoding paths like `"../data/raw/orders.csv"` breaks when:
- You move files
- You run from a different working directory
- Someone uses Windows paths (backslashes)

## The Solution

Use `pathlib.Path` + a central `config.py`:

```python
from dataclasses import dataclass
from pathlib import Path

@dataclass(frozen=True)
class Paths:
    root: Path
    raw: Path
    cache: Path
    processed: Path
    external: Path
```


---

## Exercise 4: Implement the `make_paths` Function

Complete the `make_paths` function that creates a `Paths` instance with the standard folder structure.


In [None]:
# ============================================================================
# Exercise 4: Implement make_paths function
# ============================================================================

@dataclass(frozen=True)
class Paths:
    root: Path
    raw: Path
    cache: Path
    processed: Path
    external: Path

def make_paths(root: Path) -> Paths:
    """Create a Paths instance with standard folder structure.

    Args:
        root: The project root directory

    Returns:
        Paths instance with all data subdirectories
    """
    data = root / "data"

    return Paths(
        root=...,       # TODO: your code here - should be the root parameter
        raw=...,        # TODO: your code here - should be data / "raw"
        cache=...,      # TODO: your code here - should be data / "cache"
        processed=...,  # TODO: your code here - should be data / "processed"
        external=...,   # TODO: your code here - should be data / "external"
    )

# Test it
test_root = Path("/my/project")
paths = make_paths(test_root)
print(f"Root: {paths.root}")
print(f"Raw: {paths.raw}")
print(f"Processed: {paths.processed}")


<details>
<summary>üîç Click to reveal solution</summary>

```python
def make_paths(root: Path) -> Paths:
    data = root / "data"
    return Paths(
        root=root,
        raw=data / "raw",
        cache=data / "cache",
        processed=data / "processed",
        external=data / "external",
    )
```

**Key insight:** Using `Path` objects with `/` operator handles OS differences automatically.

</details>


In [None]:
# Test Exercise 4
def test_make_paths():
    test_root = Path("/test/project")
    p = make_paths(test_root)

    assert p.root == test_root, f"root should be {test_root}, got {p.root}"
    assert p.raw == test_root / "data" / "raw", f"raw path incorrect: {p.raw}"
    assert p.cache == test_root / "data" / "cache", f"cache path incorrect: {p.cache}"
    assert p.processed == test_root / "data" / "processed", f"processed path incorrect: {p.processed}"
    assert p.external == test_root / "data" / "external", f"external path incorrect: {p.external}"

    print("‚úÖ Exercise 4 passed: Paths dataclass implemented correctly!")

test_make_paths()


---

## Exercise 5: Path Operations

Use `pathlib.Path` to construct file paths. The `/` operator joins path components.


In [None]:
# ============================================================================
# Exercise 5: Path operations
# ============================================================================

# Given this project root:
project_root = Path("/home/student/week2-data-work")

# TODO: Create these paths using the / operator
# Hint: Use project_root / "folder" / "subfolder" / "file.ext"

raw_orders_path = ...       # TODO: your code here - should be project_root/data/raw/orders.csv
processed_orders_path = ... # TODO: your code here - should be project_root/data/processed/orders.parquet
cache_api_path = ...        # TODO: your code here - should be project_root/data/cache/api_response.json

print(f"Raw orders: {raw_orders_path}")
print(f"Processed orders: {processed_orders_path}")
print(f"Cache API: {cache_api_path}")


<details>
<summary>üîç Click to reveal solution</summary>

```python
raw_orders_path = project_root / "data" / "raw" / "orders.csv"
processed_orders_path = project_root / "data" / "processed" / "orders.parquet"
cache_api_path = project_root / "data" / "cache" / "api_response.json"
```

</details>


In [None]:
# Test Exercise 5
def test_path_operations():
    assert raw_orders_path == project_root / "data" / "raw" / "orders.csv", f"Wrong raw path: {raw_orders_path}"
    assert processed_orders_path == project_root / "data" / "processed" / "orders.parquet", f"Wrong processed path: {processed_orders_path}"
    assert cache_api_path == project_root / "data" / "cache" / "api_response.json", f"Wrong cache path: {cache_api_path}"
    print("‚úÖ Exercise 5 passed: Path operations correct!")

test_path_operations()


---

# Session 3: pandas I/O + Schema Basics

## The Leading-Zero Bug

pandas dtype inference can silently corrupt meaning:

```python
# ID "00123" becomes 123 (leading zeros lost forever!)
df = pd.read_csv("orders.csv")  # BAD: no dtype specified
```

## The Fix

```python
# Always specify dtypes for IDs
df = pd.read_csv("orders.csv", dtype={"user_id": "string", "order_id": "string"})
```

## Nullable dtypes

When columns have missing values, use pandas nullable dtypes:
- `"string"` instead of `object`
- `"Int64"` instead of `int64` (capital I - allows NA)
- `"Float64"` instead of `float64` (capital F - more explicit)
- `"boolean"` instead of `bool`


---

## Sample Data

Run this cell to create sample data for the exercises:


In [None]:
# Create sample raw data (simulates what you'd have in data/raw/)
# Note: This data has intentional issues for you to handle!

orders_data = '''order_id,user_id,amount,quantity,created_at,status
A0001,0001,12.50,1,2025-12-01T10:05:00Z,Paid
A0002,0002,8.00,2,2025-12-01T11:10:00Z,paid
A0003,0003,not_a_number,1,2025-12-02T09:00:00Z,Refund
A0004,0001,25.00,,2025-12-03T14:30:00Z,PAID
A0005,0004,100.00,1,not_a_date,paid'''

users_data = '''user_id,country,signup_date
0001,SA,2025-11-15
0002,SA,2025-11-20
0003,AE,2025-11-22
0004,SA,2025-11-25'''

print("üìä Sample data created!")
print("\nOrders data (notice the issues):")
print(orders_data)
print("\n‚ö†Ô∏è Issues to handle:")
print("  - Row 3: amount is 'not_a_number' (invalid)")
print("  - Row 4: quantity is empty (missing)")
print("  - Row 5: created_at is 'not_a_date' (invalid)")
print("  - user_id has leading zeros (0001, 0002, etc.)")


---

## Exercise 6: See the Leading-Zero Bug

First, let's see what happens when you DON'T specify dtypes:


In [None]:
# ============================================================================
# Exercise 6: Observe the leading-zero bug
# ============================================================================

# Read WITHOUT explicit dtypes (this is the BUG!)
orders_bad = pd.read_csv(StringIO(orders_data))

print("Dtypes (notice user_id is int64, not string!):")
print(orders_bad.dtypes)
print()

# TODO: Look at the user_id column - what happened to the leading zeros?
# Fill in what you observe:

user_id_first_row = orders_bad.iloc[0]["user_id"]
print(f"First user_id value: {user_id_first_row}")
print(f"Type: {type(user_id_first_row)}")

# TODO: What should this value be? What is it actually?
expected_user_id = ...  # TODO: your code here - what SHOULD the first user_id be? (as a string)
actual_user_id = ...    # TODO: your code here - what IS the first user_id? (convert to string)

print(f"\nExpected: '{expected_user_id}'")
print(f"Actual: '{actual_user_id}'")
print(f"Leading zeros preserved? {expected_user_id == actual_user_id}")


<details>
<summary>üîç Click to reveal solution</summary>

```python
expected_user_id = "0001"
actual_user_id = str(orders_bad.iloc[0]["user_id"])  # "1" - leading zeros lost!
```

**The bug:** pandas inferred `user_id` as `int64`, so `"0001"` became `1`.

</details>


In [None]:
# Test Exercise 6
def test_leading_zero_bug():
    assert expected_user_id == "0001", f"Expected should be '0001', got '{expected_user_id}'"
    assert actual_user_id == "1", f"Actual should be '1' (the bug!), got '{actual_user_id}'"
    print("‚úÖ Exercise 6 passed: You've identified the leading-zero bug!")

test_leading_zero_bug()


---

## Exercise 7: Read CSV with Explicit dtypes

Now fix the bug by specifying dtypes for ID columns:


In [None]:
# ============================================================================
# Exercise 7: Read CSV with explicit dtypes
# ============================================================================

# Define missing value markers
NA_VALUES = ["", "NA", "N/A", "null", "None"]

# TODO: Read orders with explicit dtypes to preserve leading zeros
orders_raw = pd.read_csv(
    StringIO(orders_data),
    dtype={
        "order_id": ...,  # TODO: your code here - what dtype preserves "A0001"?
        "user_id": ...,   # TODO: your code here - what dtype preserves "0001"?
    },
    na_values=NA_VALUES,
    keep_default_na=True,
)

print("Dtypes (should be string for IDs):")
print(orders_raw.dtypes)
print()
print("First few rows:")
print(orders_raw.head())
print()
print(f"user_id for row 0: '{orders_raw.iloc[0]['user_id']}' (should be '0001')")


<details>
<summary>üîç Click to reveal solution</summary>

```python
orders_raw = pd.read_csv(
    StringIO(orders_data),
    dtype={
        "order_id": "string",
        "user_id": "string",
    },
    na_values=NA_VALUES,
    keep_default_na=True,
)
```

**Key insight:** Always use `"string"` dtype for ID columns to preserve leading zeros.

</details>


In [None]:
# Test Exercise 7
def test_csv_dtypes():
    assert orders_raw["order_id"].dtype == "string", f"order_id should be string, got {orders_raw['order_id'].dtype}"
    assert orders_raw["user_id"].dtype == "string", f"user_id should be string, got {orders_raw['user_id'].dtype}"
    assert orders_raw.iloc[0]["user_id"] == "0001", f"Leading zeros not preserved! Got '{orders_raw.iloc[0]['user_id']}'"
    print("‚úÖ Exercise 7 passed: CSV reading with proper dtypes verified!")

test_csv_dtypes()


---

## Exercise 8: Handle Invalid Numeric Values

The `amount` column has a value `"not_a_number"` that can't be converted to a number.

Use `pd.to_numeric()` with the `errors` parameter to handle this:
- `errors="raise"` - raise an exception (default)
- `errors="coerce"` - invalid values become NaN
- `errors="ignore"` - return input unchanged


In [None]:
# ============================================================================
# Exercise 8: Convert amount column with error handling
# ============================================================================

# Try converting without error handling (this will fail!)
try:
    bad_amount = pd.to_numeric(orders_raw["amount"])
    print("Conversion succeeded (unexpected!)")
except ValueError as e:
    print(f"‚ùå Error (expected): {e}")

print()

# TODO: Convert amount to numeric, coercing invalid values to NaN
amount_clean = pd.to_numeric(
    orders_raw["amount"],
    errors=...  # TODO: your code here - what error handling mode converts invalid to NaN?
)

print("Cleaned amount values:")
print(amount_clean)
print()
print(f"Row 2 (was 'not_a_number'): {amount_clean.iloc[2]} (should be NaN)")


<details>
<summary>üîç Click to reveal solution</summary>

```python
amount_clean = pd.to_numeric(orders_raw["amount"], errors="coerce")
```

**`errors="coerce"`** means: "If you can't convert it, make it NaN instead of crashing."

</details>


In [None]:
# Test Exercise 8
def test_to_numeric_coerce():
    assert pd.isna(amount_clean.iloc[2]), "Row 2 should be NaN (was 'not_a_number')"
    assert amount_clean.iloc[0] == 12.50, f"Valid value should be preserved, got {amount_clean.iloc[0]}"
    print("‚úÖ Exercise 8 passed: pd.to_numeric with coerce works!")

test_to_numeric_coerce()


---

## Exercise 9: Implement `enforce_schema`

Create a function that enforces correct dtypes on the orders DataFrame:
- `order_id` and `user_id` ‚Üí string
- `amount` ‚Üí Float64 (nullable, coerces invalid values to NA)
- `quantity` ‚Üí Int64 (nullable, coerces invalid values to NA)


In [None]:
# ============================================================================
# Exercise 9: Implement enforce_schema function
# ============================================================================

def enforce_schema(df: pd.DataFrame) -> pd.DataFrame:
    """Enforce correct dtypes on orders DataFrame.

    - Converts IDs to string
    - Converts amount to Float64 (invalid values become NA)
    - Converts quantity to Int64 (invalid values become NA)

    Args:
        df: Raw orders DataFrame

    Returns:
        DataFrame with enforced dtypes
    """
    return df.assign(
        order_id=df["order_id"].astype("string"),
        user_id=df["user_id"].astype("string"),
        # TODO: Convert amount to Float64, coercing errors
        amount=pd.to_numeric(df["amount"], errors=...).astype(...),  # TODO: your code here
        # TODO: Convert quantity to Int64, coercing errors
        quantity=pd.to_numeric(df["quantity"], errors=...).astype(...),  # TODO: your code here
    )

# Apply it
orders_clean = enforce_schema(orders_raw)
print("Cleaned dtypes:")
print(orders_clean.dtypes)
print()
print("Cleaned data:")
print(orders_clean)


<details>
<summary>üîç Click to reveal solution</summary>

```python
def enforce_schema(df: pd.DataFrame) -> pd.DataFrame:
    return df.assign(
        order_id=df["order_id"].astype("string"),
        user_id=df["user_id"].astype("string"),
        amount=pd.to_numeric(df["amount"], errors="coerce").astype("Float64"),
        quantity=pd.to_numeric(df["quantity"], errors="coerce").astype("Int64"),
    )
```

**Key points:**
- `errors="coerce"` converts invalid values to NaN
- `"Float64"` and `"Int64"` (capital letters) are nullable dtypes that support NA values

</details>


In [None]:
# Test Exercise 9
def test_enforce_schema():
    # Check dtypes
    assert orders_clean["amount"].dtype == "Float64", f"amount should be Float64, got {orders_clean['amount'].dtype}"
    assert orders_clean["quantity"].dtype == "Int64", f"quantity should be Int64, got {orders_clean['quantity'].dtype}"

    # Check that "not_a_number" became NA
    assert pd.isna(orders_clean.iloc[2]["amount"]), "Row 2 amount (was 'not_a_number') should be NA"

    # Check that empty quantity became NA
    assert pd.isna(orders_clean.iloc[3]["quantity"]), "Row 3 quantity (was empty) should be NA"

    # Check valid values preserved
    assert orders_clean.iloc[0]["amount"] == 12.5, f"Valid amounts should be preserved, got {orders_clean.iloc[0]['amount']}"
    assert orders_clean.iloc[0]["quantity"] == 1, f"Valid quantities should be preserved, got {orders_clean.iloc[0]['quantity']}"

    print("‚úÖ Exercise 9 passed: Schema enforcement works correctly!")

test_enforce_schema()


---

## Exercise 10: Write Parquet Output

Implement a `write_parquet` function that:
1. Creates parent directories if they don't exist
2. Writes the DataFrame as Parquet (idempotent - overwrites existing file)


In [None]:
# ============================================================================
# Exercise 10: Implement write_parquet function
# ============================================================================

def write_parquet(df: pd.DataFrame, path: Path) -> None:
    """Write DataFrame to Parquet format.

    Creates parent directories if needed. Overwrites existing file (idempotent).

    Args:
        df: DataFrame to write
        path: Output path for Parquet file
    """
    # TODO: Step 1 - Create parent directories if they don't exist
    # Hint: use path.parent.mkdir() with parents=? and exist_ok=?
    path.parent.mkdir(parents=..., exist_ok=...)  # TODO: your code here

    # TODO: Step 2 - Write to parquet without the index
    # Hint: use df.to_parquet(path, index=?)
    df.to_parquet(path, index=...)  # TODO: your code here

# Test it with a temporary directory
import tempfile
with tempfile.TemporaryDirectory() as tmpdir:
    test_path = Path(tmpdir) / "subdir" / "nested" / "orders.parquet"
    write_parquet(orders_clean, test_path)
    print(f"Wrote to: {test_path}")
    print(f"File exists: {test_path.exists()}")

    # Read it back
    df_back = pd.read_parquet(test_path)
    print(f"Rows read back: {len(df_back)}")


<details>
<summary>üîç Click to reveal solution</summary>

```python
def write_parquet(df: pd.DataFrame, path: Path) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    df.to_parquet(path, index=False)
```

**Key points:**
- `parents=True`: Creates all intermediate directories (e.g., subdir/nested/)
- `exist_ok=True`: Doesn't error if directory already exists
- `index=False`: Don't write pandas index as a column

</details>


In [None]:
# Test Exercise 10
def test_write_parquet():
    import tempfile

    with tempfile.TemporaryDirectory() as tmpdir:
        # Test with deeply nested path
        test_path = Path(tmpdir) / "a" / "b" / "c" / "test.parquet"
        write_parquet(orders_clean, test_path)

        assert test_path.exists(), "Parquet file should exist"

        # Read back and verify
        df_back = pd.read_parquet(test_path)
        assert len(df_back) == len(orders_clean), f"Row count should match: {len(df_back)} vs {len(orders_clean)}"
        assert list(df_back.columns) == list(orders_clean.columns), "Columns should match"

    print("‚úÖ Exercise 10 passed: write_parquet function works correctly!")

test_write_parquet()


---

## Exercise 11: Implement `read_parquet`

Create a simple wrapper function to read Parquet files:


In [None]:
# ============================================================================
# Exercise 11: Implement read_parquet function
# ============================================================================

def read_parquet(path: Path) -> pd.DataFrame:
    """Read Parquet file into DataFrame.

    Args:
        path: Path to Parquet file

    Returns:
        DataFrame with preserved dtypes
    """
    # TODO: your code here - use pd.read_parquet()
    return ...  # TODO: your code here

# Test it
import tempfile
with tempfile.TemporaryDirectory() as tmpdir:
    test_path = Path(tmpdir) / "test.parquet"
    write_parquet(orders_clean, test_path)

    df_loaded = read_parquet(test_path)
    print(f"Loaded {len(df_loaded)} rows")
    print(f"Dtypes preserved: {df_loaded['amount'].dtype}")


<details>
<summary>üîç Click to reveal solution</summary>

```python
def read_parquet(path: Path) -> pd.DataFrame:
    return pd.read_parquet(path)
```

**Why Parquet?** It preserves dtypes! When you read the file back, `amount` is still `Float64`, not `float64`.

</details>


In [None]:
# Test Exercise 11
def test_read_parquet():
    import tempfile

    with tempfile.TemporaryDirectory() as tmpdir:
        test_path = Path(tmpdir) / "test.parquet"
        write_parquet(orders_clean, test_path)

        df_loaded = read_parquet(test_path)

        assert len(df_loaded) == len(orders_clean), "Row count should match"
        assert df_loaded["amount"].dtype == "Float64", f"dtype should be preserved, got {df_loaded['amount'].dtype}"

    print("‚úÖ Exercise 11 passed: read_parquet function works correctly!")

test_read_parquet()


---

## Exercise 12: Complete ETL Pipeline

Put it all together! Complete this mini ETL pipeline:


In [None]:
# ============================================================================
# Exercise 12: Complete ETL Pipeline
# ============================================================================

def run_etl_pipeline(raw_data: str, output_path: Path) -> dict:
    """Run complete ETL pipeline.

    Args:
        raw_data: CSV string data
        output_path: Where to write the Parquet output

    Returns:
        Metadata dict with row counts and output path
    """
    # Step 1: Load raw data with explicit dtypes
    # TODO: your code here
    df = pd.read_csv(
        StringIO(raw_data),
        dtype={...},  # TODO: specify dtypes for order_id and user_id
        na_values=NA_VALUES,
        keep_default_na=True,
    )
    rows_loaded = len(df)

    # Step 2: Enforce schema
    # TODO: your code here
    df = ...  # TODO: call enforce_schema

    # Step 3: Write Parquet output
    # TODO: your code here
    ...  # TODO: call write_parquet

    # Return metadata
    return {
        "rows_loaded": rows_loaded,
        "rows_written": len(df),
        "output_path": str(output_path),
    }

# Test the pipeline
import tempfile
with tempfile.TemporaryDirectory() as tmpdir:
    out_path = Path(tmpdir) / "processed" / "orders.parquet"

    result = run_etl_pipeline(orders_data, out_path)
    print("ETL Result:")
    print(json.dumps(result, indent=2))

    # Verify
    df_check = pd.read_parquet(out_path)
    print(f"\nVerification: Loaded {len(df_check)} rows from output")


<details>
<summary>üîç Click to reveal solution</summary>

```python
def run_etl_pipeline(raw_data: str, output_path: Path) -> dict:
    # Step 1: Load
    df = pd.read_csv(
        StringIO(raw_data),
        dtype={"order_id": "string", "user_id": "string"},
        na_values=NA_VALUES,
        keep_default_na=True,
    )
    rows_loaded = len(df)

    # Step 2: Transform
    df = enforce_schema(df)

    # Step 3: Write
    write_parquet(df, output_path)

    return {
        "rows_loaded": rows_loaded,
        "rows_written": len(df),
        "output_path": str(output_path),
    }
```

</details>


In [None]:
# Test Exercise 12
def test_etl_pipeline():
    import tempfile

    with tempfile.TemporaryDirectory() as tmpdir:
        out_path = Path(tmpdir) / "processed" / "orders.parquet"
        result = run_etl_pipeline(orders_data, out_path)

        assert result["rows_loaded"] == 5, f"Should load 5 rows, got {result['rows_loaded']}"
        assert result["rows_written"] == 5, f"Should write 5 rows, got {result['rows_written']}"
        assert out_path.exists(), "Output file should exist"

        # Verify data integrity
        df = pd.read_parquet(out_path)
        assert df["amount"].dtype == "Float64", "amount dtype should be Float64"
        assert pd.isna(df.iloc[2]["amount"]), "Invalid amount should be NA"

    print("‚úÖ Exercise 12 passed: Complete ETL pipeline works!")

test_etl_pipeline()


---

# Summary

## Key Concepts Learned

1. **Offline-first mindset**: Separate data by role (raw/cache/processed)
2. **Idempotent outputs**: Safe to re-run without duplicates (use overwrite, not append)
3. **Centralized paths**: Use `pathlib.Path` and a `config.py`
4. **Explicit dtypes**: Always specify dtypes for IDs (`"string"`)
5. **Schema enforcement**: Use `pd.to_numeric(..., errors="coerce")`
6. **Parquet format**: Preserves dtypes, fast, compact

## Project Layout Target

```
week2-data-work/
‚îú‚îÄ‚îÄ data/
‚îÇ   ‚îú‚îÄ‚îÄ raw/         # immutable inputs (NEVER edit)
‚îÇ   ‚îú‚îÄ‚îÄ cache/       # API responses (safe to delete)
‚îÇ   ‚îú‚îÄ‚îÄ processed/   # Parquet outputs (safe to recreate)
‚îÇ   ‚îî‚îÄ‚îÄ external/    # reference data
‚îú‚îÄ‚îÄ src/bootcamp_data/
‚îÇ   ‚îú‚îÄ‚îÄ __init__.py
‚îÇ   ‚îú‚îÄ‚îÄ config.py    # Paths dataclass
‚îÇ   ‚îú‚îÄ‚îÄ io.py        # read/write functions
‚îÇ   ‚îî‚îÄ‚îÄ transforms.py # enforce_schema
‚îú‚îÄ‚îÄ scripts/
‚îÇ   ‚îî‚îÄ‚îÄ run_day1_load.py
‚îú‚îÄ‚îÄ notebooks/
‚îî‚îÄ‚îÄ reports/figures/
```

## Exit Ticket

**Why do we keep IDs as strings and prefer Parquet for processed outputs?**

<details>
<summary>üîç Click to reveal answer</summary>

- **IDs as strings**: Preserves leading zeros (e.g., "0001" stays "0001", not 1)
- **Parquet for processed**:
  - Preserves dtypes (no inference on reload)
  - Faster to read/write than CSV
  - Smaller file size
  - Binary format = less parsing errors

</details>


---

## Final Verification: Run All Tests

Run this cell to verify all exercises are complete:


In [None]:
# ============================================================================
# FINAL VERIFICATION - Run all tests
# ============================================================================

print("=" * 60)
print("FINAL VERIFICATION - Week 2 Day 1")
print("=" * 60)
print()

all_passed = True
tests = [
    ("Exercise 1: File Classifications", test_file_classifications),
    ("Exercise 2: Idempotency", test_idempotency),
    ("Exercise 3: Non-idempotent Symptom", test_non_idempotent_symptom),
    ("Exercise 4: Paths Dataclass", test_make_paths),
    ("Exercise 5: Path Operations", test_path_operations),
    ("Exercise 6: Leading-Zero Bug", test_leading_zero_bug),
    ("Exercise 7: CSV Dtypes", test_csv_dtypes),
    ("Exercise 8: pd.to_numeric Coerce", test_to_numeric_coerce),
    ("Exercise 9: Schema Enforcement", test_enforce_schema),
    ("Exercise 10: Write Parquet", test_write_parquet),
    ("Exercise 11: Read Parquet", test_read_parquet),
    ("Exercise 12: ETL Pipeline", test_etl_pipeline),
]

for name, test_fn in tests:
    try:
        test_fn()
        print(f"‚úÖ {name}")
    except AssertionError as e:
        print(f"‚ùå {name}")
        print(f"   Error: {e}")
        all_passed = False
    except Exception as e:
        print(f"‚ùå {name}")
        print(f"   Error: {type(e).__name__}: {e}")
        all_passed = False

print()
print("=" * 60)
if all_passed:
    print("üéâ ALL 12 EXERCISES COMPLETE! Great work!")
    print("You're ready for Day 2: Data Quality & Cleaning")
else:
    print("‚ö†Ô∏è Some exercises need work.")
    print("Use Ctrl+F to search for 'TODO' to find incomplete exercises.")
print("=" * 60)


---

## Finding Incomplete Exercises

If you need to find exercises you haven't completed yet:

1. Press `Ctrl+F` (or `Cmd+F` on Mac)
2. Search for `TODO`
3. Each `TODO` comment marks code you need to write

**Tip:** Replace `...` with your solution, then run the test cell below to verify.
