# Day 2, Block B: JSON → DuckDB Pipeline

**Duration:** 40-45 minutes  
**Course:** ECBS5294 - Introduction to Data Science: Working with Data  
**Instructor:** Eduardo Ariño de la Rubia

---

## Learning Objectives

By the end of this session, you will be able to:

1. **Explain** why nested JSON needs normalization (tidy data principles)
2. **Normalize** nested JSON structures into multiple related tables
3. **Use pandas** to flatten one-to-many relationships
4. **Persist** normalized data to DuckDB for SQL analysis
5. **Join** normalized tables to answer business questions
6. **Validate** data quality with assertions

---


## Part 1: The Normalization Problem (⏱️ 8-10 minutes)

### Recall: Tidy Data Principles (Day 1)

On Day 1, we learned **tidy data principles:**

1. Each **variable** is a column
2. Each **observation** is a row
3. Each **type of observational unit** forms a table

**The problem:** Nested JSON violates these principles!

---


In [39]:
# Setup: Import libraries
import requests
import json
import pandas as pd
import duckdb
from pprint import pprint

print("✅ Libraries imported successfully")

✅ Libraries imported successfully


In [27]:
# OPTION 1: Use live API (DEFAULT)
DUMMYJSON_URL = "https://dummyjson.com/products"

response = requests.get(DUMMYJSON_URL, params={'limit': 30}, timeout=10)
response.raise_for_status()
products_data = response.json()

print(f"✅ Fetched {len(products_data['products'])} products from API")

✅ Fetched 30 products from API


In [28]:
# OPTION 2: Use backup file (if API is down)
# Uncomment these lines and skip the cell above if DummyJSON is unavailable

import json
with open('../../data/day2/block_b/products_backup.json') as f:
    products_data = json.load(f)

print(f"✅ Loaded {len(products_data['products'])} products from backup")

✅ Loaded 30 products from backup


### The Problem: One-to-Many Relationships

Look at a single product with its nested reviews:

---


In [29]:
# Examine one product with nested reviews
sample_product = products_data['products'][0]

print("Product structure:")
print(f"  ID: {sample_product['id']}")
print(f"  Title: {sample_product['title']}")
print(f"  Price: ${sample_product['price']}")
print(f"\n  Reviews ({len(sample_product['reviews'])} reviews):")

for i, review in enumerate(sample_product['reviews'], 1):
    print(f"    Review {i}: {review['rating']}⭐ - {review['reviewerName']}")

print("\n" + "=" * 60)
print("Problem: One product has MANY reviews.")
print("This is a ONE-TO-MANY relationship - not tidy!")
print("=" * 60)

Product structure:
  ID: 1
  Title: Essence Mascara Lash Princess
  Price: $9.99

  Reviews (3 reviews):
    Review 1: 3⭐ - Eleanor Collins
    Review 2: 4⭐ - Lucas Gordon
    Review 3: 5⭐ - Eleanor Collins

Problem: One product has MANY reviews.
This is a ONE-TO-MANY relationship - not tidy!


### Why This Matters for Business

**Questions we want to answer:**
- What's the average rating across all reviews? (need to count reviews, not products!)
- Which reviewers are most active? (need to count by reviewer)
- How many reviews per product? (need to aggregate reviews by product)

**We can't answer these questions cleanly with nested JSON.**

**Solution:** Create two separate tables:
1. **Products table** - One row per product (product-level attributes only)
2. **Reviews table** - One row per review (with `product_id` foreign key)

This is called **normalization** - the process of organizing data to reduce redundancy and improve integrity.

---


---

## Part 2: Normalize with Pandas (⏱️ 12-15 minutes)

### Strategy: Create Multiple Tidy Tables

We'll create two tables:
1. **`products`** - Product-level attributes (id, title, price, category, brand, stock)
2. **`reviews`** - Review-level attributes (product_id, rating, comment, reviewer)

The `product_id` in the reviews table is a **foreign key** that links back to the products table.

---


### Step 1: Extract Products Table

First, let's create a DataFrame with only product-level attributes:

---


In [30]:
# Extract products (top-level attributes only)
products_list = []

for product in products_data['products']:
    products_list.append({
        'product_id': product['id'],
        'title': product['title'],
        'price': product['price'],
        'category': product['category'],
        'brand': product.get('brand', 'Unknown'),  # Safe access
        'stock': product.get('stock', 0),
        'rating': product.get('rating', None)
    })

# Create DataFrame
products_df = pd.DataFrame(products_list)

print(f"✅ Created products table: {products_df.shape[0]} rows × {products_df.shape[1]} columns")
products_df.head()

✅ Created products table: 30 rows × 7 columns


Unnamed: 0,product_id,title,price,category,brand,stock,rating
0,1,Essence Mascara Lash Princess,9.99,beauty,Essence,99,2.56
1,2,Eyeshadow Palette with Mirror,19.99,beauty,Glamour Beauty,34,2.86
2,3,Powder Canister,14.99,beauty,Velvet Touch,89,4.64
3,4,Red Lipstick,12.99,beauty,Chic Cosmetics,91,4.36
4,5,Red Nail Polish,8.99,beauty,Nail Couture,79,4.32


### Validation Pattern: Assert Data Quality

> **"Never assume data is clean. Prove it with assertions."**

**Why validate?**
- Catch data quality issues early (before they corrupt downstream analysis)
- Document assumptions (primary keys ARE unique)
- Fail fast with clear error messages (debugging is easier)

**Production pattern:**
```python
assert df['id_column'].is_unique, "Duplicate IDs found!"
assert df['required_column'].notna().all(), "NULL values in required field!"
```

Let's validate our products table:

---

In [31]:
# Validate products table
print("✅ Validating products table...")

# Check 1: Primary key uniqueness
assert products_df['product_id'].is_unique, \
    f"❌ Duplicate product IDs found! Expected {len(products_df)} unique IDs, found {products_df['product_id'].nunique()}"

print(f"   ✓ All {len(products_df)} product IDs are unique (PK check passed)")

# Check 2: No NULL primary keys
assert products_df['product_id'].notna().all(), \
    f"❌ NULL product IDs found! {products_df['product_id'].isna().sum()} rows have NULL IDs"

print(f"   ✓ No NULL product IDs (PK integrity check passed)")

# Check 3: Required fields are populated
required_fields = ['title', 'price', 'category']
for field in required_fields:
    null_count = products_df[field].isna().sum()
    assert null_count == 0, \
        f"❌ Found {null_count} NULL values in required field '{field}'"

print(f"   ✓ All required fields populated ({', '.join(required_fields)})")

print("\n✅ Products table validation: PASSED")

✅ Validating products table...
   ✓ All 30 product IDs are unique (PK check passed)
   ✓ No NULL product IDs (PK integrity check passed)
   ✓ All required fields populated (title, price, category)

✅ Products table validation: PASSED


In [32]:
# Verify products table structure
print("Products table info:")
print(f"  Shape: {products_df.shape}")
print(f"  Columns: {list(products_df.columns)}")
print(f"  Data types:\n{products_df.dtypes}")
print(f"\n  Unique products: {products_df['product_id'].nunique()}")
print(f"  Categories: {products_df['category'].nunique()}")

Products table info:
  Shape: (30, 7)
  Columns: ['product_id', 'title', 'price', 'category', 'brand', 'stock', 'rating']
  Data types:
product_id      int64
title          object
price         float64
category       object
brand          object
stock           int64
rating        float64
dtype: object

  Unique products: 30
  Categories: 4


### Step 2: Extract Reviews Table (Explode One-to-Many)

Now for the tricky part: **exploding the reviews array** into separate rows.

Each review becomes its own row, with a `product_id` foreign key linking back to the product.

---


In [33]:
# Extract reviews (explode one-to-many)
reviews_list = []

for product in products_data['products']:
    product_id = product['id']
    
    # Each product can have multiple reviews
    for review in product.get('reviews', []):
        reviews_list.append({
            'product_id': product_id,  # Foreign key!
            'rating': review['rating'],
            'comment': review['comment'],
            'reviewer_name': review['reviewerName'],
            'reviewer_email': review.get('reviewerEmail', None),
            'review_date': review.get('date', None)
        })

# Create DataFrame
reviews_df = pd.DataFrame(reviews_list)

print(f"✅ Created reviews table: {reviews_df.shape[0]} rows × {reviews_df.shape[1]} columns")
reviews_df.head(10)

✅ Created reviews table: 90 rows × 6 columns


Unnamed: 0,product_id,rating,comment,reviewer_name,reviewer_email,review_date
0,1,3,Would not recommend!,Eleanor Collins,eleanor.collins@x.dummyjson.com,2025-04-30T09:41:02.053Z
1,1,4,Very satisfied!,Lucas Gordon,lucas.gordon@x.dummyjson.com,2025-04-30T09:41:02.053Z
2,1,5,Highly impressed!,Eleanor Collins,eleanor.collins@x.dummyjson.com,2025-04-30T09:41:02.053Z
3,2,5,Great product!,Savannah Gomez,savannah.gomez@x.dummyjson.com,2025-04-30T09:41:02.053Z
4,2,4,Awesome product!,Christian Perez,christian.perez@x.dummyjson.com,2025-04-30T09:41:02.053Z
5,2,1,Poor quality!,Nicholas Bailey,nicholas.bailey@x.dummyjson.com,2025-04-30T09:41:02.053Z
6,3,4,Would buy again!,Alexander Jones,alexander.jones@x.dummyjson.com,2025-04-30T09:41:02.053Z
7,3,5,Highly impressed!,Elijah Cruz,elijah.cruz@x.dummyjson.com,2025-04-30T09:41:02.053Z
8,3,1,Very dissatisfied!,Avery Perez,avery.perez@x.dummyjson.com,2025-04-30T09:41:02.053Z
9,4,4,Great product!,Liam Garcia,liam.garcia@x.dummyjson.com,2025-04-30T09:41:02.053Z


### Validate Foreign Key Integrity

> **"In relational data, every FK must point to a valid PK. No orphans allowed!"**

**What is FK integrity?**
- Every `reviews.product_id` must exist in `products.product_id`
- If not, we have "orphaned" reviews pointing to non-existent products
- This breaks JOINs and corrupts analysis

**Business impact of orphaned FKs:**
- ❌ Reviews can't be matched to products (data loss)
- ❌ JOIN results are incomplete (wrong metrics)
- ❌ Stakeholders lose trust in data quality

Let's validate:

---

In [34]:
# Validate reviews table
print("✅ Validating reviews table...")

# Check 1: Foreign key integrity
valid_product_ids = set(products_df['product_id'])
orphaned_reviews = ~reviews_df['product_id'].isin(valid_product_ids)

assert not orphaned_reviews.any(), \
    f"❌ Found {orphaned_reviews.sum()} orphaned reviews! " \
    f"These reviews have product_id values that don't exist in products table."

print(f"   ✓ All {len(reviews_df)} reviews have valid product_id (FK integrity passed)")

# Check 2: No NULL foreign keys
assert reviews_df['product_id'].notna().all(), \
    f"❌ Found {reviews_df['product_id'].isna().sum()} NULL product_id values in reviews!"

print(f"   ✓ No NULL foreign keys in reviews table")

# Check 3: Rating values are in valid range
assert reviews_df['rating'].between(1, 5).all(), \
    f"❌ Found invalid ratings! Ratings must be between 1-5"

print(f"   ✓ All ratings are in valid range (1-5)")

# Bonus: Check referential integrity statistics
products_with_reviews = reviews_df['product_id'].nunique()
products_without_reviews = len(products_df) - products_with_reviews

print(f"\n📊 Referential integrity statistics:")
print(f"   Products with reviews: {products_with_reviews}/{len(products_df)} ({products_with_reviews/len(products_df)*100:.1f}%)")
print(f"   Products without reviews: {products_without_reviews} ({products_without_reviews/len(products_df)*100:.1f}%)")

print("\n✅ Reviews table validation: PASSED")

✅ Validating reviews table...
   ✓ All 90 reviews have valid product_id (FK integrity passed)
   ✓ No NULL foreign keys in reviews table
   ✓ All ratings are in valid range (1-5)

📊 Referential integrity statistics:
   Products with reviews: 30/30 (100.0%)
   Products without reviews: 0 (0.0%)

✅ Reviews table validation: PASSED


In [35]:
# Verify reviews table structure
print("Reviews table info:")
print(f"  Shape: {reviews_df.shape}")
print(f"  Columns: {list(reviews_df.columns)}")
print(f"\n  Total reviews: {len(reviews_df)}")
print(f"  Products with reviews: {reviews_df['product_id'].nunique()}")
print(f"  Unique reviewers: {reviews_df['reviewer_name'].nunique()}")
print(f"\n  Reviews per product (sample):")
print(reviews_df.groupby('product_id').size().head(10))

Reviews table info:
  Shape: (90, 6)
  Columns: ['product_id', 'rating', 'comment', 'reviewer_name', 'reviewer_email', 'review_date']

  Total reviews: 90
  Products with reviews: 30
  Unique reviewers: 69

  Reviews per product (sample):
product_id
1     3
2     3
3     3
4     3
5     3
6     3
7     3
8     3
9     3
10    3
dtype: int64


### What We Accomplished

**Before:** One nested JSON structure with products containing review arrays

**After:** Two tidy tables!
- **Products table:** 30 rows (one per product)
- **Reviews table:** ~90 rows (one per review, multiple reviews per product)

**Key insight:** We went from 30 products with nested data → ~90 separate review records.

This is **normalization** - we've separated the data by observational unit (products vs. reviews).

---


---

## Part 3: Load to DuckDB (⏱️ 10-12 minutes)

### Why DuckDB?

**DuckDB is perfect for analytics:**
- ✅ **Fast** - Columnar storage, optimized for analytical queries
- ✅ **SQL interface** - Use familiar SQL syntax
- ✅ **No server** - Embedded database (no setup, no configuration)
- ✅ **Works with pandas** - Seamless integration

**Think of it as:** "SQLite for analytics" or "Postgres that runs in your Python script."

---


In [36]:
# Create DuckDB connection (in-memory database)
from IPython.display import display

con = duckdb.connect(':memory:')

print("✅ Connected to DuckDB (in-memory)")
print(f"   DuckDB version: {duckdb.__version__}")

✅ Connected to DuckDB (in-memory)
   DuckDB version: 1.4.1


In [37]:
# Write products table to DuckDB
con.execute("CREATE TABLE products AS SELECT * FROM products_df")

# Verify
row_count = con.execute("SELECT COUNT(*) FROM products").fetchone()[0]
print(f"✅ Created 'products' table: {row_count} rows")

✅ Created 'products' table: 30 rows


In [38]:
# Write reviews table to DuckDB
con.execute("CREATE TABLE reviews AS SELECT * FROM reviews_df")

# Verify
row_count = con.execute("SELECT COUNT(*) FROM reviews").fetchone()[0]
print(f"✅ Created 'reviews' table: {row_count} rows")

✅ Created 'reviews' table: 90 rows


In [40]:
# Verify data loaded correctly
print("Tables in database:")
display(con.execute("SHOW TABLES").df())

print("\nProducts preview:")
display(con.execute("SELECT * FROM products LIMIT 5").df())

print("\nReviews preview:")
display(con.execute("SELECT * FROM reviews LIMIT 5").df())

Tables in database:


Unnamed: 0,name
0,products
1,reviews



Products preview:


Unnamed: 0,product_id,title,price,category,brand,stock,rating
0,1,Essence Mascara Lash Princess,9.99,beauty,Essence,99,2.56
1,2,Eyeshadow Palette with Mirror,19.99,beauty,Glamour Beauty,34,2.86
2,3,Powder Canister,14.99,beauty,Velvet Touch,89,4.64
3,4,Red Lipstick,12.99,beauty,Chic Cosmetics,91,4.36
4,5,Red Nail Polish,8.99,beauty,Nail Couture,79,4.32



Reviews preview:


Unnamed: 0,product_id,rating,comment,reviewer_name,reviewer_email,review_date
0,1,3,Would not recommend!,Eleanor Collins,eleanor.collins@x.dummyjson.com,2025-04-30T09:41:02.053Z
1,1,4,Very satisfied!,Lucas Gordon,lucas.gordon@x.dummyjson.com,2025-04-30T09:41:02.053Z
2,1,5,Highly impressed!,Eleanor Collins,eleanor.collins@x.dummyjson.com,2025-04-30T09:41:02.053Z
3,2,5,Great product!,Savannah Gomez,savannah.gomez@x.dummyjson.com,2025-04-30T09:41:02.053Z
4,2,4,Awesome product!,Christian Perez,christian.perez@x.dummyjson.com,2025-04-30T09:41:02.053Z


### Success! JSON → DuckDB

We've completed the pipeline:
1. ✅ Fetched JSON from API
2. ✅ Normalized nested structures to tidy tables
3. ✅ Loaded to DuckDB for SQL analysis

**Now we can use SQL to answer business questions!**

---


---

## Part 4: SQL Analysis & Joins (⏱️ 10-12 minutes)

### Business Questions We Can Now Answer

With normalized tables and SQL, we can answer questions like:
1. How many products do we have per category?
2. What's the average review rating per product?
3. Which products have the most reviews?
4. Which reviewers are most active?

Let's tackle each one:

---


### Query 1: Products by Category

**Business question:** "How is our product catalog distributed across categories? Which categories have the most products, and what's the average price point in each category?"

**Why this matters:** Understanding category distribution helps with:
- Inventory planning (which categories need more variety?)
- Pricing strategy (are we positioned as premium or budget in each category?)
- Marketing focus (which categories to promote?)

Let's use GROUP BY to aggregate products by category:

---

In [41]:
# Count products by category
print("Products by category:")
con.execute("""
    SELECT 
        category,
        COUNT(*) as product_count,
        ROUND(AVG(price), 2) as avg_price
    FROM products
    GROUP BY category
    ORDER BY product_count DESC
""").df()

Products by category:


Unnamed: 0,category,product_count,avg_price
0,groceries,15,6.04
1,beauty,5,13.39
2,fragrances,5,83.99
3,furniture,5,1199.99


**What happened?**

The query executed three key operations:
1. **GROUP BY category** - Split our 30 products into groups (one per category)
2. **COUNT(*)** - Counted how many products in each group
3. **AVG(price)** - Calculated average price within each group

**Business insights from the results:**
- **Groceries dominates** with 15 products (50% of catalog) at low price point ($6.04 avg)
- **Furniture** is premium category (only 5 products but $1,200 avg price)
- **Beauty & Fragrances** are mid-tier (~5 products each, $13-$84 range)

**Key SQL concept:** GROUP BY transforms row-level data into summary statistics. We went from 30 product rows → 4 category summary rows.

---

### Query 2: JOIN Products + Reviews

**Business question:** "Which products have the best customer ratings? How many reviews do they have, and are highly-rated products also highly-reviewed?"

**Why JOINs matter:** We have product data in one table and review data in another. To answer questions like "best-reviewed products," we need to **combine** these tables.

**This is the power of normalization:** We separated products and reviews into tidy tables. Now we can JOIN them back together to answer multi-dimensional questions.

**JOIN Strategy:**
- Use **LEFT JOIN** to keep ALL products (even those without reviews)
- This shows us which products need more review attention
- If we used INNER JOIN, we'd lose products with zero reviews

Let's combine both tables:

---

In [42]:
# Calculate average review rating per product
print("Top-rated products (by review ratings):")
con.execute("""
    SELECT 
        p.product_id,
        p.title,
        p.category,
        p.price,
        COUNT(r.rating) as review_count,
        ROUND(AVG(r.rating), 2) as avg_review_rating
    FROM products p
    LEFT JOIN reviews r ON p.product_id = r.product_id
    GROUP BY p.product_id, p.title, p.category, p.price
    ORDER BY avg_review_rating DESC
    LIMIT 10
""").df()

Top-rated products (by review ratings):


Unnamed: 0,product_id,title,category,price,review_count,avg_review_rating
0,28,Ice Cream,groceries,5.49,3,4.67
1,22,Dog Food,groceries,10.99,3,4.67
2,29,Juice,groceries,3.99,3,4.67
3,7,Chanel Coco Noir Eau De,fragrances,129.99,3,4.67
4,4,Red Lipstick,beauty,12.99,3,4.67
5,8,Dior J'adore,fragrances,89.99,3,4.33
6,9,Dolce Shine Eau de,fragrances,69.99,3,4.33
7,17,Beef Steak,groceries,12.99,3,4.0
8,20,Cooking Oil,groceries,4.99,3,4.0
9,13,Bedside Table African Cherry,furniture,299.99,3,4.0


**What happened?**

This query demonstrates the complete **normalization → JOIN → aggregation** pipeline:

1. **LEFT JOIN** - Connected products table with reviews table using `product_id` as the foreign key
2. **GROUP BY** - Aggregated reviews per product (collapsed multiple reviews → one row per product)
3. **COUNT(r.rating)** - Counted how many reviews each product has
4. **AVG(r.rating)** - Calculated average rating from all reviews

**Business insights from the results:**
- All top-rated products have **exactly 3 reviews** - This is consistent (our dataset has 3 reviews per product)
- Multiple products tied at **4.67 stars** - These are our best-reviewed items
- The query shows **product_id, title, category, price** alongside review metrics

**Key SQL concept - LEFT JOIN preserves all products:**
- If a product had 0 reviews, it would still appear in results with `review_count = 0` and `avg_review_rating = NULL`
- This is different from INNER JOIN, which would exclude products without reviews entirely

**Why this matters for business:** You can now identify:
- Products with high ratings AND high review counts (social proof!)
- Products with few reviews (opportunity to gather more feedback)
- Category patterns in ratings (are beauty products rated higher than groceries?)

---

### Common Mistakes with JOINs & Aggregations

> **🚨 These mistakes will cost you hours of debugging. Learn them now!**

---

#### ❌ Mistake 1: Forgetting the JOIN Condition (Accidental CROSS JOIN)

**Wrong:**
```sql
SELECT p.title, r.rating
FROM products p, reviews r  -- Missing ON condition!
```

**What happens:** Every product × every review = 30 × 90 = **2,700 rows** instead of 90!

**✅ Correct:**
```sql
SELECT p.title, r.rating
FROM products p
LEFT JOIN reviews r ON p.product_id = r.product_id  -- Explicit JOIN condition
```

---

#### ❌ Mistake 2: Using INNER JOIN When You Need LEFT JOIN

**Business question:** "Show me ALL products and their review counts (including products with zero reviews)"

**Wrong:**
```sql
SELECT p.title, COUNT(r.rating) as review_count
FROM products p
INNER JOIN reviews r ON p.product_id = r.product_id  -- ❌ Loses products without reviews!
GROUP BY p.title
```

**What happens:** Products with 0 reviews don't appear in results at all.

**✅ Correct:**
```sql
SELECT p.title, COUNT(r.rating) as review_count
FROM products p
LEFT JOIN reviews r ON p.product_id = r.product_id  -- ✅ Keeps all products
GROUP BY p.title
```

**Remember:** LEFT JOIN preserves all rows from the LEFT table (products), even if no match in RIGHT table (reviews).

---

#### ❌ Mistake 3: Forgetting GROUP BY Columns

**Wrong:**
```sql
SELECT p.title, p.category, COUNT(r.rating)
FROM products p
LEFT JOIN reviews r ON p.product_id = r.product_id
GROUP BY p.title  -- ❌ Missing p.category in GROUP BY!
```

**SQL Rule:** Every non-aggregated column in SELECT must be in GROUP BY.

**✅ Correct:**
```sql
SELECT p.title, p.category, COUNT(r.rating)
FROM products p
LEFT JOIN reviews r ON p.product_id = r.product_id
GROUP BY p.title, p.category  -- ✅ All non-aggregated columns included
```

---

#### ❌ Mistake 4: Using COUNT(*) vs COUNT(column) with LEFT JOIN

**Scenario:** Product has 0 reviews (no matching rows in reviews table).

**Query:**
```sql
SELECT p.title, COUNT(*), COUNT(r.rating)
FROM products p
LEFT JOIN reviews r ON p.product_id = r.product_id
GROUP BY p.title
```

**For a product with 0 reviews:**
- `COUNT(*)` = **1** (counts the product row, even though review columns are NULL)
- `COUNT(r.rating)` = **0** (counts only non-NULL values)

**✅ With LEFT JOIN, use `COUNT(column)` not `COUNT(*)`** to count actual matches!

---

### Decision Guide: Which JOIN Type to Use?

> **Ask yourself: "Do I need ALL rows from the left table, or only matches?"**

---

| Business Scenario | Use This JOIN | Why |
|-------------------|---------------|-----|
| "Show ALL products and their review counts" | **LEFT JOIN** | Preserves products without reviews |
| "Show ALL customers and their order totals" | **LEFT JOIN** | Preserves customers who haven't ordered |
| "Show ONLY products that have been reviewed" | **INNER JOIN** | Excludes products without reviews |
| "Show ONLY customers who have placed orders" | **INNER JOIN** | Excludes customers without orders |
| "Show ALL products AND all reviews (even orphaned reviews)" | **FULL OUTER JOIN** | Rare in practice |

---

### Quick Decision Tree

```
Do you need rows from the left table even if there's no match?
│
├─ YES → Use LEFT JOIN
│   Examples:
│   - "All products" (some might not have reviews)
│   - "All customers" (some might not have orders)
│
└─ NO → Use INNER JOIN
    Examples:
    - "Only reviewed products"
    - "Only customers with orders"
```

---

### In Our E-Commerce Example

**LEFT JOIN (what we used):**
```sql
-- Show ALL products and their review metrics
SELECT p.title, COUNT(r.rating) as review_count
FROM products p
LEFT JOIN reviews r ON p.product_id = r.product_id
GROUP BY p.title
```
**Result:** All 30 products appear. Products without reviews show `review_count = 0`.

**INNER JOIN (alternative):**
```sql
-- Show ONLY products that have been reviewed
SELECT p.title, COUNT(r.rating) as review_count
FROM products p
INNER JOIN reviews r ON p.product_id = r.product_id
GROUP BY p.title
```
**Result:** Only products with ≥1 review appear. (In our dataset, that's all 30, but in real data it might be fewer.)

---

### When to Use RIGHT JOIN

**Almost never!** RIGHT JOIN is the same as LEFT JOIN with tables swapped. Just swap the table order and use LEFT JOIN for clarity:

❌ **Confusing:**
```sql
FROM products p
RIGHT JOIN reviews r ON p.product_id = r.product_id
```

✅ **Clear:**
```sql
FROM reviews r
LEFT JOIN products p ON r.product_id = p.product_id
```

**Convention:** Always put the "main" table (the one you want to preserve) on the LEFT and use LEFT JOIN.

---

### ⏸️ Pause and Try!

**Your task:** Write a query to analyze reviewers and their rating patterns.

**Business question:** "Which reviewers have left the most reviews, and what's their average rating? Are frequent reviewers more generous or more critical?"

**Requirements:**
1. JOIN the `products` and `reviews` tables
2. GROUP BY reviewer name
3. Calculate:
   - Count of reviews per reviewer (`COUNT(r.rating)`)
   - Average rating given by each reviewer (`AVG(r.rating)`)
4. Filter to show ONLY reviewers with **2 or more reviews** (use `HAVING`)
5. Order results by review count descending
6. Limit to top 15 reviewers

**Hint structure:**
```sql
SELECT 
    r.reviewer_name,
    ??? as review_count,
    ??? as avg_rating_given
FROM ??? p
LEFT JOIN ??? r ON ???
GROUP BY ???
HAVING ??? >= 2
ORDER BY ??? DESC
LIMIT 15
```

**Replace the `???` placeholders and complete the query below:**

---

In [None]:
# Your turn! Write your query here:
#
# TODO: Replace this placeholder with your complete query
con.execute("SELECT 1 AS todo").df()  # Replace this entire query with your answer

### Key SQL Concepts Used

**LEFT JOIN:**
- Keeps ALL products (even those without reviews)
- Matches reviews where `product_id` matches
- If no reviews exist, review columns are NULL

**GROUP BY:**
- Aggregates reviews per product
- `COUNT(r.rating)` counts how many reviews each product has
- `AVG(r.rating)` calculates average rating

**Business insight:** We can now see which products have the best reviews and how many reviews they received.

---

## Summary & What We Accomplished

### The Complete Pipeline: API → DuckDB → Insights

**We demonstrated the end-to-end modern data pipeline:**

1. **Fetch** - Retrieved JSON from DummyJSON API
2. **Normalize** - Flattened nested structures (products + reviews)
3. **Persist** - Loaded tidy tables into DuckDB
4. **Analyze** - Used SQL to answer business questions
5. **Validate** - Checked data quality at each step

### Key Patterns You Learned

✅ **Normalization** - Separate observational units into different tables  
✅ **Foreign keys** - Connect related tables (`product_id` links reviews to products)  
✅ **Pandas transformation** - Use Python for complex data manipulation  
✅ **DuckDB integration** - SQL analysis on pandas DataFrames  
✅ **JOINs** - Combine tables to answer multi-dimensional questions  

### This is Homework 2!

**HW2 will ask you to:**
1. Fetch data from a different API (or JSON file)
2. Normalize nested structures into tidy tables
3. Load to DuckDB
4. Write 3-5 SQL queries to calculate business KPIs
5. Add validation assertions
6. Document your pipeline with a data dictionary

**You now have the pattern!** This notebook is your template.

---

## Where to Go Next

### Production Enhancements (Reference)

For production systems, you'd add:
- **`requests.Session()`** - Connection pooling for better performance
- **`tenacity`** - Automatic retry logic for API failures
- **Error handling** - Try/except blocks around API calls
- **Logging** - Track what happened, when, and why
- **Data validation** - More assertions (PK uniqueness, FK integrity, type checks)
- **Incremental loads** - Only fetch new data (not full refresh)

**See:** `references/api_pipeline_quick_reference.md` for production patterns.

### HW2: Due Wednesday, Oct 22 (start of class)

**Assignment:** Build a mini-pipeline similar to this notebook
- Different API/JSON source
- Normalize to 2-3 tables
- 3-5 SQL KPIs
- Validation assertions
- Data dictionary

**Instructor will present HW2 details next!**

---
