# Day 2, Block B: JSON → DuckDB Pipeline

**Duration:** 40-45 minutes  
**Course:** ECBS5294 - Introduction to Data Science: Working with Data  
**Instructor:** Eduardo Ariño de la Rubia

---

## Learning Objectives

By the end of this session, you will be able to:

1. **Explain** why nested JSON needs normalization (tidy data principles)
2. **Normalize** nested JSON structures into multiple related tables
3. **Use pandas** to flatten one-to-many relationships
4. **Persist** normalized data to DuckDB for SQL analysis
5. **Join** normalized tables to answer business questions
6. **Validate** data quality with assertions

---


## Part 1: The Normalization Problem (⏱️ 8-10 minutes)

### Recall: Tidy Data Principles (Day 1)

On Day 1, we learned **tidy data principles:**

1. Each **variable** is a column
2. Each **observation** is a row
3. Each **type of observational unit** forms a table

**The problem:** Nested JSON violates these principles!

---


In [None]:
# Setup: Import libraries
import requests
import json
import pandas as pd
import duckdb
from pprint import pprint

print("✅ Libraries imported successfully")

In [None]:
# OPTION 1: Use live API (DEFAULT)
DUMMYJSON_URL = "https://dummyjson.com/products"

response = requests.get(DUMMYJSON_URL, params={'limit': 30}, timeout=10)
response.raise_for_status()
products_data = response.json()

print(f"✅ Fetched {len(products_data['products'])} products from API")

In [None]:
# OPTION 2: Use backup file (if API is down)
# Uncomment these lines and skip the cell above if DummyJSON is unavailable

# import json
# with open('../../data/day2/block_b/products_backup.json') as f:
#     products_data = json.load(f)
# 
# print(f"✅ Loaded {len(products_data['products'])} products from backup")

### The Problem: One-to-Many Relationships

Look at a single product with its nested reviews:

---


In [None]:
# Examine one product with nested reviews
sample_product = products_data['products'][0]

print("Product structure:")
print(f"  ID: {sample_product['id']}")
print(f"  Title: {sample_product['title']}")
print(f"  Price: ${sample_product['price']}")
print(f"\n  Reviews ({len(sample_product['reviews'])} reviews):")

for i, review in enumerate(sample_product['reviews'], 1):
    print(f"    Review {i}: {review['rating']}⭐ - {review['reviewerName']}")

print("\n" + "=" * 60)
print("Problem: One product has MANY reviews.")
print("This is a ONE-TO-MANY relationship - not tidy!")
print("=" * 60)

### Why This Matters for Business

**Questions we want to answer:**
- What's the average rating across all reviews? (need to count reviews, not products!)
- Which reviewers are most active? (need to count by reviewer)
- How many reviews per product? (need to aggregate reviews by product)

**We can't answer these questions cleanly with nested JSON.**

**Solution:** Create two separate tables:
1. **Products table** - One row per product (product-level attributes only)
2. **Reviews table** - One row per review (with `product_id` foreign key)

This is called **normalization** - the process of organizing data to reduce redundancy and improve integrity.

---


---

## Part 2: Normalize with Pandas (⏱️ 12-15 minutes)

### Strategy: Create Multiple Tidy Tables

We'll create two tables:
1. **`products`** - Product-level attributes (id, title, price, category, brand, stock)
2. **`reviews`** - Review-level attributes (product_id, rating, comment, reviewer)

The `product_id` in the reviews table is a **foreign key** that links back to the products table.

---


### Step 1: Extract Products Table

First, let's create a DataFrame with only product-level attributes:

---


In [None]:
# Extract products (top-level attributes only)
products_list = []

for product in products_data['products']:
    products_list.append({
        'product_id': product['id'],
        'title': product['title'],
        'price': product['price'],
        'category': product['category'],
        'brand': product.get('brand', 'Unknown'),  # Safe access
        'stock': product.get('stock', 0),
        'rating': product.get('rating', None)
    })

# Create DataFrame
products_df = pd.DataFrame(products_list)

print(f"✅ Created products table: {products_df.shape[0]} rows × {products_df.shape[1]} columns")
products_df.head()

In [None]:
# Verify products table structure
print("Products table info:")
print(f"  Shape: {products_df.shape}")
print(f"  Columns: {list(products_df.columns)}")
print(f"  Data types:\n{products_df.dtypes}")
print(f"\n  Unique products: {products_df['product_id'].nunique()}")
print(f"  Categories: {products_df['category'].nunique()}")

### Step 2: Extract Reviews Table (Explode One-to-Many)

Now for the tricky part: **exploding the reviews array** into separate rows.

Each review becomes its own row, with a `product_id` foreign key linking back to the product.

---


In [None]:
# Extract reviews (explode one-to-many)
reviews_list = []

for product in products_data['products']:
    product_id = product['id']
    
    # Each product can have multiple reviews
    for review in product.get('reviews', []):
        reviews_list.append({
            'product_id': product_id,  # Foreign key!
            'rating': review['rating'],
            'comment': review['comment'],
            'reviewer_name': review['reviewerName'],
            'reviewer_email': review.get('reviewerEmail', None),
            'review_date': review.get('date', None)
        })

# Create DataFrame
reviews_df = pd.DataFrame(reviews_list)

print(f"✅ Created reviews table: {reviews_df.shape[0]} rows × {reviews_df.shape[1]} columns")
reviews_df.head(10)

In [None]:
# Verify reviews table structure
print("Reviews table info:")
print(f"  Shape: {reviews_df.shape}")
print(f"  Columns: {list(reviews_df.columns)}")
print(f"\n  Total reviews: {len(reviews_df)}")
print(f"  Products with reviews: {reviews_df['product_id'].nunique()}")
print(f"  Unique reviewers: {reviews_df['reviewer_name'].nunique()}")
print(f"\n  Reviews per product (sample):")
print(reviews_df.groupby('product_id').size().head(10))

### What We Accomplished

**Before:** One nested JSON structure with products containing review arrays

**After:** Two tidy tables!
- **Products table:** 30 rows (one per product)
- **Reviews table:** ~90 rows (one per review, multiple reviews per product)

**Key insight:** We went from 30 products with nested data → ~90 separate review records.

This is **normalization** - we've separated the data by observational unit (products vs. reviews).

---


---

## Part 3: Load to DuckDB (⏱️ 10-12 minutes)

### Why DuckDB?

**DuckDB is perfect for analytics:**
- ✅ **Fast** - Columnar storage, optimized for analytical queries
- ✅ **SQL interface** - Use familiar SQL syntax
- ✅ **No server** - Embedded database (no setup, no configuration)
- ✅ **Works with pandas** - Seamless integration

**Think of it as:** "SQLite for analytics" or "Postgres that runs in your Python script."

---


In [None]:
# Create DuckDB connection (in-memory database)
con = duckdb.connect(':memory:')

print("✅ Connected to DuckDB (in-memory)")
print(f"   DuckDB version: {duckdb.__version__}")

In [None]:
# Write products table to DuckDB
con.execute("CREATE TABLE products AS SELECT * FROM products_df")

# Verify
row_count = con.execute("SELECT COUNT(*) FROM products").fetchone()[0]
print(f"✅ Created 'products' table: {row_count} rows")

In [None]:
# Write reviews table to DuckDB
con.execute("CREATE TABLE reviews AS SELECT * FROM reviews_df")

# Verify
row_count = con.execute("SELECT COUNT(*) FROM reviews").fetchone()[0]
print(f"✅ Created 'reviews' table: {row_count} rows")

In [None]:
# Verify data loaded correctly
print("Tables in database:")
tables = con.execute("SHOW TABLES").df()
print(tables)

print("\nProducts preview:")
print(con.execute("SELECT * FROM products LIMIT 5").df())

print("\nReviews preview:")
print(con.execute("SELECT * FROM reviews LIMIT 5").df())

### Success! JSON → DuckDB

We've completed the pipeline:
1. ✅ Fetched JSON from API
2. ✅ Normalized nested structures to tidy tables
3. ✅ Loaded to DuckDB for SQL analysis

**Now we can use SQL to answer business questions!**

---


---

## Part 4: SQL Analysis & Joins (⏱️ 10-12 minutes)

### Business Questions We Can Now Answer

With normalized tables and SQL, we can answer questions like:
1. How many products do we have per category?
2. What's the average review rating per product?
3. Which products have the most reviews?
4. Which reviewers are most active?

Let's tackle each one:

---


### Query 1: Products by Category

---


In [None]:
# Count products by category
result = con.execute("""
    SELECT 
        category,
        COUNT(*) as product_count,
        ROUND(AVG(price), 2) as avg_price
    FROM products
    GROUP BY category
    ORDER BY product_count DESC
""").df()

print("Products by category:")
print(result)

### Query 2: JOIN Products + Reviews

Now let's combine both tables to calculate average rating per product:

---


In [None]:
# Calculate average review rating per product
result = con.execute("""
    SELECT 
        p.product_id,
        p.title,
        p.category,
        p.price,
        COUNT(r.rating) as review_count,
        ROUND(AVG(r.rating), 2) as avg_review_rating
    FROM products p
    LEFT JOIN reviews r ON p.product_id = r.product_id
    GROUP BY p.product_id, p.title, p.category, p.price
    ORDER BY avg_review_rating DESC
    LIMIT 10
""").df()

print("Top-rated products (by review ratings):")
print(result)

### Key SQL Concepts Used

**LEFT JOIN:**
- Keeps ALL products (even those without reviews)
- Matches reviews where `product_id` matches
- If no reviews exist, review columns are NULL

**GROUP BY:**
- Aggregates reviews per product
- `COUNT(r.rating)` counts how many reviews each product has
- `AVG(r.rating)` calculates average rating

**Business insight:** We can now see which products have the best reviews and how many reviews they received.

---

## Summary & What We Accomplished

### The Complete Pipeline: API → DuckDB → Insights

**We demonstrated the end-to-end modern data pipeline:**

1. **Fetch** - Retrieved JSON from DummyJSON API
2. **Normalize** - Flattened nested structures (products + reviews)
3. **Persist** - Loaded tidy tables into DuckDB
4. **Analyze** - Used SQL to answer business questions
5. **Validate** - Checked data quality at each step

### Key Patterns You Learned

✅ **Normalization** - Separate observational units into different tables  
✅ **Foreign keys** - Connect related tables (`product_id` links reviews to products)  
✅ **Pandas transformation** - Use Python for complex data manipulation  
✅ **DuckDB integration** - SQL analysis on pandas DataFrames  
✅ **JOINs** - Combine tables to answer multi-dimensional questions  

### This is Homework 2!

**HW2 will ask you to:**
1. Fetch data from a different API (or JSON file)
2. Normalize nested structures into tidy tables
3. Load to DuckDB
4. Write 3-5 SQL queries to calculate business KPIs
5. Add validation assertions
6. Document your pipeline with a data dictionary

**You now have the pattern!** This notebook is your template.

---

## Where to Go Next

### Production Enhancements (Reference)

For production systems, you'd add:
- **`requests.Session()`** - Connection pooling for better performance
- **`tenacity`** - Automatic retry logic for API failures
- **Error handling** - Try/except blocks around API calls
- **Logging** - Track what happened, when, and why
- **Data validation** - More assertions (PK uniqueness, FK integrity, type checks)
- **Incremental loads** - Only fetch new data (not full refresh)

**See:** `references/api_pipeline_quick_reference.md` for production patterns.

### HW2: Due Wednesday, Oct 22 (start of class)

**Assignment:** Build a mini-pipeline similar to this notebook
- Different API/JSON source
- Normalize to 2-3 tables
- 3-5 SQL KPIs
- Validation assertions
- Data dictionary

**Instructor will present HW2 details next!**

---
