# Day 1, Block B: Window Functions Primer

**Duration:** 20 minutes  
**Course:** ECBS5294 - Introduction to Data Science: Working with Data  
**Instructor:** Eduardo Ariño de la Rubia

**Note:** This is a focused primer covering the essentials. For advanced topics like LAG(), LEAD(), and moving averages, see the **Window Functions Deep Dive** notebook.

---

## Learning Objectives

By the end of this primer, you will be able to:

1. **Explain the mental model:** Windows preserve rows; GROUP BY collapses rows
2. **Decide when to use** window functions vs GROUP BY
3. **Use ROW_NUMBER()** for "latest record per group" problems
4. **Understand basic window function syntax** (PARTITION BY, ORDER BY)

---

## 1. Introduction: The Problem Window Functions Solve

### The Challenge

You just learned GROUP BY. It's powerful:
- "Total revenue per product" ✅
- "Count of transactions per month" ✅

But sometimes GROUP BY has a **limitation:**

**Problem:** "I want to see each transaction AND the total for that product."

With GROUP BY:
- You can see the total per product (1 row per product)
- OR you can see all transactions (many rows)
- But **not both at the same time!**

GROUP BY **collapses** rows. What if you want the calculation **without collapsing**?

**Enter: Window Functions**

> **Window functions let you add calculations to your data WITHOUT collapsing rows.**

This is incredibly powerful for analytics!

---

## 2. Setup

We're using **Superstore** data because:
- Multiple orders **per customer** (great for ROW_NUMBER examples)
- 4 years of time series data
- ~10,000 rows - perfect for learning

In [1]:
# Imports
import duckdb
import pandas as pd
import warnings

warnings.filterwarnings('ignore')

print("✅ Libraries imported!")

✅ Libraries imported!


In [2]:
# Connect to DuckDB
con = duckdb.connect(':memory:')

print("✅ Connected to DuckDB!")

✅ Connected to DuckDB!


#### Registering Data in SQL

In [3]:
# Load Superstore data
# Note: Using encoding='latin-1' due to file encoding
superstore = pd.read_csv('../../data/day1/Sample - Superstore.csv', encoding='latin-1')

# Cast Order Date to datetime for proper DATE_TRUNC support
superstore['Order Date'] = pd.to_datetime(superstore['Order Date'])

# Register with DuckDB
con.register('superstore', superstore)

print(f"✅ Loaded {len(superstore):,} rows!")

✅ Loaded 9,994 rows!


In [8]:
# Explore the data
con.execute("""
    SELECT 
        "Order ID",
        "Order Date",
        "Customer ID",
        "Customer Name",
        Category,
        "Product Name",
        Sales
    FROM superstore
    LIMIT 5
""").df()

Unnamed: 0,Order ID,Order Date,Customer ID,Customer Name,Category,Product Name,Sales
0,CA-2016-152156,2016-11-08,CG-12520,Claire Gute,Furniture,Bush Somerset Collection Bookcase,261.96
1,CA-2016-152156,2016-11-08,CG-12520,Claire Gute,Furniture,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94
2,CA-2016-138688,2016-06-12,DV-13045,Darrin Van Huff,Office Supplies,Self-Adhesive Address Labels for Typewriters b...,14.62
3,US-2015-108966,2015-10-11,SO-20335,Sean O'Donnell,Furniture,Bretford CR4500 Series Slim Rectangular Table,957.5775
4,US-2015-108966,2015-10-11,SO-20335,Sean O'Donnell,Office Supplies,Eldon Fold 'N Roll Cart System,22.368


In [9]:
# Check date range
con.execute("""
    SELECT 
        MIN("Order Date") as first_order,
        MAX("Order Date") as last_order,
        COUNT(DISTINCT "Customer ID") as unique_customers,
        COUNT(*) as total_orders
    FROM superstore
""").df()

Unnamed: 0,first_order,last_order,unique_customers,total_orders
0,2014-01-03,2017-12-30,793,9994


**Perfect!** ~10,000 orders across 4 years from ~800 customers. Great data for learning window functions.

---

## 3. The Mental Model: Windows vs GROUP BY

> **🚨 THIS IS THE MOST IMPORTANT CONCEPT**

### The Core Difference

| | GROUP BY | Window Functions |
|---|---|---|
| **What happens to rows?** | Collapses to summary | Keeps all rows |
| **Output row count** | Fewer rows (one per group) | Same row count as input |
| **Use when** | You want summary only | You want detail + calculation |
| **Example** | "Total sales per category" | "Each order + category total" |

Let's see this in action with real queries.

In [12]:
# ==============================================================================
# THE CRITICAL DIFFERENCE: Side-by-Side Comparison
# ==============================================================================

from IPython.display import display

print("="*70)
print("APPROACH 1: GROUP BY (Collapses Rows)")
print("="*70)
result_groupby = con.execute("""
    SELECT 
        Category,
        COUNT(*) AS order_count
    FROM superstore
    GROUP BY Category
    ORDER BY order_count DESC
""").df()

print(f"\n📊 Input: 9,994 rows")
print(f"📉 Output: {len(result_groupby)} rows (one per category)")
print(f"❌ We LOST all the details! Which products? Which customers? When?\n")
display(result_groupby)

print("\n" + "="*70)
print("APPROACH 2: WINDOW FUNCTION (Preserves Rows)")
print("="*70)
result_window = con.execute("""
    SELECT 
        "Order ID",
        "Product Name",
        Category,
        Sales,
        COUNT(*) OVER (PARTITION BY Category) AS category_order_count
    FROM superstore
    LIMIT 10
""").df()

print(f"\n📊 Input: 9,994 rows")
print(f"📈 Output: 9,994 rows (all kept!)")
print(f"✅ We KEPT everything AND added the count!\n")
print("(Showing first 10 rows)\n")
display(result_window)

print("\n" + "="*70)
print("🔑 KEY INSIGHT:")
print("="*70)
print("   GROUP BY:  9,994 rows  →  3 rows      (COLLAPSED)")
print("   Window:    9,994 rows  →  9,994 rows  (PRESERVED)")
print("="*70)
print("\n💡 This is why window functions are powerful:")
print("   You get the DETAIL + the AGGREGATE in the same result!")
print("="*70)

APPROACH 1: GROUP BY (Collapses Rows)

📊 Input: 9,994 rows
📉 Output: 3 rows (one per category)
❌ We LOST all the details! Which products? Which customers? When?



Unnamed: 0,Category,order_count
0,Office Supplies,6026
1,Furniture,2121
2,Technology,1847



APPROACH 2: WINDOW FUNCTION (Preserves Rows)

📊 Input: 9,994 rows
📈 Output: 9,994 rows (all kept!)
✅ We KEPT everything AND added the count!

(Showing first 10 rows)



Unnamed: 0,Order ID,Product Name,Category,Sales,category_order_count
0,CA-2016-107104,"GE 48"" Fluorescent Tube, Cool White Energy Sav...",Furniture,595.38,2121
1,CA-2014-156160,"Computer Room Manger, 14""",Furniture,97.44,2121
2,CA-2014-156160,Office Star - Mid Back Dual function Ergonomic...,Furniture,579.528,2121
3,CA-2017-157448,Eldon Radial Chair Mat for Low to Medium Pile ...,Furniture,119.94,2121
4,CA-2017-157448,Eldon Image Series Black Desk Accessories,Furniture,12.42,2121
5,CA-2016-137393,"Executive Impressions 8-1/2"" Career Panel/Part...",Furniture,41.6,2121
6,CA-2017-122770,"Eldon Executive Woodline II Desk Accessories, ...",Furniture,201.04,2121
7,CA-2015-130183,"Atlantic Metals Mobile 5-Shelf Bookcases, Cust...",Furniture,613.9992,2121
8,CA-2016-122511,"DAX Charcoal/Nickel-Tone Document Frame, 5 x 7",Furniture,30.336,2121
9,CA-2016-161746,Office Star Flex Back Scooter Chair with Alumi...,Furniture,242.136,2121



🔑 KEY INSIGHT:
   GROUP BY:  9,994 rows  →  3 rows      (COLLAPSED)
   Window:    9,994 rows  →  9,994 rows  (PRESERVED)

💡 This is why window functions are powerful:
   You get the DETAIL + the AGGREGATE in the same result!


**See the difference?**
- All detail rows are still there!
- But we've added a new column: `category_order_count`
- Every row in "Furniture" shows the same count
- Every row in "Technology" shows its count

**The calculation happened "over a window" of rows, but we kept all rows!**

---

## 4. When to Use Each

### Decision Guide

**Use GROUP BY when:**
- ✅ You want summary only (one row per group)
- ✅ You don't need row-level details
- ✅ Example: "What's our revenue per region?" (just the totals)

**Use Window Functions when:**
- ✅ You want detail + calculation
- ✅ You need ranking (1st, 2nd, 3rd...)
- ✅ You need row-to-row comparisons ("this month vs last month")
- ✅ You need to filter AFTER calculating ("show me the top 3 per category")
- ✅ Example: "Show me all orders, with each order's rank within its category"

**Key insight:** If GROUP BY loses information you need, use window functions!

---

## 5. Basic Window Function Syntax

### Simple Example: No PARTITION or ORDER

In [13]:
# Add total order count to every row
con.execute("""
    SELECT 
        "Order ID",
        "Product Name",
        COUNT(*) OVER () AS total_orders_in_dataset
    FROM superstore
    LIMIT 5
""").df()

Unnamed: 0,Order ID,Product Name,total_orders_in_dataset
0,CA-2016-152156,Bush Somerset Collection Bookcase,9994
1,CA-2016-152156,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",9994
2,CA-2016-138688,Self-Adhesive Address Labels for Typewriters b...,9994
3,US-2015-108966,Bretford CR4500 Series Slim Rectangular Table,9994
4,US-2015-108966,Eldon Fold 'N Roll Cart System,9994


**What happened:** `COUNT(*) OVER ()` with empty `()` means "count ALL rows" and add that number to every row.

### With PARTITION BY

In [14]:
# Add count PER CATEGORY to every row
con.execute("""
    SELECT 
        "Order ID",
        "Product Name",
        Category,
        COUNT(*) OVER (PARTITION BY Category) AS category_count
    FROM superstore
    LIMIT 10
""").df()

Unnamed: 0,Order ID,Product Name,Category,category_count
0,CA-2014-115812,Mitel 5320 IP Phone VoIP phone,Technology,1847
1,CA-2014-115812,Konftel 250 Conference phone - Charcoal black,Technology,1847
2,CA-2014-143336,Cisco SPA 501G IP Phone,Technology,1847
3,CA-2016-121755,Imation 8GB Mini TravelDrive USB 2.0 Flash Drive,Technology,1847
4,CA-2016-117590,GE 30524EE4,Technology,1847
5,CA-2015-117415,Plantronics HL10 Handset Lifter,Technology,1847
6,CA-2017-120999,Panasonic Kx-TS550,Technology,1847
7,CA-2016-118255,Verbatim 25 GB 6x Blu-ray Single Layer Recorda...,Technology,1847
8,CA-2016-169194,Imation 8gb Micro Traveldrive Usb 2.0 Flash Drive,Technology,1847
9,CA-2016-169194,"LF Elite 3D Dazzle Designer Hard Case Cover, L...",Technology,1847


**PARTITION BY is like GROUP BY for window functions!**
- "PARTITION BY Category" = "For each category..."
- COUNT happens within each partition
- But all rows are kept!

### Window Function Syntax Template

```sql
<function>() OVER (
    PARTITION BY group_column    -- Optional: "for each..."
    ORDER BY sort_column         -- When order matters (required for some functions)
)
```

---

## 6. The Most Common Use Case: ROW_NUMBER()

### The Business Problem

> **"I want the most recent order for each customer."**

This is a VERY common pattern in data analysis:
- Latest transaction per customer
- Most recent login per user
- Current status per order

### Why GROUP BY Fails

In [18]:
# Try with GROUP BY: Get latest date per customer
con.execute("""
    SELECT 
        "Customer ID",
        "Customer Name",
        MAX("Order Date") AS latest_order_date
    FROM superstore
    GROUP BY "Customer ID", "Customer Name"
    LIMIT 5
""").df()

Unnamed: 0,Customer ID,Customer Name,latest_order_date
0,GH-14485,Gene Hale,2016-12-08
1,LC-16930,Linda Cazamias,2017-11-19
2,RA-19885,Ruben Ausman,2017-11-17
3,JM-15265,Janet Molinari,2017-11-23
4,KM-16720,Kunst Miller,2017-12-02


**Problem:** We got the date, but we **lost the order details!**
- What was ordered?
- What category?
- Order ID?
- Sales amount?

GROUP BY collapsed everything. We need a different approach.

### Solution: ROW_NUMBER()

> **ROW_NUMBER() assigns a sequential number to each row within a group**

Strategy:
1. For each customer, rank orders by date (newest = 1)
2. Keep all the row details
3. Filter to rank = 1

### Step 1: Add Row Numbers

In [22]:
# Add row numbers
con.execute("""
    SELECT 
        "Customer ID",
        "Customer Name",
        "Order ID",
        "Order Date",
        Category,
        Sales,
        ROW_NUMBER() OVER (
            PARTITION BY "Customer ID" 
            ORDER BY "Order Date" DESC
        ) AS row_num
    FROM superstore
    LIMIT 20
""").df()

Unnamed: 0,Customer ID,Customer Name,Order ID,Order Date,Category,Sales,row_num
0,AB-10600,Ann Blume,US-2017-155425,2017-11-10,Technology,201.584,1
1,AB-10600,Ann Blume,US-2017-155425,2017-11-10,Technology,239.952,2
2,AB-10600,Ann Blume,US-2017-155425,2017-11-10,Technology,95.994,3
3,AB-10600,Ann Blume,US-2017-155425,2017-11-10,Office Supplies,38.388,4
4,AB-10600,Ann Blume,US-2017-155425,2017-11-10,Furniture,899.136,5
5,AB-10600,Ann Blume,CA-2015-158323,2015-11-30,Furniture,17.088,6
6,AB-10600,Ann Blume,CA-2015-111234,2015-02-18,Office Supplies,9.24,7
7,AB-10600,Ann Blume,CA-2014-115336,2014-11-18,Office Supplies,14.48,8
8,AH-10030,Aaron Hawkins,CA-2017-164000,2017-12-18,Office Supplies,18.704,1
9,AH-10030,Aaron Hawkins,CA-2016-162747,2016-03-20,Furniture,86.45,2


**Breaking it down:**
- `PARTITION BY "Customer ID"` = For each customer...
- `ORDER BY "Order Date" DESC` = Sort by date, newest first
- `ROW_NUMBER()` = Assign 1, 2, 3, ...

**Result:** Each customer's orders are numbered, 1 = most recent!

### Step 2: Filter to Latest Only

In [23]:
# Now filter to row_num = 1 using a subquery
con.execute("""
    SELECT 
        "Customer ID",
        "Customer Name",
        "Order ID",
        "Order Date",
        Category,
        "Product Name",
        Sales
    FROM (
        SELECT 
            "Customer ID",
            "Customer Name",
            "Order ID",
            "Order Date",
            Category,
            "Product Name",
            Sales,
            ROW_NUMBER() OVER (
                PARTITION BY "Customer ID" 
                ORDER BY "Order Date" DESC
            ) AS row_num
        FROM superstore
    )
    WHERE row_num = 1
    ORDER BY "Order Date" DESC
    LIMIT 10
""").df()

Unnamed: 0,Customer ID,Customer Name,Order ID,Order Date,Category,Product Name,Sales
0,JM-15580,Jill Matthias,CA-2017-156720,2017-12-30,Office Supplies,Bagged Rubber Bands,3.024
1,PO-18865,Patrick O'Donnell,CA-2017-143259,2017-12-30,Furniture,"Bush Westfield Collection Bookcases, Fully Ass...",323.136
2,CC-12430,Chuck Clark,CA-2017-126221,2017-12-30,Office Supplies,Eureka The Boss Plus 12-Amp Hard Box Upright V...,209.3
3,EB-13975,Erica Bern,CA-2017-115427,2017-12-30,Office Supplies,"Cardinal Slant-D Ring Binder, Heavy Gauge Vinyl",13.904
4,JG-15160,James Galang,CA-2017-118885,2017-12-29,Furniture,"Global High-Back Leather Tilter, Burgundy",393.568
5,KB-16600,Ken Brennan,CA-2017-158673,2017-12-29,Office Supplies,Xerox 1915,209.7
6,MC-17845,Michael Chen,US-2017-102638,2017-12-29,Office Supplies,Ideal Clamps,6.03
7,BS-11755,Bruce Stewart,CA-2017-130631,2017-12-29,Furniture,Hand-Finished Solid Wood Document Frame,68.46
8,BP-11185,Ben Peterman,CA-2017-146626,2017-12-29,Furniture,Nu-Dell Executive Frame,101.12
9,KH-16360,Katherine Hughes,US-2017-158526,2017-12-29,Furniture,DMI Arturo Collection Mission-style Design Woo...,1207.84


**Perfect!** Now we have:
- ✅ Latest order per customer
- ✅ All order details (Product, Category, Sales, etc.)
- ✅ No information lost!

**Key pattern:** Window functions add columns, they don't filter. Use a **subquery** to filter on the window result.

---

### ⏸️ Pause and Try!

**Your task:** Modify the query above to get the **TOP 3** orders per customer (not just the latest).

**Requirements:**
1. Use the same ROW_NUMBER pattern
2. Change the `WHERE` filter to get top 3 instead of latest (hint: `<= 3`)
3. Keep all the same columns in the output
4. Order by Customer ID and row_num
5. Limit to 15 rows total

Replace the placeholder query in the cell below with your complete SQL query.

In [25]:
# Your turn! Write your TOP 3 query here:
# Now filter to row_num = 1 using a subquery
con.execute("""
    SELECT 
        "Customer ID",
        "Customer Name",
        "Order ID",
        "Order Date",
        Category,
        "Product Name",
        Sales
    FROM (
        SELECT 
            "Customer ID",
            "Customer Name",
            "Order ID",
            "Order Date",
            Category,
            "Product Name",
            Sales,
            ROW_NUMBER() OVER (
                PARTITION BY "Customer ID" 
                ORDER BY "Order Date" DESC
            ) AS row_num
        FROM superstore
    )     
    WHERE row_num <= 3
    ORDER BY "Order Date" DESC
    LIMIT 15
""").df()



Unnamed: 0,Customer ID,Customer Name,Order ID,Order Date,Category,Product Name,Sales
0,EB-13975,Erica Bern,CA-2017-115427,2017-12-30,Office Supplies,GBC Binding covers,20.72
1,CC-12430,Chuck Clark,CA-2017-126221,2017-12-30,Office Supplies,Eureka The Boss Plus 12-Amp Hard Box Upright V...,209.3
2,EB-13975,Erica Bern,CA-2017-115427,2017-12-30,Office Supplies,"Cardinal Slant-D Ring Binder, Heavy Gauge Vinyl",13.904
3,PO-18865,Patrick O'Donnell,CA-2017-143259,2017-12-30,Office Supplies,Wilson Jones Legal Size Ring Binders,52.776
4,PO-18865,Patrick O'Donnell,CA-2017-143259,2017-12-30,Furniture,"Bush Westfield Collection Bookcases, Fully Ass...",323.136
5,PO-18865,Patrick O'Donnell,CA-2017-143259,2017-12-30,Technology,Gear Head AU3700S Headset,90.93
6,JM-15580,Jill Matthias,CA-2017-156720,2017-12-30,Office Supplies,Bagged Rubber Bands,3.024
7,BS-11755,Bruce Stewart,CA-2017-130631,2017-12-29,Furniture,Hand-Finished Solid Wood Document Frame,68.46
8,KH-16360,Katherine Hughes,US-2017-158526,2017-12-29,Furniture,DMI Arturo Collection Mission-style Design Woo...,1207.84
9,JG-15160,James Galang,CA-2017-118885,2017-12-29,Furniture,"Global High-Back Leather Tilter, Burgundy",393.568


### Solution: Top 3 Orders Per Customer

In [26]:
# Get top 3 most recent orders per customer
con.execute("""
    SELECT 
        "Customer ID",
        "Customer Name",
        "Order Date",
        Sales,
        row_num
    FROM (
        SELECT 
            "Customer ID",
            "Customer Name",
            "Order Date",
            Sales,
            ROW_NUMBER() OVER (
                PARTITION BY "Customer ID" 
                ORDER BY "Order Date" DESC
            ) AS row_num
        FROM superstore
    )
    WHERE row_num <= 3
    ORDER BY "Customer ID", row_num
    LIMIT 15
""").df()

Unnamed: 0,Customer ID,Customer Name,Order Date,Sales,row_num
0,AA-10315,Alex Avila,2017-06-29,362.94,1
1,AA-10315,Alex Avila,2017-06-29,11.54,2
2,AA-10315,Alex Avila,2016-03-03,3930.072,3
3,AA-10375,Allen Armold,2017-12-11,14.952,1
4,AA-10375,Allen Armold,2017-12-11,17.94,2
5,AA-10375,Allen Armold,2017-12-11,116.98,3
6,AA-10480,Andrew Allen,2017-04-15,15.552,1
7,AA-10480,Andrew Allen,2016-08-26,11.56,2
8,AA-10480,Andrew Allen,2016-08-26,8.64,3
9,AA-10645,Anna Andreadi,2017-11-05,12.96,1


**Just change the filter!** `WHERE row_num <= 3` gives top 3 per customer.

**This pattern works for:**
- Top N products per category
- Latest N transactions per account
- Most recent N logins per user

---

## 7. Summary: What You Learned

### Key Concepts

1. ✅ **Windows preserve rows, GROUP BY collapses**
   - GROUP BY: 10,000 rows → 3 rows (summary)
   - Windows: 10,000 rows → 10,000 rows (detail + calculation)

2. ✅ **Use PARTITION BY for groups** (like GROUP BY for windows)
   - `PARTITION BY Category` = "For each category..."
   - But all rows are kept!

3. ✅ **Use ORDER BY when order matters**
   - Required for ROW_NUMBER() to know what "first" means
   - `ORDER BY date DESC` = Newest first

4. ✅ **ROW_NUMBER() for "latest/top N per group"**
   - Add row numbers with PARTITION BY + ORDER BY
   - Filter with subquery: `WHERE row_num = 1`

5. ✅ **Pattern for filtering:** Use subquery
   - Window functions ADD columns
   - To FILTER on those columns, wrap in subquery

### Syntax You Know

```sql
-- Basic window
COUNT(*) OVER (PARTITION BY category)

-- ROW_NUMBER for ranking
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY date DESC)

-- Filter pattern (subquery)
SELECT * FROM (
    SELECT *, ROW_NUMBER() OVER (...) AS row_num
    FROM table
)
WHERE row_num = 1
```

### Ready for More?

This primer covered the essentials. If you want to learn:
- **LAG()/LEAD()** for period-over-period comparisons ("this month vs last month")
- **Moving averages** with ROWS BETWEEN frame specifications
- **Advanced patterns** and edge cases

→ See the **Window Functions Deep Dive** notebook!

### You're Ready for HW1! 🎉

You now know:
- SELECT, WHERE, ORDER BY
- GROUP BY, HAVING
- Window functions with ROW_NUMBER()

That's everything you need for Homework 1. Practice these patterns - you'll use them constantly in real data work!

---