# Homework 1: SQL Foundations with DuckDB

**Name:** [Your Name Here]  
**Due:** Day 2, Start of Class  
**Total Points:** 100 (+ 10 bonus)

---

## Instructions

1. Complete all TODO sections below
2. Write SQL queries to answer each question
3. Add markdown explanations where requested
4. Before submitting: **Kernel → Restart & Run All Cells**
5. Verify all outputs are visible
6. Rename file to `hw1_[your_name].ipynb`

**Read the README.md for full assignment details, rubric, and tips!**

---

## Setup

Run these cells to set up your environment.

In [2]:
# Install DuckDB (if not already installed)
# !pip install duckdb -q

In [3]:
# Import libraries
import duckdb
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')

print("✅ Libraries imported successfully!")

✅ Libraries imported successfully!


In [4]:
# Connect to DuckDB
con = duckdb.connect(':memory:')

print("✅ Connected to DuckDB!")

✅ Connected to DuckDB!


In [5]:
# Load the Online Retail dataset
# This creates a table called 'retail' that you'll query
con.execute("""
    CREATE TABLE retail AS 
    SELECT * FROM 'data/online_retail_hw1.csv'
""")

print("✅ Dataset loaded!")

✅ Dataset loaded!


### Dataset Exploration

Let's explore the data before starting the assignment.

In [6]:
# Check row count
con.execute("SELECT COUNT(*) as total_rows FROM retail").df()


Unnamed: 0,total_rows
0,525461


In [7]:
# View first few rows
con.execute("SELECT * FROM retail LIMIT 5").df()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


In [8]:
# Check for NULL values in each column
con.execute("""
    SELECT 
        COUNT(*) - COUNT(Invoice) AS invoice_nulls,
        COUNT(*) - COUNT(StockCode) AS stockcode_nulls,
        COUNT(*) - COUNT(Description) AS description_nulls,
        COUNT(*) - COUNT(Quantity) AS quantity_nulls,
        COUNT(*) - COUNT(InvoiceDate) AS date_nulls,
        COUNT(*) - COUNT(Price) AS price_nulls,
        COUNT(*) - COUNT("Customer ID") AS customerid_nulls,
        COUNT(*) - COUNT(Country) AS country_nulls
    FROM retail
""").df()

Unnamed: 0,invoice_nulls,stockcode_nulls,description_nulls,quantity_nulls,date_nulls,price_nulls,customerid_nulls,country_nulls
0,0,0,2928,0,0,0,107927,0


In [9]:
# Check date range
con.execute("""
    SELECT 
        MIN(InvoiceDate) as first_transaction,
        MAX(InvoiceDate) as last_transaction
    FROM retail
""").df()

Unnamed: 0,first_transaction,last_transaction
0,2009-12-01 07:45:00,2010-12-09 20:01:00


**Good!** Now you know:
- Total row count
- Which columns have NULLs (Customer ID and Description)
- Date range covered

Keep this in mind as you write queries!

---

## Part 1: Basic Queries (30 points)

This section tests: SELECT, WHERE, ORDER BY, NULL handling, LIKE

### Question 1.1: Guest Checkouts (8 points)

**Business question:** How many transactions were guest checkouts (no Customer ID)?

**Requirements:**
- Count transactions where Customer ID is NULL
- Also calculate what percentage of total transactions this represents
- Your result should have two columns: `guest_transactions` and `pct_of_total`

**Hint:** Remember to use `IS NULL`, not `= NULL`!

In [10]:
# TODO: Write your query here
con.execute("""
            SELECT 
            COUNT(*) as "guest_transactions",
            ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM retail), 2) AS pct_of_total
            FROM retail
            WHERE "Customer ID" IS NULL
""").df()

Unnamed: 0,guest_transactions,pct_of_total
0,107927,20.54


**TODO: Explain your result in 1-2 sentences:**
The query is selecting an aggregate of guest transaction, and creating a percentage of totals by multiplying by guest transactions times 100, and dividing by total transactions. I chose to round up values to 2 decimals. 

So we end up with two columns, count of gues transactions (i.e., customer ID IS NULL) and pct_of total, ratio above
> ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM retail), 2) AS pct_of_total

---

### Question 1.2: High-Value Transactions (7 points)

**Business question:** Show the top 20 highest-value transactions (revenue = Quantity * Price).

**Requirements:**
- Calculate revenue as Quantity * Price
- Show: Invoice, Description, Quantity, Price, Revenue
- Only include rows where Quantity and Price are both positive
- Filter out NULL values appropriately
- Sort by revenue descending
- Limit to top 20

**Hint:** Use calculated column with AS to name it `revenue`

In [11]:
# TODO: Write your query here
con.execute("""
    SELECT Quantity, Price, Invoice, Description, Quantity * Price as Revenue,
    FROM retail
    WHERE Quantity AND Price > 0
        AND Revenue IS NOT NULL
    ORDER BY Revenue DESC
    LIMIT 20
""").df()


Unnamed: 0,Quantity,Price,Invoice,Description,Revenue
0,1,25111.09,512771,Manual,25111.09
1,9360,1.69,530715,ROTATING SILVER ANGELS T-LIGHT HLDR,15818.4
2,1,13541.33,537632,AMAZON FEE,13541.33
3,1,10953.5,502263,Manual,10953.5
4,1,10953.5,502265,Manual,10953.5
5,1,10468.8,525399,Manual,10468.8
6,1,10468.8,522796,Manual,10468.8
7,1,10468.8,524159,Manual,10468.8
8,1,8985.6,496115,Manual,8985.6
9,3500,2.55,511465,PINK PAPER PARASOL,8925.0


---

### Question 1.3: Product Search (7 points)
**Business question:** Find all products with "CHRISTMAS" in the description.

**Requirements:**
- Use LIKE with wildcard pattern matching
- Show: StockCode, Description
- Get distinct products only (no duplicates)
- Sort alphabetically by Description
- Limit to first 15 results

**Hint:** LIKE is case-sensitive in some databases, but DuckDB is case-insensitive by default

In [12]:
# TODO: Write your query here
con.execute("""
    SELECT DISTINCT StockCode, Description,
    FROM retail
    WHERE Description LIKE '%CHRISTMAS%'
    ORDER BY Description ASC
    LIMIT 15      
""").df()


Unnamed: 0,StockCode,Description
0,35962,12 ASS ZINC CHRISTMAS DECORATIONS
1,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS
2,72815,3 WICK CHRISTMAS BRIAR CANDLE
3,22950,36 DOILIES VINTAGE CHRISTMAS
4,22731,3D CHRISTMAS STAMPS STICKERS
5,22731,3D STICKERS CHRISTMAS STAMPS
6,22733,3D STICKERS TRADITIONAL CHRISTMAS
7,22732,3D STICKERS VINTAGE CHRISTMAS
8,22733,3D TRADITIONAL CHRISTMAS STICKERS
9,22732,3D VINTAGE CHRISTMAS STICKERS


---

### Question 1.4: Multi-Country Orders (8 points)

**Business question:** Show transactions from France, Germany, or Spain, with quantity greater than 10.

**Requirements:**
- Use IN operator for country filtering
- Filter for Quantity > 10
- Show: Invoice, Country, Description, Quantity, Price
- Sort by Country, then Quantity descending
- Limit to 25 rows
- Handle NULLs appropriately

**Hint:** Combine IN with AND for multiple conditions

In [13]:
# TODO: Write your query here
con.execute("""
    SELECT Invoice, Country, Description, Quantity, Price,
    FROM retail
    WHERE Country IN ('France', 'Germany', 'Spain')
        AND Quantity > 10
        AND Price IS NOT NULL
        AND Invoice IS NOT NULL
    ORDER BY Country ASC, Quantity DESC
    LIMIT 25
            """).df()

Unnamed: 0,Invoice,Country,Description,Quantity,Price
0,518505,France,SET/6 FRUIT SALAD PAPER CUPS,7128,0.08
1,518505,France,SET/6 FRUIT SALAD PAPER PLATES,7008,0.13
2,518505,France,POP ART PEN CASE & PENS,5184,0.08
3,518505,France,MULTICOLOUR SPRING FLOWER MUG,4992,0.1
4,518505,France,BLACK SILVER FLOWER T-LIGHT HOLDER,4752,0.07
5,518505,France,TEATIME PEN CASE & PENS,4608,0.08
6,518505,France,WHITE BIRD GARDEN DESIGN MUG,4320,0.13
7,518505,France,S/4 BLUE ROUND DECOUPAGE BOXES,3936,0.42
8,518505,France,THE KING GIFT BAG,3744,0.05
9,518505,France,RED SPOTTY PUDDING BOWL,3648,0.13


**TODO: Why did you include (or not include) NULL checks in this query?**
Since we are checking for Q>10, doing exact matches for products, I only made the restriction not include NULL values for Price and Invoice

---

## Part 2: Aggregations (40 points)

This section tests: COUNT, SUM, AVG, GROUP BY, HAVING, WHERE vs HAVING

### Question 2.1: Revenue by Country (10 points)

**Business question:** What's our total revenue and transaction count for each country?

**Requirements:**
- Calculate total revenue (Quantity * Price) per country
- Count transactions per country
- Only include positive quantities and non-NULL prices
- Show: Country, total_revenue, transaction_count
- Sort by total_revenue descending
- Show all countries

**Hint:** Use SUM() and COUNT() with GROUP BY

In [14]:
# TODO: Write your query here
con.execute("""
    SELECT Country,
            ROUND(SUM(Quantity * Price), 3) AS total_revenue,
            COUNT(*) as country_transactions
    FROM retail
    WHERE Quantity > 0
        AND Price IS NOT NULL
    GROUP BY Country
    ORDER BY total_revenue DESC
""").df()


Unnamed: 0,Country,total_revenue,country_transactions
0,United Kingdom,8709577.243,474938
1,EIRE,380977.82,9460
2,Netherlands,268786.0,2730
3,Germany,202395.321,7661
4,France,147211.49,5532
5,Sweden,53525.39,887
6,Denmark,50906.85,418
7,Spain,47601.42,1235
8,Switzerland,43921.39,1170
9,Australia,31446.8,630


**TODO: Which country generates the most revenue? Does this surprise you?**

The country generating the most revenue is the UK, not suprisingly given that it is a UK based gift shop:) 

---

### Question 2.2: Popular Products (10 points)

**Business question:** Which products have been ordered more than 1,000 times?

**Requirements:**
- Group by StockCode and Description
- Count how many times each product appears
- Calculate total quantity sold for each product
- Filter to products with MORE than 1,000 transactions (use HAVING!)
- Show: StockCode, Description, transaction_count, total_quantity_sold
- Sort by transaction_count descending

**Hint:** This requires HAVING, not WHERE, because you're filtering on an aggregate

In [15]:
# TODO: Write your query here
con.execute("""
    SELECT StockCode, 
            COUNT(*) AS transaction_count, 
            ROUND(SUM(Quantity * Price), 3) AS total_quantity_sold, 
            Description, 
    FROM retail
    WHERE Description IS NOT NULL
            AND Quantity > 0
    GROUP BY "StockCode", "Description"
    HAVING COUNT(*) > 1000
    ORDER BY transaction_count DESC
""").df()        


Unnamed: 0,StockCode,transaction_count,total_quantity_sold,Description
0,85123A,3422,158590.87,WHITE HANGING HEART T-LIGHT HOLDER
1,22423,2046,170078.51,REGENCY CAKESTAND 3 TIER
2,21232,1714,34496.68,STRAWBERRY CERAMIC TRINKET BOX
3,21212,1456,24069.28,PACK OF 72 RETRO SPOT CAKE CASES
4,84879,1450,73092.99,ASSORTED COLOUR BIRD ORNAMENT
5,84991,1394,18220.5,60 TEATIME FAIRY CAKE CASES
6,21754,1376,31029.77,HOME BUILDING BLOCK WORD
7,85099B,1255,54483.87,JUMBO BAG RED RETROSPOT
8,20725,1246,29866.97,LUNCH BAG RED SPOTTY
9,21034,1226,2185.95,REX CASH+CARRY JUMBO SHOPPER


**TODO: Explain why you used HAVING instead of WHERE for the >1000 filter:**

In this exercise we used HAVING instead of WHERE because we first created transaction_count, an aggregate variable. In short, we use having, becuase this condition involves an aggregate which was encoded locally. 

---

### Question 2.3: High-Value Customers (10 points)

**Business question:** Which customers have spent more than £5,000 total?

**Requirements:**
- Calculate total spending (SUM of Quantity * Price) per customer
- Count their number of transactions
- Only include customers with Customer ID (exclude guest checkouts)
- Only include positive quantities and prices
- Filter to customers with total spending > 5000
- Show: Customer ID, total_spent, transaction_count
- Sort by total_spent descending

**Hint:** Use WHERE for row-level filtering (NULLs, positive values) and HAVING for aggregate filtering (>5000)

In [16]:
# TODO: Write your query here
con.execute("""
    SELECT 
        "Customer ID",
        ROUND(SUM("Quantity" * "Price"), 2) AS total_spent,
        COUNT(*) AS transaction_count
    FROM retail
    WHERE "Customer ID" IS NOT NULL
      AND "Quantity" > 0
      AND "Price" > 0
    GROUP BY "Customer ID"
    HAVING total_spent > 5000
    ORDER BY total_spent DESC
""").df()

Unnamed: 0,Customer ID,total_spent,transaction_count
0,18102.0,349164.35,627
1,14646.0,248396.50,1773
2,14156.0,196566.74,2648
3,14911.0,152147.57,5570
4,13694.0,131443.19,957
...,...,...,...
282,12474.0,5048.66,286
283,16186.0,5019.17,307
284,13599.0,5013.96,163
285,13869.0,5006.62,553


---

### Question 2.4: Monthly Revenue Trend (10 points)

**Business question:** What's our revenue and transaction count by month?

**Requirements:**
- Extract month from InvoiceDate (use DATE_TRUNC('month', InvoiceDate))
- Calculate total revenue per month
- Count transactions per month
- Calculate average transaction value per month
- Only include positive quantities and prices
- Show: month, total_revenue, transaction_count, avg_transaction_value
- Sort by month chronologically

**Hint:** DATE_TRUNC('month', date_column) gives you the first day of each month

In [17]:
# TODO: Write your query here
(con.execute("""
    SELECT 
        DATE_TRUNC('month', InvoiceDate) AS month,
        ROUND(SUM(Quantity * Price), 2) AS total_revenue,
        COUNT(*) AS transaction_count,
        total_revenue / COUNT(*) AS avg_transaction_value
    FROM retail
    WHERE "Quantity" > 0
      AND "Price" > 0
    GROUP BY month
    ORDER BY month ASC
""").df())

Unnamed: 0,month,total_revenue,transaction_count,avg_transaction_value
0,2009-12-01,825685.76,43957,18.783942
1,2010-01-01,652708.5,30638,21.303887
2,2010-02-01,553713.31,28282,19.578294
3,2010-03-01,833570.13,40364,20.651326
4,2010-04-01,681528.99,33268,20.486022
5,2010-05-01,659858.86,33795,19.52534
6,2010-06-01,752270.14,38900,19.338564
7,2010-07-01,650712.94,32503,20.020089
8,2010-08-01,697274.91,32473,21.472451
9,2010-09-01,924333.01,41109,22.484931


**TODO: Do you see any seasonal patterns in the revenue?**
`Avg_transaction_value` stays consistent throughout the months. However, there is some seasonality in the data, with revenue peaking in NOVEMBER AND DECEMBER. I.e., PEAK time is late autumn, early winter, revenue drops after the holidays, and fluctuates in mid-year. 

---

## Part 3: Window Functions (30 points)

This section tests: ROW_NUMBER, LAG, moving averages

In [18]:
### Visualise all columns with SQL
con.execute("""
    SELECT (*)
    FROM retail
""").df()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.10,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom
...,...,...,...,...,...,...,...,...
525456,538171,22271,FELTCRAFT DOLL ROSIE,2,2010-12-09 20:01:00,2.95,17530.0,United Kingdom
525457,538171,22750,FELTCRAFT PRINCESS LOLA DOLL,1,2010-12-09 20:01:00,3.75,17530.0,United Kingdom
525458,538171,22751,FELTCRAFT PRINCESS OLIVIA DOLL,1,2010-12-09 20:01:00,3.75,17530.0,United Kingdom
525459,538171,20970,PINK FLORAL FELTCRAFT SHOULDER BAG,2,2010-12-09 20:01:00,3.75,17530.0,United Kingdom


### Question 3.1: Latest Purchase Per Customer (8 points)

**Business question:** What was each customer's most recent purchase?

**Requirements:**
- Use ROW_NUMBER() to rank transactions per customer by date
- Partition by Customer ID
- Order by InvoiceDate descending (most recent first)
- Filter to only the most recent transaction (row_num = 1)
- Only include customers with Customer ID (no guest checkouts)
- Show: Customer ID, Invoice, InvoiceDate, Description, Quantity, Price
- Sort by InvoiceDate descending
- Show first 20 customers

**Hint:** You'll need a subquery - use ROW_NUMBER() in inner query, filter in outer query

In [19]:
# TODO: Write your query here
# Structure: 
# SELECT ... FROM (
#     SELECT ..., ROW_NUMBER() OVER (...) as row_num
#     FROM retail
# )
# WHERE row_num = 1

con.execute("""
    SELECT 
        "Customer ID",
        Invoice,
        InvoiceDate,
        Description,
        Quantity,
        Price
    FROM (
        SELECT 
            "Customer ID",
            Invoice,
            InvoiceDate,
            Description,
            Quantity,
            Price,
            ROW_NUMBER () OVER (
                PARTITION BY "Customer ID"
                ORDER BY InvoiceDate DESC
            ) as row_num
    FROM retail
    WHERE "Customer ID" IS NOT NULL
            ) AS ranked_transactions
    WHERE row_num = 1
        ORDER BY InvoiceDate DESC
        LIMIT 20
""").df()

Unnamed: 0,Customer ID,Invoice,InvoiceDate,Description,Quantity,Price
0,17530.0,538171,2010-12-09 20:01:00,PACK OF 60 DINOSAUR CAKE CASES,2,0.55
1,13969.0,538170,2010-12-09 19:32:00,JAM MAKING SET PRINTED,4,1.45
2,13230.0,538169,2010-12-09 19:28:00,HEART DECORATION WITH PEARLS,2,0.85
3,14702.0,538168,2010-12-09 19:23:00,RIBBON REEL LACE DESIGN,5,2.1
4,14713.0,538167,2010-12-09 18:58:00,SET OF 4 NAPKIN CHARMS STARS,3,2.55
5,17965.0,538166,2010-12-09 18:09:00,CHOCOLATE HOT WATER BOTTLE,1,4.95
6,14031.0,538165,2010-12-09 17:34:00,SOLDIERS EGG CUP,72,1.25
7,17841.0,538163,2010-12-09 17:27:00,CIRCUS PARADE LUNCH BOX,1,1.95
8,17576.0,538157,2010-12-09 16:57:00,PENCIL CASE LIFE IS BEAUTIFUL,5,2.95
9,15555.0,538156,2010-12-09 16:53:00,6 RIBBONS ELEGANT CHRISTMAS,8,1.65


**TODO: Why did you use a window function instead of GROUP BY for this question?**

Because GROUP BY would have not been able to compile all the computational filtering and variable encoding to show transform the database. With GROUP BY, the data would have been summarised, and I would not have been able to see all other columns, nor the filter them so easily to see the latest transactions. 

---

### Question 3.2: Week-over-Week Revenue Change (12 points)

**Business question:** How is our weekly revenue changing week-over-week?

**Requirements:**
- First, aggregate to weekly level (use DATE_TRUNC('week', InvoiceDate))
- Calculate total revenue per week
- Use LAG() to get previous week's revenue
- Calculate the change (current week - previous week)
- Calculate percent change
- Only include positive quantities and prices
- Show: week, weekly_revenue, prev_week_revenue, revenue_change, pct_change
- Sort by week chronologically
- Show all weeks

**Hint:** Build this incrementally - first get weekly totals, then add LAG, then calculate changes

In [20]:
# TODO: Write your query here
# Consider using a WITH clause (CTE) to make it cleaner:
# WITH weekly AS (
#     SELECT ... GROUP BY week
# )
# SELECT ..., LAG(...) OVER (ORDER BY week) FROM weekly

con.execute("""
WITH weekly AS (
    SELECT 
        DATE_TRUNC('week', "InvoiceDate") AS week,
        SUM(Quantity * Price) AS weekly_revenue
    FROM retail
    WHERE Quantity > 0 AND Price > 0
    GROUP BY week
)
SELECT 
    week,
    weekly_revenue,
    LAG(weekly_revenue) OVER (ORDER BY week) AS prev_week_revenue,
    (weekly_revenue - LAG(weekly_revenue) OVER (ORDER BY week)) AS revenue_change,
    ((weekly_revenue - LAG(weekly_revenue) OVER (ORDER BY week)) / NULLIF(LAG(weekly_revenue) OVER (ORDER BY week), 0)) * 100 AS pct_change
FROM weekly
ORDER BY revenue_change DESC;
""").df()


Unnamed: 0,week,weekly_revenue,prev_week_revenue,revenue_change,pct_change
0,2010-01-04,168520.11,55262.37,113257.74,204.945499
1,2010-06-07,220415.75,108579.82,111835.93,102.998817
2,2010-11-08,382123.581,288871.07,93252.511,32.281707
3,2010-09-27,333403.36,246229.42,87173.94,35.403544
4,2010-02-15,175148.652,91110.61,84038.042,92.237383
5,2010-11-01,288871.07,220689.65,68181.42,30.894707
6,2010-09-20,246229.42,179573.591,66655.829,37.118949
7,2010-06-28,181541.81,127041.53,54500.28,42.899578
8,2010-08-09,180134.32,131419.35,48714.97,37.068339
9,2010-03-15,190300.321,142176.34,48123.981,33.848094


**TODO: Which week had the biggest increase in revenue? What might explain this?**
When changing ORDER BY filtering of the query above, then the week with the highest weekly revenue, was Calendar Week 1 of 2010. It looks like customers were frantically shopping for gifts and decorations to welcome 2010 and perhaps get a last minute holiday gift or two. 


---

### Question 3.3: 7-Day Moving Average (10 points)

**Business question:** What's the 7-day moving average of daily revenue?

**Requirements:**
- First, aggregate to daily level (DATE_TRUNC('day', InvoiceDate) or just InvoiceDate::DATE)
- Calculate total revenue per day
- Use window function with ROWS BETWEEN to calculate 7-day moving average
- The moving average should include current day + 6 days before
- Only include positive quantities and prices
- Show: date, daily_revenue, moving_avg_7day
- Sort by date
- Show first 30 days

**Hint:** ROWS BETWEEN 6 PRECEDING AND CURRENT ROW gives you 7 days total

In [21]:
# TODO: Write your query here
con.execute("""
WITH daily_revenue AS (
    SELECT 
        DATE_TRUNC('day', "InvoiceDate") AS date,
        SUM(Quantity * Price) AS daily_revenue
    FROM retail
    WHERE Quantity > 0 AND Price > 0
    GROUP BY date
)
SELECT 
    date,
    daily_revenue,
    AVG(daily_revenue) OVER (
        ORDER BY date 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS moving_avg_7day
FROM daily_revenue
ORDER BY date
LIMIT 30;


""").df()

Unnamed: 0,date,daily_revenue,moving_avg_7day
0,2009-12-01,54513.5,54513.5
1,2009-12-02,63352.51,58933.005
2,2009-12-03,74037.91,63967.973333
3,2009-12-04,40732.92,58159.21
4,2009-12-05,9803.05,48487.978
5,2009-12-06,24613.64,44508.921667
6,2009-12-07,45083.35,44590.982857
7,2009-12-08,49517.23,43877.23
8,2009-12-09,40616.09,40629.17
9,2009-12-10,44442.11,36401.198571


**TODO: Why is a moving average useful for analyzing daily revenue?**
Moving averages are useful for daily revenue because they help smooth out fluctuations in sales data over a specific period of time. In this case, the 7day moving average helps indicate average fluctuation of revenue in a given week, making it easy to identify patterns and reduce the impact of daily volatility. 

---

## Bonus Question (10 points)

This tests: Synthesis of multiple concepts (window functions + GROUP BY)

### Bonus: Top Product Per Country (10 points)

**Business question:** What's the #1 best-selling product (by revenue) in each country?

**Requirements:**
- Calculate total revenue per product per country
- Rank products within each country by revenue
- Show only the #1 product for each country
- Only include positive quantities and prices
- Show: Country, StockCode, Description, total_revenue, rank
- Sort by Country

**Strategy:**
1. First: GROUP BY country and product to get revenue per product per country
2. Then: Use ROW_NUMBER() to rank products within each country
3. Finally: Filter to rank = 1

**Hint:** This combines aggregation (GROUP BY) with window functions (ROW_NUMBER)

In [22]:
# TODO: Write your query here (BONUS)
# This is challenging! Break it into steps:
# 1. Inner query: GROUP BY country and product
# 2. Middle query: Add ROW_NUMBER() partitioned by country
# 3. Outer query: Filter to row_num = 1

con.execute("""
    WITH product_revenue AS (
        SELECT
            Country,
            StockCode, 
            Description, 
            ROUND (SUM(Price * Quantity),2) AS total_revenue
        FROM retail
        WHERE Quantity > 0 AND Price > 0
        GROUP BY Country, Stockcode, Description
    ),
    ranked_products AS (
        SELECT 
            Country, 
            StockCode, 
            Description, 
            total_revenue, 
            ROW_NUMBER() OVER (
                PARTITION BY Country
                ORDER BY total_revenue DESC
            ) AS country_rank
        FROM product_revenue
    )
    SELECT 
        Country, 
        StockCode, 
        Description, 
        total_revenue
    FROM ranked_products
    WHERE country_rank = 1
        ORDER BY Country
""").df()

Unnamed: 0,Country,StockCode,Description,total_revenue
0,Australia,M,Manual,1133.45
1,Austria,POST,POSTAGE,1600.0
2,Bahrain,18097C,WHITE TALL PORCELAIN T-LIGHT HOLDER,202.5
3,Belgium,POST,POSTAGE,2653.0
4,Bermuda,84568,GIRLS ALPHABET IRON ON PATCHES,241.92
5,Brazil,20839,FRENCH PAISLEY CUSHION COVER,17.7
6,Canada,21584,RETROSPOT SMALL TUBE MATCHES,33.0
7,Channel Islands,51008,AFGHAN SLIPPER SOCK PAIR,1770.0
8,Cyprus,22423,REGENCY CAKESTAND 3 TIER,567.15
9,Denmark,85220,SMALL FAIRY CAKE FRIDGE MAGNETS,6467.6


**TODO: (Bonus) Explain your approach to this question:**
After carefully reading the business question and stratgy, my first step was to create `product_revenue` category, where I selected country and product data, calculated total_revenue, and grouped by product identifiers. 
The second was to created (from `product_revenue`) a product ranking per country, and ordered them by total revenue in `country_rank`. 
After this, I selected Country and Product related variables , filtering them for the best-selling product. 

I also rounded revenue to the second decimal for legibility. 

---

## Submission Checklist

Before submitting, verify:

- [x] All TODO sections completed
- [x] All queries produce results (no errors)
- [x] All query outputs are visible
- [x] All markdown explanations completed
- [x] SQL formatted nicely (uppercase keywords, indented)
- [x] NULL values handled appropriately (IS NULL, not = NULL)
- [x] **CRITICAL:** Kernel → Restart & Run All Cells (no errors)
- [x] File renamed to `hw1_[your_name].ipynb`

---

## Reflection (Optional but Recommended)

**What was the most challenging part of this assignment?**

[Your answer here]

**What concept do you feel most confident about now?**

[Your answer here]

**What would you like more practice with?**

[Your answer here]

---

**Great work! 🎉** You've completed queries on 525,000 rows of real data. This is the kind of work data professionals do every day. Be proud!