# OPIM510: Class 10 - Window Functions

## Window Functions

Defined by [Postgres](https://www.postgresql.org/docs/9.1/tutorial-window.html) as:

> A *window function* performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row --- the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.

### Examples of Window Function use cases

- Ranking (or ranking within groups) based on a metric
- Running totals (or running totals within groups) of a metric
- Lead/lag calculations (i.e. Looking forward or backwards by *n* rows to perform a calculation on the current row)

## Prepare our environment

In [1]:
import polars as pl
import pandas as pd
from datetime import datetime, date

## Load our dataset

In [13]:
# Read the Superstore CSV file
sales = pd.read_csv('https://raw.githubusercontent.com/yajasarora/Superstore-Sales-Analysis-with-Tableau/refs/heads/master/Superstore%20sales%20dataset.csv')

sales

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2016-152156,8/11/2016,11/11/2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.9600,2,0.00,41.9136
1,2,CA-2016-152156,8/11/2016,11/11/2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.9400,3,0.00,219.5820
2,3,CA-2016-138688,12/6/2016,16/6/2016,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.6200,2,0.00,6.8714
3,4,US-2015-108966,11/10/2015,18/10/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5,0.45,-383.0310
4,5,US-2015-108966,11/10/2015,18/10/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.3680,2,0.20,2.5164
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9989,9990,CA-2014-110422,21/1/2014,23/1/2014,Second Class,TB-21400,Tom Boeckenhauer,Consumer,United States,Miami,...,33180,South,FUR-FU-10001889,Furniture,Furnishings,Ultra Door Pull Handle,25.2480,3,0.20,4.1028
9990,9991,CA-2017-121258,26/2/2017,3/3/2017,Standard Class,DB-13060,Dave Brooks,Consumer,United States,Costa Mesa,...,92627,West,FUR-FU-10000747,Furniture,Furnishings,Tenex B1-RE Series Chair Mats for Low Pile Car...,91.9600,2,0.00,15.6332
9991,9992,CA-2017-121258,26/2/2017,3/3/2017,Standard Class,DB-13060,Dave Brooks,Consumer,United States,Costa Mesa,...,92627,West,TEC-PH-10003645,Technology,Phones,Aastra 57i VoIP phone,258.5760,2,0.20,19.3932
9992,9993,CA-2017-121258,26/2/2017,3/3/2017,Standard Class,DB-13060,Dave Brooks,Consumer,United States,Costa Mesa,...,92627,West,OFF-PA-10004041,Office Supplies,Paper,"It's Hot Message Books with Stickers, 2 3/4"" x 5""",29.6000,4,0.00,13.3200


## Using window functions to compute ranking

**Business Question:** Who are our top 10 most `Profit`able `Customer Name`s?

Key Points:
- We need to aggregate `Profit` for each `Customer Name`
- We need to rank the `Customer Name` groups on total `Profit`, then filter for only the top 10

In [None]:
top_customers = (
    sales
    .group_by("Customer Name")
    .agg(
        pl.col("Profit").sum().alias("total_profit")
    )
    .with_columns(
        pl.col("total_profit").rank(method="min", descending=True).alias("profit_rank")
    )
    .filter(pl.col("profit_rank") < 11)
    .sort("profit_rank")
)

top_customers

## Using window function within groups

**Business Question:** Who are the most `Profit`able `Customer Names` for each `Region`?

Key Points:
- We need to aggregate `Profit` for each combination of `Customer Name` and `Region`
- We need to rank the `Customer Name` groups on total `Profit`, then filter for only the top 10 FOR EACH `Region`

In [None]:
top_customers_region = (
    sales
    .group_by(["Customer Name", "Region"])
    .agg(
        pl.col("Profit").sum().alias("total_profit")
    )
    .with_columns(
        pl.col("total_profit")
        .rank(method="min", descending=True)
        .over("Region")
        .alias("profit_rank")
    )
    .filter(pl.col("profit_rank") < 11)
    .sort(["Region", "profit_rank"])
)

print(f"Shape: {top_customers_region.shape}")
top_customers_region.head(20)

## Cumulative (running) totals

**Business Question:** Show the running total of `Sales` for the days (using by `Order Date`) leading up to 2016-11-08.

Key Points:
- Need to figure out how to get a running total of `Sales`
- Must order the data in ascending order by `Order Date`

In [None]:
running_sales = (
    sales
    .group_by("Order Date")
    .agg(
        pl.col("Sales").sum().alias("tot_sales")
    )
    .sort("Order Date")
    .with_columns(
        pl.col("tot_sales").cum_sum().alias("running_total")
    )
    .filter(pl.col("Order Date") < datetime(2016, 11, 8))
)

print(f"Shape: {running_sales.shape}")
running_sales.tail(10)

## Quick Summarization

**Business Question:** What is the the total amount of both `Sales` and `Profit` for each month of `Order Date`? (When computing months, each combination of year and month counts as a unique month.)

Key points:
- Need to figure out how to parse out only the year/months from the `Order Date` column
- We can use Polars' datetime functionality to truncate dates to month boundaries
- FOR EACH usually means we are grouping by something. In this case we have to figure out how to get year/month combinations from the `Order Date` field
- Create sums for both `Sales` and `Profit`

### Date truncation examples in Polars

In [None]:
# Examples of date truncation in Polars
example_date = datetime(2023, 11, 7, 12, 31, 13)
df_example = pl.DataFrame({"datetime": [example_date]})

print("Original datetime:", example_date)
print("\nTruncated to different units:")
print(df_example.with_columns([
    pl.col("datetime").dt.truncate("1y").alias("year"),
    pl.col("datetime").dt.truncate("1mo").alias("month"),
    pl.col("datetime").dt.truncate("1d").alias("day"),
    pl.col("datetime").dt.truncate("1h").alias("hour"),
    pl.col("datetime").dt.truncate("1m").alias("minute"),
]))

In [None]:
monthly_sales = (
    sales
    .with_columns(
        pl.col("Order Date").dt.truncate("1mo").alias("month")
    )
    .group_by("month")
    .agg([
        pl.col("Sales").sum().alias("tot_sales"),
        pl.col("Profit").sum().alias("tot_profit")
    ])
    .sort("month")
)

print(f"Shape: {monthly_sales.shape}")
monthly_sales.head(12)

## Going one level further

**Business Question:** Using the data frame from the previous question: between which months was the greatest jump in sales?

Key Points:
- Using the previous data frame
- Need to figure out the difference between `tot_sales` from month to month

In [None]:
delta_sales = (
    monthly_sales
    .with_columns(
        (pl.col("tot_sales") - pl.col("tot_sales").shift(1)).alias("month_sales_delta")
    )
)

print("\nMonth with greatest sales increase:")
print(delta_sales.sort("month_sales_delta", descending=True).head(1))

print("\nAll monthly deltas:")
delta_sales.head(12)

## Same dataset but now finding the `Profit` monthly delta

**Business Question:** Using the data frame from the previous question: between which months was the greatest jump in profit?

Key Points:
- Using the previous data frame
- Need to figure out the difference between `tot_profit` from month to month

In [None]:
delta_sales = (
    delta_sales
    .with_columns(
        (pl.col("tot_profit") - pl.col("tot_profit").shift(1)).alias("month_profit_delta")
    )
)

print("\nMonth with greatest profit increase:")
print(delta_sales.sort("month_profit_delta", descending=True).head(1))

print("\nAll monthly deltas:")
delta_sales.head(12)

## Additional Window Function Examples

### Lead function example
Let's look at the next month's sales alongside current month

In [None]:
# Using shift with negative value for lead
lead_example = (
    monthly_sales
    .with_columns([
        pl.col("tot_sales").shift(-1).alias("next_month_sales"),
        pl.col("tot_sales").shift(1).alias("prev_month_sales")
    ])
)

print("Lead/Lag example:")
lead_example.head(10)

### Rolling window calculations
Calculate 3-month moving average of sales

In [None]:
rolling_example = (
    monthly_sales
    .sort("month")
    .with_columns(
        pl.col("tot_sales").rolling_mean(window_size=3).alias("3_month_avg_sales")
    )
)

print("Rolling 3-month average:")
rolling_example.head(12)