# Analyzing Data with Spark Window Functions

Window functions let you compute running totals, rankings, and moving statistics across partitions of your data without collapsing rows.

## Setup and Data Load

We reuse the shared orders dataset stored at `notebooks/data/orders_demo.csv`.

In [None]:
from pathlib import Path
from pyspark.sql import SparkSession, functions as F
from pyspark.sql.window import Window

spark = SparkSession.builder.appName('SparkWindowTutorial').getOrCreate()

repo_root = Path.cwd()
if (repo_root / 'notebooks').exists():
    data_path = repo_root / 'notebooks' / 'data' / 'orders_demo.csv'
else:
    data_path = Path('..') / 'data' / 'orders_demo.csv'

orders_df = (
    spark.read
    .option('header', True)
    .option('inferSchema', True)
    .csv(str(data_path))
)
orders_df = orders_df.withColumn('order_date', F.to_date('order_date'))
orders_df.orderBy('order_date', 'region').show()


## Defining Window Specifications

Window specs describe how rows are partitioned and ordered when calculating running metrics.

In [None]:
region_day_window = Window.partitionBy('region').orderBy('order_date').rowsBetween(Window.unboundedPreceding, 0)
region_rank_window = Window.partitionBy('region').orderBy(F.desc('orders'))
rolling_two_day = Window.partitionBy('region').orderBy('order_date').rowsBetween(-1, 0)


## Running Totals per Region

`rowsBetween(Window.unboundedPreceding, 0)` keeps the running tally up to the current row for each region.

In [None]:
running_totals = orders_df.withColumn(
    'regional_running_orders',
    F.sum('orders').over(region_day_window),
)
running_totals.orderBy('order_date', 'region').show()


## Rolling Two-Day Average

A row-based frame lets you look at the current and previous day to smooth daily volume swings.

In [None]:
rolling_avg = orders_df.withColumn(
    'orders_two_day_avg',
    F.avg('orders').over(rolling_two_day),
)
rolling_avg.orderBy('order_date', 'region').show()


## Ranking Days by Demand

Ranking identifies peak demand days per region while preserving the original rows.

In [None]:
ranked = orders_df.withColumn(
    'demand_rank',
    F.dense_rank().over(region_rank_window),
)
ranked.orderBy('region', 'demand_rank').show()


## Clean Up

Stop the SparkSession when you are done working with the notebook.

In [None]:
spark.stop()


## Exercises

- Calculate a three-day trailing sum of orders for each region using a window frame.
- Use `row_number` to identify the first day each region exceeded 12 orders.
- Visualize the rolling averages with a simple `matplotlib` plot or describe how you would export the results for visualization elsewhere.
