**⭐ 1. What This Pattern Solves**

Enables row-level calculations over a logical partition of data without collapsing it like groupBy. Essential for ranking, running totals, moving averages, and deduplication.

Use cases:

Rank customers by total purchase per month

Compute previous or next transaction amounts (lead/lag)

Moving averages over last N rows

Deduplicate latest records per key

**⭐ 2. SQL Equivalent**

In [0]:
%sql
-- Rank
SELECT CustomerID, OrderDate, Amount,
       RANK() OVER (PARTITION BY CustomerID ORDER BY Amount DESC) AS RankAmt
FROM Orders;

-- Lead / Lag
SELECT CustomerID, OrderDate, Amount,
       LAG(Amount, 1) OVER (PARTITION BY CustomerID ORDER BY OrderDate) AS PrevAmount,
       LEAD(Amount, 1) OVER (PARTITION BY CustomerID ORDER BY OrderDate) AS NextAmount
FROM Orders;


**⭐ 3. Core Idea**

Define a Window spec: partition + order + optional frame.

Apply aggregate functions over this window without collapsing rows.

Functions: rank(), dense_rank(), row_number(), lag(), lead(), sum().over(window) etc.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
from pyspark.sql.window import Window
from pyspark.sql import functions as F

# Window definition
window_spec = Window.partitionBy("col1").orderBy("col2")

# Using rank
df.withColumn("rank_col", F.rank().over(window_spec))

# Using lead/lag
df.withColumn("prev_val", F.lag("metric", 1).over(window_spec)) \
  .withColumn("next_val", F.lead("metric", 1).over(window_spec))


**⭐ 5. Detailed Example**

In [0]:
data = [
    ("Alice", "2025-01-01", 100),
    ("Alice", "2025-01-02", 150),
    ("Bob", "2025-01-01", 200),
    ("Bob", "2025-01-02", 50)
]

df = spark.createDataFrame(data, ["Customer", "Date", "Amount"])

from pyspark.sql.window import Window
from pyspark.sql import functions as F

# Define window per customer ordered by date
window_spec = Window.partitionBy("Customer").orderBy("Date")

# Add rank, lag, lead
result = df.withColumn("rank", F.rank().over(window_spec)) \
           .withColumn("prev_amount", F.lag("Amount", 1).over(window_spec)) \
           .withColumn("next_amount", F.lead("Amount", 1).over(window_spec))

result.show()

In [0]:
+--------+----------+------+----+-----------+-----------+
|Customer|      Date|Amount|rank|prev_amount|next_amount|
+--------+----------+------+----+-----------+-----------+
|Alice   |2025-01-01|   100|   1|       null|        150|
|Alice   |2025-01-02|   150|   2|        100|       null|
|Bob     |2025-01-01|   200|   1|       null|         50|
|Bob     |2025-01-02|    50|   2|        200|       null|
+--------+----------+------+----+-----------+-----------+


**⭐ 6. Mini Practice Problems**

Compute row number per Customer ordered by Amount descending.

Find previous 2 transaction amounts for each Customer.

Compute 3-day moving average of Sales per Store.

**⭐ 7. Full Data Engineering Problem**

Scenario: Bronze transaction table has CustomerID, OrderDate, Amount.

Goal: Silver table with latest order per customer and previous order amount.

Steps:

Define window by CustomerID ordered by OrderDate DESC.

Use row_number() to pick latest order.

Use lag() to fetch previous order.

Write to Silver Delta for downstream analytics.

This is commonly used for customer churn prediction, RFM scoring, and deduplication.

**⭐ 8. Time & Space Complexity**

Time: O(n) rows per partition; Spark distributes partitions across executors.

Space: Depends on partition size. Large partitions may cause memory pressure.

**⭐ 9. Common Pitfalls**

Not partitioning → all data treated as single partition → huge shuffle and memory issues.

Forgetting orderBy → ranks/lag/lead meaningless.

Using row_number() instead of rank() when duplicates matter.

Applying multiple window functions on huge datasets without caching → recomputation cost.