# ü™ü WINDOW FUNCTIONS WITH PYSPARK

---

## üìã **DAY 3 - LESSON 3: WINDOW FUNCTIONS**

### **üéØ OBJECTIVES:**

1. **Window Basics** - partitionBy, orderBy, rowsBetween
2. **Ranking Functions** - row_number, rank, dense_rank, ntile
3. **Analytic Functions** - lag, lead, first_value, last_value
4. **Aggregate Functions** - sum, avg, min, max over windows
5. **Frame Specifications** - rows vs range, unbounded
6. **Real-World Use Cases** - Running totals, moving averages, YoY growth

---

## üîß **SETUP**

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

spark = SparkSession.builder \
    .appName("WindowFunctions") \
    .master("spark://spark-master:7077") \
    .config("spark.executor.memory", "2g") \
    .getOrCreate()

print("‚úÖ Spark Session Created")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/10 16:10:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


‚úÖ Spark Session Created


---

## üìä **1. CREATE SAMPLE DATA**

In [2]:
# Sales data with dates
data = [
    ("2024-01-01", "USA", "Electronics", 1200),
    ("2024-01-01", "USA", "Clothing", 300),
    ("2024-01-01", "UK", "Electronics", 800),
    ("2024-01-02", "USA", "Electronics", 1500),
    ("2024-01-02", "USA", "Clothing", 400),
    ("2024-01-02", "UK", "Electronics", 900),
    ("2024-01-03", "USA", "Electronics", 1100),
    ("2024-01-03", "USA", "Clothing", 350),
    ("2024-01-03", "UK", "Electronics", 1000),
    ("2024-01-04", "USA", "Electronics", 1300),
    ("2024-01-04", "USA", "Clothing", 450),
    ("2024-01-04", "UK", "Electronics", 850),
    ("2024-01-05", "USA", "Electronics", 1400),
    ("2024-01-05", "USA", "Clothing", 500),
    ("2024-01-05", "UK", "Electronics", 950),
]

df = spark.createDataFrame(data, ["date", "country", "category", "revenue"]) \
    .withColumn("date", to_date(col("date")))

print("üìä SAMPLE DATA:")
df.orderBy("date", "country", "category").show(20)

üìä SAMPLE DATA:


                                                                                

+----------+-------+-----------+-------+
|      date|country|   category|revenue|
+----------+-------+-----------+-------+
|2024-01-01|     UK|Electronics|    800|
|2024-01-01|    USA|   Clothing|    300|
|2024-01-01|    USA|Electronics|   1200|
|2024-01-02|     UK|Electronics|    900|
|2024-01-02|    USA|   Clothing|    400|
|2024-01-02|    USA|Electronics|   1500|
|2024-01-03|     UK|Electronics|   1000|
|2024-01-03|    USA|   Clothing|    350|
|2024-01-03|    USA|Electronics|   1100|
|2024-01-04|     UK|Electronics|    850|
|2024-01-04|    USA|   Clothing|    450|
|2024-01-04|    USA|Electronics|   1300|
|2024-01-05|     UK|Electronics|    950|
|2024-01-05|    USA|   Clothing|    500|
|2024-01-05|    USA|Electronics|   1400|
+----------+-------+-----------+-------+



---

## ü™ü **2. WINDOW BASICS**

In [3]:
# 2.1 Simple window - partition by country
print("üîπ Window partitioned by country:")
windowCountry = Window.partitionBy("country")

df_window = df.withColumn(
    "total_by_country",
    sum("revenue").over(windowCountry)
).withColumn(
    "avg_by_country",
    avg("revenue").over(windowCountry)
)

df_window.orderBy("country", "date").show()

# 2.2 Window with ordering
print("\nüîπ Window with ordering:")
windowOrdered = Window.partitionBy("country").orderBy("date")

df_ordered = df.withColumn(
    "row_num",
    row_number().over(windowOrdered)
).withColumn(
    "running_total",
    sum("revenue").over(windowOrdered)
)

df_ordered.orderBy("country", "date").show()

# 2.3 Multiple partitions
print("\nüîπ Window with multiple partitions:")
windowMulti = Window.partitionBy("country", "category").orderBy("date")

df_multi = df.withColumn(
    "running_total",
    sum("revenue").over(windowMulti)
)

df_multi.orderBy("country", "category", "date").show(20)

üîπ Window partitioned by country:


                                                                                

+----------+-------+-----------+-------+----------------+--------------+
|      date|country|   category|revenue|total_by_country|avg_by_country|
+----------+-------+-----------+-------+----------------+--------------+
|2024-01-01|     UK|Electronics|    800|            4500|         900.0|
|2024-01-02|     UK|Electronics|    900|            4500|         900.0|
|2024-01-03|     UK|Electronics|   1000|            4500|         900.0|
|2024-01-04|     UK|Electronics|    850|            4500|         900.0|
|2024-01-05|     UK|Electronics|    950|            4500|         900.0|
|2024-01-01|    USA|Electronics|   1200|            8500|         850.0|
|2024-01-01|    USA|   Clothing|    300|            8500|         850.0|
|2024-01-02|    USA|Electronics|   1500|            8500|         850.0|
|2024-01-02|    USA|   Clothing|    400|            8500|         850.0|
|2024-01-03|    USA|   Clothing|    350|            8500|         850.0|
|2024-01-03|    USA|Electronics|   1100|           

                                                                                

+----------+-------+-----------+-------+-------+-------------+
|      date|country|   category|revenue|row_num|running_total|
+----------+-------+-----------+-------+-------+-------------+
|2024-01-01|     UK|Electronics|    800|      1|          800|
|2024-01-02|     UK|Electronics|    900|      2|         1700|
|2024-01-03|     UK|Electronics|   1000|      3|         2700|
|2024-01-04|     UK|Electronics|    850|      4|         3550|
|2024-01-05|     UK|Electronics|    950|      5|         4500|
|2024-01-01|    USA|Electronics|   1200|      1|         1500|
|2024-01-01|    USA|   Clothing|    300|      2|         1500|
|2024-01-02|    USA|Electronics|   1500|      3|         3400|
|2024-01-02|    USA|   Clothing|    400|      4|         3400|
|2024-01-03|    USA|Electronics|   1100|      5|         4850|
|2024-01-03|    USA|   Clothing|    350|      6|         4850|
|2024-01-04|    USA|Electronics|   1300|      7|         6600|
|2024-01-04|    USA|   Clothing|    450|      8|       

[Stage 9:>                                                          (0 + 1) / 1]

+----------+-------+-----------+-------+-------------+
|      date|country|   category|revenue|running_total|
+----------+-------+-----------+-------+-------------+
|2024-01-01|     UK|Electronics|    800|          800|
|2024-01-02|     UK|Electronics|    900|         1700|
|2024-01-03|     UK|Electronics|   1000|         2700|
|2024-01-04|     UK|Electronics|    850|         3550|
|2024-01-05|     UK|Electronics|    950|         4500|
|2024-01-01|    USA|   Clothing|    300|          300|
|2024-01-02|    USA|   Clothing|    400|          700|
|2024-01-03|    USA|   Clothing|    350|         1050|
|2024-01-04|    USA|   Clothing|    450|         1500|
|2024-01-05|    USA|   Clothing|    500|         2000|
|2024-01-01|    USA|Electronics|   1200|         1200|
|2024-01-02|    USA|Electronics|   1500|         2700|
|2024-01-03|    USA|Electronics|   1100|         3800|
|2024-01-04|    USA|Electronics|   1300|         5100|
|2024-01-05|    USA|Electronics|   1400|         6500|
+---------

                                                                                

---

## üèÜ **3. RANKING FUNCTIONS**

In [4]:
# 3.1 row_number, rank, dense_rank
print("üîπ Ranking functions comparison:")
windowRank = Window.partitionBy("country").orderBy(desc("revenue"))

df_rank = df.withColumn("row_number", row_number().over(windowRank)) \
    .withColumn("rank", rank().over(windowRank)) \
    .withColumn("dense_rank", dense_rank().over(windowRank))

df_rank.orderBy("country", "row_number").show(20)

print("""
üìù DIFFERENCE:
- row_number: 1, 2, 3, 4, 5... (always sequential)
- rank: 1, 2, 2, 4, 5... (gaps after ties)
- dense_rank: 1, 2, 2, 3, 4... (no gaps)
""")

# 3.2 Top N per group
print("\nüîπ Top 3 revenue days per country:")
top3 = df.withColumn(
    "rank",
    row_number().over(Window.partitionBy("country").orderBy(desc("revenue")))
).filter(col("rank") <= 3)

top3.orderBy("country", "rank").show()

# 3.3 ntile - divide into buckets
print("\nüîπ Divide into quartiles:")
df_ntile = df.withColumn(
    "quartile",
    ntile(4).over(Window.partitionBy("country").orderBy("revenue"))
)

df_ntile.orderBy("country", "quartile").show(20)

# 3.4 Percent rank
print("\nüîπ Percent rank:")
df_percent = df.withColumn(
    "percent_rank",
    percent_rank().over(Window.partitionBy("country").orderBy("revenue"))
).withColumn(
    "cume_dist",
    cume_dist().over(Window.partitionBy("country").orderBy("revenue"))
)

df_percent.orderBy("country", "revenue").show()

üîπ Ranking functions comparison:
+----------+-------+-----------+-------+----------+----+----------+
|      date|country|   category|revenue|row_number|rank|dense_rank|
+----------+-------+-----------+-------+----------+----+----------+
|2024-01-03|     UK|Electronics|   1000|         1|   1|         1|
|2024-01-05|     UK|Electronics|    950|         2|   2|         2|
|2024-01-02|     UK|Electronics|    900|         3|   3|         3|
|2024-01-04|     UK|Electronics|    850|         4|   4|         4|
|2024-01-01|     UK|Electronics|    800|         5|   5|         5|
|2024-01-02|    USA|Electronics|   1500|         1|   1|         1|
|2024-01-05|    USA|Electronics|   1400|         2|   2|         2|
|2024-01-04|    USA|Electronics|   1300|         3|   3|         3|
|2024-01-01|    USA|Electronics|   1200|         4|   4|         4|
|2024-01-03|    USA|Electronics|   1100|         5|   5|         5|
|2024-01-05|    USA|   Clothing|    500|         6|   6|         6|
|2024-01-04| 

---

## üìä **4. ANALYTIC FUNCTIONS (LAG/LEAD)**

In [5]:
# 4.1 lag and lead
print("üîπ Lag and Lead:")
windowTime = Window.partitionBy("country", "category").orderBy("date")

df_lag_lead = df.withColumn(
    "prev_day_revenue",
    lag("revenue", 1).over(windowTime)
).withColumn(
    "next_day_revenue",
    lead("revenue", 1).over(windowTime)
).withColumn(
    "day_over_day_change",
    col("revenue") - lag("revenue", 1).over(windowTime)
).withColumn(
    "day_over_day_pct",
    round(((col("revenue") - lag("revenue", 1).over(windowTime)) / lag("revenue", 1).over(windowTime)) * 100, 2)
)

df_lag_lead.orderBy("country", "category", "date").show(20)

# 4.2 first_value and last_value
print("\nüîπ First and Last values:")
windowFull = Window.partitionBy("country", "category").orderBy("date") \
    .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

df_first_last = df.withColumn(
    "first_revenue",
    first_value("revenue").over(windowFull)
).withColumn(
    "last_revenue",
    last_value("revenue").over(windowFull)
).withColumn(
    "growth_from_first",
    round(((col("revenue") - first_value("revenue").over(windowFull)) / first_value("revenue").over(windowFull)) * 100, 2)
)

df_first_last.orderBy("country", "category", "date").show(20)

üîπ Lag and Lead:
+----------+-------+-----------+-------+----------------+----------------+-------------------+----------------+
|      date|country|   category|revenue|prev_day_revenue|next_day_revenue|day_over_day_change|day_over_day_pct|
+----------+-------+-----------+-------+----------------+----------------+-------------------+----------------+
|2024-01-01|     UK|Electronics|    800|            NULL|             900|               NULL|            NULL|
|2024-01-02|     UK|Electronics|    900|             800|            1000|                100|            12.5|
|2024-01-03|     UK|Electronics|   1000|             900|             850|                100|           11.11|
|2024-01-04|     UK|Electronics|    850|            1000|             950|               -150|           -15.0|
|2024-01-05|     UK|Electronics|    950|             850|            NULL|                100|           11.76|
|2024-01-01|    USA|   Clothing|    300|            NULL|             400|           

---

## üìà **5. AGGREGATE FUNCTIONS OVER WINDOWS**

In [6]:
# 5.1 Running totals
print("üîπ Running totals:")
windowRunning = Window.partitionBy("country", "category").orderBy("date") \
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)

df_running = df.withColumn(
    "running_total",
    sum("revenue").over(windowRunning)
).withColumn(
    "running_avg",
    avg("revenue").over(windowRunning)
).withColumn(
    "running_count",
    count("*").over(windowRunning)
)

df_running.orderBy("country", "category", "date").show(20)

# 5.2 Moving averages (3-day)
print("\nüîπ Moving averages (3-day):")
windowMoving = Window.partitionBy("country", "category").orderBy("date") \
    .rowsBetween(-2, 0)  # 2 rows before + current row

df_moving = df.withColumn(
    "ma_3day",
    avg("revenue").over(windowMoving)
).withColumn(
    "sum_3day",
    sum("revenue").over(windowMoving)
)

df_moving.orderBy("country", "category", "date").show(20)

# 5.3 Centered moving average
print("\nüîπ Centered moving average (3-day):")
windowCentered = Window.partitionBy("country", "category").orderBy("date") \
    .rowsBetween(-1, 1)  # 1 before + current + 1 after

df_centered = df.withColumn(
    "centered_ma_3day",
    avg("revenue").over(windowCentered)
)

df_centered.orderBy("country", "category", "date").show(20)

# 5.4 Min/Max over window
print("\nüîπ Min/Max over window:")
df_minmax = df.withColumn(
    "max_so_far",
    max("revenue").over(windowRunning)
).withColumn(
    "min_so_far",
    min("revenue").over(windowRunning)
).withColumn(
    "is_new_high",
    when(col("revenue") == max("revenue").over(windowRunning), "Yes").otherwise("No")
)

df_minmax.orderBy("country", "category", "date").show(20)

üîπ Running totals:
+----------+-------+-----------+-------+-------------+------------------+-------------+
|      date|country|   category|revenue|running_total|       running_avg|running_count|
+----------+-------+-----------+-------+-------------+------------------+-------------+
|2024-01-01|     UK|Electronics|    800|          800|             800.0|            1|
|2024-01-02|     UK|Electronics|    900|         1700|             850.0|            2|
|2024-01-03|     UK|Electronics|   1000|         2700|             900.0|            3|
|2024-01-04|     UK|Electronics|    850|         3550|             887.5|            4|
|2024-01-05|     UK|Electronics|    950|         4500|             900.0|            5|
|2024-01-01|    USA|   Clothing|    300|          300|             300.0|            1|
|2024-01-02|    USA|   Clothing|    400|          700|             350.0|            2|
|2024-01-03|    USA|   Clothing|    350|         1050|             350.0|            3|
|2024-01-04

---

## üéØ **6. FRAME SPECIFICATIONS**

In [7]:
print("üìù FRAME SPECIFICATIONS:")
print("""
1. rowsBetween(start, end):
   - Physical rows
   - Window.unboundedPreceding: From start
   - Window.unboundedFollowing: To end
   - Window.currentRow: Current row
   - -N: N rows before
   - +N: N rows after

2. rangeBetween(start, end):
   - Logical range (based on orderBy column value)
   - Same syntax as rowsBetween

Examples:
""")

# Example 1: Last 3 rows
w1 = Window.partitionBy("country").orderBy("date").rowsBetween(-2, 0)
print("rowsBetween(-2, 0): Last 3 rows (including current)")

# Example 2: Next 2 rows
w2 = Window.partitionBy("country").orderBy("date").rowsBetween(0, 2)
print("rowsBetween(0, 2): Current + next 2 rows")

# Example 3: All rows up to current
w3 = Window.partitionBy("country").orderBy("date").rowsBetween(Window.unboundedPreceding, 0)
print("rowsBetween(unboundedPreceding, 0): All rows up to current (running total)")

# Example 4: All rows
w4 = Window.partitionBy("country").orderBy("date").rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
print("rowsBetween(unboundedPreceding, unboundedFollowing): All rows in partition")

# Practical example
print("\nüîπ Practical example - Different frames:")
df_frames = df.filter(col("country") == "USA").filter(col("category") == "Electronics") \
    .withColumn("last_3_avg", avg("revenue").over(w1)) \
    .withColumn("next_2_avg", avg("revenue").over(w2)) \
    .withColumn("running_avg", avg("revenue").over(w3)) \
    .withColumn("overall_avg", avg("revenue").over(w4))

df_frames.orderBy("date").show()

üìù FRAME SPECIFICATIONS:

1. rowsBetween(start, end):
   - Physical rows
   - Window.unboundedPreceding: From start
   - Window.unboundedFollowing: To end
   - Window.currentRow: Current row
   - -N: N rows before
   - +N: N rows after

2. rangeBetween(start, end):
   - Logical range (based on orderBy column value)
   - Same syntax as rowsBetween

Examples:

rowsBetween(-2, 0): Last 3 rows (including current)
rowsBetween(0, 2): Current + next 2 rows
rowsBetween(unboundedPreceding, 0): All rows up to current (running total)
rowsBetween(unboundedPreceding, unboundedFollowing): All rows in partition

üîπ Practical example - Different frames:
+----------+-------+-----------+-------+------------------+------------------+------------------+-----------+
|      date|country|   category|revenue|        last_3_avg|        next_2_avg|       running_avg|overall_avg|
+----------+-------+-----------+-------+------------------+------------------+------------------+-----------+
|2024-01-01|    USA|

---

## üíº **7. REAL-WORLD USE CASES**

In [8]:
# 7.1 YoY Growth (Year over Year)
print("üîπ Year over Year Growth:")
# Create multi-year data
data_yoy = [
    ("2023-01-01", "USA", 1000),
    ("2023-02-01", "USA", 1100),
    ("2023-03-01", "USA", 1200),
    ("2024-01-01", "USA", 1300),
    ("2024-02-01", "USA", 1400),
    ("2024-03-01", "USA", 1500),
]

df_yoy = spark.createDataFrame(data_yoy, ["date", "country", "revenue"]) \
    .withColumn("date", to_date(col("date"))) \
    .withColumn("year", year(col("date"))) \
    .withColumn("month", month(col("date")))

windowYoY = Window.partitionBy("country", "month").orderBy("year")

df_yoy_calc = df_yoy.withColumn(
    "prev_year_revenue",
    lag("revenue", 1).over(windowYoY)
).withColumn(
    "yoy_growth",
    round(((col("revenue") - lag("revenue", 1).over(windowYoY)) / lag("revenue", 1).over(windowYoY)) * 100, 2)
)

df_yoy_calc.orderBy("date").show()

# 7.2 Cumulative percentage
print("\nüîπ Cumulative percentage:")
windowCum = Window.partitionBy("country").orderBy("date") \
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)

windowTotal = Window.partitionBy("country")

df_cum_pct = df.withColumn(
    "running_total",
    sum("revenue").over(windowCum)
).withColumn(
    "total_revenue",
    sum("revenue").over(windowTotal)
).withColumn(
    "cumulative_pct",
    round((sum("revenue").over(windowCum) / sum("revenue").over(windowTotal)) * 100, 2)
)

df_cum_pct.orderBy("country", "date").show(20)

# 7.3 Gap analysis
print("\nüîπ Gap analysis (days without sales):")
windowGap = Window.partitionBy("country", "category").orderBy("date")

df_gap = df.withColumn(
    "prev_date",
    lag("date", 1).over(windowGap)
).withColumn(
    "days_since_last",
    datediff(col("date"), lag("date", 1).over(windowGap))
)

df_gap.orderBy("country", "category", "date").show(20)

# 7.4 Streak counting
print("\nüîπ Consecutive days with revenue > 1000:")
df_streak = df.withColumn(
    "is_high",
    when(col("revenue") > 1000, 1).otherwise(0)
).withColumn(
    "streak_group",
    sum(when(col("revenue") <= 1000, 1).otherwise(0)).over(
        Window.partitionBy("country", "category").orderBy("date")
    )
).withColumn(
    "streak_length",
    count("*").over(
        Window.partitionBy("country", "category", "streak_group").orderBy("date")
    )
)

df_streak.filter(col("is_high") == 1).orderBy("country", "category", "date").show(20)

üîπ Year over Year Growth:
+----------+-------+-------+----+-----+-----------------+----------+
|      date|country|revenue|year|month|prev_year_revenue|yoy_growth|
+----------+-------+-------+----+-----+-----------------+----------+
|2023-01-01|    USA|   1000|2023|    1|             NULL|      NULL|
|2023-02-01|    USA|   1100|2023|    2|             NULL|      NULL|
|2023-03-01|    USA|   1200|2023|    3|             NULL|      NULL|
|2024-01-01|    USA|   1300|2024|    1|             1000|      30.0|
|2024-02-01|    USA|   1400|2024|    2|             1100|     27.27|
|2024-03-01|    USA|   1500|2024|    3|             1200|      25.0|
+----------+-------+-------+----+-----+-----------------+----------+


üîπ Cumulative percentage:
+----------+-------+-----------+-------+-------------+-------------+--------------+
|      date|country|   category|revenue|running_total|total_revenue|cumulative_pct|
+----------+-------+-----------+-------+-------------+-------------+--------------+


---

## üíæ **8. SAVE RESULTS**

In [None]:
# Save window analysis results
output_path = "s3a://warehouse/window_analysis/"

# Save running totals
df_running.write.mode("overwrite").partitionBy("country").parquet(f"{output_path}running_totals/")
print(f"‚úÖ Running totals saved")

# Save moving averages
df_moving.write.mode("overwrite").partitionBy("country").parquet(f"{output_path}moving_averages/")
print(f"‚úÖ Moving averages saved")

# Save rankings
df_rank.write.mode("overwrite").partitionBy("country").parquet(f"{output_path}rankings/")
print(f"‚úÖ Rankings saved")

---

## üéì **KEY TAKEAWAYS**

### **‚úÖ What You Learned:**

1. **Window Basics** - partitionBy, orderBy, frames
2. **Ranking** - row_number, rank, dense_rank, ntile
3. **Analytic** - lag, lead, first_value, last_value
4. **Aggregates** - sum, avg, min, max over windows
5. **Frames** - rowsBetween, rangeBetween
6. **Use Cases** - YoY growth, running totals, moving averages

### **üìä Window Function Cheat Sheet:**

```python
# Basic window
Window.partitionBy("col").orderBy("col")

# Ranking
row_number().over(window)  # 1,2,3,4...
rank().over(window)        # 1,2,2,4...
dense_rank().over(window)  # 1,2,2,3...

# Analytic
lag("col", 1).over(window)   # Previous row
lead("col", 1).over(window)  # Next row

# Frames
rowsBetween(-2, 0)           # Last 3 rows
rowsBetween(0, 2)            # Next 3 rows
rowsBetween(unboundedPreceding, 0)  # Running total
```

### **üöÄ Next:** Day 3 - Lesson 4: Joins

---

In [None]:
spark.stop()
print("‚úÖ Spark session stopped")
print("\nüéâ DAY 3 - LESSON 3 COMPLETED!")