# Spark Streaming with PySpark
## Module 16: Window Operations

Aggregating data over time is a fundamental requirement in streaming. For example: *"Count the number of errors every 5 minutes"* or *"Calculate the average temperature over the last hour, updating every 10 minutes."*

Spark Structured Streaming provides three types of time windows to handle these scenarios.

### Objectives:
1.  **Tumbling Windows (Fixed):** Non-overlapping, contiguous time intervals.
2.  **Sliding Windows (Overlapping):** Fixed duration, but moving at a specific interval.
3.  **Session Windows (Dynamic):** Based on user activity, not fixed time.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, window, avg, count

spark = SparkSession.builder \
    .appName("Window_Operations_Demo") \
    .master("local[*]") \
    .getOrCreate()

print("Spark Session Created.")

## 1. Tumbling Windows (Fixed)

*   **Description:** A series of fixed-sized, non-overlapping and contiguous time intervals.
*   **Behavior:** An input can belong to **only one** window.
*   **Example:** "Count events every 10 minutes."
    *   Window 1: 12:00 - 12:10
    *   Window 2: 12:10 - 12:20

**Syntax:**
`window(timeColumn, windowDuration)`

In [None]:
# Example Logic (Not running stream, just defining query)

# Assume df has column 'eventTime' and 'temperature'
# Calculate average temperature every 10 minutes
# Tumbling Window: size="10 minutes"

# tumbling_df = raw_df.groupBy(
#     window(col("eventTime"), "10 minutes"),
#     col("deviceId")
# ).agg(avg("temperature").alias("avg_temp"))

print("Tumbling Window logic defined (Code is commented to prevent execution without data source).")

## 2. Sliding Windows (Overlapping)

*   **Description:** Fixed-sized windows that "slide" over time.
*   **Behavior:** An input can belong to **multiple** windows if the slide duration is smaller than the window duration.
*   **Example:** "Calculate average temperature over the **last 10 minutes**, updating **every 5 minutes**."
    *   Window 1: 12:00 - 12:10
    *   Window 2: 12:05 - 12:15 (Overlaps with Window 1 by 5 mins)

**Syntax:**
`window(timeColumn, windowDuration, slideDuration)`

In [None]:
# Example Logic
# Window Size: 10 minutes
# Slide Duration: 5 minutes

# sliding_df = raw_df.groupBy(
#     window(col("eventTime"), "10 minutes", "5 minutes"),
#     col("deviceId")
# ).agg(avg("temperature").alias("avg_temp"))

print("Sliding Window logic defined.")

## 3. Session Windows (Dynamic)

*   **Description:** Windows that do not have a fixed start or end time. They are defined by periods of activity followed by a gap of inactivity.
*   **Behavior:** Useful for user sessions (e.g., web analytics). A session closes when no data arrives for a specific "gap" duration.
*   **Example:** "Group user clicks into sessions. If the user is idle for 30 minutes, close the session."

**Syntax:**
`session_window(timeColumn, gapDuration)`

In [None]:
from pyspark.sql.functions import session_window

# Example Logic
# Gap Duration: 30 minutes (Session closes after 30 mins of inactivity)

# session_df = raw_df.groupBy(
#     session_window(col("eventTime"), "30 minutes"),
#     col("userId")
# ).count()

print("Session Window logic defined.")

## Up Next: Implementation

In the next module, we will take our **Device Data** pipeline and implement these windows practically. We will see how **Late Data** affects these windows and how to handle it using **Watermarking**.