**⭐ 1. What This Pattern Solves**

Sessionization groups events into user sessions, typically for web/app analytics.
Use-cases include:

Grouping clicks into browsing sessions

Tracking user activity duration

Calculating session-based metrics like average session length

The key is defining a session boundary, often based on inactivity thresholds (e.g., 30 minutes).

**⭐ 2. SQL Equivalent**

In [0]:
%sql
SELECT *,
       SUM(is_new_session) OVER(PARTITION BY user_id ORDER BY event_time) AS session_id
FROM (
    SELECT *,
           CASE WHEN event_time - LAG(event_time) OVER(PARTITION BY user_id ORDER BY event_time) > interval '30' minute
                OR LAG(event_time) OVER(PARTITION BY user_id ORDER BY event_time) IS NULL
           THEN 1 ELSE 0 END AS is_new_session
    FROM events
) t;


**⭐ 3. Core Idea**

Sort events per user by timestamp

Detect session boundaries using time difference

Assign a cumulative session ID using a running sum of session flags

Works well with Window and lag() functions

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, lag, sum as Fsum
from pyspark.sql.types import TimestampType
import pyspark.sql.functions as F

window_spec = Window.partitionBy("user_id").orderBy("event_time")

df_with_lag = df.withColumn("prev_time", lag("event_time").over(window_spec))
df_with_flag = df_with_lag.withColumn(
    "is_new_session",
    F.when((col("event_time").cast("long") - col("prev_time").cast("long") > threshold_seconds) | col("prev_time").isNull(), 1).otherwise(0)
)

df_sessions = df_with_flag.withColumn(
    "session_id",
    Fsum("is_new_session").over(window_spec)
)

**⭐ 5. Detailed Example**

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import lag, when, col, sum as Fsum
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()

data = [
    ("U1", "2025-12-04 10:00:00"),
    ("U1", "2025-12-04 10:10:00"),
    ("U1", "2025-12-04 11:00:00"),
    ("U2", "2025-12-04 09:00:00"),
    ("U2", "2025-12-04 09:40:00")
]

df = spark.createDataFrame(data, ["user_id", "event_time"])
df = df.withColumn("event_time", col("event_time").cast("timestamp"))

threshold_seconds = 30*60  # 30 minutes

window_spec = Window.partitionBy("user_id").orderBy("event_time")
df_with_lag = df.withColumn("prev_time", lag("event_time").over(window_spec))
df_with_flag = df_with_lag.withColumn(
    "is_new_session",
    when((col("event_time").cast("long") - col("prev_time").cast("long") > threshold_seconds) | col("prev_time").isNull(), 1).otherwise(0)
)
df_sessions = df_with_flag.withColumn(
    "session_id",
    Fsum("is_new_session").over(window_spec)
)

df_sessions.show()

**Step-by-step:**

Partition by user_id

Order by event_time

Compute lag() for previous event

Flag new sessions based on threshold

Assign session_id using cumulative sum

**⭐ 6. Mini Practice Problems**

Sessionize clicks with a 15-minute inactivity window.

Compute session duration per user.

Count events per session for each user.

**⭐ 7. Full Data Engineering Problem**

Scenario: Track user activity on a mobile app. Ingest 100M events/day and calculate:

Number of sessions per user

Average session duration

Most active session per day

Solution Approach:

Load events from Kafka

Partition by user_id and order by event_time

Detect new sessions using lag() and threshold

Compute session_id and aggregate metrics

Store in Silver/Gold tables

Performance tip: Ensure partitioning by user or day to avoid massive shuffles.

**⭐ 8. Time & Space Complexity**

Time: O(N) per partition + sort

Space: O(N) per partition for lag and cumulative sum

Large users with many events → consider bucketing or batching.

**⭐ 9. Common Pitfalls**

Using fixed session ID instead of cumulative sum → merges sessions incorrectly

Not sorting events → wrong session boundaries

Ignoring time unit conversions (seconds vs minutes)

Very large partitions → memory overflow