#### Large-Scale User Sessionization with Variable Inactivity Windows using a Healthcare EMR Interaction theme for the HealthcareData Lakehouse.

Goal: Demonstrate sessionizing clinician activity logs from an Electronic Medical Record (EMR) system, where the definition of an inactive period (triggering a new session) varies based on the clinician's role.

Techniques to Showcase:

    1. Loading and preparing event data and dimension data (using Delta Lake).
    2. Joining event data with dimension data to access variable parameters (inactivity threshold).
    3. Applying Spark Window Functions (lag, conditional sum) partitioned by user (clinician_id) and ordered by time.
    4. Calculating time differences between events.
    5. Dynamically identifying session boundaries based on the variable inactivity threshold.
    6. Generating a unique session ID.
    7. Aggregating events into session-level metrics (start time, end time, duration, event count).

Sample Data:

1. Clinician Dimension Data (clinician_dim)

    Contains clinician info and their specific inactivity threshold in minutes.
2. EMR Event Logs (emr_events)

    Simulates raw event logs. Timestamps are crucial for testing the inactivity logic. Note the gaps between events for specific clinicians relative to their thresholds.


#### Reset Demo

In [1]:
%%sql

DROP TABLE IF EXISTS clinician_emr_sessions;
DROP TABLE IF EXISTS clinician_dim;

StatementMeta(, 5bcdc2f2-6f46-48cd-a783-a02f29cd944b, 3, Finished, Available, Finished)

<Spark SQL result set with 0 rows and 0 fields>

<Spark SQL result set with 0 rows and 0 fields>

#### Cell 1: Setup & Imports

In [2]:
# Required import for the Fabric Warehouse/Lakehouse Spark Connector (if interacting with WH/SQL EP)
# import com.microsoft.spark.fabric # Not strictly needed if only using Lakehouse Delta

# Import PySpark functions
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import TimestampType

print("Setup complete. PySpark functions imported.")

# --- Configuration ---
# Assumes your Lakehouse is named 'HealthcareData'
# Adjust paths if you uploaded CSVs elsewhere in 'Files'
landing_zone_path = "Files/landing/healthcare"
clinician_csv_path = f"{landing_zone_path}/clinicians.csv"
events_csv_path = f"{landing_zone_path}/emr_events.csv"

# Define Delta table names within the Lakehouse ('Tables' folder)
clinician_dim_table = "clinician_dim"
emr_events_table = "emr_events"
sessionized_output_table = "clinician_emr_sessions"

print(f"Clinician CSV Path: {clinician_csv_path}")
print(f"Events CSV Path: {events_csv_path}")
print(f"Output Table: {sessionized_output_table}")

StatementMeta(, 5bcdc2f2-6f46-48cd-a783-a02f29cd944b, 5, Finished, Available, Finished)

Setup complete. PySpark functions imported.
Clinician CSV Path: Files/landing/healthcare/clinicians.csv
Events CSV Path: Files/landing/healthcare/emr_events.csv
Output Table: clinician_emr_sessions


#### Cell 2: Load, Prepare, and Save Initial Delta Tables

In [3]:
# Load Clinician Data
clinicians_df = spark.read.csv(clinician_csv_path, header=True, inferSchema=True)
# Cast threshold to integer just in case
clinicians_df = clinicians_df.withColumn(
    "session_inactivity_minutes", F.col("session_inactivity_minutes").cast("integer")
)

print("Clinician Schema & Sample:")
clinicians_df.printSchema()
clinicians_df.show(5)

# Write clinician data to Delta table (overwrite for demo purposes)
clinicians_df.write.format("delta").mode("overwrite").saveAsTable(clinician_dim_table)
print(f"Clinician data saved to Delta table: {clinician_dim_table}")

# Load Event Data
events_raw_df = spark.read.csv(events_csv_path, header=True, inferSchema=True)
# ** Crucial: Cast timestamp string to actual TimestampType **
events_raw_df = events_raw_df.withColumn(
    "event_timestamp", F.to_timestamp(F.col("event_timestamp"), "yyyy-MM-dd HH:mm:ss")
)
# Optionally cast clinician_id if needed
events_raw_df = events_raw_df.withColumn("clinician_id", F.col("clinician_id").cast("integer"))


print("\nRaw Events Schema & Sample:")
events_raw_df.printSchema()
# Show ordered by clinician and time to see sequences
events_raw_df.orderBy("clinician_id", "event_timestamp").show(10, truncate=False)

# Write event data to Delta table (overwrite for demo purposes)
events_raw_df.write.format("delta").mode("overwrite").saveAsTable(emr_events_table)
print(f"EMR Event data saved to Delta table: {emr_events_table}")


# Optional: Read back from Delta to ensure types are correct for next steps
# clinicians_dim_df = spark.table(clinician_dim_table)
# emr_events_df = spark.table(emr_events_table)
# print("\nData loaded from Delta tables for processing.")

StatementMeta(, 5bcdc2f2-6f46-48cd-a783-a02f29cd944b, 6, Finished, Available, Finished)

Clinician Schema & Sample:
root
 |-- clinician_id: integer (nullable = true)
 |-- clinician_name: string (nullable = true)
 |-- clinician_role: string (nullable = true)
 |-- session_inactivity_minutes: integer (nullable = true)

+------------+---------------+--------------+--------------------------+
|clinician_id| clinician_name|clinician_role|session_inactivity_minutes|
+------------+---------------+--------------+--------------------------+
|         101|Dr. Alice Smith|     Physician|                        30|
|         102|   Bob Jones RN|         Nurse|                        20|
|         103|  Charlie Davis|         Admin|                        45|
|         104|Dr. Priya Patel|     Physician|                        30|
+------------+---------------+--------------+--------------------------+

Clinician data saved to Delta table: clinician_dim

Raw Events Schema & Sample:
root
 |-- event_id: string (nullable = true)
 |-- clinician_id: integer (nullable = true)
 |-- event_times

Demo Point: Explain loading from CSV, the importance of casting the timestamp correctly, and saving as Delta tables (best practice in Lakehouse).

#### Cell 3: Join Events with Clinician Thresholds

In [4]:
# Read data back from Delta tables for processing
clinicians_dim_df = spark.table(clinician_dim_table)
emr_events_df = spark.table(emr_events_table)

# Join events with clinician dimension to get the inactivity threshold
events_with_threshold_df = emr_events_df.join(
    clinicians_dim_df,
    on="clinician_id",
    how="inner" # Use inner join, assuming all events have a valid clinician
).select(
    "event_id",
    emr_events_df["clinician_id"], # Qualify to avoid ambiguity
    "event_timestamp",
    "event_type",
    "patient_id",
    "metadata",
    "clinician_role",
    "session_inactivity_minutes"
)

print("Events joined with inactivity threshold:")
events_with_threshold_df.orderBy("clinician_id", "event_timestamp").show(truncate=False)

StatementMeta(, 5bcdc2f2-6f46-48cd-a783-a02f29cd944b, 7, Finished, Available, Finished)

Events joined with inactivity threshold:
+--------+------------+-------------------+---------------+----------+--------------------------------+--------------+--------------------------+
|event_id|clinician_id|event_timestamp    |event_type     |patient_id|metadata                        |clinician_role|session_inactivity_minutes|
+--------+------------+-------------------+---------------+----------+--------------------------------+--------------+--------------------------+
|EVT001  |101         |2025-03-31 08:00:00|LOGIN          |NULL      |{"ip_address": "192.168.1.10"}  |Physician     |30                        |
|EVT002  |101         |2025-03-31 08:01:00|OPEN_CHART     |PAT001    |{"chart_id": "C1"}              |Physician     |30                        |
|EVT003  |101         |2025-03-31 08:03:00|ADD_NOTE       |PAT001    |{"note_type": "Consult"}        |Physician     |30                        |
|EVT004  |101         |2025-03-31 08:08:00|PLACE_ORDER    |PAT001    |{"order_type"

Demo Point: Show the result of the join, highlighting that each event now has the corresponding clinician's inactivity threshold associated with it.

#### Cell 4: Calculate Time Differences & Session Flags using Window Functions

In [5]:
# Define the window specification: Partition by clinician, order by timestamp
window_spec = Window.partitionBy("clinician_id").orderBy("event_timestamp")

# Calculate time difference and identify session starts
events_with_session_id_df = events_with_threshold_df.withColumn(
    # Get the timestamp of the previous event for the same clinician
    "prev_event_timestamp", F.lag("event_timestamp", 1).over(window_spec)
).withColumn(
    # Calculate time difference in seconds
    "time_diff_seconds",
    F.when(F.col("prev_event_timestamp").isNull(), 0) # Handle first event
    .otherwise(F.unix_timestamp("event_timestamp") - F.unix_timestamp("prev_event_timestamp"))
).withColumn(
    # Flag = 1 if time diff > threshold (or if it's the first event), else 0
    "is_new_session_flag",
    F.when(
        (F.col("prev_event_timestamp").isNull()) |
        (F.col("time_diff_seconds") > (F.col("session_inactivity_minutes") * 60)),
        1
    ).otherwise(0)
).withColumn(
    # Session ID: Running sum of the flag within the window
    "_session_group_num", F.sum("is_new_session_flag").over(window_spec)
).withColumn(
    # Create a unique session ID (clinician + session group number)
    "session_id", F.concat(F.col("clinician_id"), F.lit("_S"), F.col("_session_group_num"))
)


print("Events with Session Information Calculated:")
# Show relevant columns to trace the logic
events_with_session_id_df.select(
    "clinician_id",
    "event_timestamp",
    "event_type",
    "session_inactivity_minutes",
    "prev_event_timestamp",
    "time_diff_seconds",
    "is_new_session_flag",
    "session_id"
).orderBy("clinician_id", "event_timestamp").show(30, truncate=False)

StatementMeta(, 5bcdc2f2-6f46-48cd-a783-a02f29cd944b, 8, Finished, Available, Finished)

Events with Session Information Calculated:
+------------+-------------------+---------------+--------------------------+--------------------+-----------------+-------------------+----------+
|clinician_id|event_timestamp    |event_type     |session_inactivity_minutes|prev_event_timestamp|time_diff_seconds|is_new_session_flag|session_id|
+------------+-------------------+---------------+--------------------------+--------------------+-----------------+-------------------+----------+
|101         |2025-03-31 08:00:00|LOGIN          |30                        |NULL                |0                |1                  |101_S1    |
|101         |2025-03-31 08:01:00|OPEN_CHART     |30                        |2025-03-31 08:00:00 |60               |0                  |101_S1    |
|101         |2025-03-31 08:03:00|ADD_NOTE       |30                        |2025-03-31 08:01:00 |120              |0                  |101_S1    |
|101         |2025-03-31 08:08:00|PLACE_ORDER    |30                

Demo Point: This is the core logic. Explain Window.partitionBy().orderBy(). Step through the calculation: lag() to get previous timestamp, calculating time_diff_seconds, using the variable session_inactivity_minutes in the when() condition to set the is_new_session_flag, and finally the sum() over the window to create the unique session_id. Show how the is_new_session_flag correctly identifies the start of new sessions based on the gaps vs threshold in the sample data.

#### Cell 5: Aggregate Session Details

In [6]:
# Group by session ID and clinician ID to get session metrics
sessions_df = events_with_session_id_df.groupBy(
    "session_id",
    "clinician_id"  # Only list the columns defining the group here
).agg(
    # Use aggregate functions like 'first' inside the .agg() clause
    F.first("clinician_role").alias("clinician_role"),
    F.first("session_inactivity_minutes").alias("session_threshold_minutes"),
    # --- Original aggregations ---
    F.min("event_timestamp").alias("session_start_time"),
    F.max("event_timestamp").alias("session_end_time"),
    F.count("*").alias("event_count"),
    F.collect_list("event_type").alias("event_types_in_session") # Collect events in session
).withColumn(
    "session_duration_seconds",
    # Ensure you reference the newly aliased columns from agg()
    F.unix_timestamp(F.col("session_end_time")) - F.unix_timestamp(F.col("session_start_time"))
)

print("Aggregated Session Data:")
sessions_df.orderBy("clinician_id", "session_start_time").show(truncate=False)

StatementMeta(, 5bcdc2f2-6f46-48cd-a783-a02f29cd944b, 9, Finished, Available, Finished)

Aggregated Session Data:
+----------+------------+--------------+-------------------------+-------------------+-------------------+-----------+------------------------------------------------------+------------------------+
|session_id|clinician_id|clinician_role|session_threshold_minutes|session_start_time |session_end_time   |event_count|event_types_in_session                                |session_duration_seconds|
+----------+------------+--------------+-------------------------+-------------------+-------------------+-----------+------------------------------------------------------+------------------------+
|101_S1    |101         |Physician     |30                       |2025-03-31 08:00:00|2025-03-31 08:08:00|4          |[LOGIN, OPEN_CHART, ADD_NOTE, PLACE_ORDER]            |480                     |
|101_S2    |101         |Physician     |30                       |2025-03-31 08:48:00|2025-03-31 09:05:00|4          |[OPEN_CHART, VIEW_RESULTS, SIGN_NOTE, LOGOUT]         |1020  

Demo Point: Show how the events are now grouped into distinct sessions, calculating start/end times, duration, and event counts for each session identified in the previous step.

#### Cell 6: Write Sessionized Data to Delta Lake

In [7]:
# Write the aggregated session data to a new Delta table
try:
    sessions_df.write.format("delta").mode("overwrite").saveAsTable(sessionized_output_table)
    print(f"Successfully wrote sessionized data to Delta table: {sessionized_output_table}")
except Exception as e:
    print(f"Error writing session Delta table: {e}")

StatementMeta(, 5bcdc2f2-6f46-48cd-a783-a02f29cd944b, 10, Finished, Available, Finished)

Successfully wrote sessionized data to Delta table: clinician_emr_sessions


Demo Point: Explain that this final table contains the valuable session insights derived from the raw logs.

#### Cell 7: Analyze/Query Session Data (SQL)

In [8]:
%%sql
-- Query the sessionized Delta table created previously
-- Use the table name defined in the Python variable `sessionized_output_table`

SELECT
    session_id,
    clinician_id,
    clinician_role,
    session_threshold_minutes,
    session_start_time,
    session_end_time,
    session_duration_seconds,
    (session_duration_seconds / 60.0) as session_duration_minutes, -- Calculate minutes
    event_count,
    event_types_in_session -- Show the list of events
FROM
    clinician_emr_sessions -- Reference the table directly
ORDER BY
    clinician_id, session_start_time;

-- Example Analysis: Average session duration per role
/*
SELECT
    clinician_role,
    AVG(session_duration_seconds / 60.0) as avg_session_duration_minutes,
    COUNT(*) as number_of_sessions
FROM
    clinician_emr_sessions
GROUP BY
    clinician_role
ORDER BY
    avg_session_duration_minutes DESC;
*/

StatementMeta(, 5bcdc2f2-6f46-48cd-a783-a02f29cd944b, 11, Finished, Available, Finished)

<Spark SQL result set with 8 rows and 10 fields>

Demo Point: Use SQL to query the final session table. Show the details of individual sessions. Run the example aggregation query (commented out by default) to demonstrate analyzing session patterns (e.g., average duration by role). Discuss how this sessionized data is much more valuable for analysis than the raw event logs.