# Visual Pattern Detection in Dotted Charts

This notebook explores process event logs through multiple Dotted Chart views (Time, Case, Resource, Performance). We will:
- Load real and synthetic logs
- Visualize four views with Plotly
- Generate synthetic patterns (Burst, Idle)
- Summarize insights for future app implementation


In [9]:
# Imports & setup
import os
import sys
import warnings
from datetime import timedelta
import random

import numpy as np
import pandas as pd
import plotly.express as px

# Optional: pm4py for XES support
try:
    import pm4py  # noqa: F401
    from pm4py.objects.log.importer.xes import importer as xes_importer
    PM4PY_AVAILABLE = True
except Exception:
    PM4PY_AVAILABLE = False

pd.set_option('display.max_columns', 50)
warnings.filterwarnings('ignore')

DATA_DIR = os.path.abspath(os.path.join(os.path.dirname(''), '..', 'data'))
DATA_DIR


'/Users/jennifer.nikolovic/visual-pattern-detection-app/data'

In [10]:
# Loader helper
from typing import Optional

def load_event_log(file_path: str) -> pd.DataFrame:
    """
    Load an event log from CSV or XES.
    - Normalizes columns to lowercase
    - Renames common PM4Py columns to standard names: activity, resource, case_id, timestamp
    - Parses timestamps and sorts by timestamp
    """
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"File not found: {file_path}")

    ext = os.path.splitext(file_path)[1].lower()
    if ext == '.csv':
        df = pd.read_csv(file_path)
    elif ext == '.xes':
        if not PM4PY_AVAILABLE:
            raise ImportError("pm4py not available. Install pm4py to load XES files.")
        log = xes_importer.apply(file_path)
        from pm4py.objects.conversion.log import converter as log_converter
        from pm4py.objects.log.util import dataframe_utils
        df = log_converter.apply(log, variant=log_converter.Variants.TO_DATA_FRAME)
        df = dataframe_utils.convert_timestamp_columns_in_df(df)
    else:
        raise ValueError("Unsupported file type. Use .csv or .xes")

    # Normalize column names
    df.columns = [c.lower() for c in df.columns]

    # Common renaming for PM4Py
    rename_map = {
        'concept:name': 'activity',
        'org:resource': 'resource',
        'time:timestamp': 'timestamp',
        'case:concept:name': 'case_id',
        'caseid': 'case_id'
    }
    df = df.rename(columns={k: v for k, v in rename_map.items() if k in df.columns})

    # Ensure timestamp column
    if 'timestamp' not in df.columns:
        # try alternate typical names
        for alt in ['time', 'event_time', 'start_time', 'end_time', 'event_timestamp']:
            if alt in df.columns:
                df = df.rename(columns={alt: 'timestamp'})
                break

    if 'timestamp' in df.columns:
        df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')

    # Ensure case_id
    if 'case_id' not in df.columns:
        for alt in ['case', 'trace_id', 'case identifier']:
            if alt in df.columns:
                df = df.rename(columns={alt: 'case_id'})
                break

    # Ensure activity
    if 'activity' not in df.columns:
        for alt in ['activity_name', 'event', 'name']:
            if alt in df.columns:
                df = df.rename(columns={alt: 'activity'})
                break

    # Ensure resource (optional)
    if 'resource' not in df.columns:
        for alt in ['org:resource', 'user', 'resource_name']:
            if alt in df.columns:
                df = df.rename(columns={alt: 'resource'})
                break

    # Sort by timestamp if present
    if 'timestamp' in df.columns:
        df = df.sort_values('timestamp').reset_index(drop=True)

    return df


In [11]:
# Dotted chart plotting helper

def _ensure_case_duration(df: pd.DataFrame) -> pd.DataFrame:
    if 'timestamp' not in df.columns or 'case_id' not in df.columns:
        return df.copy()
    durations = (df.groupby('case_id')['timestamp']
                   .agg(lambda s: (s.max() - s.min()).total_seconds() if pd.notna(s.max()) and pd.notna(s.min()) else np.nan))
    out = df.copy()
    out['case_duration'] = out['case_id'].map(durations)
    return out


def plot_dotted_chart(df: pd.DataFrame, view: str = "Time"):
    """Plot different dotted chart views using plotly.express.scatter."""
    view = (view or 'Time').strip().lower()
    dfx = _ensure_case_duration(df)

    if view == 'time':
        if not {'timestamp', 'activity'} <= set(dfx.columns):
            raise ValueError("Time View requires 'timestamp' and 'activity'")
        fig = px.scatter(dfx, x='timestamp', y='activity', opacity=0.7, title='Dotted Chart - Time View')
    elif view == 'case':
        if 'case_id' not in dfx.columns:
            raise ValueError("Case View requires 'case_id'")
        ycol = 'timestamp' if 'timestamp' in dfx.columns else None
        if ycol is None:
            raise ValueError("Case View requires 'timestamp'")
        fig = px.scatter(dfx, x='case_id', y=ycol, opacity=0.7, title='Dotted Chart - Case View')
    elif view == 'resource':
        if not {'timestamp', 'resource'} <= set(dfx.columns):
            raise ValueError("Resource View requires 'timestamp' and 'resource'")
        fig = px.scatter(dfx, x='timestamp', y='resource', opacity=0.7, title='Dotted Chart - Resource View')
    elif view == 'performance':
        if 'case_duration' not in dfx.columns:
            dfx = _ensure_case_duration(dfx)
        if not {'timestamp', 'case_duration'} <= set(dfx.columns):
            raise ValueError("Performance View requires 'timestamp' and 'case_duration'")
        fig = px.scatter(dfx, x='timestamp', y='case_duration', opacity=0.7, title='Dotted Chart - Performance View (sec)')
    else:
        raise ValueError("Unknown view. Use one of: Time, Case, Resource, Performance")

    fig.update_layout(height=400, legend=dict(orientation='h'))
    fig.show()

# Quick demo helper to render all views

def render_all_views(df: pd.DataFrame):
    for v in ['Time', 'Case', 'Resource', 'Performance']:
        try:
            plot_dotted_chart(df, view=v)
        except Exception as e:
            print(f"Skipping {v} view: {e}")


In [12]:
# Synthetic log generation
from datetime import datetime

def generate_synthetic_log(mode: str = 'normal',
                            num_cases: int = 20,
                            activities: list = None,
                            start_time: pd.Timestamp = None,
                            seed: int = 42) -> pd.DataFrame:
    """
    Generate small synthetic event logs with patterns:
    - normal: steady flow
    - burst: many events clustered in a short window
    - idle: long gaps between events
    """
    rng = np.random.default_rng(seed)
    random.seed(seed)
    if activities is None:
        activities = ['A', 'B', 'C', 'D']
    if start_time is None:
        start_time = pd.Timestamp(datetime.now().replace(microsecond=0))

    rows = []

    if mode == 'normal':
        for case in range(1, num_cases + 1):
            t = start_time + pd.Timedelta(seconds=int(rng.integers(0, 600)))
            for act in activities:
                t += pd.Timedelta(seconds=int(rng.integers(20, 120)))
                rows.append({'case_id': f'C{case:03d}', 'activity': act, 'timestamp': t, 'resource': f'R{rng.integers(1,5)}'})

    elif mode == 'burst':
        burst_center = start_time + pd.Timedelta(minutes=5)
        for case in range(1, num_cases + 1):
            # majority events near burst center
            t = burst_center + pd.Timedelta(seconds=int(rng.normal(0, 10)))
            for act in activities:
                jitter = int(rng.integers(-5, 5))
                tt = t + pd.Timedelta(seconds=jitter)
                rows.append({'case_id': f'C{case:03d}', 'activity': act, 'timestamp': tt, 'resource': f'R{rng.integers(1,5)}'})

    elif mode == 'idle':
        for case in range(1, num_cases + 1):
            t = start_time + pd.Timedelta(seconds=int(rng.integers(0, 60)))
            for act in activities:
                gap = int(rng.choice([10, 15, 120, 300]))  # occasional long idle gaps
                t += pd.Timedelta(seconds=gap)
                rows.append({'case_id': f'C{case:03d}', 'activity': act, 'timestamp': t, 'resource': f'R{rng.integers(1,5)}'})
    else:
        raise ValueError("mode must be one of: 'normal', 'burst', 'idle'")

    df = pd.DataFrame(rows)
    df = df.sort_values('timestamp').reset_index(drop=True)
    return df

# Quick generate and preview examples
normal_df = generate_synthetic_log('normal', num_cases=15)
burst_df = generate_synthetic_log('burst', num_cases=15)
idle_df = generate_synthetic_log('idle', num_cases=15)

normal_df.head(), burst_df.head(), idle_df.head()


(  case_id activity           timestamp resource
 0    C014        A 2025-10-30 13:05:08       R1
 1    C002        A 2025-10-30 13:05:09       R4
 2    C015        A 2025-10-30 13:05:24       R3
 3    C001        A 2025-10-30 13:05:31       R3
 4    C012        A 2025-10-30 13:06:07       R1,
   case_id activity           timestamp resource
 0    C013        C 2025-10-30 13:07:48       R4
 1    C004        A 2025-10-30 13:07:48       R3
 2    C002        C 2025-10-30 13:07:48       R1
 3    C004        D 2025-10-30 13:07:50       R3
 4    C002        A 2025-10-30 13:07:50       R4,
   case_id activity           timestamp resource
 0    C014        A 2025-10-30 13:03:19       R1
 1    C014        B 2025-10-30 13:03:34       R3
 2    C010        A 2025-10-30 13:03:37       R4
 3    C009        A 2025-10-30 13:03:38       R3
 4    C007        A 2025-10-30 13:03:40       R3)

In [None]:
# Try loading example real logs (if available)
import gzip

# Search data directories for *.xes.gz files
import glob

data_folders = [
    os.path.join(DATA_DIR, d) for d in os.listdir(DATA_DIR) 
    if os.path.isdir(os.path.join(DATA_DIR, d))
]
xes_files = []
for folder in data_folders:
    xes_files += glob.glob(os.path.join(folder, '*.xes.gz'))

loaded_examples = {}
for p in xes_files:
    try:
        # We need to decompress .xes.gz for pm4py
        decompressed = p[:-3]
        with gzip.open(p, 'rb') as f_in, open(decompressed, 'wb') as f_out:
            f_out.write(f_in.read())
        df_example = load_event_log(decompressed)
        loaded_examples[os.path.basename(decompressed)] = df_example
        print(f"Loaded: {decompressed} | shape={df_example.shape}")
    except Exception as e:
        print(f"Failed to load {p}: {e}")

# Clean up decompressed files
import os
for f in list(loaded_examples.keys()):
    file_path = os.path.join(DATA_DIR, f)
    if os.path.exists(file_path):
        os.remove(file_path)

# If none loaded, fall back to synthetic
if not loaded_examples:
    print("No real logs found; using synthetic examples instead.")
    loaded_examples = {
        'synthetic_normal.csv': normal_df,
        'synthetic_burst.csv': burst_df,
        'synthetic_idle.csv': idle_df,
    }

# Show basic info for the first available example
first_name, first_df = next(iter(loaded_examples.items()))
print(f"\nPreview of: {first_name}")
display(first_df.head(10))
print("\nDataFrame info:")
print(first_df.info())



parsing log, completed traces :: 100%|██████████| 100000/100000 [00:11<00:00, 8704.13it/s]


Loaded (Hospital Billing): /Users/jennifer.nikolovic/visual-pattern-detection-app/data/Hospital Billing - Event Log_1_all/Hospital Billing - Event Log.xes | shape=(2535, 23)

Preview of: Hospital Billing - Event Log.xes


Unnamed: 0,iscancelled,diagnosis,timestamp,casetype,speciality,resource,activity,blocked,isclosed,flagd,flagb,flaga,state,lifecycle:transition,case_id,closecode,actred,actorange,flagc,msgcount,version,msgtype,msgcode
0,False,,2012-12-13 10:13:18+00:00,B,C,ResN,NEW,False,True,True,False,False,In progress,complete,RB,,,,,,,,
1,False,S,2012-12-13 11:12:05+00:00,A,I,ResV,NEW,False,True,True,False,False,In progress,complete,R,,,,,,,,
2,False,,2012-12-13 11:18:50+00:00,B,M,ResNA,NEW,False,False,False,False,False,In progress,complete,PB,,,,,,,,
3,True,,2012-12-13 14:06:26+00:00,,,,DELETE,,,,,,In progress,complete,PB,,,,,,,,
4,False,,2012-12-13 15:30:12+00:00,B,C,ResUB,NEW,False,True,True,False,False,In progress,complete,ND,,,,,,,,
5,,PC,2012-12-13 15:30:51+00:00,,,,CHANGE DIAGN,,,,,,In progress,complete,ND,,,,,,,,
6,False,UA,2012-12-13 23:35:24+00:00,A,H,ResA,NEW,False,True,True,False,False,In progress,complete,XA,,,,,,,,
7,False,EA,2012-12-14 03:34:01+00:00,A,M,ResA,NEW,False,True,True,False,False,In progress,complete,CA,,,,,,,,
8,False,FA,2012-12-14 04:17:07+00:00,A,M,ResA,NEW,False,True,True,False,False,In progress,complete,DA,,,,,,,,
9,False,,2012-12-14 11:00:47+00:00,B,C,ResUB,NEW,False,True,True,False,False,In progress,complete,QD,,,,,,,,



DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2535 entries, 0 to 2534
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype              
---  ------                --------------  -----              
 0   iscancelled           549 non-null    object             
 1   diagnosis             449 non-null    object             
 2   timestamp             2535 non-null   datetime64[ns, UTC]
 3   casetype              506 non-null    object             
 4   speciality            500 non-null    object             
 5   resource              1354 non-null   object             
 6   activity              2535 non-null   object             
 7   blocked               500 non-null    object             
 8   isclosed              500 non-null    object             
 9   flagd                 500 non-null    object             
 10  flagb                 500 non-null    object             
 11  flaga                 500 non-null    object        

In [14]:
# --- Plot all loaded real logs (all views) ---

for log_name, df in loaded_examples.items():
    print(f"\n{'='*50}\nReport for: {log_name}")
    display(df.head(10))
    print("\nDataFrame info:")
    print(df.info())
    for v in ['Time', 'Case', 'Resource', 'Performance']:
        print(f"\nDotted Chart View: {v}")
        try:
            plot_dotted_chart(df, view=v)
        except Exception as e:
            print(f"Fehler beim Plotten von {v}: {e}")




Report for: Hospital Billing - Event Log.xes


Unnamed: 0,iscancelled,diagnosis,timestamp,casetype,speciality,resource,activity,blocked,isclosed,flagd,flagb,flaga,state,lifecycle:transition,case_id,closecode,actred,actorange,flagc,msgcount,version,msgtype,msgcode
0,False,,2012-12-13 10:13:18+00:00,B,C,ResN,NEW,False,True,True,False,False,In progress,complete,RB,,,,,,,,
1,False,S,2012-12-13 11:12:05+00:00,A,I,ResV,NEW,False,True,True,False,False,In progress,complete,R,,,,,,,,
2,False,,2012-12-13 11:18:50+00:00,B,M,ResNA,NEW,False,False,False,False,False,In progress,complete,PB,,,,,,,,
3,True,,2012-12-13 14:06:26+00:00,,,,DELETE,,,,,,In progress,complete,PB,,,,,,,,
4,False,,2012-12-13 15:30:12+00:00,B,C,ResUB,NEW,False,True,True,False,False,In progress,complete,ND,,,,,,,,
5,,PC,2012-12-13 15:30:51+00:00,,,,CHANGE DIAGN,,,,,,In progress,complete,ND,,,,,,,,
6,False,UA,2012-12-13 23:35:24+00:00,A,H,ResA,NEW,False,True,True,False,False,In progress,complete,XA,,,,,,,,
7,False,EA,2012-12-14 03:34:01+00:00,A,M,ResA,NEW,False,True,True,False,False,In progress,complete,CA,,,,,,,,
8,False,FA,2012-12-14 04:17:07+00:00,A,M,ResA,NEW,False,True,True,False,False,In progress,complete,DA,,,,,,,,
9,False,,2012-12-14 11:00:47+00:00,B,C,ResUB,NEW,False,True,True,False,False,In progress,complete,QD,,,,,,,,



DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2535 entries, 0 to 2534
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype              
---  ------                --------------  -----              
 0   iscancelled           549 non-null    object             
 1   diagnosis             449 non-null    object             
 2   timestamp             2535 non-null   datetime64[ns, UTC]
 3   casetype              506 non-null    object             
 4   speciality            500 non-null    object             
 5   resource              1354 non-null   object             
 6   activity              2535 non-null   object             
 7   blocked               500 non-null    object             
 8   isclosed              500 non-null    object             
 9   flagd                 500 non-null    object             
 10  flagb                 500 non-null    object             
 11  flaga                 500 non-null    object        


Dotted Chart View: Case



Dotted Chart View: Resource



Dotted Chart View: Performance


In [15]:
# Visualize four views for the first example
print(f"Rendering dotted chart views for: {first_name}")
render_all_views(first_df)


Rendering dotted chart views for: Hospital Billing - Event Log.xes


## Inspecting Patterns

- Bursts typically appear as **vertical dense lines** in Time/Resource views.
- Idle behavior shows as **horizontal gaps** in Case/Time views.
- Periodic activity may form **repeated horizontal bands** across activities/resources.

We will now render the same views for synthetic data to observe these patterns clearly.


In [16]:
# Render views for synthetic datasets
print("Synthetic: normal flow")
render_all_views(normal_df)

print("Synthetic: burst pattern")
render_all_views(burst_df)

print("Synthetic: idle pattern")
render_all_views(idle_df)


Synthetic: normal flow


Synthetic: burst pattern


Synthetic: idle pattern


## Summary of Insights

- **Time View**: Best to spot bursts (vertical stacks) and overall temporal density.
- **Case View**: Highlights case progression and idleness (horizontal gaps).
- **Resource View**: Reveals resource-specific spikes or workload clustering.
- **Performance View**: Correlates event timing with case durations; useful for bottleneck hypotheses.

Potential heuristics for later detection:
- **Burst detection**: moving window counts over `timestamp`, z-score threshold.
- **Idle detection**: inter-event time per `case_id` exceeding a percentile threshold.
- **Periodicity**: autocorrelation or spectral density over event rates by `activity`/`resource`.

These explorations guide feature design for the Streamlit app’s dotted chart analysis.
