# BAf√∂G OCEL Simulation

Diese Simulation generiert einen **Object-Centric Event Log (OCEL)** f√ºr den BAf√∂G-Antragsprozess.

## Outputs
- `events.csv` - Alle Events mit Sorting-Spalte
- `applications.csv` - Application-Objekte
- `documents.csv` - Document-Objekte
- `event_object_link.csv` - Verkn√ºpfung Events ‚Üî Objekte
- `log_not_sliced.csv` - Zwischenformat zur Kontrolle

## Datenmodell
Basierend auf `agent/schema.sql` mit zwei Objekttypen:
- **Application**: Jeder Antrag
- **Document**: 1-5 Dokumente pro Antrag (abh√§ngig von Attributen)

<div style="border-left:6px solid #6366F1; padding:12px 14px; border-radius:10px;">

## How this notebook works (high level)

<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">OCEL</span>
<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">SimPy</span>
<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">CSV Export</span>

This notebook generates a **synthetic OCEL (Object-Centric Event Log)** for the BAf√∂G application process.

### Flow

1. **Load config** from `data/bpmn_models/simulation_config/sim_ocel_config.json`
2. **Simulate arrivals** (time-dependent inter-arrival rates) + **process behavior** (gateways, deviations, durations)
3. **Model capacity / backlog** using a limited **SimPy resource** for `Clerk`
4. **Export OCEL CSVs** (`events.csv`, `applications.csv`, `documents.csv`, `event_object_link.csv`, `log_not_sliced.csv`)

### Key concept

- **System** activities are assumed to have *unlimited capacity (24/7)*
- **Clerk** activities are capacity-limited (queueing/backlogs can happen)

</div>

## 1. Setup & Imports

In [37]:
import json
import random
from datetime import datetime, timedelta
from dataclasses import dataclass, field
from typing import List, Dict, Optional, Tuple
from pathlib import Path
import numpy as np
import pandas as pd
import simpy
import sys
from pathlib import Path

PROJECT_ROOT = Path.cwd().resolve().parent   # notebooks/ -> project/
sys.path.insert(0, str(PROJECT_ROOT))

from config import ROOT, OCEL_SIMULATION_DIR, OCEL_DIR
# Project paths

print(f"Project Root: {ROOT}")
print(f"Config Path: {OCEL_SIMULATION_DIR}")
print(f"Output Dir: {OCEL_DIR}")

Project Root: C:\Users\abodu\Desktop\Clutter Desktop\ÿßŸàÿ±ÿßŸÇ ÿßŸÑÿ¨ÿßŸÖÿπÿ©\Semesters\WinterSemester 25&26\Buisness Process Management\pm4py\Mining-tests
Config Path: C:\Users\abodu\Desktop\Clutter Desktop\ÿßŸàÿ±ÿßŸÇ ÿßŸÑÿ¨ÿßŸÖÿπÿ©\Semesters\WinterSemester 25&26\Buisness Process Management\pm4py\Mining-tests\data\bpmn_models\simulation_config\sim_ocel_config.json
Output Dir: C:\Users\abodu\Desktop\Clutter Desktop\ÿßŸàÿ±ÿßŸÇ ÿßŸÑÿ¨ÿßŸÖÿπÿ©\Semesters\WinterSemester 25&26\Buisness Process Management\pm4py\Mining-tests\data\outputs\event_logs\ocel


## 2. Load Configuration

<div style="border-left:6px solid #22C55E; padding:12px 14px; border-radius:10px;">

## Configuration knobs (`sim_ocel_config.json`)

<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">simulation</span>
<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">interarrival</span>
<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">resources</span>

### Most important fields under `simulation`

- `num_cases`
  Total number of applications to generate.
- `start_date`
  Timestamp of simulation start.
- `random_seed`
  Makes the simulation reproducible.

---

### Time span control

- `target_log_days`
  Target overall time span (days) for generated timestamps.
- `tail_buffer_days`
  Extra days after the *last arrival* so in-flight cases can finish (realistic tail).

---

### Realism / backlog controls

- `enforce_working_hours`
  If `true`, `Clerk` work is restricted to `resources.Clerk.availability`.
- `max_review_rounds`
  Upper bound for repeating the "Review ‚Üí missing? ‚Üí request/receive" cycle.

---

### Optional boundary events

- `include_start_event` ‚Üí include/exclude `Application started`
- `include_end_event` ‚Üí include/exclude `Application handled`

### Arrivals

Arrivals are configured under `interarrival` (weekday/weekend + hour-of-day mean minutes). The notebook calibrates these means by applying an internal scale factor so arrivals fit into `target_log_days`.

</div>

In [38]:
with open(OCEL_SIMULATION_DIR, 'r', encoding='utf-8') as f:
    config = json.load(f)

# Extract key parameters
NUM_CASES = config['simulation']['num_cases']
START_DATE = datetime.fromisoformat(config['simulation']['start_date'])
RANDOM_SEED = config['simulation']['random_seed']
TARGET_LOG_DAYS = config.get('simulation', {}).get('target_log_days', 40)

# Debug mode: Set to True for faster testing with fewer cases
DEBUG_MODE = False
DEBUG_CASES = 50  # Number of cases in debug mode

if DEBUG_MODE:
    NUM_CASES = DEBUG_CASES
    print(f"‚ö†Ô∏è  DEBUG MODE: Running with {NUM_CASES} cases")

# Set random seeds
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

print(f"Simulation: {NUM_CASES} cases starting at {START_DATE}")
print(f"Random seed: {RANDOM_SEED}")
print(f"Target time span (days): {TARGET_LOG_DAYS}")

Simulation: 9800 cases starting at 2024-09-15 00:00:00
Random seed: 42
Target time span (days): 40


## 3. Data Classes

<div style="border-left:6px solid #A855F7; padding:12px 14px; border-radius:10px;">

### Data classes (OCEL objects + events)

<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">Application</span>
<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">Document</span>
<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">Event</span>

This section defines the core in-memory structures that later become the OCEL tables:

- **`Application`**
  The primary business object (case/object), with attributes like `is_parent_independent`, `housing_type`, and final `status`.
- **`Document`**
  Secondary object type linked to an application (`application_id`). Documents can be `Missing` or `Received`, with an optional `submission_date`.
- **`Event`**
  Stores one activity execution with `activity`, `timestamp`, and `org_resource` and a list of `linked_objects`.

Each class provides a `to_dict()` used during export to build the CSV files.

</div>

In [39]:
@dataclass
class Application:
    """Represents a BAf√∂G application."""
    application_id: str
    student_id: str
    is_initial_application: bool = True
    is_parent_independent: bool = False
    housing_type: str = "Alleine"  # 'Eltern' or 'Alleine'
    status: str = "Pending"  # 'Pending', 'Approved', 'Rejected'
    
    def to_dict(self) -> dict:
        return {
            'application_id': self.application_id,
            'student_id': self.student_id,
            'is_initial_application': self.is_initial_application,
            'is_parent_independent': self.is_parent_independent,
            'housing_type': self.housing_type,
            'status': self.status
        }


@dataclass
class Document:
    """Represents a document attached to an application."""
    document_id: str
    application_id: str
    doc_type: str
    doc_category: str
    status: str = "Missing"  # 'Missing', 'Received'
    submission_date: Optional[datetime] = None
    
    def to_dict(self) -> dict:
        return {
            'document_id': self.document_id,
            'application_id': self.application_id,
            'doc_type': self.doc_type,
            'doc_category': self.doc_category,
            'status': self.status,
            'submission_date': self.submission_date.isoformat() if self.submission_date else None
        }


@dataclass
class Event:
    """Represents an event in the process."""
    event_id: str
    activity: str
    timestamp: datetime
    sorting_integer: int
    org_resource: str
    linked_objects: List[Tuple[str, str]] = field(default_factory=list)  # [(object_id, object_type), ...]
    
    def to_dict(self) -> dict:
        return {
            'event_id': self.event_id,
            'activity': self.activity,
            'timestamp': self.timestamp.isoformat(),
            'sorting_integer': self.sorting_integer,
            'org_resource': self.org_resource
        }

## 4. Duration Sampling Functions

<div style="border-left:6px solid #06B6D4; padding:12px 14px; border-radius:10px;">

### Duration sampling (activity processing times)

<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">uniform</span>
<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">normal</span>
<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">exponential</span>

This section turns the JSON activity definitions into sampled **durations in minutes**.

- `sample_duration(activity_config)`
  Samples from the configured distribution and applies bounds (min/max) to avoid unrealistic outliers.
- `get_activity_duration(activity_name)`
  Looks up the activity in `config['activities']` (exact match first, then partial match) and returns a sampled duration.
- `get_resource(activity_name)`
  Returns the configured resource (e.g., `System` or `Clerk`) for the activity.

These durations feed directly into SimPy timeouts and therefore shape the timestamp spacing in the final event log.

</div>

In [40]:
def sample_duration(activity_config: dict) -> float:
    """Sample a duration in minutes based on the activity configuration."""
    dist = activity_config.get('distribution', 'uniform')
    
    if dist == 'uniform':
        duration = random.uniform(
            activity_config.get('min_minutes', 1),
            activity_config.get('max_minutes', 5)
        )
    elif dist == 'normal':
        mean = activity_config.get('mean_minutes', 10)
        std = activity_config.get('std_minutes', 2)
        min_val = activity_config.get('min_minutes', 1)
        max_val = activity_config.get('max_minutes', mean * 3)
        
        duration = np.random.normal(mean, std)
        duration = max(min_val, min(max_val, duration))  # Truncate
    elif dist == 'exponential':
        mean = activity_config.get('mean_minutes', 60)
        min_val = activity_config.get('min_minutes', 1)
        max_val = activity_config.get('max_minutes', mean * 3)
        
        duration = np.random.exponential(mean)
        duration = max(min_val, min(max_val, duration))  # Truncate
    else:
        duration = 5  # Default
    
    return duration


def get_activity_duration(activity_name: str) -> float:
    """Get duration for an activity in minutes."""
    activities = config.get('activities', {})
    
    # Try exact match first
    if activity_name in activities:
        return sample_duration(activities[activity_name])
    
    # Try partial match
    for key, cfg in activities.items():
        if key.lower() in activity_name.lower() or activity_name.lower() in key.lower():
            return sample_duration(cfg)
    
    # Default: 5 minutes
    return 5.0


def get_resource(activity_name: str) -> str:
    """Get resource for an activity."""
    activities = config.get('activities', {})
    if activity_name in activities:
        return activities[activity_name].get('resource', 'System')
    return 'System'

## 5. Document Generation

<div style="border-left:6px solid #F97316; padding:12px 14px; border-radius:10px;">

### Document generation (object creation rules)

<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">conditions</span>
<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">1..N docs</span>

This section generates `Document` objects for each `Application` based on attributes and JSON rules in `config['document_types']`.

- Documents with `always_generated: true` are created for every application.
- Conditional documents are created depending on application attributes, e.g.:
  - parent-dependent applications generate income-related forms
  - students not living with parents can generate housing documents

New documents start with status `Missing` and are later set to `Received` when the simulation reaches the corresponding receive step.

</div>

In [41]:
def generate_documents(application: Application, doc_counter: int) -> Tuple[List[Document], int]:
    """Generate documents for an application based on its attributes."""
    documents = []
    doc_types = config.get('document_types', {})
    
    for doc_type, doc_config in doc_types.items():
        should_generate = False
        
        if doc_config.get('always_generated', False):
            should_generate = True
        elif doc_config.get('condition') == 'is_parent_dependent':
            should_generate = not application.is_parent_independent
        elif doc_config.get('condition') == 'has_formblatt_3':
            # Check if Formblatt 3 was generated
            should_generate = not application.is_parent_independent
        elif doc_config.get('condition') == 'not_living_with_parents':
            should_generate = application.housing_type != 'Eltern'
        
        if should_generate:
            doc = Document(
                document_id=f"DOC_{doc_counter:06d}",
                application_id=application.application_id,
                doc_type=doc_type,
                doc_category=doc_config.get('category', 'Sonstiges'),
                status='Missing'
            )
            documents.append(doc)
            doc_counter += 1
    
    return documents, doc_counter


# Test document generation
test_app = Application(
    application_id="APP_TEST",
    student_id="STU_TEST",
    is_parent_independent=False,
    housing_type="Alleine"
)
test_docs, _ = generate_documents(test_app, 0)
print(f"Test application generates {len(test_docs)} documents:")
for doc in test_docs:
    print(f"  - {doc.doc_type} ({doc.doc_category})")

Test application generates 5 documents:
  - Formblatt 1 (Antrag)
  - Immatrikulationsbescheinigung (Identit√§t)
  - Formblatt 3 (Einkommen)
  - Einkommensnachweis Eltern (Einkommen)
  - Mietbescheinigung (Wohnen)


## 6. Simulation Engine

<div style="border-left:6px solid #F59E0B; padding:12px 14px; border-radius:10px;">

### Simulation engine details

<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">Queues</span>
<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">Capacity</span>
<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">Working Hours</span>

This section implements the process dynamics using **SimPy**.

---

#### Resources and backlogs

- `System` activities run with effectively unlimited capacity and can happen at any time.
- `Clerk` activities use a limited SimPy resource:
  - `self.clerk = simpy.Resource(..., capacity=resources.Clerk.capacity)`
  - If demand > capacity, cases wait in the clerk queue.
  - Waiting times are stored in `sim.waiting_times_minutes['Clerk']`.

---

#### Working hours (optional)

If `simulation.enforce_working_hours = true`, clerk work is only performed during:

- `resources.Clerk.availability.days`
- `resources.Clerk.availability.start_hour` / `end_hour`

Implementation note: before requesting the clerk, the simulation waits until the next shift start (`_minutes_until_next_clerk_shift`).

---

#### Review rounds (controlled)

Bounded by `simulation.max_review_rounds`:

- `1`: at most one review round
- `> 1`: repeat up to N rounds

---

#### Optional start/end events

- `simulation.include_start_event` ‚Üí `Application started`
- `simulation.include_end_event` ‚Üí `Application handled`

These switches only affect the event log output, not the internal case objects.

</div>

In [42]:
class BAfoegSimulation:
    """SimPy-based simulation of the BAf√∂G application process."""
    
    def __init__(self, config: dict, start_date: datetime, interarrival_scale: float = 1.0):
        self.config = config
        self.start_date = start_date
        self.interarrival_scale = interarrival_scale
        self.env = simpy.Environment()

        sim_cfg = config.get('simulation', {})
        self.enforce_working_hours = sim_cfg.get('enforce_working_hours', True)
        self.max_review_rounds = int(sim_cfg.get('max_review_rounds', 1))
        if self.max_review_rounds < 1:
            self.max_review_rounds = 1

        self.include_start_event = bool(sim_cfg.get('include_start_event', True))
        self.include_end_event = bool(sim_cfg.get('include_end_event', True))
        
        # Counters
        self.event_counter = 0
        self.app_counter = 0
        self.doc_counter = 0
        
        # Storage
        self.applications: List[Application] = []
        self.documents: List[Document] = []
        self.events: List[Event] = []

        # Waiting time stats
        self.waiting_times_minutes: Dict[str, List[float]] = {"Clerk": [], "System": []}
        
        # Gateway probabilities
        self.gateways = config.get('gateways', {})
        
        # Deviation probabilities
        self.deviations = config.get('deviations', {})

        # SimPy resources
        resources_cfg = config.get('resources', {})
        clerk_capacity = int(resources_cfg.get('Clerk', {}).get('capacity', 1))
        self.clerk = simpy.Resource(self.env, capacity=clerk_capacity)
        self.resources_cfg = resources_cfg
    
    def sim_time_to_datetime(self, sim_time: float) -> datetime:
        """Convert simulation time (in minutes) to datetime."""
        return self.start_date + timedelta(minutes=sim_time)

    def record_event(self, activity: str, linked_objects: List[Tuple[str, str]], resource: str = None):
        """Record an event."""
        self.event_counter += 1
        
        if resource is None:
            resource = get_resource(activity)
        
        event = Event(
            event_id=f"E_{self.event_counter:06d}",
            activity=activity,
            timestamp=self.sim_time_to_datetime(self.env.now),
            sorting_integer=self.event_counter,
            org_resource=resource,
            linked_objects=linked_objects
        )
        self.events.append(event)
        return event

    def decide_gateway(self, gateway_name: str, option_a: str, option_b: str) -> str:
        """Make a gateway decision based on probabilities."""
        gateway_config = self.gateways.get(gateway_name, {})
        prob_a = gateway_config.get(option_a, 0.5)
        return option_a if random.random() < prob_a else option_b

    def check_deviation(self, deviation_name: str) -> bool:
        """Check if a deviation should occur."""
        deviation_config = self.deviations.get(deviation_name, {})
        prob = deviation_config.get('probability', 0.0)
        return random.random() < prob

    def _minutes_until_next_clerk_shift(self) -> float:
        if not self.enforce_working_hours:
            return 0.0

        clerk_cfg = self.resources_cfg.get('Clerk', {}).get('availability', {})
        days_allowed = set(clerk_cfg.get('days', ['Mon', 'Tue', 'Wed', 'Thu', 'Fri']))
        start_hour = float(clerk_cfg.get('start_hour', 8.0))
        end_hour = float(clerk_cfg.get('end_hour', 16.0))

        now_dt = self.sim_time_to_datetime(self.env.now)

        # Already during a shift?
        if now_dt.strftime('%a') in days_allowed:
            hour = now_dt.hour + now_dt.minute / 60 + now_dt.second / 3600
            if start_hour <= hour < end_hour:
                return 0.0

        # Find next shift start
        for add_days in range(0, 14):
            candidate_day = (now_dt + timedelta(days=add_days)).replace(hour=0, minute=0, second=0, microsecond=0)
            if candidate_day.strftime('%a') not in days_allowed:
                continue

            shift_start = candidate_day + timedelta(hours=start_hour)
            if add_days == 0 and shift_start <= now_dt:
                continue

            return (shift_start - now_dt).total_seconds() / 60

        return 24 * 60

    def perform_activity(self, activity: str, linked_objects: List[Tuple[str, str]], resource: str = None):
        if resource is None:
            resource = get_resource(activity)

        duration = get_activity_duration(activity)

        if resource == 'Clerk':
            pre_wait = self._minutes_until_next_clerk_shift()
            if pre_wait > 0:
                yield self.env.timeout(pre_wait)

            request_start = self.env.now
            with self.clerk.request() as req:
                yield req
                queue_wait = self.env.now - request_start
                self.waiting_times_minutes['Clerk'].append(queue_wait)

                yield self.env.timeout(duration)
                self.record_event(activity, linked_objects, 'Clerk')

        else:
            self.waiting_times_minutes['System'].append(0.0)
            yield self.env.timeout(duration)
            self.record_event(activity, linked_objects, resource)

    def process_application(self, app_id: int):
        """Simulate the processing of a single application."""
        
        # === Create Application Object ===
        is_parent_independent = self.decide_gateway(
            'parent_data_required', 'not_required', 'required'
        ) == 'not_required'
        
        housing_type = random.choice(['Eltern', 'Alleine'])
        
        application = Application(
            application_id=f"APP_{app_id:06d}",
            student_id=f"STU_{app_id:06d}",
            is_initial_application=True,
            is_parent_independent=is_parent_independent,
            housing_type=housing_type,
            status="Pending"
        )
        self.applications.append(application)
        
        # === Generate Documents ===
        documents, self.doc_counter = generate_documents(application, self.doc_counter)
        self.documents.extend(documents)
        
        # Helper to get object links
        def app_link():
            return [(application.application_id, 'Application')]
        
        def all_doc_links():
            return [(doc.document_id, 'Document') for doc in documents]
        
        # === START: Application started (optional) ===
        if self.include_start_event:
            self.record_event("Application started", app_link())
        
        # === Gateway: Parent Data Required? ===
        if not is_parent_independent:
            yield from self.perform_activity("Request Parent Data", app_link(), "System")

            # Receive Parent Data can be delayed days
            yield self.env.timeout(get_activity_duration("Receive Parent Data"))
            for doc in documents:
                if doc.doc_type == "Formblatt 3":
                    doc.status = "Received"
                    doc.submission_date = self.sim_time_to_datetime(self.env.now)
            self.record_event(
                "Receive Parent Data",
                app_link() + [(d.document_id, 'Document') for d in documents if d.doc_type == "Formblatt 3"],
                "System"
            )
        
        yield from self.perform_activity("Send Application Mail", app_link(), "System")
        
        # === Receive Application ===
        yield self.env.timeout(get_activity_duration("Receive Application"))
        for doc in documents:
            if doc.doc_type in ["Formblatt 1", "Immatrikulationsbescheinigung"]:
                doc.status = "Received"
                doc.submission_date = self.sim_time_to_datetime(self.env.now)
        received_docs = [d for d in documents if d.doc_type in ["Formblatt 1", "Immatrikulationsbescheinigung"]]
        self.record_event("Receive Application", app_link() + [(d.document_id, 'Document') for d in received_docs], "System")
        
        # === Document Review Rounds (controlled) ===
        skip_review_once = self.check_deviation('review_skip')
        rounds = 0
        while rounds < self.max_review_rounds:
            rounds += 1

            if not skip_review_once:
                yield from self.perform_activity("Review Document", app_link() + all_doc_links(), "Clerk")
            else:
                skip_review_once = False

            decision = self.decide_gateway('documents_missing', 'complete', 'missing')
            if decision == 'missing':
                missing_docs = [d for d in documents if d.status == 'Missing']
                yield from self.perform_activity(
                    "Request Missing Documents",
                    app_link() + [(d.document_id, 'Document') for d in missing_docs[:1]],
                    "Clerk"
                )

                # Receive Missing Documents can be delayed days
                yield self.env.timeout(get_activity_duration("Receive Missing Documents"))
                for doc in missing_docs:
                    doc.status = "Received"
                    doc.submission_date = self.sim_time_to_datetime(self.env.now)
                self.record_event("Receive Missing Documents", app_link() + [(d.document_id, 'Document') for d in missing_docs], "System")

                continue

            break
        
        if self.check_deviation('direct_rejection'):
            yield from self.perform_activity("Send Rejection", app_link(), "Clerk")
            application.status = "Rejected"
            if self.include_end_event:
                self.record_event("Application handled", app_link())
            return
        
        yield from self.perform_activity("Assess Application", app_link(), "Clerk")
        
        eligibility = self.decide_gateway('eligibility_decision', 'approved', 'rejected')
        
        if eligibility == 'approved':
            yield from self.perform_activity("Calculate Claim", app_link(), "Clerk")
            yield from self.perform_activity("Send Notification", app_link(), "Clerk")
            application.status = "Approved"
        else:
            yield from self.perform_activity("Send Rejection", app_link(), "Clerk")
            application.status = "Rejected"

        # === END: Application handled (optional) ===
        if self.include_end_event:
            self.record_event("Application handled", app_link())

    def get_interarrival_time(self) -> float:
        """Get interarrival time based on current simulated datetime."""
        current_dt = self.sim_time_to_datetime(self.env.now)
        current_hour = current_dt.hour
        day_name = current_dt.strftime('%a')
        
        interarrival_config = self.config.get('interarrival', {})
        
        if day_name in ['Sat', 'Sun']:
            cfg = interarrival_config.get('weekend', {})
            mean = cfg.get('mean_minutes', 300)
        elif 8 <= current_hour < 16:
            cfg = interarrival_config.get('weekday_08_16', {})
            mean = cfg.get('mean_minutes', 120)
        elif 16 <= current_hour < 21:
            cfg = interarrival_config.get('weekday_16_21', {})
            mean = cfg.get('mean_minutes', 30)
        elif 21 <= current_hour < 24:
            cfg = interarrival_config.get('weekday_21_24', {})
            mean = cfg.get('mean_minutes', 180)
        else:
            mean = 240
        
        mean = mean * self.interarrival_scale
        return np.random.exponential(mean)
    
    def arrival_generator(self, num_cases: int):
        """Generate arrivals based on time-dependent interarrival times."""
        for i in range(num_cases):
            interarrival = self.get_interarrival_time()
            
            yield self.env.timeout(interarrival)
            self.env.process(self.process_application(i))
            
            if (i + 1) % 500 == 0:
                print(f"  Started {i + 1}/{num_cases} applications...")
    
    def run(self, num_cases: int, until_minutes: Optional[float] = None):
        """Run the simulation."""
        print(f"Starting simulation with {num_cases} cases...")
        
        self.env.process(self.arrival_generator(num_cases))
        
        if until_minutes is None:
            max_time = num_cases * 500 + 100000
            self.env.run(until=max_time)
        else:
            self.env.run(until=until_minutes)
        
        print(f"Simulation complete!")
        print(f"  - Applications: {len(self.applications)}")
        print(f"  - Documents: {len(self.documents)}")
        print(f"  - Events: {len(self.events)}")

## 7. Run Simulation

<div style="border-left:6px solid #0EA5E9; padding:12px 14px; border-radius:10px;">

### Run simulation details (time span calibration)

<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">Calibration</span>
<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">Target Days</span>
<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">Diagnostics</span>

This section runs the simulation in two stages.

---

#### 1) Inter-arrival calibration

Arrivals use time-dependent inter-arrival times (weekday/weekend + hour-of-day). To achieve a log spanning roughly `simulation.target_log_days`, the notebook calibrates a multiplier `interarrival_scale`:

- repeatedly samples arrivals (fast, without full case processing)
- searches for a scale such that the last arrival happens around:

`ARRIVAL_TARGET_DAYS = target_log_days - tail_buffer_days`

This keeps the shape (peaks/weekends) but adjusts the overall speed.

---

#### 2) Full simulation run horizon

After calibration, the full SimPy simulation runs for approximately:

`until_minutes ‚âà (target_log_days + 1) * 24 * 60`

`tail_buffer_days` matters because some activities intentionally take multiple days (document waiting), creating a realistic tail.

---

#### Interpreting printed diagnostics

- `Calibrated interarrival_scale=...` ‚Üí smaller means higher arrival rate
- `Event log span: ... days` ‚Üí actual timestamp window achieved
- Backlog indicators appear later (throughput quantiles + clerk waiting times)

</div>

In [43]:
def estimate_last_arrival_minutes(config: dict, start_date: datetime, num_cases: int, interarrival_scale: float, seed: int) -> float:
    """Estimate last arrival time (in minutes) by sampling the inter-arrival distribution."""
    rng = np.random.default_rng(seed)
    t = 0.0
    interarrival_config = config.get('interarrival', {})

    for _ in range(num_cases):
        current_dt = start_date + timedelta(minutes=t)
        current_hour = current_dt.hour
        day_name = current_dt.strftime('%a')

        if day_name in ['Sat', 'Sun']:
            cfg = interarrival_config.get('weekend', {})
            mean = cfg.get('mean_minutes', 300)
        elif 8 <= current_hour < 16:
            cfg = interarrival_config.get('weekday_08_16', {})
            mean = cfg.get('mean_minutes', 120)
        elif 16 <= current_hour < 21:
            cfg = interarrival_config.get('weekday_16_21', {})
            mean = cfg.get('mean_minutes', 30)
        elif 21 <= current_hour < 24:
            cfg = interarrival_config.get('weekday_21_24', {})
            mean = cfg.get('mean_minutes', 180)
        else:
            mean = 240

        mean = mean * interarrival_scale
        t += rng.exponential(mean)

    return t


def calibrate_interarrival_scale(
    config: dict,
    start_date: datetime,
    num_cases: int,
    target_last_arrival_days: float,
    seed: int,
    tol_days: float = 0.5,
    max_iter: int = 25,
) -> Tuple[float, float]:
    """Binary-search a scale factor so the last arrival happens around target_last_arrival_days."""
    target_minutes = target_last_arrival_days * 24 * 60
    tol_minutes = tol_days * 24 * 60

    lo, hi = 0.05, 50.0

    best_scale = 1.0
    best_last = float('inf')

    for _ in range(max_iter):
        mid = (lo + hi) / 2
        last = estimate_last_arrival_minutes(config, start_date, num_cases, mid, seed)

        if abs(last - target_minutes) < abs(best_last - target_minutes):
            best_scale = mid
            best_last = last

        if abs(last - target_minutes) <= tol_minutes:
            return mid, last

        if last > target_minutes:
            # Arrivals are too slow -> reduce mean -> smaller scale
            hi = mid
        else:
            # Arrivals are too fast -> increase mean -> larger scale
            lo = mid

    return best_scale, best_last


# Calibrate so that the *arrivals* fit into the desired time span,
# leaving a tail buffer for in-flight cases to finish.
TAIL_BUFFER_DAYS = config.get('simulation', {}).get('tail_buffer_days', 5)
ARRIVAL_TARGET_DAYS = max(1.0, TARGET_LOG_DAYS - TAIL_BUFFER_DAYS)

scale, est_last_arrival = calibrate_interarrival_scale(
    config=config,
    start_date=START_DATE,
    num_cases=NUM_CASES,
    target_last_arrival_days=ARRIVAL_TARGET_DAYS,
    seed=RANDOM_SEED,
)

# Re-seed RNGs to keep the full simulation deterministic after calibration
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

print(f"Calibrated interarrival_scale={scale:.4f}")
print(f"Estimated last arrival after ~{est_last_arrival / (24*60):.2f} days (target {ARRIVAL_TARGET_DAYS} days)")

# Create and run simulation
sim = BAfoegSimulation(config, START_DATE, interarrival_scale=scale)

# Run long enough so that the process instances can finish after the last arrival.
# This aims for an overall event span of roughly TARGET_LOG_DAYS.
until_minutes = (TARGET_LOG_DAYS + 1) * 24 * 60
sim.run(NUM_CASES, until_minutes=until_minutes)

# Quick span check
if len(sim.events) > 0:
    min_ts = min(e.timestamp for e in sim.events)
    max_ts = max(e.timestamp for e in sim.events)
    span_days = (max_ts - min_ts).total_seconds() / (24 * 3600)
    print(f"Event log span: {span_days:.2f} days ({min_ts} -> {max_ts})")

Calibrated interarrival_scale=0.0500
Estimated last arrival after ~36.74 days (target 35 days)
Starting simulation with 9800 cases...
  Started 500/9800 applications...
  Started 1000/9800 applications...
  Started 1500/9800 applications...
  Started 2000/9800 applications...
  Started 2500/9800 applications...
  Started 3000/9800 applications...
  Started 3500/9800 applications...
  Started 4000/9800 applications...
  Started 4500/9800 applications...
  Started 5000/9800 applications...
  Started 5500/9800 applications...
  Started 6000/9800 applications...
  Started 6500/9800 applications...
  Started 7000/9800 applications...
  Started 7500/9800 applications...
  Started 8000/9800 applications...
  Started 8500/9800 applications...
  Started 9000/9800 applications...
  Started 9500/9800 applications...
Simulation complete!
  - Applications: 9800
  - Documents: 40154
  - Events: 67159
Event log span: 40.99 days (2024-09-15 00:08:39.131123 -> 2024-10-25 23:59:57.362346)


## 8. Export OCEL

<div style="border-left:6px solid #7C3AED; padding:12px 14px; border-radius:10px;">

### Output files (OCEL CSV export)

<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">events.csv</span>
<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">objects</span>
<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">links</span>

The export step produces the following files in `OCEL_DIR`:

- **events.csv**
  Activity table with `case_id`, `activity`, `timestamp`, `org_resource`, plus sorting columns.
- **applications.csv**
  Object table for `Application` objects.
- **documents.csv**
  Object table for `Document` objects (linked to applications via `application_id`).
- **event_object_link.csv**
  Link table between events and object IDs/types.
- **log_not_sliced.csv**
  Intermediate control file in an OCEL-like flat format.

<div style="margin-top:10px; padding:10px; border:1px dashed currentColor; border-radius:10px;">
<b>Tip</b>: If you disable `include_start_event` / `include_end_event`, the statistics section computes throughput using the **first and last event per case**.
</div>

</div>

In [44]:
def export_ocel(sim: BAfoegSimulation, output_dir: Path):
    """Export simulation results to OCEL CSV files."""
    
    # Celonis timestamp format: yyyy-MM-dd HH:mm:ss
    CELONIS_TIMESTAMP_FORMAT = '%Y-%m-%d %H:%M:%S'

    def _parse_one_iso(val, allow_na: bool):
        if val is None:
            return pd.NaT
        if isinstance(val, float) and np.isnan(val):
            return pd.NaT
        if isinstance(val, (datetime, pd.Timestamp, np.datetime64)):
            return pd.Timestamp(val)
        if isinstance(val, str):
            s = val.strip()
            if s == "":
                return pd.NaT
            # Handle common ISO variants
            # - 2024-09-20T07:40:00
            # - 2024-09-20T07:40:00.123456
            # - 2024-09-20 07:40:00
            # - 2024-09-20T07:40:00Z
            if s.endswith('Z'):
                s = s[:-1] + '+00:00'
            try:
                return pd.Timestamp(datetime.fromisoformat(s))
            except ValueError:
                # Fall back to pandas/dateutil parser
                return pd.to_datetime(s, errors='coerce')
        return pd.to_datetime(val, errors='coerce')

    def _parse_dt_series(series: pd.Series, column_name: str, allow_na: bool) -> pd.Series:
        parsed = series.map(lambda v: _parse_one_iso(v, allow_na))
        if not allow_na and parsed.isna().any():
            bad = series[parsed.isna()].head(10).tolist()
            raise ValueError(f"Unparseable datetime values in '{column_name}': {bad}")
        return parsed
    
    # === Build event-to-application mapping for case_sorting ===
    # Create a mapping: event_id -> application_id (primary object)
    event_to_app = {}
    for event in sim.events:
        for obj_id, obj_type in event.linked_objects:
            if obj_type == 'Application':
                event_to_app[event.event_id] = obj_id
                break  # Only one application per event
    
    # Group events by application and assign case_sorting_integer
    from collections import defaultdict
    app_events = defaultdict(list)
    for event in sim.events:
        app_id = event_to_app.get(event.event_id)
        if app_id:
            app_events[app_id].append(event)
    
    # Sort events within each case by timestamp and global sorting_integer
    event_case_sorting = {}
    for app_id, events in app_events.items():
        # Sort by timestamp first, then by global sorting_integer for tie-breaking
        sorted_events = sorted(events, key=lambda e: (e.timestamp, e.sorting_integer))
        for idx, event in enumerate(sorted_events, start=1):
            event_case_sorting[event.event_id] = idx
    
    # === events.csv (with case_id for classic Process Mining) ===
    events_data = []
    for e in sim.events:
        d = e.to_dict()
        d['case_id'] = event_to_app.get(e.event_id, '')  # Add case_id (application_id)
        d['case_sorting_integer'] = event_case_sorting.get(e.event_id, 0)
        events_data.append(d)
    
    events_df = pd.DataFrame(events_data)
    events_df['timestamp'] = _parse_dt_series(events_df['timestamp'], 'events.timestamp', allow_na=False).dt.strftime(CELONIS_TIMESTAMP_FORMAT)
    # Reorder columns: case_id first for Celonis
    cols = ['case_id', 'event_id', 'activity', 'timestamp', 'sorting_integer', 'case_sorting_integer', 'org_resource']
    events_df = events_df[cols]
    events_df.to_csv(output_dir / 'events.csv', index=False, sep=';')
    print(f"‚úÖ events.csv: {len(events_df)} events (with case_id & case_sorting_integer)")
    
    # === applications.csv ===
    apps_data = [a.to_dict() for a in sim.applications]
    apps_df = pd.DataFrame(apps_data)
    apps_df.to_csv(output_dir / 'applications.csv', index=False, sep=';')
    print(f"‚úÖ applications.csv: {len(apps_df)} applications")
    
    # === documents.csv ===
    docs_data = [d.to_dict() for d in sim.documents]
    docs_df = pd.DataFrame(docs_data)
    # Format submission_date for Celonis
    docs_df['submission_date'] = _parse_dt_series(docs_df['submission_date'], 'documents.submission_date', allow_na=True).dt.strftime(CELONIS_TIMESTAMP_FORMAT)
    docs_df.to_csv(output_dir / 'documents.csv', index=False, sep=';')
    print(f"‚úÖ documents.csv: {len(docs_df)} documents")
    
    # === event_object_link.csv ===
    links_data = []
    for event in sim.events:
        for obj_id, obj_type in event.linked_objects:
            links_data.append({
                'event_id': event.event_id,
                'object_id': obj_id,
                'object_type': obj_type
            })
    links_df = pd.DataFrame(links_data)
    links_df.to_csv(output_dir / 'event_object_link.csv', index=False, sep=';')
    print(f"‚úÖ event_object_link.csv: {len(links_df)} links")
    
    # === log_not_sliced.csv (intermediate format) ===
    # Similar to example_log_not_sliced.csv
    not_sliced_data = []
    for event in sim.events:
        row = {
            'ocel:eid': event.event_id,
            'ocel:timestamp': event.timestamp.strftime(CELONIS_TIMESTAMP_FORMAT),
            'ocel:activity': event.activity,
            'ocel:type:Application': str([obj_id for obj_id, obj_type in event.linked_objects if obj_type == 'Application']),
            'ocel:type:Document': str([obj_id for obj_id, obj_type in event.linked_objects if obj_type == 'Document'])
        }
        not_sliced_data.append(row)
    
    not_sliced_df = pd.DataFrame(not_sliced_data)
    not_sliced_df.to_csv(output_dir / 'log_not_sliced.csv', index=False)
    print(f"‚úÖ log_not_sliced.csv: {len(not_sliced_df)} rows")
    
    return {
        'events': events_df,
        'applications': apps_df,
        'documents': docs_df,
        'links': links_df,
        'not_sliced': not_sliced_df
    }


# Export
dfs = export_ocel(sim, OCEL_DIR)
print(f"\nüìÅ Output written to: {OCEL_DIR}")

‚úÖ events.csv: 67159 events (with case_id & case_sorting_integer)
‚úÖ applications.csv: 9800 applications
‚úÖ documents.csv: 40154 documents
‚úÖ event_object_link.csv: 133700 links
‚úÖ log_not_sliced.csv: 67159 rows

üìÅ Output written to: C:\Users\abodu\Desktop\Clutter Desktop\ÿßŸàÿ±ÿßŸÇ ÿßŸÑÿ¨ÿßŸÖÿπÿ©\Semesters\WinterSemester 25&26\Buisness Process Management\pm4py\Mining-tests\data\outputs\event_logs\ocel


## 9. Statistics & Validation

<div style="border-left:6px solid #14B8A6; padding:12px 14px; border-radius:10px;">

### How to read the results

<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">Throughput</span>
<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">Backlog</span>
<span style="display:inline-block; padding:2px 10px; border-radius:999px; border:1px solid currentColor; font-size:12px;">Sanity Checks</span>

This section prints a few checks so you can quickly validate that the generated log looks realistic.

---

#### 1) Distributions (high-level sanity)

- **Application status**: Approved/Rejected/Pending counts
- **Parent independence**: validates `parent_data_required` probabilities
- **Document types**: validates document generation
- **Activity distribution**: event counts per activity (spot unexpected spikes)

---

#### 2) Throughput time (case duration)

Throughput is computed per case and shown as quantiles.

- If `include_start_event` and `include_end_event` are enabled:
  - `Application handled` ‚àí `Application started`
- Otherwise:
  - first event timestamp ‚àí last event timestamp per `case_id`

---

#### 3) Backlog indicator (Clerk waiting times)

- Non-zero waiting times indicate real queueing/backlogs.
- If waiting times are near zero, increase demand or reduce `resources.Clerk.capacity`.

</div>

In [45]:
# Application status distribution
status_counts = dfs['applications']['status'].value_counts()
print("Application Status Distribution:")
print(status_counts)
print(f"\nApproval Rate: {status_counts.get('Approved', 0) / len(dfs['applications']) * 100:.1f}%")

print("\n" + "="*50)

# Parent independence
parent_counts = dfs['applications']['is_parent_independent'].value_counts()
print("\nParent Independence:")
print(parent_counts)
print(f"Parent-Independent Rate: {parent_counts.get(True, 0) / len(dfs['applications']) * 100:.1f}%")

print("\n" + "="*50)

# Document types
print("\nDocument Types:")
print(dfs['documents']['doc_type'].value_counts())

print("\n" + "="*50)

# Activity distribution
print("\nActivity Distribution:")
print(dfs['events'].groupby('activity').size().sort_values(ascending=False))

print("\n" + "="*50)

# Throughput time + backlog indicators
case_events = dfs['events'][['case_id', 'activity', 'timestamp']].copy()
case_events['timestamp'] = pd.to_datetime(case_events['timestamp'])

start_rows = case_events[case_events['activity'] == 'Application started'][['case_id', 'timestamp']].rename(columns={'timestamp': 'start_ts'})
end_rows = case_events[case_events['activity'] == 'Application handled'][['case_id', 'timestamp']].rename(columns={'timestamp': 'end_ts'})

if len(start_rows) == 0 or len(end_rows) == 0:
    # Fallback if start/end events are disabled: use first/last event per case
    first_last = case_events.sort_values(['case_id', 'timestamp']).groupby('case_id').agg(
        start_ts=('timestamp', 'min'),
        end_ts=('timestamp', 'max')
    ).reset_index()
    throughput = first_last
    print("\nNOTE: Start/end events are disabled or missing; throughput computed using first/last event per case.")
else:
    throughput = start_rows.merge(end_rows, on='case_id', how='inner')

throughput['throughput_days'] = (throughput['end_ts'] - throughput['start_ts']).dt.total_seconds() / (24 * 3600)

print("\nThroughput time (days) quantiles:")
print(throughput['throughput_days'].quantile([0.5, 0.75, 0.9, 0.95, 0.99]).round(2))

if hasattr(sim, 'waiting_times_minutes') and 'Clerk' in sim.waiting_times_minutes:
    waits = np.array(sim.waiting_times_minutes['Clerk'], dtype=float)
    if waits.size > 0:
        print("\nClerk queue waiting time (minutes):")
        print(f"  Avg: {waits.mean():.2f}")
        print(f"  P50: {np.percentile(waits, 50):.2f}")
        print(f"  P90: {np.percentile(waits, 90):.2f}")
        print(f"  P95: {np.percentile(waits, 95):.2f}")
        print(f"  P99: {np.percentile(waits, 99):.2f}")

print("\nNOTE: If Clerk waiting times are near zero, increase demand (reduce interarrival means / increase num_cases) or reduce Clerk capacity.")

Application Status Distribution:
status
Approved    4308
Pending     3940
Rejected    1552
Name: count, dtype: int64

Approval Rate: 44.0%


Parent Independence:
is_parent_independent
False    7821
True     1979
Name: count, dtype: int64
Parent-Independent Rate: 20.2%


Document Types:
doc_type
Formblatt 1                      9800
Immatrikulationsbescheinigung    9800
Formblatt 3                      7821
Einkommensnachweis Eltern        7821
Mietbescheinigung                4912
Name: count, dtype: int64


Activity Distribution:
activity
Send Application Mail        9143
Receive Application          9142
Review Document              7825
Request Parent Data          7821
Receive Parent Data          7164
Assess Application           6539
Request Missing Documents    4905
Calculate Claim              4757
Send Notification            4308
Receive Missing Documents    4003
Send Rejection               1552
dtype: int64


NOTE: Start/end events are disabled or missing; throughput comput

## 10. Preview Output Files

In [46]:
print("=== events.csv (first 10 rows) ===")
display(dfs['events'].head(10))

print("\n=== applications.csv (first 5 rows) ===")
display(dfs['applications'].head(5))

print("\n=== documents.csv (first 10 rows) ===")
display(dfs['documents'].head(10))

print("\n=== event_object_link.csv (first 15 rows) ===")
display(dfs['links'].head(15))

print("\n=== log_not_sliced.csv (first 10 rows) ===")
display(dfs['not_sliced'].head(10))

=== events.csv (first 10 rows) ===


Unnamed: 0,case_id,event_id,activity,timestamp,sorting_integer,case_sorting_integer,org_resource
0,APP_000000,E_000001,Request Parent Data,2024-09-15 00:08:39,1,1,System
1,APP_000001,E_000002,Request Parent Data,2024-09-15 00:53:48,2,1,System
2,APP_000002,E_000003,Request Parent Data,2024-09-15 01:07:16,3,1,System
3,APP_000003,E_000004,Send Application Mail,2024-09-15 01:09:52,4,1,System
4,APP_000003,E_000005,Receive Application,2024-09-15 01:12:53,5,2,System
5,APP_000004,E_000006,Request Parent Data,2024-09-15 01:39:46,6,1,System
6,APP_000005,E_000007,Request Parent Data,2024-09-15 01:40:33,7,1,System
7,APP_000006,E_000008,Send Application Mail,2024-09-15 02:33:09,8,1,System
8,APP_000006,E_000009,Receive Application,2024-09-15 02:34:46,9,2,System
9,APP_000007,E_000010,Request Parent Data,2024-09-15 02:35:08,10,1,System



=== applications.csv (first 5 rows) ===


Unnamed: 0,application_id,student_id,is_initial_application,is_parent_independent,housing_type,status
0,APP_000000,STU_000000,True,False,Eltern,Rejected
1,APP_000001,STU_000001,True,False,Eltern,Approved
2,APP_000002,STU_000002,True,False,Eltern,Approved
3,APP_000003,STU_000003,True,True,Eltern,Approved
4,APP_000004,STU_000004,True,False,Eltern,Rejected



=== documents.csv (first 10 rows) ===


Unnamed: 0,document_id,application_id,doc_type,doc_category,status,submission_date
0,DOC_000000,APP_000000,Formblatt 1,Antrag,Received,2024-09-24 05:28:05
1,DOC_000001,APP_000000,Immatrikulationsbescheinigung,Identit√§t,Received,2024-09-24 05:28:05
2,DOC_000002,APP_000000,Formblatt 3,Einkommen,Received,2024-09-24 05:21:26
3,DOC_000003,APP_000000,Einkommensnachweis Eltern,Einkommen,Received,2024-09-26 10:08:57
4,DOC_000004,APP_000001,Formblatt 1,Antrag,Received,2024-09-16 05:29:41
5,DOC_000005,APP_000001,Immatrikulationsbescheinigung,Identit√§t,Received,2024-09-16 05:29:41
6,DOC_000006,APP_000001,Formblatt 3,Einkommen,Received,2024-09-16 05:23:37
7,DOC_000007,APP_000001,Einkommensnachweis Eltern,Einkommen,Missing,
8,DOC_000008,APP_000002,Formblatt 1,Antrag,Received,2024-09-16 01:14:52
9,DOC_000009,APP_000002,Immatrikulationsbescheinigung,Identit√§t,Received,2024-09-16 01:14:52



=== event_object_link.csv (first 15 rows) ===


Unnamed: 0,event_id,object_id,object_type
0,E_000001,APP_000000,Application
1,E_000002,APP_000001,Application
2,E_000003,APP_000002,Application
3,E_000004,APP_000003,Application
4,E_000005,APP_000003,Application
5,E_000005,DOC_000012,Document
6,E_000005,DOC_000013,Document
7,E_000006,APP_000004,Application
8,E_000007,APP_000005,Application
9,E_000008,APP_000006,Application



=== log_not_sliced.csv (first 10 rows) ===


Unnamed: 0,ocel:eid,ocel:timestamp,ocel:activity,ocel:type:Application,ocel:type:Document
0,E_000001,2024-09-15 00:08:39,Request Parent Data,['APP_000000'],[]
1,E_000002,2024-09-15 00:53:48,Request Parent Data,['APP_000001'],[]
2,E_000003,2024-09-15 01:07:16,Request Parent Data,['APP_000002'],[]
3,E_000004,2024-09-15 01:09:52,Send Application Mail,['APP_000003'],[]
4,E_000005,2024-09-15 01:12:53,Receive Application,['APP_000003'],"['DOC_000012', 'DOC_000013']"
5,E_000006,2024-09-15 01:39:46,Request Parent Data,['APP_000004'],[]
6,E_000007,2024-09-15 01:40:33,Request Parent Data,['APP_000005'],[]
7,E_000008,2024-09-15 02:33:09,Send Application Mail,['APP_000006'],[]
8,E_000009,2024-09-15 02:34:46,Receive Application,['APP_000006'],"['DOC_000022', 'DOC_000023']"
9,E_000010,2024-09-15 02:35:08,Request Parent Data,['APP_000007'],[]


## 11. Celonis Import Instructions

To import in Celonis:

1. **Upload Files**: Upload all 4 CSV files to a Celonis Data Pool

2. **Create Data Model**:
   - Activity Table: `events.csv`
     - Case Key: Link via `event_object_link` ‚Üí `applications`
     - Activity: `activity`
     - Timestamp: `timestamp`
     - Sorting: `sorting_integer`
   
3. **Object Tables**:
   - `applications.csv` - Primary object
   - `documents.csv` - Secondary object (linked via `application_id`)

4. **Link Table**: `event_object_link.csv`
   - Links events to both Application and Document objects

5. **Foreign Keys**:
   - `documents.application_id` ‚Üí `applications.application_id`
   - `event_object_link.event_id` ‚Üí `events.event_id`
   - `event_object_link.object_id` ‚Üí `applications.application_id` (when object_type = 'Application')
   - `event_object_link.object_id` ‚Üí `documents.document_id` (when object_type = 'Document')