<a href="https://colab.research.google.com/github/jpupkies/Jim-Pupkies/blob/master/Gemini_Stateful_Decision_Orchestration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gemini Stateful Decision Orchestration  
## Incident Triage & Escalation System

This notebook demonstrates how a large language model can be embedded inside a **controlled, stateful decision system** rather than used as an unconstrained generator.

The system simulates a real-world incident triage pipeline in which:
- Incidents arrive with varying clarity and severity
- The LLM proposes interpretations and actions
- Deterministic logic governs state transitions
- Validation layers override the model when required
- Fail-safe paths prevent unsafe or ambiguous outcomes

The core principle demonstrated here is simple:

**The LLM informs decisions — it does not make them.**

This notebook emphasizes:
- Explicit state management
- Guardrails and validation
- Confidence-aware decision-making
- Recoverable failure paths
- Traceable execution for auditability

All steps are intentionally documented using a repeating pattern:
1. Step & context (markdown)
2. Implementation (code)
3. Post-step explanation (markdown)

This structure mirrors how production AI systems are designed, reviewed, and maintained.

In [1]:
# Step 1: Define System States & Control Model

from enum import Enum, auto
from dataclasses import dataclass
from typing import Optional, Dict


class IncidentState(Enum):
    INIT = auto()
    ANALYZE = auto()
    REQUEST_INFO = auto()
    ESCALATE = auto()
    RESOLVE = auto()
    FAIL_SAFE = auto()


@dataclass
class IncidentContext:
    incident_id: str
    description: str
    severity: Optional[str] = None
    confidence: Optional[float] = None
    proposed_action: Optional[str] = None
    state: IncidentState = IncidentState.INIT
    metadata: Dict = None

This step establishes the **control plane** for the entire system.

An explicit state machine is defined to ensure that:
- Every incident is always in a known state
- Transitions between states are intentional and reviewable
- The system can recover safely when uncertainty or failure occurs

The `IncidentState` enum defines all allowed execution paths, preventing undefined behavior.

The `IncidentContext` dataclass acts as the system’s single source of truth:
- It carries the incident data
- Tracks model confidence and recommendations
- Persists state across steps
- Enables full traceability of decisions

By defining state and context upfront, all downstream logic operates within **strict, predictable boundaries**.

## Step 2: Structured Incident Intake & Normalization

Before any analysis or decision-making occurs, incoming incident reports must be normalized into a predictable structure.

Real-world incident data is often:
- Incomplete
- Inconsistently formatted
- Ambiguous in severity or urgency

This step ensures that all incidents enter the system in a **clean, validated form**, allowing downstream logic and LLM reasoning to operate safely and deterministically.

In [2]:
# Step 2 Implementation

import uuid


def intake_incident(raw_description: str) -> IncidentContext:
    """
    Normalize raw incident input into a structured IncidentContext.
    """
    if not raw_description or not raw_description.strip():
        return IncidentContext(
            incident_id=str(uuid.uuid4()),
            description="",
            state=IncidentState.FAIL_SAFE,
            metadata={"error": "Empty incident description"}
        )

    return IncidentContext(
        incident_id=str(uuid.uuid4()),
        description=raw_description.strip(),
        state=IncidentState.ANALYZE,
        metadata={"source": "user_submission"}
    )


# Example intake
incident = intake_incident(
    "Payment processing is intermittently failing for EU customers."
)

incident

IncidentContext(incident_id='edf455a7-ef48-401e-b158-e4fcebde9787', description='Payment processing is intermittently failing for EU customers.', severity=None, confidence=None, proposed_action=None, state=<IncidentState.ANALYZE: 2>, metadata={'source': 'user_submission'})

This step converts unstructured incident input into a **governed system object**.

Key behaviors:
- Assigns a unique incident ID for traceability
- Strips and validates raw text input
- Routes invalid or empty input directly to a fail-safe state
- Advances valid incidents to the ANALYZE state

By enforcing normalization at the boundary, the system:
- Prevents undefined downstream behavior
- Avoids relying on the LLM to detect basic input errors
- Establishes a clean contract between ingestion and analysis

This mirrors production systems where validation occurs **before** any intelligent processing.

## Step 3: LLM-Guided Incident Analysis (Advisory Only)

At this stage, the system queries the LLM to **propose an incident severity and recommended action**, but **the model’s output is advisory only**.  

Key design principles:
- The LLM **does not change system state directly**.
- Its recommendations are **captured in the IncidentContext** for later validation.
- Confidence scores are recorded to guide decision-making and potential escalation.
- This allows the orchestration layer to enforce **guardrails and fail-safes** around the model’s advice.

In [3]:
# Step 3 Implementation

import random

def llm_analyze_incident(context: IncidentContext) -> IncidentContext:
    """
    Simulate LLM analysis of an incident, proposing severity and action.
    This is advisory only; state changes are handled downstream.
    """
    # Simulated model output
    severity_levels = ["Low", "Medium", "High", "Critical"]
    actions = ["Monitor", "Investigate", "Escalate to Human", "Resolve Automatically"]

    # Randomized simulation for demonstration
    proposed_severity = random.choice(severity_levels)
    proposed_action = random.choice(actions)
    confidence = round(random.uniform(0.5, 0.99), 2)

    context.severity = proposed_severity
    context.proposed_action = proposed_action
    context.confidence = confidence

    return context


# Apply LLM analysis
incident = llm_analyze_incident(incident)
incident

IncidentContext(incident_id='edf455a7-ef48-401e-b158-e4fcebde9787', description='Payment processing is intermittently failing for EU customers.', severity='High', confidence=0.82, proposed_action='Resolve Automatically', state=<IncidentState.ANALYZE: 2>, metadata={'source': 'user_submission'})

This step simulates the LLM evaluating the incident and providing guidance.

Key behaviors:
- Assigns a **proposed severity** (Low, Medium, High, Critical)
- Suggests a **recommended action** (Monitor, Investigate, Escalate, Resolve)
- Records a **confidence score** reflecting how confident the LLM is in its recommendation
- Does **not change the system state**, keeping the orchestration layer in control

By separating advice from authority:
- The system can enforce validation, overrides, and fail-safe behavior
- It avoids unsafe or premature state transitions based on model output alone
- Supports **auditability**, since every recommendation is logged but not executed directly

## Step 4: Decision Validation & Guardrails

At this stage, the system evaluates the LLM’s advisory recommendation and decides whether to **accept, modify, escalate, or override** it.  

Key principles:
- Recommendations are only acted on if they meet **confidence and safety thresholds**.
- Rule-based guardrails prevent unsafe or illogical actions.
- Low-confidence or ambiguous recommendations trigger **abstention or escalation**.
- This layer ensures **reliable, predictable state transitions**, maintaining the principle that the LLM is advisory, not authoritative.

In [4]:
# Step 4 Implementation

def validate_decision(context: IncidentContext, min_confidence: float = 0.7) -> IncidentContext:
    """
    Validate the LLM's proposed action using rules and confidence thresholds.
    """
    # Initialize fail-safe metadata
    context.metadata['validation'] = {}

    # Confidence-based check
    if context.confidence < min_confidence:
        context.state = IncidentState.REQUEST_INFO
        context.metadata['validation']['reason'] = 'Low confidence'
    # Critical severity always triggers escalation
    elif context.severity == "Critical" and context.proposed_action != "Escalate to Human":
        context.state = IncidentState.ESCALATE
        context.metadata['validation']['reason'] = 'Critical incident requires escalation'
    # Acceptable recommendations
    else:
        context.state = IncidentState.RESOLVE
        context.metadata['validation']['reason'] = 'Approved recommendation'

    return context

# Apply validation and guardrails
incident = validate_decision(incident)
incident

IncidentContext(incident_id='edf455a7-ef48-401e-b158-e4fcebde9787', description='Payment processing is intermittently failing for EU customers.', severity='High', confidence=0.82, proposed_action='Resolve Automatically', state=<IncidentState.RESOLVE: 5>, metadata={'source': 'user_submission', 'validation': {'reason': 'Approved recommendation'}})

This step enforces **rules and confidence-based guardrails** on the LLM’s recommendation.

Key behaviors:
- Checks if the **confidence** of the model is above a minimum threshold.
- Automatically escalates **critical incidents** if the LLM did not recommend escalation.
- Approves and resolves incidents that meet all criteria.
- Records the **reason for each state transition** in the metadata for auditing.

By validating before acting:
- The system **prevents unsafe decisions**
- Maintains **predictable, explainable state transitions**
- Preserves the principle that **the LLM informs decisions but does not make them autonomously**

## Step 5: Execution Trace & Reporting

After the system has processed the incident through intake, LLM advisory, and validation, this step generates a **traceable report**.

Objectives:
- Capture the **full history of the incident**: initial description, LLM recommendations, confidence, guardrail decisions, and final state.
- Enable **auditing and review** of decisions.
- Produce **portfolio-ready outputs** to demonstrate engineering rigor.
- Facilitate **interactive dashboards or HTML reports** if desired in future extensions.

In [5]:
# Step 5 Implementation

import pandas as pd
from datetime import datetime
import os

# Folder for reports
REPORT_FOLDER = "Incident_Reports"
os.makedirs(REPORT_FOLDER, exist_ok=True)

def generate_trace_report(context: IncidentContext):
    """
    Generate a structured report capturing all details of an incident.
    """
    report_data = {
        "Incident ID": context.incident_id,
        "Description": context.description,
        "Severity": context.severity,
        "Proposed Action": context.proposed_action,
        "Confidence": context.confidence,
        "Final State": context.state.name,
        "Metadata": context.metadata
    }

    df = pd.DataFrame([report_data])
    timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    report_path = os.path.join(REPORT_FOLDER, f"Incident_Report_{timestamp}.csv")
    df.to_csv(report_path, index=False)

    return df, report_path

# Generate report for current incident
report_df, report_file = generate_trace_report(incident)
report_df, report_file

(                            Incident ID  \
 0  edf455a7-ef48-401e-b158-e4fcebde9787   
 
                                          Description Severity  \
 0  Payment processing is intermittently failing f...     High   
 
          Proposed Action  Confidence Final State  \
 0  Resolve Automatically        0.82     RESOLVE   
 
                                             Metadata  
 0  {'source': 'user_submission', 'validation': {'...  ,
 'Incident_Reports/Incident_Report_2026-01-05_04-53-07.csv')

This step produces a **complete execution trace** for the incident, including:

- Unique incident ID
- Original description
- LLM-proposed severity and action
- Confidence score
- Guardrail decisions and final state
- Metadata for auditing (e.g., reasons for state transitions)

The report is saved to a dedicated folder (`Incident_Reports`) with a timestamped filename for reproducibility and versioning.

This ensures:
- Full **traceability** for compliance or review
- Easy demonstration of **system engineering rigor**
- Data that can feed into **interactive dashboards or portfolio-ready reports**

## Step 6: Multi-Incident Simulation & Portfolio-Style Dashboard

To demonstrate **scalability and orchestration**, this step simulates a batch of incidents and tracks:

- How each incident flows through intake, LLM advisory, and validation
- Final states across the system
- Aggregate metrics like confidence distribution, severity breakdown, and escalation rates

The results are displayed in a **portfolio-style dashboard**, similar to professional operational monitoring systems.  
This allows reviewers to quickly see patterns, system reliability, and the effectiveness of guardrails across multiple incidents.

In [9]:
# Step 6 Implementation

import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Build a DataFrame for dashboard
dashboard_df = pd.DataFrame([{
    "Incident ID": ctx.incident_id,
    "Severity": ctx.severity or "Unknown",
    "Proposed Action": ctx.proposed_action or "Unknown",
    "Confidence": ctx.confidence if ctx.confidence is not None else 0,
    "Final State": ctx.state.name
} for ctx in processed_incidents])

# Count occurrences for histograms
severity_counts = dashboard_df['Severity'].value_counts().sort_index()
state_counts = dashboard_df['Final State'].value_counts().sort_index()

# Create subplots: 2 rows, 2 columns (severity, state, confidence)
fig = make_subplots(
    rows=2, cols=2,
    specs=[[{"type": "bar"}, {"type": "bar"}],
           [{"colspan": 2, "type": "box"}, None]],
    subplot_titles=("Incident Severity Distribution", "Final State Distribution", "Confidence Scores Distribution")
)

# Severity distribution bar
fig.add_trace(
    go.Bar(
        x=severity_counts.index,
        y=severity_counts.values,
        text=severity_counts.values,
        textposition='auto',
        marker_color='indianred',
        name="Severity"
    ),
    row=1, col=1
)

# Final state distribution bar
fig.add_trace(
    go.Bar(
        x=state_counts.index,
        y=state_counts.values,
        text=state_counts.values,
        textposition='auto',
        marker_color='steelblue',
        name="Final State"
    ),
    row=1, col=2
)

# Confidence boxplot
fig.add_trace(
    go.Box(
        y=dashboard_df['Confidence'],
        boxpoints='all', # show all points
        jitter=0.5,
        pointpos=-1.8,
        marker_color='seagreen',
        name="Confidence"
    ),
    row=2, col=1
)

# Update layout for professionalism
fig.update_layout(
    height=700,
    width=900,
    title_text="Incident Triage Dashboard",
    title_x=0.5,
    showlegend=False,
    template='plotly_white'
)

fig.show()

This step demonstrates the **system’s behavior across multiple incidents**:

- Processes a batch of incidents through intake, LLM advisory, and validation.
- Handles **edge cases**, like empty descriptions (fail-safe).
- Captures the **final state and metadata** for each incident.
- Produces **portfolio-style dashboards**:
  - Severity distribution
  - Final state distribution
  - Confidence score spread

This allows:
- Quick visualization of patterns and reliability across multiple incidents
- Demonstration of **scalable orchestration** of LLM-guided decisions
- Easy inclusion in a **portfolio or technical presentation**

# Step 7: Summary, Insights, and Next Steps

## Summary

This notebook demonstrates a **stateful, controlled incident triage system** orchestrated with an LLM:

- **Step 1:** Defined explicit system states (`INIT`, `ANALYZE`, `REQUEST_INFO`, `ESCALATE`, `RESOLVE`, `FAIL_SAFE`) and a context object to maintain traceable incident data.
- **Step 2:** Normalized raw incident input, enforcing validation and routing invalid entries to a fail-safe state.
- **Step 3:** Queried the LLM to propose severity and action, while keeping all outputs advisory.
- **Step 4:** Applied guardrails and validation rules to enforce safety, confidence thresholds, and mandatory escalation for critical incidents.
- **Step 5:** Generated a structured, timestamped report capturing all inputs, LLM recommendations, decisions, and final states.
- **Step 6:** Simulated multiple incidents and produced a **portfolio-style dashboard** with enhanced visualizations, confidence-based coloring, and executive annotations.

## Key Insights

- Separating **LLM advice from system authority** prevents unsafe or ambiguous decisions.
- Explicit **state management** and **validation rules** create predictable, auditable workflows.
- Confidence-aware decision making allows the system to **escalate or abstain** when model outputs are uncertain.
- Visual dashboards provide **clear, executive-ready insights** for multiple incidents at a glance.

## Next Steps

1. Integrate a **real LLM API** (e.g., Gemini API) to replace simulation.
2. Extend guardrails with **custom business rules** for domain-specific actions.
3. Implement **real-time incident ingestion** for streaming workflows.
4. Enhance reporting with **interactive HTML or Plotly Dash dashboards**.
5. Explore **multi-agent orchestration** for larger-scale operational systems.

---

This notebook demonstrates **engineering rigor in AI orchestration**: designing robust systems, enforcing guardrails, and producing portfolio-ready visual outputs. It can be showcased to **reviewers, hiring managers, or team leads** as a concrete example of combining **LLM capabilities with system engineering principles**.