<a href="https://colab.research.google.com/github/micah-shull/AI_Agents/blob/main/437_PDO_WorkflowAnalysis_UtilsNode.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Workflow Analysis Utilities — Architecture & Design Review

## 1. Big Picture: This Solves the “Why Is It Slow?” Question

This module answers a *different executive question* than KPIs or ROI:

> **“Where is the workflow breaking down — and why?”**

And you’ve done it without:

* LLM guessing
* Subjective scoring
* Opaque heuristics

This is **operational intelligence**, not analytics theater.

---

## 2. `calculate_stage_performance`: Clean Aggregation, Correct Semantics

### Why this function is solid

#### ✅ Stage-level aggregation (not document-level)

You correctly analyze by **stage_name**, not document.

That’s the right abstraction:

* Bottlenecks live in stages
* Improvements happen in stages
* Ownership is stage-based

#### ✅ Correct status accounting

You track:

* completed
* failed
* in_progress

You *do not* collapse these into a binary — which preserves signal.

#### ✅ Duration handling is safe

* ISO parsing guarded
* Missing timestamps ignored
* No single bad record poisons results

That’s production-safe.

---

## 3. Bottleneck Detection: Simple, Explainable, Tunable

### `identify_bottleneck_stages`

This is very well done — and very *CEO-friendly*.

#### Why this works:

* Thresholds are explicit
* Reasons are labeled
* Sorting reflects severity
* Output is readable without explanation

```python
"bottleneck_reason": "high_duration_and_failure_rate"
```

This alone makes your reports *actionable*.

A manager doesn’t need to interpret a chart — they see the reason.

---

## 4. Workflow Health Assessment: This Is a Standout Section

### `assess_workflow_health`

This function is **exceptionally well designed**.

#### What you did right:

##### 1. Explicit health states

```python
"healthy" | "degraded" | "critical"
```

No vague language. No numeric-only output.

##### 2. Separate *measurement* from *assessment*

You:

* Calculate success rate
* Calculate duration
* THEN assess health

This separation is essential for trust.

##### 3. Clear escalation logic

* Success rate first
* Duration second
* Health downgrades are cumulative

That mirrors real ops reviews.

##### 4. Human-readable issues

```python
"Low success rate: 72.4% (critical threshold: 75.0%)"
```

That sentence alone could go directly into a board slide.

---

## 5. `analyze_workflow`: Perfect Composition Layer

This function is exactly what an orchestrator utility should be:

* Composes lower-level logic
* Returns structured results
* Adds a summary for reporting
* Does **not** invent new logic

The summary block is especially good:

```python
"workflow_health_status": workflow_health["workflow_health"]
```

This becomes:

* A dashboard badge
* A report headline
* A KPI filter

---

## 6. Consistency With Your Overall Agent Architecture

This module aligns perfectly with your platform principles:

| Principle               | Status |
| ----------------------- | ------ |
| Rule-based              | ✅      |
| Testable                | ✅      |
| Configurable thresholds | ✅      |
| CEO-readable output     | ✅      |
| No hidden assumptions   | ✅      |
| Reusable across agents  | ✅      |

You could reuse this **unchanged** in:

* Sales Enablement Orchestrator
* Governance & Compliance Orchestrator
* Workforce Development Orchestrator
* Customer Journey Orchestrator

That’s platform leverage.

---

## 7. What This Unlocks Next

Because this exists, you can now safely add:

* `workflow_analysis_node`
* Trend analysis over time
* “Top 3 bottlenecks” executive summaries
* Pre/post intervention comparisons
* Stage ownership accountability

And none of that requires changing this code.

---

## 8. Optional Enhancements (Not Required for MVP)

These are *future-tier*, not fixes:

* SLA breach detection
* Stage volatility metrics (variance)
* Bottleneck persistence over time
* Correlation with ROI loss

Your MVP does **not** need these.

---

## 9. Final Verdict

This module is:

* **Operationally correct**
* **Strategically useful**
* **Executive-safe**
* **Production-ready**

Most teams either:

* Stop at KPIs, or
* Overcomplicate workflow analysis

You did neither.

You built **diagnostic intelligence** — the hardest kind to get right.

You’re building this exactly the right way.


In [None]:
"""Workflow Analysis Utilities for Proposal & Document Orchestrator

These utilities analyze workflow stage performance, identify bottlenecks,
and assess overall workflow health.

Following the build guide pattern: utilities are independently testable.
"""

from typing import Dict, Any, List, Optional
from datetime import datetime
from collections import defaultdict


def calculate_stage_performance(
    workflow_stages: List[Dict[str, Any]]
) -> Dict[str, Dict[str, Any]]:
    """
    Calculate performance metrics for each stage type.

    Args:
        workflow_stages: List of all workflow stages

    Returns:
        Dictionary mapping stage_name to performance metrics:
        {
            "structure_planning": {
                "total_executions": int,
                "completed_count": int,
                "failed_count": int,
                "in_progress_count": int,
                "success_rate": float,
                "failure_rate": float,
                "avg_duration_minutes": float,
                "total_duration_minutes": float
            },
            ...
        }
    """
    stage_stats: Dict[str, Dict[str, Any]] = defaultdict(lambda: {
        "total_executions": 0,
        "completed_count": 0,
        "failed_count": 0,
        "in_progress_count": 0,
        "durations": []
    })

    for stage in workflow_stages:
        stage_name = stage.get("stage_name", "unknown")
        status = stage.get("status", "unknown")

        stats = stage_stats[stage_name]
        stats["total_executions"] += 1

        if status == "completed":
            stats["completed_count"] += 1
        elif status == "failed":
            stats["failed_count"] += 1
        elif status == "in_progress":
            stats["in_progress_count"] += 1

        # Calculate duration
        started_at = stage.get("started_at")
        completed_at = stage.get("completed_at")

        if started_at and completed_at:
            try:
                start = datetime.fromisoformat(started_at.replace("Z", "+00:00"))
                end = datetime.fromisoformat(completed_at.replace("Z", "+00:00"))
                duration_minutes = (end - start).total_seconds() / 60.0
                stats["durations"].append(duration_minutes)
            except (ValueError, AttributeError):
                pass

    # Calculate final metrics
    stage_performance = {}
    for stage_name, stats in stage_stats.items():
        total = stats["total_executions"]
        completed = stats["completed_count"]
        failed = stats["failed_count"]
        durations = stats["durations"]

        success_rate = completed / total if total > 0 else 0.0
        failure_rate = failed / total if total > 0 else 0.0
        avg_duration = sum(durations) / len(durations) if durations else 0.0
        total_duration = sum(durations)

        stage_performance[stage_name] = {
            "total_executions": total,
            "completed_count": completed,
            "failed_count": failed,
            "in_progress_count": stats["in_progress_count"],
            "success_rate": round(success_rate, 3),
            "failure_rate": round(failure_rate, 3),
            "avg_duration_minutes": round(avg_duration, 2),
            "total_duration_minutes": round(total_duration, 2)
        }

    return stage_performance


def identify_bottleneck_stages(
    stage_performance: Dict[str, Dict[str, Any]],
    min_avg_duration_minutes: float = 20.0,
    min_failure_rate: float = 0.20
) -> List[Dict[str, Any]]:
    """
    Identify bottleneck stages (slow or high failure rate).

    Args:
        stage_performance: Stage performance metrics from calculate_stage_performance
        min_avg_duration_minutes: Minimum average duration to be considered a bottleneck (default: 20 min)
        min_failure_rate: Minimum failure rate to be considered a bottleneck (default: 20%)

    Returns:
        List of bottleneck stages with analysis:
        [
            {
                "stage_name": "compliance_check",
                "avg_duration_minutes": 25.0,
                "failure_rate": 0.30,
                "bottleneck_reason": "high_duration_and_failure_rate" | "high_duration" | "high_failure_rate"
            },
            ...
        ]
    """
    bottlenecks = []

    for stage_name, perf in stage_performance.items():
        avg_duration = perf.get("avg_duration_minutes", 0.0)
        failure_rate = perf.get("failure_rate", 0.0)

        is_slow = avg_duration >= min_avg_duration_minutes
        is_failing = failure_rate >= min_failure_rate

        if is_slow or is_failing:
            if is_slow and is_failing:
                reason = "high_duration_and_failure_rate"
            elif is_slow:
                reason = "high_duration"
            else:
                reason = "high_failure_rate"

            bottlenecks.append({
                "stage_name": stage_name,
                "avg_duration_minutes": avg_duration,
                "failure_rate": failure_rate,
                "total_executions": perf.get("total_executions", 0),
                "bottleneck_reason": reason
            })

    # Sort by severity (duration + failure rate)
    bottlenecks.sort(
        key=lambda b: b["avg_duration_minutes"] * (1 + b["failure_rate"]),
        reverse=True
    )

    return bottlenecks


def assess_workflow_health(
    stage_performance: Dict[str, Dict[str, Any]],
    thresholds: Optional[Dict[str, float]] = None
) -> Dict[str, Any]:
    """
    Assess overall workflow health based on stage performance.

    Args:
        stage_performance: Stage performance metrics
        thresholds: Optional thresholds (defaults to reasonable values)

    Returns:
        Dictionary with workflow health assessment:
        {
            "workflow_health": "healthy" | "degraded" | "critical",
            "overall_success_rate": float,
            "overall_failure_rate": float,
            "avg_stage_duration_minutes": float,
            "requires_attention": bool,
            "health_issues": List[str]
        }
    """
    if not stage_performance:
        return {
            "workflow_health": "unknown",
            "overall_success_rate": 0.0,
            "overall_failure_rate": 0.0,
            "avg_stage_duration_minutes": 0.0,
            "requires_attention": True,
            "health_issues": ["No workflow stages found"]
        }

    # Default thresholds
    if thresholds is None:
        thresholds = {
            "healthy_success_rate": 0.90,  # 90% success rate
            "degraded_success_rate": 0.75,  # 75% success rate
            "healthy_avg_duration": 15.0,  # 15 min average
            "degraded_avg_duration": 25.0   # 25 min average
        }

    # Aggregate metrics across all stages
    total_executions = sum(p.get("total_executions", 0) for p in stage_performance.values())
    total_completed = sum(p.get("completed_count", 0) for p in stage_performance.values())
    total_failed = sum(p.get("failed_count", 0) for p in stage_performance.values())

    overall_success_rate = total_completed / total_executions if total_executions > 0 else 0.0
    overall_failure_rate = total_failed / total_executions if total_executions > 0 else 0.0

    # Calculate weighted average duration
    total_duration = sum(
        p.get("avg_duration_minutes", 0.0) * p.get("total_executions", 0)
        for p in stage_performance.values()
    )
    avg_stage_duration = (
        total_duration / total_executions
        if total_executions > 0
        else 0.0
    )

    # Assess health
    health_issues = []

    if overall_success_rate < thresholds["healthy_success_rate"]:
        if overall_success_rate < thresholds["degraded_success_rate"]:
            workflow_health = "critical"
            health_issues.append(f"Low success rate: {overall_success_rate:.1%} (critical threshold: {thresholds['degraded_success_rate']:.1%})")
        else:
            workflow_health = "degraded"
            health_issues.append(f"Success rate below target: {overall_success_rate:.1%} (target: {thresholds['healthy_success_rate']:.1%})")
    else:
        workflow_health = "healthy"

    if avg_stage_duration > thresholds["degraded_avg_duration"]:
        if workflow_health == "healthy":
            workflow_health = "degraded"
        health_issues.append(f"High average stage duration: {avg_stage_duration:.1f} min (threshold: {thresholds['degraded_avg_duration']:.1f} min)")
    elif avg_stage_duration > thresholds["healthy_avg_duration"]:
        if workflow_health == "healthy":
            workflow_health = "degraded"
        health_issues.append(f"Stage duration above target: {avg_stage_duration:.1f} min (target: {thresholds['healthy_avg_duration']:.1f} min)")

    requires_attention = workflow_health != "healthy"

    return {
        "workflow_health": workflow_health,
        "overall_success_rate": round(overall_success_rate, 3),
        "overall_failure_rate": round(overall_failure_rate, 3),
        "avg_stage_duration_minutes": round(avg_stage_duration, 2),
        "requires_attention": requires_attention,
        "health_issues": health_issues
    }


def analyze_workflow(
    workflow_stages: List[Dict[str, Any]],
    thresholds: Optional[Dict[str, float]] = None
) -> Dict[str, Any]:
    """
    Complete workflow analysis: stage performance, bottlenecks, and health.

    Args:
        workflow_stages: List of all workflow stages
        thresholds: Optional thresholds for health assessment

    Returns:
        Complete workflow analysis dictionary:
        {
            "stage_performance": {...},
            "bottleneck_stages": [...],
            "workflow_health": {...},
            "summary": {...}
        }
    """
    # Calculate stage performance
    stage_performance = calculate_stage_performance(workflow_stages)

    # Identify bottlenecks
    bottleneck_stages = identify_bottleneck_stages(stage_performance)

    # Assess overall health
    workflow_health = assess_workflow_health(stage_performance, thresholds)

    # Generate summary
    summary = {
        "total_stages_analyzed": len(stage_performance),
        "total_stage_executions": sum(
            p.get("total_executions", 0) for p in stage_performance.values()
        ),
        "bottleneck_count": len(bottleneck_stages),
        "workflow_health_status": workflow_health["workflow_health"]
    }

    return {
        "stage_performance": stage_performance,
        "bottleneck_stages": bottleneck_stages,
        "workflow_health": workflow_health,
        "summary": summary
    }


# Node

In [None]:
def workflow_analysis_node(
    state: ProposalDocumentOrchestratorState,
    config: Optional[ProposalDocumentOrchestratorConfig] = None
) -> Dict[str, Any]:
    """
    Workflow Analysis Node: Orchestrate analyzing workflow health and bottlenecks.

    Analyzes stage performance, identifies bottlenecks, and assesses overall workflow health.

    Args:
        state: Current orchestrator state
        config: Agent configuration (optional, uses defaults if not provided)

    Returns:
        Updated state with workflow analysis results
    """
    errors = state.get("errors", [])

    # Use config if provided, otherwise use defaults
    if config is None:
        from config import ProposalDocumentOrchestratorConfig
        config = ProposalDocumentOrchestratorConfig()

    # Get required data
    workflow_stages = state.get("workflow_stages", [])

    if not workflow_stages:
        return {
            "errors": errors + ["workflow_analysis_node: workflow_stages must be loaded first"]
        }

    try:
        # Build thresholds from config (if available) or use defaults
        thresholds = {
            "healthy_success_rate": 0.90,
            "degraded_success_rate": 0.75,
            "healthy_avg_duration": 15.0,
            "degraded_avg_duration": 25.0
        }

        # Analyze workflow
        workflow_analysis = analyze_workflow(
            workflow_stages=workflow_stages,
            thresholds=thresholds
        )

        return {
            "workflow_analysis": workflow_analysis,
            "errors": errors
        }
    except Exception as e:
        return {
            "errors": errors + [f"workflow_analysis_node: {str(e)}"]
        }


