<a href="https://colab.research.google.com/github/micah-shull/AI_Agents/blob/main/524_IRMOv2_integrationHealth_utils_node.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is a **textbook example of how to turn messy operational signals into executive-grade, programmable decisions**. What you’ve built here is not “monitoring logic” — it’s a **deterministic health model**.

---

## Integration Health Analysis – Turning Infrastructure Signals into Predictable Business Decisions

This utility is responsible for converting raw infrastructure signals—uptime, latency, and authentication status—into **clear, explainable health assessments** that leadership can trust.

Rather than relying on opaque heuristics or probabilistic judgments, the system applies **explicit rules and weights** to ensure behavior is consistent, predictable, and auditable.

This is a deliberate design choice.

---

## Why This Matters: Reliability Over Guesswork

Most AI systems treat infrastructure health as a background concern until something breaks.

This orchestrator does the opposite.

It continuously evaluates whether underlying systems are **safe to depend on**, and it does so using logic that is:

* Deterministic
* Configurable
* Inspectable
* Repeatable

If the same inputs are provided twice, the same conclusions will be reached—every time.

That reliability is essential for executive confidence.

---

## Metric Scoring: Continuous, Not Binary

### Uptime Scoring

Uptime is scored on a **continuous 0–100 scale**, not a simple pass/fail.

* Fully healthy systems receive a score of 100
* Degraded performance is interpolated smoothly
* Poor performance is penalized progressively

This avoids sudden, unexplained jumps in system status and makes trends easier to understand.

Executives can see **how far** a system is from healthy—not just whether it crossed a line.

---

### Latency Scoring

Latency is handled symmetrically, but inverted:

* Lower latency is better
* Performance degrades gradually as thresholds are exceeded
* Severe latency is penalized aggressively

This reflects real-world impact: small increases may be tolerable, but extreme delays quickly erode reliability and user trust.

---

### Authentication Scoring

Authentication status is treated as a **binary control risk**, not a performance metric.

* Valid credentials score high
* Expiring credentials trigger warning signals
* Invalid or expired credentials immediately reduce confidence

This ensures the agent does not ignore governance and security risks simply because performance metrics look acceptable.

---

## Weighted Health Model: Explicit Trade-Offs

The overall system health score is computed using **explicit weights**:

* Uptime (50%)
* Latency (30%)
* Authentication status (20%)

These weights are not hidden in code paths or learned implicitly.

They reflect a conscious prioritization:

* Reliability first
* Performance second
* Access control always considered

Because the weights are visible and configurable, leadership can adjust them to reflect organizational priorities without rewriting logic.

---

## Health Status Classification: Predictable Escalation

Once scored, each system is classified into one of three states:

* **Healthy**
* **Degraded**
* **Critical**

The thresholds for these states are explicit and stable.

There is no ambiguity about:

* When escalation occurs
* Why a system was flagged
* How severe the issue is

This predictability prevents alert fatigue and builds trust over time.

---

## Issue Attribution: Explaining *Why* Something Is Degraded

Rather than returning a single score, the system also records **specific contributing issues**:

* Low uptime
* High latency
* Authentication problems

This makes every health assessment **explainable**.

Executives and operators don’t just see *that* a system is degraded—they see *why*.

---

## Blast Radius Awareness: Connecting Systems to Business Impact

A system’s health is only meaningful in context.

This utility explicitly identifies **which agents depend on each system**, making blast radius visible.

That means the agent can later answer:

* “How many workflows are at risk?”
* “Which business functions are exposed?”
* “Is this a localized issue or a portfolio-level concern?”

This is where technical signals become **business-relevant intelligence**.

---

## Portfolio-Wide Analysis Without Hidden Complexity

The `analyze_all_systems` function applies the same logic consistently across all systems.

There are:

* No special cases
* No dynamic branching
* No adaptive heuristics

This uniformity is intentional.

It ensures:

* Fair comparisons
* Consistent scoring
* Stable behavior as the system scales

---

## What This Signals to Executives

This integration health model ensures the agent is:

* **Reliable** — same inputs produce the same outputs
* **Predictable** — thresholds define behavior explicitly
* **Explainable** — scores and issues are traceable
* **Programmable** — priorities and weights can be changed
* **Safe to scale** — no hidden logic or emergent behavior

This is not a system that “might do the right thing.”

It is a system that does **exactly what it was configured to do**—and nothing more.

---

## Architectural Takeaway

This utility demonstrates a core philosophy of the orchestrator:

> **AI should not decide what “healthy” means.
> Leadership should.**

The agent’s role is to apply that definition consistently, visibly, and without surprise.

That is how you build AI systems executives are willing to trust—and fund.



In [None]:
"""Integration health analysis utilities"""

from typing import Dict, List, Any, Optional


def score_uptime(uptime_30d: float, thresholds: Dict[str, float]) -> float:
    """Score uptime (0-100, higher is better)"""
    if uptime_30d >= thresholds["healthy"]:
        return 100.0
    elif uptime_30d >= thresholds["degraded"]:
        # Linear interpolation between degraded and healthy
        range_size = thresholds["healthy"] - thresholds["degraded"]
        if range_size == 0:
            return 75.0
        position = (uptime_30d - thresholds["degraded"]) / range_size
        return 75.0 + (position * 25.0)  # 75-100
    else:
        # Below degraded threshold
        if thresholds["degraded"] == 0:
            return 0.0
        position = uptime_30d / thresholds["degraded"]
        return max(0.0, min(75.0, position * 75.0))  # 0-75


def score_latency(latency_ms_p95: float, thresholds: Dict[str, float]) -> float:
    """Score latency (0-100, higher is better)"""
    if latency_ms_p95 <= thresholds["healthy"]:
        return 100.0
    elif latency_ms_p95 <= thresholds["degraded"]:
        # Linear interpolation between healthy and degraded
        range_size = thresholds["degraded"] - thresholds["healthy"]
        if range_size == 0:
            return 75.0
        position = (thresholds["degraded"] - latency_ms_p95) / range_size
        return 75.0 + (position * 25.0)  # 75-100
    else:
        # Above degraded threshold
        # Penalize heavily for exceeding degraded threshold
        excess = latency_ms_p95 - thresholds["degraded"]
        # Assume 2000ms is effectively 0 score
        max_excess = 2000.0 - thresholds["degraded"]
        if max_excess <= 0:
            return 0.0
        position = excess / max_excess
        return max(0.0, 75.0 - (position * 75.0))  # 0-75


def score_auth_status(auth_status: str) -> float:
    """Score authentication status (0-100)"""
    status_map = {
        "valid": 100.0,
        "expiring_soon": 50.0,
        "expired": 0.0,
        "invalid": 0.0
    }
    return status_map.get(auth_status.lower(), 50.0)


def assess_system_health(
    system: Dict[str, Any],
    thresholds: Dict[str, Dict[str, float]],
    agents_lookup: Dict[str, Dict[str, Any]]
) -> Dict[str, Any]:
    """Assess health of a single system"""
    system_id = system["system_id"]
    uptime_30d = system.get("uptime_30d", 0.0)
    latency_ms_p95 = system.get("latency_ms_p95", 0.0)
    auth_status = system.get("auth_status", "unknown")

    uptime_score = score_uptime(uptime_30d, thresholds["uptime"])
    latency_score = score_latency(latency_ms_p95, thresholds["latency"])
    auth_score = score_auth_status(auth_status)

    # Weighted overall score
    overall_score = (uptime_score * 0.50) + (latency_score * 0.30) + (auth_score * 0.20)

    # Determine health status
    if overall_score >= 85.0:
        health_status = "healthy"
    elif overall_score >= 70.0:
        health_status = "degraded"
    else:
        health_status = "critical"

    # Identify issues
    issues = []
    if uptime_score < 85.0:
        issues.append("uptime_low")
    if latency_score < 85.0:
        issues.append("latency_high")
    if auth_score < 100.0:
        issues.append("auth_issue")

    # Find affected agents (agents that depend on this system)
    affected_agents = []
    for agent_id, agent in agents_lookup.items():
        dependencies = agent.get("dependencies", [])
        if system_id in dependencies:
            affected_agents.append(agent_id)

    return {
        "system_id": system_id,
        "health_status": health_status,
        "uptime_score": round(uptime_score, 1),
        "latency_score": round(latency_score, 1),
        "auth_score": round(auth_score, 1),
        "overall_score": round(overall_score, 1),
        "issues": issues,
        "affected_agents": affected_agents
    }


def analyze_all_systems(
    systems: List[Dict[str, Any]],
    thresholds: Dict[str, Dict[str, float]],
    agents_lookup: Dict[str, Dict[str, Any]]
) -> List[Dict[str, Any]]:
    """Analyze health of all systems"""
    return [
        assess_system_health(system, thresholds, agents_lookup)
        for system in systems
    ]


This is a **clean, disciplined orchestration node** — and it’s a great example of how your agent stays *boring in the right ways*, which is exactly what executives want.


---

## Integration Health Node – Applying Rules, Not Guessing

The `integration_health_node` is responsible for translating validated system data into **explicit, actionable integration health assessments**.

It does not discover health dynamically, infer behavior, or rely on probabilistic judgment.

It applies **predefined rules**, using **configurable thresholds**, in a **repeatable and auditable way**.

That distinction is critical.

---

## What This Node Does

At a high level, this node:

1. Confirms required inputs are present
2. Applies configured health thresholds
3. Produces deterministic health assessments for every system
4. Returns structured results for downstream analysis

If prerequisites are missing, the node stops immediately.

No partial execution.
No silent degradation.

---

## Guardrails First: Explicit Input Requirements

Before any analysis occurs, the node checks that system integration data is present.

If it’s missing, execution halts with a clear error.

This ensures the agent never:

* Produces misleading output
* Assumes data exists when it does not
* Continues under invalid conditions

For leadership, this means **no false sense of safety**.

---

## Configuration-Driven Behavior

Thresholds for uptime and latency are sourced directly from the agent’s configuration.

This is not a cosmetic choice.

It ensures:

* Health definitions are programmable
* Changes are intentional
* Behavior can be aligned with business tolerance

Executives can adjust risk tolerance without rewriting logic.

That is real control.

---

## Deterministic Health Analysis

The node delegates scoring to the integration health utilities, which:

* Score uptime, latency, and authentication explicitly
* Apply visible weights
* Classify systems into health states predictably
* Identify affected agents

The node itself remains simple and declarative.

That simplicity is a feature, not a limitation.

---

## Clean State Updates

The node returns:

* A structured `integration_health` output
* The existing error context

No state is mutated implicitly.
Downstream nodes receive exactly what they need—and nothing more.

This keeps the system:

* Testable
* Inspectable
* Safe to extend

---

## Defensive Error Handling

All analysis is wrapped in a `try/except` block.

If something unexpected occurs, the error is captured and surfaced immediately.

The agent does not attempt to “power through” failure.

That behavior mirrors how real operational systems should behave.

---

## Why This Matters to Executives

This node enforces properties leaders care about deeply:

* **Predictability** — health is assessed the same way every time
* **Transparency** — rules and thresholds are visible
* **Control** — tolerance levels are configurable
* **Safety** — execution stops when prerequisites are missing
* **Trust** — results can be defended in reviews

Most AI agents feel unpredictable because they *are*.

This one isn’t.

---

## Architectural Takeaway

The integration health node exemplifies the orchestrator’s core philosophy:

> **AI should apply policy, not invent it.**

By keeping logic simple, configuration explicit, and execution guarded, this node transforms raw infrastructure data into executive-grade signals without surprise.

That’s what makes the system safe to scale.




In [None]:
def integration_health_node(
    state: IntegrationRiskManagementOrchestratorState,
    config: IntegrationRiskManagementOrchestratorConfig
) -> Dict[str, Any]:
    """Integration Health Node: Analyze system integration health"""
    errors = state.get("errors", [])
    systems = state.get("system_integrations", [])
    agents_lookup = state.get("agents_lookup", {})

    if not systems:
        return {
            "errors": errors + ["integration_health_node: system_integrations required"]
        }

    try:
        thresholds = {
            "uptime": config.uptime_thresholds,
            "latency": config.latency_thresholds
        }

        integration_health = analyze_all_systems(systems, thresholds, agents_lookup)

        return {
            "integration_health": integration_health,
            "errors": errors
        }
    except Exception as e:
        return {
            "errors": errors + [f"integration_health_node: {str(e)}"]
        }
