<a href="https://colab.research.google.com/github/micah-shull/AI_Agents/blob/main/523_IRMOv2_dataLoading_node.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## Data Loading Node – Establishing the Trust Boundary for the Entire Agent

The `data_loading_node` is the **first execution step** in the orchestrator’s plan, and it is intentionally designed as the system’s **trust boundary**.

Before the agent evaluates risk, analyzes trends, or produces executive recommendations, it must first prove that it is reasoning over **complete, valid, and intentional data**.

This node is where that guarantee is enforced.

---

## What This Node Does in Practice

At a high level, the data loading node performs four critical functions:

1. **Loads all required datasets**
2. **Validates them before use**
3. **Builds deterministic lookup structures**
4. **Scopes analysis intentionally when requested**

Only after these steps succeed does the agent proceed.

If anything fails, execution stops.

---

## Defensive Execution: No Silent Failure

All data loading is wrapped in a single `try/except` block, and any exception is captured and returned explicitly through the agent’s error channel.

This design ensures:

* No partial execution
* No misleading downstream analysis
* No reports generated from incomplete data

If the agent produces output, leadership can be confident that *every required dataset loaded successfully*.

This is a fundamental difference from many AI systems that attempt to “do their best” with whatever data happens to be available.

---

## Explicit Data Domains, Loaded Intentionally

The node loads each dataset using a dedicated, validated loader:

* Agent inventory
* System integrations
* Workflows
* Risk signals
* KPI and cost metrics
* Historical snapshots (v2)
* Ownership review history (v2)
* Expected vs actual value (v2)

Each dataset represents a **distinct dimension of reality**:

* Configuration
* Current-state signals
* Historical memory
* Governance intent

Nothing is inferred. Everything is declared.

---

## Deterministic Lookups: Explainability at Scale

Once the raw data is loaded, the node immediately builds **explicit lookup tables**.

This step is critical.

Rather than allowing downstream logic to repeatedly scan lists or infer relationships dynamically, the agent constructs deterministic mappings such as:

* Agent → workflows
* Agent → risks
* Agent → KPIs
* Agent → historical snapshots
* Agent → governance reviews
* Agent → expected vs actual value

These lookups make the agent’s reasoning:

* Faster
* More transparent
* Fully traceable

When the agent later explains *why* an issue was prioritized, those explanations trace cleanly back through these structures.

---

## Intentional Scoping: Portfolio or Single-Agent Analysis

The optional `agent_id` filter is a subtle but powerful feature.

It allows the same agent to operate in two modes:

* **Portfolio mode** – analyze the entire ecosystem
* **Focused mode** – analyze a single agent in depth

This filtering happens *after* data loading and validation, not before.

That choice matters because:

* Data integrity is preserved
* Comparisons remain valid
* The agent’s behavior is predictable

Scope is a **business decision**, not a side effect of missing data.

---

## Clean State Updates, No Hidden Mutation

The node returns a clear, structured update to state:

* Loaded datasets
* Derived lookups
* Preserved error context

Nothing is mutated implicitly.
Everything that downstream nodes rely on is explicitly returned.

This makes the system:

* Easier to test
* Easier to reason about
* Easier to audit

---

## Why This Matters to Executives (Even If They Never See the Code)

This node quietly enforces properties that leaders care deeply about:

* **Reliability** – the agent won’t run on bad data
* **Predictability** – execution either succeeds or stops
* **Transparency** – inputs and relationships are explicit
* **Control** – scope is intentional, not accidental
* **Safety** – no partial or misleading outputs

Many AI agents *appear* intelligent but fail operationally because they lack this discipline.

This one doesn’t.

---

## Architectural Takeaway

The data loading node is not a utility step.
It is a **governance gate**.

By validating inputs, enforcing completeness, and building deterministic relationships, it ensures that every conclusion produced later—risk scores, trends, prioritization, and reports—rests on a foundation leadership can trust.

This is how you build AI systems that don’t surprise people.



In [None]:
def data_loading_node(
    state: IntegrationRiskManagementOrchestratorState,
    config: IntegrationRiskManagementOrchestratorConfig
) -> Dict[str, Any]:
    """Data Loading Node: Load all data files"""
    errors = state.get("errors", [])
    data_dir = config.data_dir

    try:
        # Load all data files
        agents = load_agents(data_dir, config.agents_file)
        systems = load_system_integrations(data_dir, config.systems_file)
        workflows = load_workflows(data_dir, config.workflows_file)
        risks = load_risk_signals(data_dir, config.risks_file)
        kpis = load_kpis_cost(data_dir, config.kpis_file)

        # v2: Load historical data
        snapshots = load_historical_snapshots(data_dir, config.historical_snapshots_file)
        reviews = load_ownership_review_history(data_dir, config.ownership_review_history_file)
        expected_vs_actual = load_expected_vs_actual_value(data_dir, config.expected_vs_actual_value_file)

        # Build lookups
        lookups = build_lookups(agents, systems, workflows, risks, kpis, snapshots, reviews, expected_vs_actual)

        # Filter by agent_id if specified
        agent_id = state.get("agent_id")
        if agent_id:
            agents = [a for a in agents if a["agent_id"] == agent_id]
            workflows = [w for w in workflows if w["agent_id"] == agent_id]
            risks = [r for r in risks if r["agent_id"] == agent_id]
            kpis = [k for k in kpis if k["agent_id"] == agent_id]
            snapshots = [s for s in snapshots if s["agent_id"] == agent_id]
            reviews = [r for r in reviews if r["agent_id"] == agent_id]
            expected_vs_actual = [e for e in expected_vs_actual if e["agent_id"] == agent_id]

        return {
            "agents": agents,
            "system_integrations": systems,
            "workflows": workflows,
            "risk_signals": risks,
            "kpis_cost_metrics": kpis,
            "historical_snapshots": snapshots,
            "ownership_review_history": reviews,
            "expected_vs_actual_value": expected_vs_actual,
            **lookups,
            "errors": errors
        }
    except Exception as e:
        return {
            "errors": errors + [f"data_loading_node: {str(e)}"]
        }