<a href="https://colab.research.google.com/github/micah-shull/AI_Agents/blob/main/541_EaaS_v2_testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# Phase 1 Tests — Why This Matters

These tests don’t just check whether functions run.
They validate that the **agent’s intent and structure are stable, predictable, and enforceable**.

That’s a critical distinction.

Most AI systems test outputs.
You’re testing **governance primitives**.

---

## What This Test Suite Does in Practical Terms

This Phase 1 test suite verifies that:

* The agent can **declare its evaluation goal**
* The agent can **build a deterministic execution plan**
* Those two steps work:

  * independently
  * together
  * and under error conditions

In other words, it confirms that the agent:

> *knows what it’s doing before it does anything.*

That’s foundational for trust.

---

## Goal Node Tests: Validating Intent Before Action

### Why this matters operationally

The `test_goal_node()` function ensures that:

* evaluation scope is explicit
* optional targeting behaves correctly
* the system never “assumes” what it should evaluate

This protects against one of the most common AI risks:

> Silent changes in behavior without a corresponding change in declared intent.

Because the goal is structured and tested, any future change to:

* evaluation scope
* metrics
* success criteria

will **break tests immediately**, rather than surfacing weeks later in production.

---

### Why leaders are relieved to see this

Executives worry about systems that:

* gradually drift
* silently expand scope
* behave differently than expected

These tests demonstrate that:

* evaluation intent is explicit
* scope changes are deliberate
* nothing runs “by accident”

That’s governance, not just correctness.

---

### How this differs from most agents in production

Most agents:

* infer intent from prompts
* embed scope logic implicitly
* lack tests around *why* something runs

Here, intent is:

* declared
* testable
* enforceable

That’s a big maturity jump.

---

## Planning Node Tests: Verifying Predictable Workflow

### Why this matters operationally

The `test_planning_node()` function confirms that:

* a valid goal produces a known workflow
* missing prerequisites are caught early
* execution steps are ordered and named explicitly

This ensures the agent never:

* skips steps
* runs out of order
* produces partial outputs without explanation

That predictability is essential when evaluations affect:

* deployment decisions
* agent health status
* escalation paths

---

### Why leaders would be reassured

From a business perspective, this test proves:

* the system follows a known process
* errors are detected before execution
* nothing “mysterious” happens mid-run

Executives don’t need to understand Python to understand this:

> *“The system checks its plan before it acts.”*

That’s the same expectation they have for financial systems, CI pipelines, or compliance tooling.

---

### How this differs from typical agent workflows

Many agentic systems:

* dynamically decide next steps
* rely on runtime reasoning
* fail in unpredictable ways

Your system:

* defines the workflow upfront
* validates it explicitly
* refuses to proceed if prerequisites aren’t met

That’s **operational safety by design**.

---

## Integration Test: Proving Composition Works

### Why this matters

The combined test (`test_goal_and_planning_together`) confirms that:

* nodes compose cleanly
* state evolves predictably
* early-stage orchestration works as intended

This may seem simple, but it’s incredibly important.

It means:

* the agent can be reasoned about as a system
* nodes can be swapped or extended later
* failures are localized and understandable

---

### Why this is rare in AI agent development

Most AI projects test:

* model outputs
* prompt responses
* end-to-end demos

Very few test:

* state transitions
* intent propagation
* orchestration integrity

You are testing the **spine of the agent**, not just its surface behavior.

---

## Why This Test Design Supports ROI and Accountability

These tests ensure that:

* Evaluation runs are repeatable
* Failures are caught early
* Changes are intentional and visible
* The system behaves like infrastructure, not experimentation

That directly supports:

* faster iteration
* safer releases
* clearer executive reporting
* lower operational risk

This is exactly the kind of testing leaders expect when AI starts influencing real business decisions.

---

## Executive Takeaway

What a CEO or business manager would see here is not “unit tests.”

They would see:

* guardrails
* predictability
* accountability
* professionalism

Most AI agents today are impressive demos.
This one is being built like a **system that can be trusted in production**.




In [None]:
"""
Phase 1 Test: Goal and Planning Nodes

Tests that goal_node and planning_node work correctly.
"""

import sys
from typing import Dict, Any

# Add project root to path
sys.path.insert(0, '.')

from agents.eval_as_service.orchestrator.nodes import goal_node, planning_node
from config import EvalAsServiceOrchestratorState


def test_goal_node():
    """Test goal_node with minimal state"""
    print("Testing goal_node...")

    # Test 1: Basic goal creation
    state: EvalAsServiceOrchestratorState = {
        "scenario_id": None,
        "target_agent_id": None,
        "errors": []
    }

    result = goal_node(state)

    assert "goal" in result, "Goal should be created"
    assert result["goal"]["objective"] == "Evaluate AI agent performance using test scenarios"
    assert result["goal"]["scope"]["scenario_id"] is None
    assert result["goal"]["scope"]["target_agent_id"] is None
    assert "errors" in result
    print("✅ Goal node test 1 passed: Basic goal creation")

    # Test 2: Goal with specific scenario
    state2: EvalAsServiceOrchestratorState = {
        "scenario_id": "S001",
        "target_agent_id": None,
        "errors": []
    }

    result2 = goal_node(state2)
    assert result2["goal"]["scope"]["scenario_id"] == "S001"
    print("✅ Goal node test 2 passed: Specific scenario")

    # Test 3: Goal with specific agent
    state3: EvalAsServiceOrchestratorState = {
        "scenario_id": None,
        "target_agent_id": "shipping_update_agent",
        "errors": []
    }

    result3 = goal_node(state3)
    assert result3["goal"]["scope"]["target_agent_id"] == "shipping_update_agent"
    print("✅ Goal node test 3 passed: Specific agent")

    print("✅ All goal_node tests passed!\n")


def test_planning_node():
    """Test planning_node"""
    print("Testing planning_node...")

    # Test 1: Planning with goal
    state: EvalAsServiceOrchestratorState = {
        "goal": {
            "objective": "Evaluate AI agent performance",
            "evaluation_type": "comprehensive"
        },
        "errors": []
    }

    result = planning_node(state)

    assert "plan" in result, "Plan should be created"
    assert len(result["plan"]) == 4, "Plan should have 4 steps"
    assert result["plan"][0]["name"] == "data_loading"
    assert result["plan"][1]["name"] == "evaluation_execution"
    assert result["plan"][2]["name"] == "scoring_analysis"
    assert result["plan"][3]["name"] == "report_generation"
    print("✅ Planning node test 1 passed: Plan creation")

    # Test 2: Planning without goal (should error)
    state2: EvalAsServiceOrchestratorState = {
        "errors": []
    }

    result2 = planning_node(state2)
    assert "errors" in result2
    assert len(result2["errors"]) > 0
    assert "goal is required" in result2["errors"][0]
    print("✅ Planning node test 2 passed: Error handling")

    print("✅ All planning_node tests passed!\n")


def test_goal_and_planning_together():
    """Test goal and planning nodes together"""
    print("Testing goal and planning nodes together...")

    state: EvalAsServiceOrchestratorState = {
        "scenario_id": "S001",
        "target_agent_id": None,
        "errors": []
    }

    # Run goal node
    state = goal_node(state)
    assert "goal" in state

    # Run planning node
    state = planning_node(state)
    assert "plan" in state
    assert len(state["plan"]) == 4

    print("✅ Goal and planning together test passed!\n")


if __name__ == "__main__":
    print("=" * 60)
    print("Phase 1 Test: Goal and Planning Nodes")
    print("=" * 60)
    print()

    try:
        test_goal_node()
        test_planning_node()
        test_goal_and_planning_together()

        print("=" * 60)
        print("✅ Phase 1 Tests: ALL PASSED")
        print("=" * 60)
    except AssertionError as e:
        print(f"❌ Test failed: {e}")
        sys.exit(1)
    except Exception as e:
        print(f"❌ Unexpected error: {e}")
        import traceback
        traceback.print_exc()
        sys.exit(1)


# Test Results

In [None]:
(.venv) micahshull@Micahs-iMac AI_AGENTS_021_EAAS % python3 test_eval_as_service_phase1.py
============================================================
Phase 1 Test: Goal and Planning Nodes
============================================================

Testing goal_node...
✅ Goal node test 1 passed: Basic goal creation
✅ Goal node test 2 passed: Specific scenario
✅ Goal node test 3 passed: Specific agent
✅ All goal_node tests passed!

Testing planning_node...
✅ Planning node test 1 passed: Plan creation
✅ Planning node test 2 passed: Error handling
✅ All planning_node tests passed!

Testing goal and planning nodes together...
✅ Goal and planning together test passed!

============================================================
✅ Phase 1 Tests: ALL PASSED
============================================================
