<p style="text-align:center">
  <a href="https://www.linkedin.com/company/100622063" target="_blank" title="Follow LevelUp360 on LinkedIn">
    <img src="../../assets/levelup360-inverted-logo-transparent.svg" alt="LevelUp360" width="220">
  </a>
</p>


# Marketing Team – Week 4: content planning agent Routing Tests

**Objective:** Validate that the LangGraph content planning agent node correctly routes queries to appropriate tools before building the full workflow.

**Pass Criteria:**
- Routing accuracy: >90%
- Routing consistency: >95%

**Test Approach:**
1. Define test cases with known correct routing decisions
2. Measure routing accuracy across all test cases
3. Measure routing consistency (5 runs per case)
4. Analyze failures and inconsistencies
5. Pass/fail decision based on thresholds

**NOTE: Accuracy and consistency results are dependent on the system message you use in brand config YAML (i.e. itconsulting.yaml) models/content_planning/system_message**
---

In [None]:
# Imports and Setup
from src.evaluation.routing_evaluator import RoutingEvaluator, LangGraphRoutingAdapter
from src.agents.graphs.content_generation_graph import build_content_workflow
import pandas as pd
from rich import print as rprint

# Configuration
brand = "itconsulting"

# Build the graph
app = build_content_workflow(brand=brand)

# Initialize adapter and evaluator
adapter = LangGraphRoutingAdapter(
    app=app,
    brand=brand,
    template="LINKEDIN_POST_ZERO_SHOT",
    use_cot=True
)

evaluator = RoutingEvaluator(adapter)

rprint("✓ Setup complete")
rprint(f"  Brand: {brand}")
rprint(f"  Graph: Content generation workflow compiled")
rprint(f"  Tools: rag_search, web_search")
rprint(f"  Adapter: LangGraphRoutingAdapter")
rprint(f"  Evaluator: RoutingEvaluator")

---

## Test Cases

Defining 20 test cases covering different routing scenarios:
- **Company-specific queries** → `rag_search` (needs your documents)
- **General knowledge queries** → `direct_generation` (no retrieval needed)
- **Current/industry information** → `web_search` (needs current data)
- **Mixed queries** → Multiple tools (e.g., `rag_search` + `web_search`)

Each test case includes:
- `query`: The input topic/question
- `expected_tools`: List of tools that should be called (supports single or multiple)
- `reason`: Why these tools are expected

---

In [None]:
# Define Test Cases 
routing_test_cases = [
    # Company-specific queries - should use rag_search
    {
        "query": "Create a post about our AI governance pipeline",
        "expected_tools": ["rag_search"],  
        "reason": "Company-specific implementation details"
    },
    {
        "query": "Explain how our agentic marketing team works",
        "expected_tools": ["rag_search"],  
        "reason": "Specific technical implementation from our docs"
    },
    {
        "query": "What services does LevelUp360 offer?",
        "expected_tools": ["rag_search"],  
        "reason": "Company-specific service information"
    },
    {
        "query": "Describe our Azure cloud architecture approach",
        "expected_tools": ["rag_search"],  
        "reason": "Our specific architecture patterns"
    },
    {
        "query": "What certifications and expertise do we have?",
        "expected_tools": ["rag_search"],  
        "reason": "Company-specific credentials"
    },
    {
        "query": "How do we implement secure AI infrastructure?",
        "expected_tools": ["rag_search"],  
        "reason": "Our specific security implementation"
    },
    {
        "query": "What is our approach to AI evaluation and governance?",
        "expected_tools": ["rag_search"],  
        "reason": "Company-specific methodology"
    },
    {
        "query": "Explain our content generation workflow",
        "expected_tools": ["rag_search"],  
        "reason": "Our specific workflow implementation"
    },
    
    # General knowledge - should use direct_generation (no tools)
    {
        "query": "Write about the importance of AI governance in general",
        "expected_tools": [],
        "reason": "General AI governance concepts, no company context needed"
    },
    {
        "query": "Discuss general cloud security best practices",
        "expected_tools": [],
        "reason": "General industry knowledge"
    },
    {
        "query": "Explain what agentic AI systems are",
        "expected_tools": [],
        "reason": "General AI concept definition"
    },
    {
        "query": "Write about the benefits of infrastructure as code",
        "expected_tools": [],
        "reason": "General DevOps concept"
    },
    {
        "query": "Discuss the role of AI in business transformation",
        "expected_tools": [],
        "reason": "General business/AI topic"
    },
    {
        "query": "Explain the concept of retrieval-augmented generation",
        "expected_tools": [],
        "reason": "General AI architecture concept"
    },
    
    # Current/industry information - should use web_search
    {
        "query": "What are the latest trends in AI implementation for 2025?",
        "expected_tools": ["web_search"],
        "reason": "Current industry trends, needs recent information"
    },
    {
        "query": "What are current cloud security compliance requirements?",
        "expected_tools": ["web_search"],
        "reason": "Current regulatory information"
    },
    {
        "query": "What are the latest Azure AI services announcements?",
        "expected_tools": ["web_search"],
        "reason": "Current product updates"
    },
    {
        "query": "What are companies saying about AI ROI in 2025?",
        "expected_tools": ["web_search"],
        "reason": "Current market sentiment and data"
    },
    
    # Edge cases - mixed signals
    {
    "query": "Compare our Azure governance approach to current industry standards",
    "expected_tools": ["rag_search", "web_search"],  
    "reason": "Needs our approach (RAG) + industry standards (web search)"
    },
    {
    "query": "How does our agentic system compare to general AI agent frameworks?",
    "expected_tools": ["rag_search", "web_search"],  
    "reason": "Needs our system (RAG) + general frameworks (web search)"
    },
]

# Count by tool type
single_rag = sum(1 for tc in routing_test_cases if tc['expected_tools'] == ['rag_search'])
single_direct = sum(1 for tc in routing_test_cases if tc['expected_tools'] == [])
single_web = sum(1 for tc in routing_test_cases if tc['expected_tools'] == ['web_search'])
multi_tool = sum(1 for tc in routing_test_cases if len(tc['expected_tools']) > 1)

rprint(f"  Single tool - rag_search: {single_rag}")
rprint(f"  Single tool - direct_generation: {single_direct}")
rprint(f"  Single tool - web_search: {single_web}")
rprint(f"  Multi-tool: {multi_tool}")

---

## Test 1: Routing Accuracy

Testing if the content planning agent chooses the correct tool(s) for each query type.

**Metric:** Percentage of test cases where `actual_tools == expected_tools` (order-independent)

**Target:** >90% accuracy

---

In [None]:
# Run Routing Accuracy Tests
rprint("Running routing accuracy tests...")
rprint("=" * 60)

accuracy_results = evaluator.test_routing_accuracy(routing_test_cases)

# Calculate overall accuracy
accuracy = accuracy_results['correct'].mean()

rprint(f"\n{'=' * 60}")
rprint(f"ROUTING ACCURACY: {accuracy * 100:.1f}%")
rprint(f"{'=' * 60}")
rprint(f"Correct: {accuracy_results['correct'].sum()}/{len(accuracy_results)}")

In [None]:
# Accuracy Breakdown by Tool Type
rprint("\nAccuracy breakdown:")
rprint("-" * 60)

# Helper to categorize test cases
def categorize_case(expected_tools):
    if len(expected_tools) == 1:
        return f"Single: {expected_tools[0]}"
    else:
        return "Multi-tool"

accuracy_results['category'] = accuracy_results['expected_tools'].apply(
    lambda x: categorize_case(x)
)

by_category = accuracy_results.groupby('category').agg({
    'correct': ['sum', 'count', 'mean']
}).round(3)

by_category.columns = ['Correct', 'Total', 'Accuracy']
by_category['Accuracy %'] = (by_category['Accuracy'] * 100).round(1)

rprint(by_category[['Correct', 'Total', 'Accuracy %']])

In [None]:
#  Show Routing Failures
failures = accuracy_results[~accuracy_results['correct']]

if len(failures) > 0:
    rprint(f"\n{'=' * 60}")
    rprint(f"X ROUTING FAILURES: {len(failures)}")
    rprint(f"{'=' * 60}\n")
    
    for idx, row in failures.iterrows():
        rprint(f"Query: {row['query']}")
        rprint(f"  Expected: {row['expected_tools']}")
        rprint(f"  Actual: {row['actual_tools']}")
        rprint(f"  content planning agent reasoning: {row['reasoning']}")
        rprint(f"  Why expected: {row['reason_for_expected']}")
        rprint()
else:
    rprint(f"\n{'=' * 60}")
    rprint("✓ NO ROUTING FAILURES")
    rprint(f"{'=' * 60}")

---

## Test 2: Routing Consistency

Testing if the content planning agent makes the same routing decision across multiple runs.

**Metric:** Percentage of test cases where all 5 runs produce the same routing decision

**Target:** >95% consistency

---

In [None]:
# Run Routing Consistency Tests
rprint("Running routing consistency tests (5 runs per case)...")
rprint("=" * 60)

consistency_results = evaluator.test_routing_consistency(
    routing_test_cases,
    num_runs=5
)

# Calculate overall consistency
consistency = consistency_results['consistent'].mean()

rprint(f"\n{'=' * 60}")
rprint(f"ROUTING CONSISTENCY: {consistency * 100:.1f}%")
rprint(f"{'=' * 60}")
rprint(f"Consistent: {consistency_results['consistent'].sum()}/{len(consistency_results)}")

In [None]:
# Show Inconsistent Routing Cases
inconsistent = consistency_results[~consistency_results['consistent']]

if len(inconsistent) > 0:
    rprint(f"\n{'=' * 60}")
    rprint(f"  INCONSISTENT ROUTING: {len(inconsistent)}")
    rprint(f"{'=' * 60}\n")
    
    for idx, row in inconsistent.iterrows():
        rprint(f"Query: {row['query']}")
        rprint(f"  Expected: {row['expected_tools']}")
        rprint(f"  Decisions across 5 runs:")
        for i, decision in enumerate(row['decisions'], 1):
            rprint(f"    Run {i}: {decision}")
        rprint(f"  Unique decision patterns: {row['variance']}")
        rprint()
else:
    rprint(f"\n{'=' * 60}")
    rprint("✓ ALL ROUTING DECISIONS CONSISTENT")
    rprint(f"{'=' * 60}")

---

## Analysis and Decision

Analyzing results against pass criteria:
- **Routing accuracy:** >90%
- **Routing consistency:** >95%

---

In [None]:
# Analysis
analysis = evaluator.analyze_results(
    accuracy_results,
    consistency_results,
    accuracy_threshold=0.90,
    consistency_threshold=0.95
)

rprint("=" * 60)
rprint("ANALYSIS")
rprint("=" * 60)
rprint(f"\nAccuracy: {analysis['accuracy'] * 100:.1f}% (threshold: {analysis['thresholds']['accuracy'] * 100:.0f}%)")
rprint(f"Consistency: {analysis['consistency'] * 100:.1f}% (threshold: {analysis['thresholds']['consistency'] * 100:.0f}%)")

rprint(f"\nFailure summary:")
rprint(f"  Routing failures: {len(analysis['failures'])}")
rprint(f"  Inconsistent cases: {len(analysis['inconsistent_cases'])}")

if len(analysis['failures']) > 0:
    rprint(f"\nMost common failure patterns:")
    failure_df = pd.DataFrame(analysis['failures'])
    # Show which expected tools had most failures
    if 'expected_tools' in failure_df.columns:
        rprint(failure_df['expected_tools'].value_counts().head())

In [None]:
# Final Decision
rprint("\n" + "=" * 60)
rprint("FINAL ASSESSMENT")
rprint("=" * 60)

if analysis['passes']:
    rprint("\n✓ PASS: content planning agent routing is reliable")
    rprint(f"\n  Accuracy: {analysis['accuracy'] * 100:.1f}% ✓")
    rprint(f"  Consistency: {analysis['consistency'] * 100:.1f}% ✓")
    rprint("\nNext steps:")
    rprint("  → Proceed to building full LangGraph workflow")
    rprint("  → Test end-to-end content quality vs Week 3 baseline")
    rprint("  → Validate multi-tool coordination in full workflow")
else:
    rprint("\nX FAIL: content planning agent routing needs tuning")
    rprint("\nIssues identified:")
    
    if analysis['accuracy'] <= 0.90:
        rprint(f"  → Accuracy too low: {analysis['accuracy'] * 100:.1f}% (target: >90%)")
        rprint("     Action: Review failed cases and adjust content planning agent prompt")
        
        # Show specific failure patterns
        if len(analysis['failures']) > 0:
            rprint("\n     Common failure types:")
            failure_df = pd.DataFrame(analysis['failures'])
            for _, failure in failure_df.head(3).iterrows():
                rprint(f"       - Expected {failure['expected_tools']}, got {failure['actual_tools']}")
    
    if analysis['consistency'] <= 0.95:
        rprint(f"  → Consistency too low: {analysis['consistency'] * 100:.1f}% (target: >95%)")
        rprint("     Action: Consider lowering temperature or adjusting routing logic")
    
    rprint("\nNext steps:")
    rprint("  → Fix identified issues")
    rprint("  → Re-run routing tests")
    rprint("  → Do not proceed to full workflow until routing passes")

---

## Save Results

Saving test results for documentation and future comparison.

---

In [None]:
# Save Results to CSV
import os
from datetime import datetime

# Create output directory if it doesn't exist
os.makedirs("data/week_04", exist_ok=True)

# Save accuracy results
accuracy_path = "data/week_04/routing_accuracy.csv"
accuracy_results.to_csv(accuracy_path, index=False)
rprint(f"✓ Saved accuracy results: {accuracy_path}")

# Save consistency results
consistency_path = "data/week_04/routing_consistency.csv"
consistency_results.to_csv(consistency_path, index=False)
rprint(f"✓ Saved consistency results: {consistency_path}")

# Save summary
summary_df = pd.DataFrame([{
    "timestamp": datetime.now().isoformat(),
    "brand": brand,
    "accuracy": analysis['accuracy'],
    "consistency": analysis['consistency'],
    "passes": analysis['passes'],
    "failures": len(analysis['failures']),
    "inconsistent_cases": len(analysis['inconsistent_cases']),
    "total_test_cases": len(routing_test_cases),
    "single_tool_cases": len([tc for tc in routing_test_cases if len(tc['expected_tools']) == 1]),
    "multi_tool_cases": len([tc for tc in routing_test_cases if len(tc['expected_tools']) > 1])
}])

summary_path = "data/week_04/routing_summary.csv"
summary_df.to_csv(summary_path, index=False)
rprint(f"✓ Saved summary: {summary_path}")

rprint(f"\nAll results saved to data/week_04/")

---

## Test Case Details

Full details of all test cases for reference and debugging.

---

In [None]:
# Display Full Results Table
rprint("Full Accuracy Results:")
rprint("=" * 60)
display(accuracy_results[['query', 'expected_tools', 'actual_tools', 'correct', 'category']])

rprint("\nFull Consistency Results:")
rprint("=" * 60)
display(consistency_results[['query', 'expected_tools', 'actual_decisions', 'consistent', 'unique_decisions', 'mode_decision']])

---

## Multi-Tool Analysis

Specific analysis of multi-tool routing cases (queries requiring multiple tools).

---

In [None]:
# Multi-Tool Specific Analysis
multi_tool_cases = accuracy_results[accuracy_results['category'] == 'Multi-tool']

if len(multi_tool_cases) > 0:
    rprint("=" * 60)
    rprint("MULTI-TOOL ROUTING ANALYSIS")
    rprint("=" * 60)
    
    multi_accuracy = multi_tool_cases['correct'].mean()
    rprint(f"\nMulti-tool accuracy: {multi_accuracy * 100:.1f}%")
    rprint(f"Cases: {multi_tool_cases['correct'].sum()}/{len(multi_tool_cases)} correct")
    
    rprint("\nMulti-tool test cases:")
    for _, row in multi_tool_cases.iterrows():
        status = "✓" if row['correct'] else "✗"
        rprint(f"\n{status} {row['query'][:60]}...")
        rprint(f"  Expected: {row['expected_tools']}")
        rprint(f"  Actual: {row['actual_tools']}")
        if not row['correct']:
            rprint(f"  Issue: {row['reasoning']}")
else:
    rprint("No multi-tool test cases defined")

---

## Notes and Observations

**Key Findings:**
- [Add observations after running tests]

**Multi-Tool Coordination:**
- [Note how well content planning agent handles queries needing multiple tools]

**Potential Issues:**
- [Note any patterns in failures, especially multi-tool cases]

**Recommendations:**
- [Suggestions for improvement if tests fail]

---

**Test completed:** [Date/Time auto-filled when run]

**Next steps:** [Determined by pass/fail results]

---