# üîå MAS Evaluation Framework - Real ADK Agents Demo

This notebook demonstrates evaluating **REAL Google ADK agents** (not simulated spans).

## MAS Architecture (Flat - No Orchestrator)
```
   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
   ‚îÇ Academic    ‚îÇ ‚îÇ Industry    ‚îÇ
   ‚îÇ Researcher  ‚îÇ ‚îÇ Researcher  ‚îÇ
   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
          ‚îÇ               ‚îÇ
          ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                  ‚ñº
          ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
          ‚îÇ   Writer    ‚îÇ
          ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                 ‚ñº
          ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
          ‚îÇ   Critic    ‚îÇ
          ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

## Metrics Used
| Metric | Source |
|--------|--------|
| IDS (Information Diversity) | GEMMAS Paper |
| UPR (Unnecessary Path Ratio) | GEMMAS Paper |
| TRS (Thought Relevance) | Custom (no paper) |
| MAST 14 Failure Modes | MAST Paper (using MAD dataset for ICL) |

---
## üì¶ Step 1: Installation

In [None]:
# Install all required packages
!pip install -q google-adk google-generativeai opentelemetry-sdk networkx sentence-transformers matplotlib scikit-learn

In [None]:
# Set up API key
import os
from google.colab import userdata

try:
    os.environ["GOOGLE_API_KEY"] = userdata.get('GOOGLE_API_KEY')
    print("‚úÖ API key loaded from Colab secrets")
except:
    # Manual fallback
    # os.environ["GOOGLE_API_KEY"] = "your-api-key-here"
    print("‚ö†Ô∏è Set GOOGLE_API_KEY in Colab secrets or manually")

In [None]:
# Upload mas_eval folder to Colab
# Option 1: Upload zip and extract
# Option 2: Clone from GitHub
# Option 3: Manual upload

import sys
sys.path.insert(0, '.')

# Verify import
try:
    from mas_eval import MASEvaluator
    from mas_eval.adapters import ADKAdapter, ADKTracingCallback
    from mas_eval.metrics import ThoughtRelevanceMetric, GEMMAS_Evaluator
    from mas_eval.mast import MASTClassifier, MASTTaxonomy
    from mas_eval.graph import CRGModule
    print("‚úÖ mas_eval framework loaded")
except ImportError as e:
    print(f"‚ùå Upload the mas_eval folder first: {e}")

---
## ü§ñ Step 2: Create Real ADK Agents (Flat Architecture)

**Architecture (No Orchestrator):**
- 2 Researchers (parallel) - Academic & Industry
- 1 Writer (sequential)
- 1 Critic (quality control)

The agents communicate through a shared session context rather than an orchestrator.

In [None]:
from google.adk.agents import Agent
from google.adk.runners import InMemoryRunner
from google.adk.sessions import InMemorySessionService
from google.adk.artifacts import InMemoryArtifactService

MODEL = "gemini-2.0-flash"

# Tool functions for agents
def search_academic(query: str, max_results: int = 3) -> dict:
    """Search academic papers and scholarly sources."""
    return {
        "status": "success",
        "source": "academic",
        "results": [
            {"title": f"Academic Paper on {query}", "journal": "Nature, 2024", "finding": f"Key research finding about {query}"},
            {"title": f"Meta-analysis of {query}", "journal": "Science, 2024", "finding": f"Comprehensive review of {query}"}
        ][:max_results]
    }

def search_industry(query: str, max_results: int = 3) -> dict:
    """Search industry reports and market analysis."""
    return {
        "status": "success",
        "source": "industry",
        "results": [
            {"company": "Google", "insight": f"Major investment in {query}"},
            {"company": "Microsoft", "insight": f"Launching {query} products for enterprise"}
        ][:max_results]
    }

def write_section(title: str, content: str) -> dict:
    """Write a section of the report."""
    return {"status": "success", "section": {"title": title, "content": content}}

def critique_report(report: str) -> dict:
    """Critique a report and suggest improvements."""
    return {
        "status": "success",
        "critique": "The report covers key points but could use more specific data.",
        "score": 7,
        "needs_revision": False
    }

print("‚úÖ Tool functions defined")

In [None]:
# Create the agents (Flat Architecture - No Orchestrator)

# 1. Academic Researcher
academic_researcher = Agent(
    name="Academic_Researcher",
    model=MODEL,
    description="Expert at finding peer-reviewed academic papers and scholarly research.",
    instruction="""You are an academic researcher. Your role:
1. Search for peer-reviewed papers using the search_academic tool
2. Analyze research methodology and cite sources
3. Focus on scientific rigor and recent publications (last 5 years)
4. Summarize findings with proper citations

Always provide evidence-based insights from scholarly sources.""",
    tools=[search_academic]
)

# 2. Industry Researcher
industry_researcher = Agent(
    name="Industry_Researcher",
    model=MODEL,
    description="Expert at researching industry trends, companies, and market analysis.",
    instruction="""You are an industry analyst. Your role:
1. Search for industry insights using the search_industry tool
2. Analyze market trends and company initiatives
3. Identify commercial applications and growth projections
4. Provide actionable business insights

Focus on real-world applications and market data.""",
    tools=[search_industry]
)

# 3. Writer (receives research from both researchers)
writer = Agent(
    name="Writer",
    model=MODEL,
    description="Expert at synthesizing research into clear, structured reports.",
    instruction="""You are a professional research writer. Your role:
1. Synthesize findings from Academic_Researcher and Industry_Researcher
2. Structure content with clear sections using write_section tool
3. Balance academic rigor with accessibility
4. Highlight key insights and implications

Create reports that are informative and well-organized.""",
    tools=[write_section],
    sub_agents=[academic_researcher, industry_researcher]  # Can delegate to researchers
)

# 4. Critic (reviews writer's output)
critic = Agent(
    name="Critic",
    model=MODEL,
    description="Quality control expert that reviews and critiques reports.",
    instruction="""You are a quality critic. Your role:
1. Review the report produced by the Writer
2. Use critique_report tool to evaluate quality
3. If score < 7, request revisions from Writer
4. If score >= 7, approve and pass to final output

Be constructive but thorough in your critique.""",
    tools=[critique_report],
    sub_agents=[writer]  # Can delegate back to writer for revisions
)

# The root agent is Critic (coordinates writer and researchers)
root_agent = critic

print("‚úÖ 4 agents created (flat architecture):")
print("   - Academic_Researcher")
print("   - Industry_Researcher")
print("   - Writer (coordinates researchers)")
print("   - Critic (root - quality control)")

---
## üîç Step 3: Run & Capture Traces with ADKAdapter

The ADKAdapter automatically captures:
- **Thoughts**: Internal reasoning, planning, analysis
- **Actions**: Tool calls, agent transfers
- **Observations**: Tool results
- **Outputs**: Final responses

These traces are used to build the Graph of Thoughts (GOT) for evaluation.

In [None]:
from mas_eval.adapters import ADKAdapter
import asyncio

# Wrap root agent with ADKAdapter for tracing
adapter = ADKAdapter(root_agent, service_name="research-mas")

# Define the research query
RESEARCH_QUERY = "Research the current state of AI in Healthcare, including academic advances and commercial applications"

print(f"üîç Research Query: {RESEARCH_QUERY}")
print("\nüöÄ Running MAS with tracing...")
print("   (Thought tracing enabled for GOT construction)")

In [None]:
# Run the MAS and capture traces
async def run_mas():
    response, spans = await adapter.run_with_tracing(
        user_message=RESEARCH_QUERY,
        user_id="demo_user"
    )
    return response, spans

# Execute
response, captured_spans = await run_mas()

print(f"\n‚úÖ Execution complete!")
print(f"üìä Captured {len(captured_spans)} spans")
print(f"\nüìù Final Response (preview):")
print(response[:500] + "..." if len(response) > 500 else response)

In [None]:
# Explore captured spans (including thoughts for GOT)
print("üìä Captured Spans (Thoughts & Actions):")
print("=" * 60)

# Count by type
type_counts = {}
for span in captured_spans:
    t = span.step_type.value
    type_counts[t] = type_counts.get(t, 0) + 1

print("\nüìà Span Types:")
for t, count in sorted(type_counts.items()):
    emoji = "üí≠" if t == "thought" else "‚ö°" if t == "action" else "üîß" if "tool" in t else "üì§"
    print(f"   {emoji} {t}: {count}")

print("\nüìä Sample Spans:")
for i, span in enumerate(captured_spans[:15]):  # Show first 15
    emoji = "üí≠" if span.step_type.value == "thought" else "‚ö°"
    print(f"\n[{i+1}] {emoji} {span.agent_name} - {span.step_type.value}")
    print(f"    Content: {span.content[:100]}..." if len(span.content) > 100 else f"    Content: {span.content}")

# Show agent distribution
agent_counts = {}
for span in captured_spans:
    agent_counts[span.agent_name] = agent_counts.get(span.agent_name, 0) + 1

print(f"\n\nüìà Spans per Agent:")
for agent, count in sorted(agent_counts.items()):
    print(f"   {agent}: {count}")

---
## üìä Step 4: Build Causal Reasoning Graph (CRG) / Graph of Thoughts (GOT)

The CRG represents the flow of reasoning through the MAS:
- **Nodes**: Agent thoughts, actions, and outputs
- **Edges**: Causal and temporal relationships

This is also known as the Graph of Thoughts (GOT) when focused on reasoning patterns.

In [None]:
from mas_eval.graph import CRGModule, GraphVisualizer
import matplotlib.pyplot as plt

# Build CRG/GOT from captured spans
crg = CRGModule()
graph = crg.build(captured_spans)

# Get statistics
stats = crg.get_statistics()
print("üìä Graph of Thoughts (GOT) Statistics:")
for key, value in stats.items():
    print(f"   {key}: {value}")

# Show thought node count
thought_nodes = crg.get_nodes_by_type("thought")
print(f"\nüí≠ Thought nodes in GOT: {len(thought_nodes)}")

In [None]:
# Visualize the CRG/GOT
viz = GraphVisualizer(graph)
plt.figure(figsize=(16, 12))
viz.plot(figsize=(16, 12), color_by="agent", title="MAS Graph of Thoughts (GOT) - Real ADK Agents")
plt.tight_layout()
plt.show()

---
## üìà Step 5: Calculate GEMMAS Metrics (IDS/UPR)

**Source: GEMMAS Paper**

In [None]:
from mas_eval.metrics import GEMMAS_Evaluator

gemmas = GEMMAS_Evaluator()
metrics = gemmas.evaluate(graph, captured_spans)

print(gemmas.summary(metrics))

---
## üß† Step 6: Calculate Thought Relevance Score (TRS)

**‚ö†Ô∏è Note: TRS is a CUSTOM metric (no research paper)**

This metric evaluates how relevant each agent's thoughts are to the task.

In [None]:
from mas_eval.metrics import ThoughtRelevanceMetric

trs_metric = ThoughtRelevanceMetric()
trs_result = trs_metric.calculate(
    graph=graph,
    spans=captured_spans,
    task_description=RESEARCH_QUERY  # Use the original query as reference
)

print("üß† Thought Relevance Score (TRS)")
print("‚ö†Ô∏è  Note: This is a CUSTOM metric, not from a published paper")
print("=" * 50)
print(f"\nOverall TRS: {trs_result['overall_score']:.4f}")
print(f"Interpretation: {trs_result['interpretation']}")
print(f"Total thoughts analyzed: {trs_result['thought_count']}")

print("\nüìä Per-Agent Breakdown:")
for agent, data in trs_result.get('agent_scores', {}).items():
    score = data['score']
    emoji = "üü¢" if score >= 0.7 else "üü°" if score >= 0.5 else "üî¥"
    print(f"   {emoji} {agent}: {score:.2f} ({data['thought_count']} thoughts)")

In [None]:
# Visualize TRS by agent
if trs_result.get('agent_scores'):
    agents = list(trs_result['agent_scores'].keys())
    scores = [trs_result['agent_scores'][a]['score'] for a in agents]
    colors = ['#4CAF50' if s >= 0.7 else '#FF9800' if s >= 0.5 else '#F44336' for s in scores]

    plt.figure(figsize=(10, 6))
    bars = plt.barh(agents, scores, color=colors)
    plt.xlabel('Thought Relevance Score')
    plt.title('üß† TRS by Agent (Custom Metric - No Paper)')
    plt.xlim(0, 1)
    
    for bar, score in zip(bars, scores):
        plt.text(score + 0.02, bar.get_y() + bar.get_height()/2, f'{score:.2f}', va='center')
    
    plt.tight_layout()
    plt.show()

---
## üéØ Step 7: MAST Failure Classification (with MAD Dataset)

**Source: "Why Do Multi-Agent LLM Systems Fail?" Paper**

The classifier uses **In-Context Learning (ICL)** with examples from the MAD (Multi-Agent System Traces) dataset.
This is the labeled dataset you uploaded (`mast_dataset/`).

The FINE_TUNED mode loads real labeled examples from MAD_human_labelled_dataset.json for better few-shot classification.

In [None]:
from mas_eval.mast import MASTClassifier, MASTTaxonomy
from mas_eval.mast.classifier import ClassifierMode

# Show MAST taxonomy
taxonomy = MASTTaxonomy()
print(taxonomy.summary())

In [None]:
# Run MAST classification with real dataset examples
# Using FINE_TUNED mode = In-Context Learning with MAD dataset
MAST_DATASET_PATH = "./mast_dataset"  # Path to the MAST dataset folder

classifier = MASTClassifier(
    model="gemini-2.0-flash",
    mode=ClassifierMode.FINE_TUNED,  # Uses MAD dataset for in-context learning
    confidence_threshold=0.5,
    dataset_path=MAST_DATASET_PATH  # Pass the dataset path
)

print("üéØ Running MAST Classification...")
print(f"üìÇ Using dataset: {MAST_DATASET_PATH}")
print("   Mode: In-Context Learning with real MAD examples")
print()

mast_result = classifier.classify(captured_spans)
print(classifier.summary(mast_result))

---
## üìã Step 8: Full Evaluation with MASEvaluator

In [None]:
from mas_eval import MASEvaluator

# Full evaluation
evaluator = MASEvaluator(
    enable_tracing=True,
    enable_crg=True,
    enable_gemmas=True,
    enable_mast=True
)

print("üîç Running Full Evaluation...")
eval_result = evaluator.evaluate(captured_spans)

print("\n" + "=" * 60)
print("COMPLETE EVALUATION RESULTS")
print("=" * 60)
print(eval_result.summary())

In [None]:
# Generate HTML report
evaluator.to_html(eval_result, "mas_evaluation_report.html")
print("üìÑ HTML report saved to: mas_evaluation_report.html")

# Display in notebook
from IPython.display import HTML, display
with open("mas_evaluation_report.html", "r") as f:
    display(HTML(f.read()))

---
## üí° Step 9: Get Improvement Suggestions

In [None]:
from mas_eval.suggestions import MASAdvisor

advisor = MASAdvisor()
suggestions = advisor.generate_suggestions(eval_result, trs_result)

print("üí° IMPROVEMENT SUGGESTIONS")
print(advisor.format_suggestions(suggestions))