# Ground Truth Evaluation

이 노트북은 Strands Evals(확장 가능한 LLM 기반 평가 프레임워크)를 사용하여 agent 응답을 ground truth(예상 출력)와 비교 평가합니다. rubric 전용 평가와 달리, ground truth 평가는 실제 출력을 미리 정의된 정답과 비교합니다.

**사용 사례:**
- 회귀 테스트: agent 변경이 알려진 정상 응답을 손상시키지 않는지 확인
- 품질 벤치마킹: agent가 예상 동작과 얼마나 일치하는지 측정
- 학습 데이터 검증: agent 출력을 큐레이션된 예제와 비교 검증

**두 개의 데이터 소스 (별도 파일):**
1. **Traces 파일** (`demo_traces.json`): AgentCore Observability의 실제 agent 응답 포함
2. **Ground Truth 파일** (`demo_ground_truth.json`): 각 trace에 대한 예상 출력 포함

노트북은 이 파일들을 `trace_id`로 병합하여 실제와 예상을 비교합니다.

**두 가지 모드:**
1. **Demo Mode**: JSON 파일에서 샘플 데이터 로드 (AWS 액세스 불필요)
2. **Live Mode**: AgentCore Observability에서 실제 trace 가져오기, 자체 ground truth 파일 제공

**이 노트북은 두 가지 evaluator를 시연합니다:**
- Output evaluation: 실제 응답을 예상 ground truth와 비교
- Trajectory evaluation: 실제 tool 사용을 예상 tool과 비교

Strands Evals는 거의 모든 평가 유형에 대한 커스텀 evaluator를 지원합니다—점수 기준으로 표현할 수 있는 모든 기준을 평가하도록 이 패턴을 확장할 수 있습니다.

**워크플로우:**
1. trace 로드 (demo 파일 또는 AgentCore Observability에서 실시간)
2. ground truth 기대값 로드 (예상 출력/trajectory)
3. trace_id로 병합
4. 실제와 예상을 비교하는 evaluator 실행
5. AgentCore Observability에 결과 로깅 (선택 사항)
6. 결과 분석 및 격차 식별

## 이 노트북의 위치

이것은 **Notebook 3 (Option B)** - 예상 ground truth 출력과 비교하여 세션을 평가합니다.

![Notebook Workflow](images/notebook_workflow.svg)

## Ground Truth Evaluation 작동 방식

SME(Subject Matter Expert)가 예상 출력이 포함된 ground truth 파일을 생성합니다. 이것은 `trace_id`로 실제 trace와 병합됩니다:

![Ground Truth Flow](images/ground_truth_flow.svg)

## Setup

모듈을 import하고 로깅을 구성합니다.

In [None]:
import json
import logging
import sys
from datetime import datetime, timedelta, timezone
from typing import List, Dict, Any, Optional

sys.path.insert(0, ".")

from strands_evals import Case, Experiment
from strands_evals.evaluators import OutputEvaluator, TrajectoryEvaluator

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
)
logger = logging.getLogger(__name__)

## Configuration

Demo Mode(샘플 파일 사용)와 Live Mode(AgentCore Observability에서 가져오기) 중 선택합니다.

**Demo Mode** (`USE_DEMO_MODE = True`):
- `DEMO_TRACES_PATH`에서 trace 로드
- `DEMO_GROUND_TRUTH_PATH`에서 ground truth 로드
- AWS 자격 증명 불필요

**Live Mode** (`USE_DEMO_MODE = False`):
- `SESSION_ID`를 사용하여 AgentCore Observability에서 trace 가져오기
- `GROUND_TRUTH_PATH`에서 ground truth 로드 (이 파일을 직접 생성해야 함)
- AWS 자격 증명 및 `config.py` 설정 필요

**CloudWatch Logging** (`LOG_TO_CLOUDWATCH = True`):
- AgentCore Observability 대시보드로 평가 결과 전송
- `EVALUATION_CONFIG_ID` 및 evaluator 이름 필요

In [None]:
# =============================================================================
# Mode Selection
# =============================================================================
USE_DEMO_MODE = True

# =============================================================================
# Demo Mode Paths
# =============================================================================
DEMO_TRACES_PATH = "demo_traces.json"           # Actual agent responses
DEMO_GROUND_TRUTH_PATH = "demo_ground_truth.json"  # Your expected outputs

# =============================================================================
# Live Mode Settings
# =============================================================================
SESSION_ID = "your-session-id-here"              # Session to evaluate
GROUND_TRUTH_PATH = "my_ground_truth.json"       # Your ground truth file

# =============================================================================
# CloudWatch Logging
# =============================================================================
LOG_TO_CLOUDWATCH = True
OUTPUT_EVALUATOR_NAME = "Custom.GroundTruthOutput"
TRAJECTORY_EVALUATOR_NAME = "Custom.GroundTruthTrajectory"

## 파일 형식

Ground truth 평가는 동일한 `session_id`를 공유하는 **두 개의 별도 파일**을 사용합니다:

**주요 개념:**
- `session_id`: 단일 사용자 세션의 모든 trace를 그룹화
- `trace_id`: 세션 내 각 개별 상호작용을 식별

### 1. Traces 파일 (실제 agent 응답)
agent가 실제로 수행한 작업 포함 - CloudWatch에서 가져오거나 로컬에 저장:
```json
{
  "session_id": "5B467129-E54A-4F70-908D-CB31818004B5",
  "traces": [
    {
      "trace_id": "693cb6c4e931",
      "user_prompt": "What is the best route for a NZ road trip?",
      "actual_output": "Based on the search results, here are the best routes...",
      "actual_trajectory": ["web_search"]
    },
    {
      "trace_id": "693cb6fa87aa",
      "user_prompt": "Should I visit North or South Island?",
      "actual_output": "Here's how the islands compare...",
      "actual_trajectory": ["web_search"]
    }
  ]
}
```

### 2. Ground Truth 파일 (예상 출력)
SME가 trace를 검토하고 각 `trace_id`에 대한 예상 출력을 작성:
```json
{
  "session_id": "5B467129-E54A-4F70-908D-CB31818004B5",
  "ground_truth": [
    {
      "trace_id": "693cb6c4e931",
      "user_prompt_reference": "What is the best route for a NZ road trip?",
      "expected_output": "Response should mention Milford Road, Southern Scenic Route...",
      "expected_trajectory": ["web_search"]
    },
    {
      "trace_id": "693cb6fa87aa",
      "user_prompt_reference": "Should I visit North or South Island?",
      "expected_output": "Response should compare both islands with key features...",
      "expected_trajectory": ["web_search"]
    }
  ]
}
```

**참고:** `user_prompt_reference`는 선택 사항입니다 - SME가 어떤 trace에 대한 기대값을 작성하는지 기억하는 데 도움이 됩니다.

## Trace 및 Ground Truth 로드

demo 파일 또는 CloudWatch에서 trace 데이터를 로드한 다음, ground truth 기대값을 로드하고 `trace_id`로 병합합니다.

In [None]:
if USE_DEMO_MODE:
    # Load traces (actual agent responses)
    with open(DEMO_TRACES_PATH, "r") as f:
        traces_data = json.load(f)
    
    SESSION_ID = traces_data["session_id"]
    traces = []
    for i, t in enumerate(traces_data["traces"]):
        traces.append({
            "trace_index": i,
            "trace_id": t.get("trace_id", f"demo-trace-{i:03d}"),
            "user_prompt": t["user_prompt"],
            "actual_output": t.get("actual_output", ""),
            "actual_trajectory": t.get("actual_trajectory", []),
        })
    
    # Load ground truth (expected outputs) - separate file!
    with open(DEMO_GROUND_TRUTH_PATH, "r") as f:
        gt_data = json.load(f)
    
    # Build ground truth lookup by trace_id
    gt_by_trace_id = {
        gt["trace_id"]: {
            "expected_output": gt["expected_output"],
            "expected_trajectory": gt.get("expected_trajectory", []),
        }
        for gt in gt_data["ground_truth"]
    }
    
    # Merge: match traces to ground truth by trace_id
    ground_truth = {}
    matched_count = 0
    for trace in traces:
        trace_id = trace["trace_id"]
        if trace_id in gt_by_trace_id:
            ground_truth[trace["trace_index"]] = gt_by_trace_id[trace_id]
            matched_count += 1
    
    print(f"Demo Mode:")
    print(f"  Traces loaded: {len(traces)} from {DEMO_TRACES_PATH}")
    print(f"  Ground truth loaded: {len(gt_data['ground_truth'])} entries from {DEMO_GROUND_TRUTH_PATH}")
    print(f"  Matched by trace_id: {matched_count}")
    print(f"  Session ID: {SESSION_ID}")
    if gt_data.get("description"):
        print(f"  Description: {gt_data['description']}")

else:
    from config import AWS_REGION, SOURCE_LOG_GROUP, LOOKBACK_HOURS
    from utils import CloudWatchSessionMapper, ObservabilityClient
    from strands_evals.types.trace import AgentInvocationSpan, ToolExecutionSpan
    
    # Fetch traces from CloudWatch
    obs_client = ObservabilityClient(region_name=AWS_REGION, log_group=SOURCE_LOG_GROUP)
    mapper = CloudWatchSessionMapper()
    
    end_time = datetime.now(timezone.utc)
    start_time = end_time - timedelta(hours=LOOKBACK_HOURS)
    
    trace_data = obs_client.get_session_data(
        session_id=SESSION_ID,
        start_time_ms=int(start_time.timestamp() * 1000),
        end_time_ms=int(end_time.timestamp() * 1000),
        include_runtime_logs=False,
    )
    
    if not trace_data.spans:
        raise ValueError(f"No spans found for session {SESSION_ID}")
    
    session = trace_data.to_session(mapper)
    traces = []
    for i, trace in enumerate(session.traces):
        agent_span = None
        tool_calls = []
        for span in trace.spans:
            if isinstance(span, AgentInvocationSpan):
                agent_span = span
            elif isinstance(span, ToolExecutionSpan):
                tool_calls.append(span.tool_call.name)
        if agent_span:
            traces.append({
                "trace_index": i,
                "trace_id": trace.trace_id,
                "user_prompt": agent_span.user_prompt or "",
                "actual_output": agent_span.agent_response or "",
                "actual_trajectory": tool_calls,
            })
    
    # Load ground truth from separate file
    try:
        with open(GROUND_TRUTH_PATH, "r") as f:
            gt_data = json.load(f)
        gt_by_trace_id = {
            gt["trace_id"]: {
                "expected_output": gt["expected_output"],
                "expected_trajectory": gt.get("expected_trajectory", []),
            }
            for gt in gt_data["ground_truth"]
        }
        ground_truth = {}
        for trace in traces:
            trace_id = trace["trace_id"]
            if trace_id in gt_by_trace_id:
                ground_truth[trace["trace_index"]] = gt_by_trace_id[trace_id]
        print(f"Ground truth loaded: {len(ground_truth)} matches from {GROUND_TRUTH_PATH}")
    except FileNotFoundError:
        ground_truth = {}
        print(f"Ground truth file not found: {GROUND_TRUTH_PATH}")
        print("Create a ground truth file or define manually in the next section")
    
    print(f"Live Mode: Loaded {len(traces)} traces from CloudWatch")

## Trace 검토

각 trace의 세부 정보를 표시합니다. 각 상호작용에 대한 ground truth가 어떻게 보여야 하는지 이해하기 위해 이를 검토하세요.

In [None]:
for trace in traces:
    print(f"\n{'='*70}")
    print(f"TRACE {trace['trace_index'] + 1} (ID: {trace['trace_id']})")
    print(f"{'='*70}")
    
    prompt = trace['user_prompt']
    output = trace['actual_output']
    
    print(f"\nUSER PROMPT:\n{prompt[:500]}..." if len(prompt) > 500 else f"\nUSER PROMPT:\n{prompt}")
    print(f"\nACTUAL OUTPUT:\n{output[:500]}..." if len(output) > 500 else f"\nACTUAL OUTPUT:\n{output}")
    print(f"\nACTUAL TRAJECTORY: {trace['actual_trajectory']}")
    
    if trace['trace_index'] in ground_truth:
        gt = ground_truth[trace['trace_index']]
        print(f"\nEXPECTED OUTPUT: {gt['expected_output']}")
        print(f"EXPECTED TRAJECTORY: {gt['expected_trajectory']}")

## Ground Truth 정의 (Live Mode 전용)

Live Mode를 사용하는 경우, 여기에서 각 trace에 대한 예상 출력 및 예상 trajectory를 정의합니다.

**Demo Mode에서는 ground truth가 JSON 파일에서 미리 로드됩니다.**

In [None]:
if not ground_truth:
    print("Ground truth not defined. Generating template...\n")
    print("Copy, edit, and paste the following:\n")
    print("ground_truth = {")
    for trace in traces:
        prompt_preview = trace['user_prompt'][:50].replace('"', "'")
        print(f"    {trace['trace_index']}: {{")
        print(f'        "expected_output": "TODO: {prompt_preview}...",') 
        print(f'        "expected_trajectory": {trace["actual_trajectory"]},')
        print(f"    }},")
    print("}")
else:
    print(f"Ground truth already defined for {len(ground_truth)} traces")

## Evaluator Rubric

두 evaluator 모두 rubric을 사용하여 실제 값과 예상 값을 비교합니다:

- **OutputEvaluator**: 의미적 유사성을 사용하여 실제 응답을 예상 출력과 비교
- **TrajectoryEvaluator**: 실제 tool trajectory를 예상 trajectory와 비교

In [None]:
ground_truth_output_rubric = """
Compare the agent's actual output against the expected ground truth output.

Evaluation criteria:
1. Semantic Match (0-0.5): Does the actual output convey the same meaning as expected?
   - 0.5: Full semantic alignment - same information and intent
   - 0.3: Partial alignment - captures main points but misses details
   - 0.0: No alignment - different information or wrong answer

2. Completeness (0-0.3): Does the actual output include all key points from expected?
   - 0.3: All key information present
   - 0.15: Most key information present
   - 0.0: Missing critical information

3. Correctness (0-0.2): Is the actual output factually consistent with expected?
   - 0.2: No factual contradictions
   - 0.1: Minor inconsistencies
   - 0.0: Major contradictions or errors

Final score = sum of all criteria (0.0 to 1.0)
"""

trajectory_rubric = """
Compare the agent's actual tool trajectory against the expected trajectory.

Evaluation criteria:
1. Tool Match (0-0.5): Did the agent use the expected tools?
   - 0.5: All expected tools were used
   - 0.25: Some expected tools were used
   - 0.0: None of the expected tools were used

2. No Extra Tools (0-0.3): Did the agent avoid unnecessary tools?
   - 0.3: No extra tools beyond expected
   - 0.15: One extra tool
   - 0.0: Multiple unnecessary tools

3. Order (0-0.2): Were tools used in the expected sequence?
   - 0.2: Correct order
   - 0.1: Minor order differences
   - 0.0: Completely different order

Final score = sum of all criteria (0.0 to 1.0)
"""

## Evaluation Case 생성

비교를 위해 실제 출력과 예상 ground truth를 모두 포함하는 Case를 생성합니다.

In [None]:
def create_ground_truth_cases(traces: List[Dict], ground_truth: Dict[int, Dict], session_id: str) -> List[Case]:
    """Create evaluation cases with ground truth for comparison."""
    cases = []
    
    for trace in traces:
        idx = trace["trace_index"]
        gt = ground_truth.get(idx, {})
        
        if not gt:
            print(f"Warning: No ground truth for trace {idx}, skipping")
            continue
        
        case = Case(
            name=f"trace_{idx}_{trace['trace_id'][:8]}",
            input=trace["user_prompt"],
            expected_output=gt.get("expected_output", ""),
            session_id=session_id,
            metadata={
                "actual_output": trace["actual_output"],
                "actual_trajectory": trace["actual_trajectory"],
                "expected_trajectory": gt.get("expected_trajectory", []),
                "trace_id": trace["trace_id"],
            },
        )
        cases.append(case)
    
    return cases

cases = create_ground_truth_cases(traces, ground_truth, SESSION_ID)
print(f"Created {len(cases)} evaluation cases")

## Task 함수

이 함수들은 evaluator가 비교할 실제 값과 예상 값을 추출합니다.

In [None]:
def ground_truth_task_fn(case: Case) -> str:
    """Return actual and expected output for comparison."""
    actual = case.metadata.get("actual_output", "")
    expected = case.expected_output or ""
    return f"ACTUAL OUTPUT:\n{actual}\n\nEXPECTED OUTPUT (Ground Truth):\n{expected}"


def trajectory_task_fn(case: Case):
    """Return output and trajectory for TrajectoryEvaluator.
    
    TrajectoryEvaluator expects a dictionary with 'output' and 'trajectory' keys.
    """
    actual_output = case.metadata.get("actual_output", "")
    actual_trajectory = case.metadata.get("actual_trajectory", [])
    expected_trajectory = case.metadata.get("expected_trajectory", [])
    
    # Format output to include comparison context
    comparison_output = f"""ACTUAL OUTPUT:
{actual_output}

EXPECTED TRAJECTORY: {expected_trajectory}
ACTUAL TRAJECTORY: {actual_trajectory}"""
    
    # Return dictionary format expected by TrajectoryEvaluator
    return {"output": comparison_output, "trajectory": actual_trajectory}

## Ground Truth Evaluation 실행

실제 출력을 ground truth와 비교하는 평가를 실행합니다.

In [None]:
if cases:
    print("Running Output Evaluation...")
    output_evaluator = OutputEvaluator(rubric=ground_truth_output_rubric)
    output_experiment = Experiment(cases=cases, evaluators=[output_evaluator])
    output_results = output_experiment.run_evaluations(ground_truth_task_fn)
    output_report = output_results[0]  # Extract the report from the list
    print(f"Output Evaluation Complete - Overall Score: {output_report.overall_score:.2f}")
else:
    print("No cases to evaluate. Please define ground truth first.")

In [None]:
if cases:
    print("Running Trajectory Evaluation...")
    
    # Collect all unique tools from actual and expected trajectories
    all_tools = set()
    for trace in traces:
        all_tools.update(trace["actual_trajectory"])
    for gt in ground_truth.values():
        all_tools.update(gt.get("expected_trajectory", []))
    
    trajectory_evaluator = TrajectoryEvaluator(
        rubric=trajectory_rubric,
        trajectory_description={"available_tools": list(all_tools)}
    )
    trajectory_experiment = Experiment(cases=cases, evaluators=[trajectory_evaluator])
    trajectory_results = trajectory_experiment.run_evaluations(trajectory_task_fn)
    trajectory_report = trajectory_results[0]
    print(f"Trajectory Evaluation Complete - Overall Score: {trajectory_report.overall_score:.2f}")

## 상세 결과

trace별 점수 및 설명을 확인합니다.

In [None]:
if cases:
    print("\n" + "="*70)
    print("GROUND TRUTH EVALUATION RESULTS")
    print("="*70)
    
    for i, case in enumerate(cases):
        print(f"\n--- Trace {i+1}: {case.name} ---")
        prompt_display = case.input[:80] + "..." if len(case.input) > 80 else case.input
        print(f"User Prompt: {prompt_display}")
        
        output_score = output_report.scores[i] if i < len(output_report.scores) else 0
        output_reason = output_report.reasons[i] if i < len(output_report.reasons) else "N/A"
        print(f"\nOutput Score: {output_score:.2f}")
        print(f"Explanation: {output_reason[:250]}..." if len(str(output_reason)) > 250 else f"Explanation: {output_reason}")
        
        traj_score = trajectory_report.scores[i] if i < len(trajectory_report.scores) else 0
        traj_reason = trajectory_report.reasons[i] if i < len(trajectory_report.reasons) else "N/A"
        print(f"\nTrajectory Score: {traj_score:.2f}")
        print(f"Expected Tools: {case.metadata.get('expected_trajectory', [])}")
        print(f"Actual Tools:   {case.metadata.get('actual_trajectory', [])}")
        print(f"Explanation: {traj_reason[:250]}..." if len(str(traj_reason)) > 250 else f"Explanation: {traj_reason}")

## CloudWatch에 결과 로깅 (선택 사항)

원본 trace ID를 사용하여 AgentCore Observability 대시보드로 평가 결과를 전송합니다. 이를 통해 AgentCore Observability 콘솔에서 원본 trace와 함께 ground truth 평가 점수를 볼 수 있습니다.

이 기능을 활성화하려면 configuration 섹션에서 `LOG_TO_CLOUDWATCH = True`로 설정하세요.

In [None]:
if LOG_TO_CLOUDWATCH and cases:
    from config import EVALUATION_CONFIG_ID, setup_cloudwatch_environment
    from utils import send_evaluation_to_cloudwatch
    
    # Setup CloudWatch environment
    setup_cloudwatch_environment()
    
    output_logged = 0
    trajectory_logged = 0
    
    print("Logging results to CloudWatch...")
    
    for i, case in enumerate(cases):
        trace_id = case.metadata.get("trace_id", "")
        if not trace_id:
            continue
        
        # Log output evaluation result
        output_score = output_report.scores[i] if i < len(output_report.scores) else 0
        output_reason = output_report.reasons[i] if i < len(output_report.reasons) else ""
        if send_evaluation_to_cloudwatch(
            trace_id=trace_id,
            session_id=SESSION_ID,
            evaluator_name=OUTPUT_EVALUATOR_NAME,
            score=output_score,
            explanation=str(output_reason)[:500],
            config_id=EVALUATION_CONFIG_ID,
        ):
            output_logged += 1
        
        # Log trajectory evaluation result
        traj_score = trajectory_report.scores[i] if i < len(trajectory_report.scores) else 0
        traj_reason = trajectory_report.reasons[i] if i < len(trajectory_report.reasons) else ""
        if send_evaluation_to_cloudwatch(
            trace_id=trace_id,
            session_id=SESSION_ID,
            evaluator_name=TRAJECTORY_EVALUATOR_NAME,
            score=traj_score,
            explanation=str(traj_reason)[:500],
            config_id=EVALUATION_CONFIG_ID,
        ):
            trajectory_logged += 1
    
    print(f"CloudWatch logging complete:")
    print(f"  Output evaluations logged: {output_logged}/{len(cases)}")
    print(f"  Trajectory evaluations logged: {trajectory_logged}/{len(cases)}")
else:
    if not LOG_TO_CLOUDWATCH:
        print("CloudWatch logging disabled. Set LOG_TO_CLOUDWATCH = True to enable.")

## 요약

agent가 ground truth와 얼마나 일치하는지 보여주는 집계 결과입니다.

In [None]:
if cases:
    print("\n" + "="*70)
    print("SUMMARY")
    print("="*70)
    print(f"\nSession: {SESSION_ID}")
    print(f"Mode: {'Demo' if USE_DEMO_MODE else 'Live'}")
    print(f"Traces Evaluated: {len(cases)}")
    
    print(f"\nOutput Evaluation (Actual vs Expected Response):")
    print(f"  Overall Score: {output_report.overall_score:.2f}")
    print(f"  Range: {min(output_report.scores):.2f} - {max(output_report.scores):.2f}")
    
    print(f"\nTrajectory Evaluation (Actual vs Expected Tools):")
    print(f"  Overall Score: {trajectory_report.overall_score:.2f}")
    print(f"  Range: {min(trajectory_report.scores):.2f} - {max(trajectory_report.scores):.2f}")
    
    low_output = [i+1 for i, s in enumerate(output_report.scores) if s < 0.5]
    low_traj = [i+1 for i, s in enumerate(trajectory_report.scores) if s < 0.5]
    
    if low_output:
        print(f"\nTraces with low output scores (<0.5): {low_output}")
    if low_traj:
        print(f"Traces with low trajectory scores (<0.5): {low_traj}")
    
    if not low_output and not low_traj:
        print(f"\nAll traces scored above 0.5 - agent behavior matches ground truth well!")

## 결과 내보내기

평가 결과를 JSON으로 저장합니다.

In [None]:
if cases:
    export_data = {
        "evaluation_time": datetime.now(timezone.utc).isoformat(),
        "session_id": SESSION_ID,
        "mode": "demo" if USE_DEMO_MODE else "live",
        "evaluation_type": "ground_truth",
        "summary": {
            "traces_evaluated": len(cases),
            "output_overall_score": output_report.overall_score,
            "trajectory_overall_score": trajectory_report.overall_score,
        },
        "traces": [
            {
                "trace_id": case.metadata.get("trace_id"),
                "user_prompt": case.input,
                "expected_output": case.expected_output,
                "actual_output": case.metadata.get("actual_output"),
                "expected_trajectory": case.metadata.get("expected_trajectory"),
                "actual_trajectory": case.metadata.get("actual_trajectory"),
                "output_score": output_report.scores[i] if i < len(output_report.scores) else None,
                "trajectory_score": trajectory_report.scores[i] if i < len(trajectory_report.scores) else None,
            }
            for i, case in enumerate(cases)
        ],
    }
    
    output_path = "ground_truth_results.json"
    with open(output_path, "w") as f:
        json.dump(export_data, f, indent=2)
    
    print(f"Results exported to {output_path}")