# Prepare Vote Extraction Dataset for Datadog LLM Experiments

**Objective**: Create a curated dataset from Thai election form images for systematic testing

**What You'll Learn**:
1. Load and inspect test images from `assets/`
2. Create dataset records with input and expected output (ground truth)
3. Save dataset as local JSON for version control
4. Push dataset to Datadog via API
5. Validate dataset quality

**Prerequisites**:
- Images in `assets/ss5-18-images/`
- Datadog API and App keys
- Python packages: `requests`, `PIL`, `json`

---


## üì¶ Setup and Imports

**Install Required Packages** (run once):

*If you have `fastapi-backend` installed and see a dependency conflict, uncomment and run the cell below first:*


In [1]:
# Optional: Upgrade fastapi-backend to resolve dependency conflicts
# Uncomment the line below if you see ddtrace version conflicts
%pip install --quiet --upgrade -e ../../services/fastapi-backend/

print("‚úÖ Optional: fastapi-backend upgraded (if uncommented)")
print("   This ensures ddtrace>=4.0.0 compatibility")


Note: you may need to restart the kernel to use updated packages.
‚úÖ Optional: fastapi-backend upgraded (if uncommented)
   This ensures ddtrace>=4.0.0 compatibility


In [2]:
# Install required packages for dataset preparation and experiments
%pip install --quiet --upgrade requests pillow python-dotenv "ddtrace>=3.18.0" httpx "google-genai>=1.56.0"

print("‚úÖ All required packages installed!")
print("\nInstalled packages:")
print("  - requests: HTTP requests for Datadog API")
print("  - pillow: Image processing")
print("  - python-dotenv: Environment variables")
print("  - ddtrace: Datadog LLM Observability SDK (>=3.18.0)")
print("  - httpx: Async HTTP client for API calls")
print("  - google-genai: Google Generative AI SDK (>=1.56.0)")
print("\nüí° Tip: If you see any dependency warnings, they're typically safe to ignore")
print("   for this notebook as it doesn't directly use all backend dependencies.")


Note: you may need to restart the kernel to use updated packages.
‚úÖ All required packages installed!

Installed packages:
  - requests: HTTP requests for Datadog API
  - pillow: Image processing
  - python-dotenv: Environment variables
  - ddtrace: Datadog LLM Observability SDK (>=3.18.0)
  - httpx: Async HTTP client for API calls
  - google-genai: Google Generative AI SDK (>=1.56.0)

   for this notebook as it doesn't directly use all backend dependencies.


In [4]:
import json
import os
import sys
from pathlib import Path
from datetime import datetime
from typing import Dict, List, Any

import requests
from PIL import Image
from dotenv import load_dotenv

# Add project root to path
project_root = Path.cwd().parent.parent
sys.path.insert(0, str(project_root))

# Load environment variables
load_dotenv(project_root / ".env")

print(f"‚úÖ Project root: {project_root}")
print(f"‚úÖ Current working directory: {Path.cwd()}")
print(f"‚úÖ Environment loaded from .env")


‚úÖ Project root: /Users/nuttee.jirattivongvibul/Projects/genai-app-python
‚úÖ Current working directory: /Users/nuttee.jirattivongvibul/Projects/genai-app-python/notebooks/datasets
‚úÖ Environment loaded from .env


## üîë Configuration & API Keys


In [None]:
# Configuration
DD_API_KEY = os.getenv("DD_API_KEY")
DD_APP_KEY = os.getenv("DD_APP_KEY")
DD_SITE = os.getenv("DD_SITE", "datadoghq.com")

# Paths
IMAGES_DIR = project_root / "assets" / "ss5-18-images"
DATASET_DIR = project_root / "datasets" / "vote-extraction"
DATASET_DIR.mkdir(parents=True, exist_ok=True)

# Verify setup
print("üîë API Keys:")
print(f"   DD_API_KEY: {'‚úÖ Set' if DD_API_KEY else '‚ùå Missing'}")
print(f"   DD_APP_KEY: {'‚úÖ Set' if DD_APP_KEY else '‚ùå Missing'}")
print(f"\nüìÇ Paths:")
print(f"   Images: {IMAGES_DIR}")
print(f"   Dataset output: {DATASET_DIR}")
print(f"   Images exist: {'‚úÖ Yes' if IMAGES_DIR.exists() else '‚ùå No'}")


üîë API Keys:
   DD_API_KEY: ‚úÖ Set
   DD_APP_KEY: ‚úÖ Set

üìÇ Paths:
   Images: /Users/nuttee.jirattivongvibul/Projects/genai-app-python/assets/ss5-18-images
   Dataset output: /Users/nuttee.jirattivongvibul/Projects/genai-app-python/datasets/vote-extraction
   Images exist: ‚úÖ Yes


## üì∏ Step 1: Discover Images

Let's see what images we have available.


In [None]:
# Discover all images
image_files = sorted(list(IMAGES_DIR.glob("*.jpg")) + list(IMAGES_DIR.glob("*.png")))

print(f"üìä Found {len(image_files)} images:")
print("=" * 80)

for i, img_path in enumerate(image_files[:10], 1):  # Show first 10
    img = Image.open(img_path)
    size_mb = img_path.stat().st_size / (1024 * 1024)
    print(f"{i:2}. {img_path.name:45} {img.size[0]:4}x{img.size[1]:4}px {size_mb:6.2f}MB")

if len(image_files) > 10:
    print(f"... and {len(image_files) - 10} more")

print(f"\n‚úÖ Total: {len(image_files)} images")
print(f"   Estimated form sets (6 pages each): {len(image_files) // 6}")


## üíæ Step 2: Work with Local Dataset Files

We store datasets as JSON files for:
- Version control (Git-friendly)
- Local editing and review
- Backup and sharing
- Incremental updates


In [None]:
# List existing dataset files
dataset_files = sorted(DATASET_DIR.glob("*.json"))

print(f"üìÅ Existing dataset files in {DATASET_DIR}:")
print("=" * 80)

if dataset_files:
    for i, file in enumerate(dataset_files, 1):
        size_kb = file.stat().st_size / 1024
        modified = datetime.fromtimestamp(file.stat().st_mtime)
        print(f"{i}. {file.name}")
        print(f"   Size: {size_kb:.2f} KB | Modified: {modified.strftime('%Y-%m-%d %H:%M:%S')}")
else:
    print("No dataset files found yet.")
    print("\nüí° Use the Streamlit app to create your first dataset!")
    print("   Run: streamlit run frontend/streamlit/pages/2_üìä_Dataset_Manager.py")


## üì§ Step 3: Push Dataset to Datadog LLMObs

Use Datadog LLMObs SDK to create projects and datasets.


In [None]:
# Load the latest dataset file
latest_file = DATASET_DIR / "vote-extraction-dataset_latest.json"

if not latest_file.exists() and dataset_files:
    latest_file = dataset_files[-1]  # Use most recent file

if latest_file.exists():
    print(f"üìÇ Loading dataset from: {latest_file.name}")
    
    with open(latest_file, 'r', encoding='utf-8') as f:
        dataset = json.load(f)
    
    print(f"\n‚úÖ Dataset loaded:")
    print(f"   Name: {dataset['metadata']['name']}")
    print(f"   Version: {dataset['metadata']['version']}")
    print(f"   Records: {dataset['metadata']['num_records']}")
    print(f"   Total Pages: {dataset['metadata']['total_pages']}")
    print(f"   Created: {dataset['metadata']['created_at']}")
    
    # Show first record as example
    if dataset['records']:
        print(f"\nüìÑ Example record:")
        print(json.dumps(dataset['records'][0], indent=2, ensure_ascii=False)[:500] + "...")
else:
    print("‚ùå No dataset files found.")
    print("\nüí° Create one using the Streamlit Dataset Manager app!")
    dataset = None


## üöÄ Push to Datadog (Using SDK)

**Note**: Uncomment and run this cell when you're ready to push your dataset to Datadog.


In [None]:
# Uncomment to push dataset to Datadog
'''
if dataset and DD_API_KEY and DD_APP_KEY:
    print("üîó Pushing dataset to Datadog...")
    
    # Using HTTP API
    base_url = f"https://api.{DD_SITE}/api/v2/llm-obs/v1"
    headers = {
        "DD-API-KEY": DD_API_KEY,
        "DD-APPLICATION-KEY": DD_APP_KEY,
        "Content-Type": "application/json",
    }
    
    # 1. Create or get project
    project_name = "vote-extraction-project"
    response = requests.get(f"{base_url}/projects", headers=headers)
    projects = response.json().get("data", [])
    
    project_id = None
    for proj in projects:
        if proj["attributes"]["name"] == project_name:
            project_id = proj["id"]
            print(f"‚úÖ Found existing project: {project_id}")
            break
    
    if not project_id:
        # Create new project
        payload = {
            "data": {
                "type": "project",
                "attributes": {
                    "name": project_name,
                    "description": "Thai election vote extraction testing"
                }
            }
        }
        response = requests.post(f"{base_url}/projects", json=payload, headers=headers)
        project_id = response.json()["data"]["id"]
        print(f"‚úÖ Created new project: {project_id}")
    
    # 2. Create dataset
    payload = {
        "data": {
            "type": "dataset",
            "attributes": {
                "project_id": project_id,
                "name": dataset["metadata"]["name"],
                "description": dataset["metadata"].get("description", ""),
                "dataset_version": 1
            }
        }
    }
    response = requests.post(f"{base_url}/datasets", json=payload, headers=headers)
    dataset_id = response.json()["data"]["id"]
    print(f"‚úÖ Created dataset: {dataset_id}")
    
    # 3. Add records
    print(f"\nüì§ Adding {len(dataset['records'])} records...")
    for i, record in enumerate(dataset['records'], 1):
        payload = {
            "data": {
                "type": "dataset_record",
                "attributes": {
                    "input": record["input"],
                    "expected_output": record["expected_output"]
                }
            }
        }
        response = requests.post(f"{base_url}/datasets/{dataset_id}/records", json=payload, headers=headers)
        print(f"‚úÖ Added record {i}/{len(dataset['records'])}: {record['id']}")
    
    print(f"\nüéâ Dataset pushed successfully!")
    print(f"üîó View in Datadog: https://app.{DD_SITE}/llm/experiments")
else:
    print("‚ö†Ô∏è Skipped: No dataset or API keys not set")
'''

print("üí° TIP: Use the Streamlit Dataset Manager app to push datasets with a GUI!")
print("   It provides a much better experience for managing datasets.")


## üß™ Step 4: Run Experiments

Experiments let you systematically test your LLM application by running your agent across a set of scenarios from your dataset and measuring performance against expected outputs.

**Key Components**:
1. **Task**: Core workflow to evaluate (single LLM call or complex flow)
2. **Evaluators**: Functions that measure performance (boolean, score, categorical)
3. **Summary Evaluators**: Aggregate metrics across all records (precision, recall, accuracy)

**Benefits**:
- Compare different app configurations side-by-side
- Track performance improvements over time
- Identify failure patterns
- Validate before production deployment


### 4.1 Load Dataset from Datadog

First, load the dataset we created earlier from Datadog.


In [53]:
from ddtrace.llmobs import LLMObs
from typing import Dict, Any, Optional, List, Callable

# Load dataset from Datadog
dataset_name = "vote-extraction-bangbamru-1-10"  # Change to your dataset name
project_name = "vote-extraction-project"

# Initialize LLMObs
LLMObs.enable(
    ml_app="vote-extractor",
    api_key=DD_API_KEY,
    site=DD_SITE,
    agentless_enabled=True,
    project_name=project_name,
)

print(f"üì• Loading dataset '{dataset_name}' from Datadog...")

try:
    experiment_dataset = LLMObs.pull_dataset(
        dataset_name=dataset_name,
        project_name=project_name,
        version=2  # Optional: specify version, defaults to latest
    )
    
    print(f"‚úÖ Dataset loaded successfully!")
    print(f"   Records: {len(experiment_dataset)}")
    
    # Try to get version (may not be available in all ddtrace versions)
    version = getattr(experiment_dataset, 'version', None)
    if version is not None:
        print(f"   Version: {version}")
    
    # Preview first record
    if len(experiment_dataset) > 0:
        first_record = experiment_dataset[0]
        print(f"\nüìÑ First record preview:")
        print(f"   Input keys: {list(first_record['input_data'].keys())}")
        print(f"   Expected output keys: {list(first_record['expected_output'].keys())}")
        
except Exception as e:
    print(f"‚ùå Error loading dataset: {e}")
    print("\nüí° Make sure you've pushed the dataset to Datadog first!")
    print("   Use the Streamlit Dataset Manager or Step 3 above")
    experiment_dataset = None


üì• Loading dataset 'vote-extraction-bangbamru-1-10' from Datadog...
‚úÖ Dataset loaded successfully!
   Records: 11
   Version: 2

üìÑ First record preview:
   Input keys: ['form_set_name', 'image_paths', 'num_pages']
   Expected output keys: ['ballot_statistics', 'form_info', 'notes', 'vote_results', 'voter_statistics']


In [54]:
# üîç Inspect Dataset Object (Debug)
if experiment_dataset:
    print("üìä Dataset Object Inspection:")
    print("=" * 80)
    
    # Show type
    print(f"Type: {type(experiment_dataset)}")
    
    # Show available attributes
    print(f"\nüìù Available attributes:")
    attrs = [attr for attr in dir(experiment_dataset) if not attr.startswith('_')]
    for attr in attrs[:15]:  # Show first 15
        try:
            value = getattr(experiment_dataset, attr, None)
            if not callable(value):
                print(f"   - {attr}: {value}")
        except:
            pass
    
    # Show structure
    print(f"\nüì¶ Dataset Structure:")
    print(f"   - Length: {len(experiment_dataset)}")
    if len(experiment_dataset) > 0:
        print(f"   - First record type: {type(experiment_dataset[0])}")
        print(f"   - First record keys: {list(experiment_dataset[0].keys())}")
    
    print("\nüí° Note: The Dataset object is a wrapper around a list of records.")
    print("   Version info may be stored internally or not exposed as an attribute.")


üìä Dataset Object Inspection:
Type: <class 'ddtrace.llmobs._experiment.Dataset'>

üìù Available attributes:
   - BATCH_UPDATE_THRESHOLD: 5242880
   - description: Auto-generated from LLM extraction on 2026-01-04 02:20:54
   - latest_version: 3
   - name: vote-extraction-bangbamru-1-10
   - project: {'name': 'vote-extraction-project', '_id': 'b8001020-6c33-4e94-a476-bcd78efddf3b'}
   - url: https://app.datadoghq.com/llm/datasets/241bfded-e79d-4d2d-bbc4-a74bb06d85f9
   - version: 2

üì¶ Dataset Structure:
   - Length: 11
   - First record type: <class 'dict'>
   - First record keys: ['record_id', 'input_data', 'expected_output', 'metadata']

üí° Note: The Dataset object is a wrapper around a list of records.
   Version info may be stored internally or not exposed as an attribute.


### 4.2 Define Task Function

The task function processes each dataset record. For vote extraction, we'll call our FastAPI backend.

**‚ö†Ô∏è Important: API Key Configuration**

The FastAPI backend requires authentication if `API_KEY` is set in your environment:

- **If you get 401 Unauthorized errors**: Set `API_KEY` in your `.env` file or environment
- **For local testing without API key**: Ensure `API_KEY=""` (empty) and `API_KEY_REQUIRED=false` in `.env`
- **The notebook will automatically load `API_KEY` from environment** and include it in requests

Example `.env` configuration:
```bash
# Option 1: Use API key (recommended for production-like testing)
API_KEY=your-secret-key-here
API_KEY_REQUIRED=true

# Option 2: Disable API key (for quick local testing)
API_KEY=
API_KEY_REQUIRED=false
```


In [67]:
import httpx
from ddtrace.llmobs.decorators import workflow

# FastAPI backend URL and API key
API_BASE_URL = os.getenv("API_BASE_URL", "http://localhost:8000")
API_KEY = os.getenv("API_KEY", "")  # Load API key from environment

@workflow
def vote_extraction_task(input_data: Dict[str, Any], config: Optional[Dict[str, Any]] = None) -> Dict[str, Any]:
    """
    Task function that extracts vote data from election form images.
    
    Args:
        input_data: Dictionary containing 'form_set_name' and 'image_paths'
        config: Optional configuration (model settings, etc.)
    
    Returns:
        Dictionary with extracted vote data (form_info, ballot_statistics, vote_results)
    """
    form_set_name = input_data.get("form_set_name")
    image_paths = input_data.get("image_paths", [])
    
    print(f"Processing: {form_set_name} ({len(image_paths)} pages)")
    
    try:
        # Read images
        files = []
        for img_path in image_paths:
            img_file = Path(img_path)
            if img_file.exists():
                files.append(("files", (img_file.name, img_file.read_bytes(), "image/jpeg")))
        
        # Prepare headers with API key (if configured)
        headers = {}
        if API_KEY:
            headers["X-API-Key"] = API_KEY
        
        # Call extraction API
        with httpx.Client(timeout=300.0) as client:
            response = client.post(
                f"{API_BASE_URL}/api/v1/vote-extraction/extract",
                files=files,
                headers=headers
            )
            response.raise_for_status()
            result = response.json()
        
        # Extract first form data (for single-form datasets)
        if result.get("data") and len(result["data"]) > 0:
            extracted_data = result["data"][0]
            return {
                "form_info": extracted_data.get("form_info"),
                "voter_statistics": extracted_data.get("voter_statistics"),
                "ballot_statistics": extracted_data.get("ballot_statistics"),
                "vote_results": extracted_data.get("vote_results", [])
            }
        else:
            return {"error": "No data extracted"}
            
    except Exception as e:
        print(f"‚ùå Error processing {form_set_name}: {e}")
        return {"error": str(e)}

print("‚úÖ Task function defined: vote_extraction_task()")


‚úÖ Task function defined: vote_extraction_task()


### 4.3 Define Evaluator Functions

Evaluators measure how well the model performs. We'll create evaluators for different aspects of vote extraction.


### 4.3.1 LLM-as-Judge Evaluator (Advanced)

This evaluator uses a more powerful LLM (gemini-3-pro-preview via Vertex AI) to assess extraction quality.
It provides detailed reasoning and identifies specific errors.


In [73]:
def llm_judge_evaluator(input_data: Dict[str, Any], output_data: Dict[str, Any], expected_output: Dict[str, Any]) -> float:
    """
    LLM-as-Judge evaluator using gemini-3-pro-preview via Vertex AI.
    
    Uses a more powerful LLM to evaluate the quality of extraction outputs
    by comparing them with ground truth. Provides a quality score from 0.0 to 1.0.
    Includes retry logic with exponential backoff for empty responses.
    
    Note: Requires GOOGLE_CLOUD_PROJECT and VERTEX_AI_LOCATION environment variables.
    """
    import json
    import time
    from google import genai
    from google.genai import types
    from ddtrace import tracer
    
    # Retry configuration
    MAX_RETRIES = 3
    INITIAL_RETRY_DELAY = 1.0  # seconds
    
    # Define response schema for structured output
    EVALUATION_SCHEMA = {
        "type": "OBJECT",
        "properties": {
            "score": {
                "type": "NUMBER",
                "description": "Quality score between 0.0 (worst) and 1.0 (perfect)"
            },
            "reasoning": {
                "type": "STRING",
                "description": "Brief explanation of the score"
            },
            "errors": {
                "type": "ARRAY",
                "description": "List of specific errors found",
                "items": {
                    "type": "OBJECT",
                    "properties": {
                        "field": {"type": "STRING", "description": "Field path with error"},
                        "expected": {"type": "STRING", "description": "Expected value"},
                        "actual": {"type": "STRING", "description": "Actual value"},
                        "severity": {"type": "STRING", "enum": ["minor", "major", "critical"]}
                    }
                }
            }
        },
        "required": ["score", "reasoning", "errors"]
    }
    
    # Create main evaluation span
    form_set_name = input_data.get("form_set_name", "Unknown")
    
    with tracer.trace(
        "llm_judge.evaluate",
        service="vote-extractor",
        resource=f"evaluate_{form_set_name}"
    ) as eval_span:
        eval_span.set_tag("form_set_name", form_set_name)
        eval_span.set_tag("evaluator", "llm_judge")
        eval_span.set_tag("model", "gemini-3-pro-preview")
        
        try:
            # Get GCP configuration
            project_id = os.getenv("GOOGLE_CLOUD_PROJECT")
            location = os.getenv("VERTEX_AI_LOCATION", "global")
            
            if not project_id:
                print("‚ö†Ô∏è  LLM Judge: GOOGLE_CLOUD_PROJECT not set, skipping evaluation")
                eval_span.set_tag("error.skip", "missing_gcp_project")
                return 0.0  # Cannot evaluate without project
            
            # Initialize Google GenAI client with Vertex AI
            with tracer.trace("llm_judge.initialize_client", service="vote-extractor") as init_span:
                init_span.set_tag("project_id", project_id)
                init_span.set_tag("location", location)
                
                client = genai.Client(
                    vertexai=True,
                    project=project_id,
                    location=location,
                )
            
            # Build evaluation prompt with tracing
            with tracer.trace("llm_judge.build_prompt", service="vote-extractor") as prompt_span:
                prompt_span.set_tag("form_set_name", form_set_name)
                
                prompt = f"""You are an expert evaluator for Thai election vote extraction systems.

Compare the extracted output with the ground truth and evaluate the quality.

**Input Form:** {form_set_name}

**Ground Truth:**
{json.dumps(expected_output, indent=2, ensure_ascii=False)}

**Extracted Output:**
{json.dumps(output_data, indent=2, ensure_ascii=False)}

Evaluate the extraction quality considering:
1. Form Information (date, location, polling station)
2. Voter Statistics (eligible voters, voters present)
3. Ballot Statistics (allocated, used, good, bad, no-vote)
4. Vote Results (candidate numbers, names, vote counts)

Provide:
- score: float between 0.0 (worst) and 1.0 (perfect)
- reasoning: brief explanation
- errors: list of specific errors found (if any)
"""
                prompt_span.set_metric("prompt_length", len(prompt))
            
            # Call Gemini 3 Pro Preview as judge with structured schema
            # Retry logic for empty responses
            response = None
            retry_delay = INITIAL_RETRY_DELAY
            
            for attempt in range(1, MAX_RETRIES + 1):
                with tracer.trace("llm_judge.api_call", service="vote-extractor") as api_span:
                    api_span.set_tag("model", "gemini-3-pro-preview")
                    api_span.set_tag("provider", "google")
                    api_span.set_tag("temperature", "0.0")
                    api_span.set_metric("attempt", attempt)
                    api_span.set_metric("max_retries", MAX_RETRIES)
                    
                    try:
                        response = client.models.generate_content(
                            model="gemini-3-pro-preview",
                            contents=prompt,
                            config=types.GenerateContentConfig(
                                response_mime_type="application/json",
                                response_schema=EVALUATION_SCHEMA,  # ‚úÖ Enforce structured output
                                temperature=0.0,  # Deterministic evaluation
                                max_output_tokens=4096,
                            ),
                        )
                        
                        api_span.set_tag("response_received", response is not None)
                        
                        # Debug: Inspect response structure
                        if response:
                            finish_reason = getattr(response, 'finish_reason', 'N/A')
                            api_span.set_tag("finish_reason", str(finish_reason))
                            candidates = getattr(response, 'candidates', [])
                            api_span.set_metric("candidates_count", len(candidates) if candidates else 0)
                            
                            print(f"üîç Response Debug - {form_set_name} (attempt {attempt}):")
                            print(f"   - Has text: {bool(response.text) if hasattr(response, 'text') else False}")
                            print(f"   - Text length: {len(response.text) if hasattr(response, 'text') and response.text else 0}")
                            print(f"   - Finish reason: {finish_reason}")
                            print(f"   - Candidates: {len(candidates) if candidates else 0}")
                        
                        # Check if response has content
                        if response and response.text:
                            api_span.set_tag("response_valid", True)
                            print(f"‚úÖ LLM Judge: Valid response for {form_set_name} (attempt {attempt})")
                            break  # Success! Exit retry loop
                        else:
                            api_span.set_tag("response_valid", False)
                            api_span.set_tag("retry_reason", "empty_response")
                            
                            # Detailed error info for debugging
                            if response:
                                print(f"‚ö†Ô∏è  Response object exists but no text:")
                                print(f"   - finish_reason: {getattr(response, 'finish_reason', 'N/A')}")
                                print(f"   - candidates: {len(getattr(response, 'candidates', []))}")
                                if hasattr(response, 'prompt_feedback'):
                                    print(f"   - prompt_feedback: {response.prompt_feedback}")
                                print(f"   - text attr: {hasattr(response, 'text')}")
                                print(f"   - text value: {repr(response.text) if hasattr(response, 'text') else 'N/A'}")
                            
                            if attempt < MAX_RETRIES:
                                print(f"‚ö†Ô∏è  LLM Judge: Empty response for {form_set_name} (attempt {attempt}/{MAX_RETRIES}), retrying in {retry_delay:.1f}s...")
                                time.sleep(retry_delay)
                                retry_delay *= 2  # Exponential backoff
                            else:
                                print(f"‚ùå LLM Judge: Empty response for {form_set_name} after {MAX_RETRIES} attempts")
                    
                    except Exception as api_error:
                        api_span.set_tag("api_error", str(api_error))
                        
                        if attempt < MAX_RETRIES:
                            print(f"‚ö†Ô∏è  LLM Judge: API error for {form_set_name} (attempt {attempt}/{MAX_RETRIES}): {api_error}, retrying in {retry_delay:.1f}s...")
                            time.sleep(retry_delay)
                            retry_delay *= 2
                        else:
                            print(f"‚ùå LLM Judge: API error for {form_set_name} after {MAX_RETRIES} attempts: {api_error}")
                            raise
            
            # Parse and validate response
            with tracer.trace("llm_judge.parse_response", service="vote-extractor") as parse_span:
                if not response or not response.text:
                    print(f"‚ö†Ô∏è  LLM Judge: Empty response for {form_set_name}")
                    parse_span.set_tag("error", "empty_response")
                    eval_span.set_metric("score", 0.0)
                    return 0.0
                
                parse_span.set_metric("response_length", len(response.text))
                
                evaluation = json.loads(response.text)
                score = float(evaluation.get("score", 0.0))
                
                parse_span.set_metric("score", score)
                eval_span.set_metric("final_score", score)
                eval_span.set_tag("result", "success")
            
            # Optional: Print reasoning for debugging
            # print(f"LLM Judge: {form_set_name} - Score: {score:.2f}")
            # print(f"Reasoning: {evaluation.get('reasoning', 'N/A')}")
            
            return score
            
        except json.JSONDecodeError as e:
            print(f"‚ö†Ô∏è  LLM Judge JSON Parse Error for {form_set_name}: {e}")
            print(f"    Raw response: {response.text if 'response' in locals() else 'None'}")
            eval_span.set_tag("error", "json_decode_error")
            eval_span.set_metric("score", 0.0)
            return 0.0
        except Exception as e:
            print(f"‚ö†Ô∏è  LLM Judge Error for {form_set_name}: {e}")
            eval_span.set_tag("error", type(e).__name__)
            eval_span.set_metric("score", 0.0)
            return 0.0


print("‚úÖ LLM-as-Judge evaluator defined:")
print("   - llm_judge_evaluator (score 0.0-1.0)")
print("   - Uses: gemini-3-pro-preview via Vertex AI")
print("   - Provides: Quality score + detailed reasoning")


‚úÖ LLM-as-Judge evaluator defined:
   - llm_judge_evaluator (score 0.0-1.0)
   - Uses: gemini-3-pro-preview via Vertex AI
   - Provides: Quality score + detailed reasoning


In [74]:
def exact_form_match(input_data: Dict[str, Any], output_data: Dict[str, Any], expected_output: Dict[str, Any]) -> bool:
    """
    Boolean evaluator: Check if form info matches exactly.
    """
    if "error" in output_data or not output_data.get("form_info"):
        return False
    
    output_form = output_data.get("form_info", {})
    expected_form = expected_output.get("form_info", {})
    
    # Compare key fields
    return (
        output_form.get("district") == expected_form.get("district") and
        output_form.get("polling_station_number") == expected_form.get("polling_station_number")
    )


def ballot_accuracy_score(input_data: Dict[str, Any], output_data: Dict[str, Any], expected_output: Dict[str, Any]) -> float:
    """
    Score evaluator: Calculate ballot statistics accuracy (0.0 to 1.0).
    """
    if "error" in output_data or not output_data.get("ballot_statistics"):
        return 0.0
    
    output_ballots = output_data.get("ballot_statistics", {})
    expected_ballots = expected_output.get("ballot_statistics", {})
    
    # Compare key ballot counts
    fields = ["ballots_used", "good_ballots", "bad_ballots", "no_vote_ballots"]
    matches = 0
    total = 0
    
    for field in fields:
        if field in expected_ballots:
            total += 1
            if output_ballots.get(field) == expected_ballots.get(field):
                matches += 1
    
    return matches / total if total > 0 else 0.0


def vote_results_quality(input_data: Dict[str, Any], output_data: Dict[str, Any], expected_output: Dict[str, Any]) -> str:
    """
    Categorical evaluator: Assess vote results quality (excellent/good/fair/poor).
    """
    if "error" in output_data or not output_data.get("vote_results"):
        return "poor"
    
    output_votes = output_data.get("vote_results", [])
    expected_votes = expected_output.get("vote_results", [])
    
    # Count matching candidates
    if len(expected_votes) == 0:
        return "poor"
    
    matches = 0
    for exp_vote in expected_votes:
        for out_vote in output_votes:
            if (out_vote.get("number") == exp_vote.get("number") and
                out_vote.get("vote_count") == exp_vote.get("vote_count")):
                matches += 1
                break
    
    accuracy = matches / len(expected_votes)
    
    if accuracy >= 0.95:
        return "excellent"
    elif accuracy >= 0.80:
        return "good"
    elif accuracy >= 0.60:
        return "fair"
    else:
        return "poor"


def has_no_errors(input_data: Dict[str, Any], output_data: Dict[str, Any], expected_output: Dict[str, Any]) -> bool:
    """
    Boolean evaluator: Check if extraction completed without errors.
    """
    return "error" not in output_data

print("‚úÖ Evaluators defined:")
print("   - exact_form_match (boolean)")
print("   - ballot_accuracy_score (score)")
print("   - vote_results_quality (categorical)")
print("   - has_no_errors (boolean)")


‚úÖ Evaluators defined:
   - exact_form_match (boolean)
   - ballot_accuracy_score (score)
   - vote_results_quality (categorical)
   - has_no_errors (boolean)


### 4.4 Define Summary Evaluators (Optional)

Summary evaluators aggregate results across all records.


In [75]:
def overall_accuracy(inputs: List[Any], outputs: List[Any], expected_outputs: List[Any], evaluators_results: Dict[str, List]) -> float:
    """
    Summary evaluator: Calculate overall accuracy across all records.
    """
    form_matches = evaluators_results.get("exact_form_match", [])
    if not form_matches:
        return 0.0
    
    return form_matches.count(True) / len(form_matches)


def success_rate(inputs: List[Any], outputs: List[Any], expected_outputs: List[Any], evaluators_results: Dict[str, List]) -> float:
    """
    Summary evaluator: Calculate percentage of records processed without errors.
    """
    no_errors = evaluators_results.get("has_no_errors", [])
    if not no_errors:
        return 0.0
    
    return no_errors.count(True) / len(no_errors)


def avg_ballot_accuracy(inputs: List[Any], outputs: List[Any], expected_outputs: List[Any], evaluators_results: Dict[str, List]) -> float:
    """
    Summary evaluator: Average ballot accuracy score across all records.
    """
    scores = evaluators_results.get("ballot_accuracy_score", [])
    if not scores:
        return 0.0
    
    return sum(scores) / len(scores)


def avg_llm_judge_score(inputs: List[Any], outputs: List[Any], expected_outputs: List[Any], evaluators_results: Dict[str, List]) -> float:
    """
    Summary evaluator: Average LLM Judge quality score across all records.
    """
    scores = evaluators_results.get("llm_judge_evaluator", [])
    if not scores:
        return 0.0
    
    return sum(scores) / len(scores)

print("‚úÖ Summary evaluators defined:")
print("   - overall_accuracy (float)")
print("   - success_rate (float)")
print("   - avg_ballot_accuracy (float)")
print("   - avg_llm_judge_score (float) ‚≠ê NEW!")


‚úÖ Summary evaluators defined:
   - overall_accuracy (float)
   - success_rate (float)
   - avg_ballot_accuracy (float)
   - avg_llm_judge_score (float) ‚≠ê NEW!


### 4.5 Create and Run Experiment

Now we'll create an experiment and run it against our dataset.


In [76]:
# Create experiment
experiment = LLMObs.experiment(
    name="vote-extraction-baseline",
    task=vote_extraction_task,
    dataset=experiment_dataset,
    evaluators=[
        exact_form_match,
        ballot_accuracy_score,
        vote_results_quality,
        has_no_errors,
        llm_judge_evaluator  # ‚≠ê NEW: LLM-as-Judge quality assessment
    ],
    summary_evaluators=[
        overall_accuracy,
        success_rate,
        avg_ballot_accuracy,
        avg_llm_judge_score  # ‚≠ê NEW: Average LLM judge score
    ],
    description="Baseline evaluation of vote extraction accuracy with LLM-as-Judge",
    config={
        "model": "gemini-2.5-flash",
        "temperature": 0.0,
        "version": "1.0"
    },
)

print(f"‚úÖ Experiment created: {experiment.name}")
print(f"   Dataset: {dataset_name}")
print(f"   Records: {len(experiment_dataset)}")
print(f"   Evaluators: {len(experiment._evaluators)} (includes LLM Judge)")
print(f"   Summary Evaluators: {len(experiment._summary_evaluators)}")
print(f"\nüìä View in Datadog: {experiment.url}")


‚úÖ Experiment created: vote-extraction-baseline
   Dataset: vote-extraction-bangbamru-1-10
   Records: 11
   Evaluators: 5 (includes LLM Judge)
   Summary Evaluators: 4

üìä View in Datadog: https://app.datadoghq.com/llm/experiments/None


### 4.6 Run Experiment

Run the experiment with various options.


In [77]:
# Option 1: Run on all records (default)
print("üöÄ Running experiment on all records...")
print("‚è±Ô∏è  This may take several minutes depending on dataset size...")

results = experiment.run(
    sample_size=10,
    jobs=2,
    raise_errors=True
)

# Option 2: Test on a sample (for faster iteration)
# results = experiment.run(sample_size=3)

# Option 3: Parallel processing (faster execution)
# results = experiment.run(jobs=4)

# Option 4: Stop on first error (for debugging)
# results = experiment.run(raise_errors=True)

print(f"\n‚úÖ Experiment completed!")
print(f"   Total records processed: {len(results.get('rows', []))}")


üöÄ Running experiment on all records...
‚è±Ô∏è  This may take several minutes depending on dataset size...
Processing: ‡∏ö‡∏≤‡∏á‡∏ö‡∏≥‡∏´‡∏£‡∏∏4 (6 pages)Processing: ‡∏ö‡∏≤‡∏á‡∏ö‡∏≥‡∏´‡∏£‡∏∏1 (6 pages)

Processing: ‡∏ö‡∏≤‡∏á‡∏ö‡∏≥‡∏´‡∏£‡∏∏5 (6 pages)
Processing: ‡∏ö‡∏≤‡∏á‡∏ö‡∏≥‡∏´‡∏£‡∏∏2 (6 pages)
Processing: ‡∏ö‡∏≤‡∏á‡∏ö‡∏≥‡∏´‡∏£‡∏∏3 (6 pages)
Processing: ‡∏ö‡∏≤‡∏á‡∏ö‡∏≥‡∏´‡∏£‡∏∏4 (6 pages)
Processing: ‡∏ö‡∏≤‡∏á‡∏ö‡∏≥‡∏´‡∏£‡∏∏9 (6 pages)
Processing: ‡∏ö‡∏≤‡∏á‡∏ö‡∏≥‡∏´‡∏£‡∏∏7 (6 pages)
Processing: ‡∏ö‡∏≤‡∏á‡∏ö‡∏≥‡∏´‡∏£‡∏∏6 (6 pages)
Processing: ‡∏ö‡∏≤‡∏á‡∏ö‡∏≥‡∏´‡∏£‡∏∏10 (6 pages)
üîç Response Debug - ‡∏ö‡∏≤‡∏á‡∏ö‡∏≥‡∏´‡∏£‡∏∏4 (attempt 1):
   - Has text: True
   - Text length: 1709
   - Finish reason: N/A
   - Candidates: 1
‚úÖ LLM Judge: Valid response for ‡∏ö‡∏≤‡∏á‡∏ö‡∏≥‡∏´‡∏£‡∏∏4 (attempt 1)
üîç Response Debug - ‡∏ö‡∏≤‡∏á‡∏ö‡∏≥‡∏´‡∏£‡∏∏1 (attempt 1):
   - Has text: True
   - Text length: 995
   - Finish reason: N/A
   - Candidates: 1
‚úÖ LLM Judge: Valid response for ‡∏ö‡∏≤‡∏á‡∏

In [14]:
print(experiment.url)

https://app.datadoghq.com/llm/experiments/edd06b7d-bb70-47ef-ae67-41ca9dc226ff


### 4.7 View and Analyze Results

Process and display experiment results.


In [20]:
# Display summary statistics
print("üìä Experiment Results Summary")
print("=" * 80)

# Summary evaluators results
if "summary_evaluators" in results:
    print("\nüéØ Summary Metrics:")
    for metric_name, metric_value in results["summary_evaluators"].items():
        if isinstance(metric_value, float):
            print(f"   {metric_name}: {metric_value:.2%}")
        else:
            print(f"   {metric_name}: {metric_value}")

# Per-record results
print(f"\nüìÑ Per-Record Results:")
print("-" * 80)

for i, row in enumerate(results.get("rows", [])[:5], 1):  # Show first 5 records
    print(f"\n{i}. Record {row.get('idx', i)}:")
    
    # Input info
    input_data = row.get("input", {})
    form_name = input_data.get("form_set_name", "Unknown")
    print(f"   Form: {form_name}")
    
    # Evaluations
    evaluations = row.get("evaluations", {})
    for eval_name, eval_result in evaluations.items():
        value = eval_result.get("value")
        if isinstance(value, float):
            print(f"   {eval_name}: {value:.2%}")
        else:
            print(f"   {eval_name}: {value}")
    
    # Errors
    error = row.get("error", {})
    if error.get("message"):
        print(f"   ‚ö†Ô∏è Error: {error.get('message')}")

if len(results.get("rows", [])) > 5:
    print(f"\n... and {len(results.get('rows', [])) - 5} more records")

print(f"\n\nüîó View full results in Datadog:")
print(f"   {experiment.url}")


üìä Experiment Results Summary

üìÑ Per-Record Results:
--------------------------------------------------------------------------------

1. Record 0:
   Form: ‡∏ö‡∏≤‡∏á‡∏ö‡∏≥‡∏´‡∏£‡∏∏1
   exact_form_match: True
   ballot_accuracy_score: 100.00%
   vote_results_quality: excellent
   has_no_errors: True

2. Record 1:
   Form: ‡∏ö‡∏≤‡∏á‡∏ö‡∏≥‡∏´‡∏£‡∏∏5
   exact_form_match: True
   ballot_accuracy_score: 100.00%
   vote_results_quality: excellent
   has_no_errors: True

3. Record 2:
   Form: ‡∏ö‡∏≤‡∏á‡∏ö‡∏≥‡∏´‡∏£‡∏∏2
   exact_form_match: True
   ballot_accuracy_score: 100.00%
   vote_results_quality: excellent
   has_no_errors: True

4. Record 3:
   Form: ‡∏ö‡∏≤‡∏á‡∏ö‡∏≥‡∏´‡∏£‡∏∏3
   exact_form_match: True
   ballot_accuracy_score: 100.00%
   vote_results_quality: excellent
   has_no_errors: True

5. Record 4:
   Form: ‡∏ö‡∏≤‡∏á‡∏ö‡∏≥‡∏´‡∏£‡∏∏4
   exact_form_match: True
   ballot_accuracy_score: 100.00%
   vote_results_quality: excellent
   has_no_errors: True

... and 5 more records




---

## 5. Model Comparison Experiments

Let's run experiments with different Gemini models to compare performance on vote extraction tasks.

**Models to Test**:
- `gemini-2.5-flash` - Fast, cost-effective (baseline)
- `gemini-2.5-flash-lite` - Ultra-fast, lower cost
- `gemini-3-pro-preview` - Most capable, higher cost

**Optimized Parameters for Data Extraction**:
- `temperature=0.0` - Deterministic output (best for structured data)
- `temperature=0.1` - Slightly more varied (testing tolerance)
- `sample_size=10` - Full dataset evaluation
- `jobs=2` - Parallel processing (balanced for API rate limits)
- `raise_errors=True` - Fail fast for debugging


### 5.1 Experiment 1: gemini-2.5-flash (Baseline, Temperature 0.0)


In [None]:
# Experiment 1: gemini-2.5-flash with temperature 0.0 (deterministic)
print("=" * 80)
print("üß™ Experiment 1: gemini-2.5-flash (temperature=0.0)")
print("=" * 80)

experiment_flash_t0 = LLMObs.experiment(
    name="vote-extraction-gemini-2.5-flash-t0",
    task=vote_extraction_task,
    dataset=experiment_dataset,
    evaluators=[
        exact_form_match,
        ballot_accuracy_score,
        vote_results_quality,
        has_no_errors
    ],
    summary_evaluators=[
        overall_accuracy,
        success_rate,
        avg_ballot_accuracy
    ],
    metadata={
        "model": "gemini-2.5-flash",
        "temperature": 0.0,
        "purpose": "Baseline - deterministic extraction",
        "cost_tier": "medium"
    }
)

print(f"‚úÖ Created: {experiment_flash_t0.name}")
print(f"üìä View: {experiment_flash_t0.url}")

# Run experiment
print("\nüöÄ Running experiment...")
results_flash_t0 = experiment_flash_t0.run(
    sample_size=10,
    jobs=2,
    raise_errors=True
)

print(f"\n‚úÖ Completed! Processed {len(results_flash_t0.get('rows', []))} records")
print(f"üìà Summary Metrics:")
for key, value in results_flash_t0.get('summary_metrics', {}).items():
    print(f"   - {key}: {value}")


### 5.2 Experiment 2: gemini-2.5-flash-lite (Ultra-Fast, Temperature 0.0)


In [None]:
# Experiment 2: gemini-2.5-flash-lite (ultra-fast, lower cost)
print("=" * 80)
print("üß™ Experiment 2: gemini-2.5-flash-lite (temperature=0.0)")
print("=" * 80)

experiment_flash_lite = LLMObs.experiment(
    name="vote-extraction-gemini-2.5-flash-lite-t0",
    task=vote_extraction_task,
    dataset=experiment_dataset,
    evaluators=[
        exact_form_match,
        ballot_accuracy_score,
        vote_results_quality,
        has_no_errors
    ],
    summary_evaluators=[
        overall_accuracy,
        success_rate,
        avg_ballot_accuracy
    ],
    metadata={
        "model": "gemini-2.5-flash-lite",
        "temperature": 0.0,
        "purpose": "Speed test - ultra-fast model",
        "cost_tier": "low"
    }
)

print(f"‚úÖ Created: {experiment_flash_lite.name}")
print(f"üìä View: {experiment_flash_lite.url}")

# Run experiment
print("\nüöÄ Running experiment...")
results_flash_lite = experiment_flash_lite.run(
    sample_size=10,
    jobs=2,
    raise_errors=True
)

print(f"\n‚úÖ Completed! Processed {len(results_flash_lite.get('rows', []))} records")
print(f"üìà Summary Metrics:")
for key, value in results_flash_lite.get('summary_metrics', {}).items():
    print(f"   - {key}: {value}")


### 5.3 Experiment 3: gemini-3-pro-preview (Most Capable, Temperature 0.0)


In [None]:
# Experiment 3: gemini-3-pro-preview (most capable, higher accuracy expected)
print("=" * 80)
print("üß™ Experiment 3: gemini-3-pro-preview (temperature=0.0)")
print("=" * 80)

experiment_pro = LLMObs.experiment(
    name="vote-extraction-gemini-3-pro-preview-t0",
    task=vote_extraction_task,
    dataset=experiment_dataset,
    evaluators=[
        exact_form_match,
        ballot_accuracy_score,
        vote_results_quality,
        has_no_errors
    ],
    summary_evaluators=[
        overall_accuracy,
        success_rate,
        avg_ballot_accuracy
    ],
    metadata={
        "model": "gemini-3-pro-preview",
        "temperature": 0.0,
        "purpose": "Quality test - most capable model",
        "cost_tier": "high"
    }
)

print(f"‚úÖ Created: {experiment_pro.name}")
print(f"üìä View: {experiment_pro.url}")

# Run experiment
print("\nüöÄ Running experiment...")
results_pro = experiment_pro.run(
    sample_size=10,
    jobs=2,
    raise_errors=True
)

print(f"\n‚úÖ Completed! Processed {len(results_pro.get('rows', []))} records")
print(f"üìà Summary Metrics:")
for key, value in results_pro.get('summary_metrics', {}).items():
    print(f"   - {key}: {value}")


### 5.4 Experiment 4: gemini-2.5-flash (Temperature 0.1) - Tolerance Test


In [None]:
# Experiment 4: gemini-2.5-flash with temperature 0.1 (slight variation)
print("=" * 80)
print("üß™ Experiment 4: gemini-2.5-flash (temperature=0.1)")
print("=" * 80)

experiment_flash_t01 = LLMObs.experiment(
    name="vote-extraction-gemini-2.5-flash-t01",
    task=vote_extraction_task,
    dataset=experiment_dataset,
    evaluators=[
        exact_form_match,
        ballot_accuracy_score,
        vote_results_quality,
        has_no_errors
    ],
    summary_evaluators=[
        overall_accuracy,
        success_rate,
        avg_ballot_accuracy
    ],
    metadata={
        "model": "gemini-2.5-flash",
        "temperature": 0.1,
        "purpose": "Tolerance test - slightly more varied output",
        "cost_tier": "medium"
    }
)

print(f"‚úÖ Created: {experiment_flash_t01.name}")
print(f"üìä View: {experiment_flash_t01.url}")

# Run experiment
print("\nüöÄ Running experiment...")
results_flash_t01 = experiment_flash_t01.run(
    sample_size=10,
    jobs=2,
    raise_errors=True
)

print(f"\n‚úÖ Completed! Processed {len(results_flash_t01.get('rows', []))} records")
print(f"üìà Summary Metrics:")
for key, value in results_flash_t01.get('summary_metrics', {}).items():
    print(f"   - {key}: {value}")


### 5.5 Compare Results

Compare all experiments side-by-side to determine the best model for vote extraction.


In [None]:
# Compare all experiments
import pandas as pd

experiments_data = [
    {
        "Experiment": "gemini-2.5-flash (T=0.0)",
        "Model": "gemini-2.5-flash",
        "Temperature": 0.0,
        "Cost Tier": "Medium",
        "Overall Accuracy": results_flash_t0.get('summary_metrics', {}).get('overall_accuracy', 'N/A'),
        "Success Rate": results_flash_t0.get('summary_metrics', {}).get('success_rate', 'N/A'),
        "Avg Ballot Accuracy": results_flash_t0.get('summary_metrics', {}).get('avg_ballot_accuracy', 'N/A'),
        "URL": experiment_flash_t0.url
    },
    {
        "Experiment": "gemini-2.5-flash-lite (T=0.0)",
        "Model": "gemini-2.5-flash-lite",
        "Temperature": 0.0,
        "Cost Tier": "Low",
        "Overall Accuracy": results_flash_lite.get('summary_metrics', {}).get('overall_accuracy', 'N/A'),
        "Success Rate": results_flash_lite.get('summary_metrics', {}).get('success_rate', 'N/A'),
        "Avg Ballot Accuracy": results_flash_lite.get('summary_metrics', {}).get('avg_ballot_accuracy', 'N/A'),
        "URL": experiment_flash_lite.url
    },
    {
        "Experiment": "gemini-3-pro-preview (T=0.0)",
        "Model": "gemini-3-pro-preview",
        "Temperature": 0.0,
        "Cost Tier": "High",
        "Overall Accuracy": results_pro.get('summary_metrics', {}).get('overall_accuracy', 'N/A'),
        "Success Rate": results_pro.get('summary_metrics', {}).get('success_rate', 'N/A'),
        "Avg Ballot Accuracy": results_pro.get('summary_metrics', {}).get('avg_ballot_accuracy', 'N/A'),
        "URL": experiment_pro.url
    },
    {
        "Experiment": "gemini-2.5-flash (T=0.1)",
        "Model": "gemini-2.5-flash",
        "Temperature": 0.1,
        "Cost Tier": "Medium",
        "Overall Accuracy": results_flash_t01.get('summary_metrics', {}).get('overall_accuracy', 'N/A'),
        "Success Rate": results_flash_t01.get('summary_metrics', {}).get('success_rate', 'N/A'),
        "Avg Ballot Accuracy": results_flash_t01.get('summary_metrics', {}).get('avg_ballot_accuracy', 'N/A'),
        "URL": experiment_flash_t01.url
    }
]

comparison_df = pd.DataFrame(experiments_data)

print("=" * 120)
print("üìä EXPERIMENT COMPARISON SUMMARY")
print("=" * 120)
print()
print(comparison_df[["Experiment", "Cost Tier", "Overall Accuracy", "Success Rate", "Avg Ballot Accuracy"]].to_string(index=False))
print()
print("=" * 120)
print("üîó View in Datadog:")
for exp in experiments_data:
    print(f"   ‚Ä¢ {exp['Experiment']}: {exp['URL']}")
print("=" * 120)

# Identify best performing model
print("\n‚ú® RECOMMENDATIONS:")
print()

# Find best accuracy
best_accuracy = max(
    [e["Overall Accuracy"] for e in experiments_data if isinstance(e["Overall Accuracy"], (int, float))],
    default=0
)

# Find best cost/performance
for exp in experiments_data:
    if isinstance(exp["Overall Accuracy"], (int, float)) and exp["Overall Accuracy"] == best_accuracy:
        print(f"üèÜ BEST ACCURACY: {exp['Experiment']}")
        print(f"   - Overall Accuracy: {exp['Overall Accuracy']}%")
        print(f"   - Success Rate: {exp['Success Rate']}%")
        print(f"   - Cost Tier: {exp['Cost Tier']}")
        break

# Find best value
flash_lite_exp = next((e for e in experiments_data if "flash-lite" in e["Experiment"]), None)
if flash_lite_exp and isinstance(flash_lite_exp["Overall Accuracy"], (int, float)):
    if flash_lite_exp["Overall Accuracy"] >= 95:
        print(f"\nüí∞ BEST VALUE: {flash_lite_exp['Experiment']}")
        print(f"   - Overall Accuracy: {flash_lite_exp['Overall Accuracy']}% (excellent)")
        print(f"   - Cost Tier: {flash_lite_exp['Cost Tier']} (fastest, cheapest)")
        print(f"   - Recommendation: Use for high-volume processing")

# Temperature comparison
flash_t0 = next((e for e in experiments_data if e["Temperature"] == 0.0 and e["Model"] == "gemini-2.5-flash"), None)
flash_t01 = next((e for e in experiments_data if e["Temperature"] == 0.1 and e["Model"] == "gemini-2.5-flash"), None)
if flash_t0 and flash_t01:
    print(f"\nüå°Ô∏è TEMPERATURE IMPACT:")
    print(f"   - T=0.0 Accuracy: {flash_t0['Overall Accuracy']}%")
    print(f"   - T=0.1 Accuracy: {flash_t01['Overall Accuracy']}%")
    if isinstance(flash_t0['Overall Accuracy'], (int, float)) and isinstance(flash_t01['Overall Accuracy'], (int, float)):
        diff = abs(flash_t0['Overall Accuracy'] - flash_t01['Overall Accuracy'])
        if diff < 5:
            print(f"   - Impact: Minimal ({diff}% difference)")
            print(f"   - Recommendation: Use T=0.0 for deterministic results")
        else:
            print(f"   - Impact: Significant ({diff}% difference)")
            print(f"   - Recommendation: Use T=0.0 for structured data extraction")


### 5.6 Production Deployment Strategy

Based on experiment results, choose the best model for production deployment.


In [None]:
print("=" * 80)
print("üöÄ PRODUCTION DEPLOYMENT STRATEGY")
print("=" * 80)
print()

print("üìã Decision Framework:")
print()

print("1Ô∏è‚É£  HIGH VOLUME / COST SENSITIVE:")
print("   Model: gemini-2.5-flash-lite")
print("   Temperature: 0.0")
print("   Rationale: Lowest cost, fastest processing")
print("   Use When: Processing thousands of forms, budget constraints")
print("   Trade-off: Slightly lower accuracy acceptable")
print()

print("2Ô∏è‚É£  BALANCED (RECOMMENDED):")
print("   Model: gemini-2.5-flash")
print("   Temperature: 0.0")
print("   Rationale: Best balance of cost, speed, and accuracy")
print("   Use When: Standard production workloads")
print("   Trade-off: None - optimal for most use cases")
print()

print("3Ô∏è‚É£  MAXIMUM QUALITY:")
print("   Model: gemini-3-pro-preview")
print("   Temperature: 0.0")
print("   Rationale: Highest accuracy, most capable")
print("   Use When: Critical data, legal/compliance requirements")
print("   Trade-off: Higher cost, slower processing")
print()

print("=" * 80)
print("üîß IMPLEMENTATION STEPS:")
print("=" * 80)
print()

print("1. Update backend configuration:")
print("   File: services/fastapi-backend/app/config.py")
print("   ")
print("   # Set based on experiment results")
print("   DEFAULT_MODEL = 'gemini-2.5-flash'  # or your chosen model")
print("   DEFAULT_TEMPERATURE = 0.0")
print()

print("2. Deploy to Cloud Run:")
print("   ")
print("   git add -A")
print("   git commit -m 'chore: Update to optimal model from experiments'")
print("   git push origin main")
print("   ")
print("   # CI/CD will automatically deploy")
print()

print("3. Monitor in production:")
print("   - Track accuracy metrics in Datadog LLMObs")
print("   - Set up alerts for accuracy drops")
print("   - Review cost vs. performance monthly")
print()

print("4. Continuous improvement:")
print("   - Add more ground truth data to dataset")
print("   - Re-run experiments quarterly")
print("   - Test new model versions as they release")
print()

print("=" * 80)
print("üìä MONITORING CHECKLIST:")
print("=" * 80)
print()
print("‚úÖ Set up Datadog monitors for:")
print("   ‚Ä¢ Overall accuracy threshold (e.g., < 95%)")
print("   ‚Ä¢ Success rate threshold (e.g., < 90%)")
print("   ‚Ä¢ Error rate spike (e.g., > 5%)")
print("   ‚Ä¢ Latency increase (e.g., p95 > 10s)")
print("   ‚Ä¢ Cost anomalies")
print()

print("=" * 80)
print("‚ú® Experiment Complete! Ready for production deployment.")
print("=" * 80)


---

## 6. Wrapper Function: Easy Experiment Configuration

Create a reusable wrapper function for running multiple experiments with custom configurations.


In [None]:
from ddtrace.llmobs import LLMObs
from typing import Dict, Any, Optional, List, Callable

def run_model_experiments(
    # LLMObs Configuration
    ml_app: str = "vote-extractor",
    api_key: str = None,
    site: str = "datadoghq.com",
    agentless_enabled: bool = True,
    project_name: str = "vote-extraction-project",
    
    # Dataset Configuration
    dataset_name: str = "vote-extraction-bangbamru-1-10",
    dataset_version: Optional[int] = None,
    
    # Models and Temperatures to Test
    model_configs: Optional[List[Dict[str, Any]]] = None,
    
    # Task Function
    task_function: Optional[Callable] = None,
    
    # Evaluators
    evaluators: Optional[List[Callable]] = None,
    summary_evaluators: Optional[List[Callable]] = None,
    
    # Run Configuration
    sample_size: Optional[int] = None,
    jobs: int = 2,
    raise_errors: bool = True,
    
    # Options
    show_comparison: bool = True,
    return_results: bool = True
) -> Dict[str, Any]:
    """
    Run multiple LLM experiments with different model configurations.
    
    Args:
        ml_app: Datadog LLMObs application name
        api_key: Datadog API key (defaults to DD_API_KEY env var)
        site: Datadog site (e.g., datadoghq.com, datadoghq.eu)
        agentless_enabled: Enable agentless mode
        project_name: Datadog LLMObs project name
        
        dataset_name: Name of the dataset to load from Datadog
        dataset_version: Specific version to use (defaults to latest)
        
        model_configs: List of model configurations to test. Each dict should have:
                       - model: str (model name)
                       - temperature: float (0.0-1.0)
                       - name_suffix: str (optional, for experiment naming)
                       - metadata: dict (optional, extra metadata)
        
        task_function: Task function to use (defaults to vote_extraction_task)
        evaluators: List of evaluator functions (defaults to standard set)
        summary_evaluators: List of summary evaluator functions
        
        sample_size: Number of records to test (None = all records)
        jobs: Number of parallel jobs
        raise_errors: Stop on first error
        
        show_comparison: Print comparison table at the end
        return_results: Return experiment results dictionary
    
    Returns:
        Dictionary with experiment results and comparison data
    
    Example:
        >>> results = run_model_experiments(
        ...     model_configs=[
        ...         {"model": "gemini-2.5-flash", "temperature": 0.0},
        ...         {"model": "gemini-2.5-flash-lite", "temperature": 0.0},
        ...         {"model": "gemini-3-pro-preview", "temperature": 0.0}
        ...     ],
        ...     sample_size=10,
        ...     jobs=2
        ... )
    """
    from ddtrace.llmobs import LLMObs
    import pandas as pd
    
    # Initialize LLMObs
    print("=" * 80)
    print("üîß INITIALIZING DATADOG LLMOBS")
    print("=" * 80)
    
    if api_key is None:
        api_key = os.getenv("DD_API_KEY")
    
    if not api_key:
        raise ValueError("DD_API_KEY not found in environment or parameters")
    
    try:
        LLMObs.enable(
            ml_app=ml_app,
            api_key=api_key,
            site=site,
            agentless_enabled=agentless_enabled,
            project_name=project_name,
        )
        print(f"‚úÖ LLMObs enabled")
        print(f"   App: {ml_app}")
        print(f"   Site: {site}")
        print(f"   Project: {project_name}")
    except Exception as e:
        print(f"‚ö†Ô∏è  LLMObs already enabled or error: {e}")
    
    # Load dataset
    print(f"\nüìä Loading dataset: {dataset_name}")
    dataset = LLMObs.pull_dataset(
        dataset_name=dataset_name,
        project_name=project_name,
        version=dataset_version
    )
    print(f"‚úÖ Dataset loaded: {len(dataset)} records")
    
    # Extract dataset ID from URL for comparison link
    dataset_id = None
    try:
        # Dataset URL format: https://app.datadoghq.com/llm/datasets/{dataset_id}
        if hasattr(dataset, 'url') and dataset.url:
            dataset_id = dataset.url.split('/datasets/')[-1]
            print(f"   Dataset ID: {dataset_id}")
    except Exception as e:
        print(f"   ‚ö†Ô∏è  Could not extract dataset ID: {e}")
    
    # Default model configurations
    if model_configs is None:
        model_configs = [
            {"model": "gemini-2.5-flash", "temperature": 0.0, "name_suffix": "flash-t0"},
            {"model": "gemini-2.5-flash-lite", "temperature": 0.0, "name_suffix": "flash-lite-t0"},
            {"model": "gemini-3-pro-preview", "temperature": 0.0, "name_suffix": "pro-t0"},
        ]
    
    # Default task function
    if task_function is None:
        if 'vote_extraction_task' not in globals():
            raise ValueError("task_function not provided and vote_extraction_task not defined")
        task_function = vote_extraction_task
    
    # Default evaluators
    if evaluators is None:
        evaluators = [exact_form_match, ballot_accuracy_score, vote_results_quality, has_no_errors, llm_judge_evaluator]  # ‚≠ê Added LLM Judge
    
    if summary_evaluators is None:
        summary_evaluators = [overall_accuracy, success_rate, avg_ballot_accuracy, avg_llm_judge_score]  # ‚≠ê Added LLM Judge Score
    
    # Run experiments
    print(f"\n{'=' * 80}")
    print(f"üöÄ RUNNING {len(model_configs)} EXPERIMENTS")
    print(f"{'=' * 80}")
    print(f"   Sample Size: {sample_size or 'All records'}")
    print(f"   Parallel Jobs: {jobs}")
    print(f"   Raise Errors: {raise_errors}")
    print()
    
    all_results = []
    
    for i, config in enumerate(model_configs, 1):
        model = config.get("model")
        temperature = config.get("temperature", 0.0)
        name_suffix = config.get("name_suffix", f"{model.split('-')[-1]}-t{int(temperature*10)}")
        extra_metadata = config.get("metadata", {})
        
        print(f"\n{'‚îÄ' * 80}")
        print(f"üß™ Experiment {i}/{len(model_configs)}: {model} (T={temperature})")
        print(f"{'‚îÄ' * 80}")
        
        # Create experiment
        experiment_name = f"vote-extraction-{name_suffix}"
        
        # Prepare tags (combine model, temperature, and extra metadata)
        tags = {
            "model": model,
            "temperature": str(temperature),
            **{k: str(v) for k, v in extra_metadata.items()}
        }
        
        experiment = LLMObs.experiment(
            name=experiment_name,
            task=task_function,
            dataset=dataset,
            evaluators=evaluators,
            summary_evaluators=summary_evaluators,
            tags=tags
        )
        
        print(f"‚úÖ Created: {experiment.name}")
        print(f"üìä View: {experiment.url}")
        
        # Run experiment
        print(f"‚è±Ô∏è  Running...")
        try:
            results = experiment.run(
                sample_size=sample_size,
                jobs=jobs,
                raise_errors=raise_errors
            )
            
            print(f"‚úÖ Completed! Processed {len(results.get('rows', []))} records")
            
            # Collect results
            all_results.append({
                "experiment_name": experiment_name,
                "model": model,
                "temperature": temperature,
                "sample_size": len(results.get('rows', [])),
                "summary_metrics": results.get('summary_metrics', {}),
                "url": experiment.url,
                "status": "success"
            })
            
            # Print summary metrics
            if results.get('summary_metrics'):
                print("üìà Summary Metrics:")
                for key, value in results['summary_metrics'].items():
                    print(f"   - {key}: {value}")
        
        except Exception as e:
            print(f"‚ùå Error: {e}")
            all_results.append({
                "experiment_name": experiment_name,
                "model": model,
                "temperature": temperature,
                "sample_size": 0,
                "summary_metrics": {},
                "url": experiment.url,
                "status": "failed",
                "error": str(e)
            })
            
            if raise_errors:
                raise
    
    # Show comparison
    if show_comparison and all_results:
        print(f"\n{'=' * 120}")
        print("üìä EXPERIMENT COMPARISON")
        print(f"{'=' * 120}\n")
        
        # Create comparison DataFrame
        comparison_data = []
        for result in all_results:
            metrics = result['summary_metrics']
            comparison_data.append({
                "Experiment": result['experiment_name'],
                "Model": result['model'],
                "Temperature": result['temperature'],
                "Status": result['status'],
                "Records": result['sample_size'],
                "Overall Accuracy": metrics.get('overall_accuracy', 'N/A'),
                "Success Rate": metrics.get('success_rate', 'N/A'),
                "Avg Ballot Accuracy": metrics.get('avg_ballot_accuracy', 'N/A'),
                "Avg LLM Judge Score": metrics.get('avg_llm_judge_score', 'N/A'),  # ‚≠ê NEW
            })
        
        df = pd.DataFrame(comparison_data)
        print(df.to_string(index=False))
        
        print(f"\n{'=' * 120}")
        print("üîó View in Datadog:")
        for result in all_results:
            status_icon = "‚úÖ" if result['status'] == "success" else "‚ùå"
            print(f"   {status_icon} {result['experiment_name']}: {result['url']}")
        print(f"{'=' * 120}")
        
        # Generate comparison URL if dataset_id is available
        if dataset_id:
            comparison_url = f"https://app.datadoghq.com/llm/experiments?dataset={dataset_id}&project={project_name}"
            print(f"\nüîç Compare all experiments side-by-side:")
            print(f"   {comparison_url}")
            print(f"{'=' * 120}\n")
        else:
            print()
        
        # Best performing
        successful_results = [r for r in all_results if r['status'] == 'success']
        if successful_results:
            best_accuracy = max(
                [r['summary_metrics'].get('overall_accuracy', 0) for r in successful_results],
                default=0
            )
            
            if best_accuracy > 0:
                best_exp = next(
                    (r for r in successful_results if r['summary_metrics'].get('overall_accuracy') == best_accuracy),
                    None
                )
                if best_exp:
                    print("üèÜ BEST PERFORMER:")
                    print(f"   Model: {best_exp['model']}")
                    print(f"   Temperature: {best_exp['temperature']}")
                    print(f"   Overall Accuracy: {best_accuracy}%")
                    print()
    
    # Return results
    if return_results:
        result_dict = {
            "experiments": all_results,
            "total_experiments": len(all_results),
            "successful_experiments": len([r for r in all_results if r['status'] == 'success']),
            "failed_experiments": len([r for r in all_results if r['status'] == 'failed']),
            "dataset_name": dataset_name,
            "dataset_size": len(dataset),
            "project_name": project_name
        }
        
        # Add comparison URL if available
        if dataset_id:
            result_dict["comparison_url"] = f"https://app.datadoghq.com/llm/experiments?dataset={dataset_id}&project={project_name}"
            result_dict["dataset_id"] = dataset_id
        
        return result_dict
    
    return None


# Print function signature help
print("‚úÖ Wrapper function defined: run_model_experiments()")
print("\nüìñ Quick Usage:")
print("""
results = run_model_experiments(
    model_configs=[
        {"model": "gemini-2.5-flash", "temperature": 0.0},
        {"model": "gemini-2.5-flash-lite", "temperature": 0.0},
    ],
    sample_size=10,
    jobs=2,
    raise_errors=True
)
""")


‚úÖ Wrapper function defined: run_model_experiments()

üìñ Quick Usage:

results = run_model_experiments(
    model_configs=[
        {"model": "gemini-2.5-flash", "temperature": 0.0},
        {"model": "gemini-2.5-flash-lite", "temperature": 0.0},
    ],
    sample_size=10,
    jobs=2,
    raise_errors=True
)



### 6.1 Example 1: Run with Default Configuration

The simplest usage - runs 3 default models with temperature 0.0


In [32]:
# Example 1: Use defaults (3 models: flash, flash-lite, pro-preview at T=0.0)
results = run_model_experiments(
    sample_size=10,
    jobs=2,
    raise_errors=True
)

# Results include comparison and best performer
print(f"\nüìä Summary:")
print(f"   Total experiments: {results['total_experiments']}")
print(f"   Successful: {results['successful_experiments']}")
print(f"   Failed: {results['failed_experiments']}")

# Access comparison URL for side-by-side view
if 'comparison_url' in results:
    print(f"\nüîç Compare all experiments:")
    print(f"   {results['comparison_url']}")


üîß INITIALIZING DATADOG LLMOBS
‚úÖ LLMObs enabled
   App: vote-extractor
   Site: datadoghq.com
   Project: vote-extraction-project

üìä Loading dataset: vote-extraction-bangbamru-1-10
‚úÖ Dataset loaded: 11 records
   Dataset ID: 241bfded-e79d-4d2d-bbc4-a74bb06d85f9

üöÄ RUNNING 3 EXPERIMENTS
   Sample Size: 10
   Parallel Jobs: 2
   Raise Errors: True


‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üß™ Experiment 1/3: gemini-2.5-flash (T=0.0)
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
‚úÖ Created: vote-extraction-flash-t0
üìä View: https://app.datadoghq.com/llm/experiments/None
‚è±Ô∏è  Runnin

KeyboardInterrupt: 

‚ùå Error processing ‡∏ö‡∏≤‡∏á‡∏ö‡∏≥‡∏´‡∏£‡∏∏10: Server disconnected without sending a response.


### 6.2 Example 2: Custom Model Configurations

Test specific models with different temperatures


In [None]:
# Example 2: Custom model configurations with different temperatures
results = run_model_experiments(
    model_configs=[
        {
            "model": "gemini-2.5-flash",
            "temperature": 0.0,
            "name_suffix": "flash-deterministic",
            "metadata": {"purpose": "Production baseline", "cost_tier": "medium"}
        },
        {
            "model": "gemini-2.5-flash",
            "temperature": 0.1,
            "name_suffix": "flash-tolerant",
            "metadata": {"purpose": "Tolerance test", "cost_tier": "medium"}
        },
        {
            "model": "gemini-2.5-flash",
            "temperature": 0.2,
            "name_suffix": "flash-varied",
            "metadata": {"purpose": "Variation test", "cost_tier": "medium"}
        },
        {
            "model": "gemini-2.5-flash-lite",
            "temperature": 0.0,
            "name_suffix": "lite-speed",
            "metadata": {"purpose": "High-volume test", "cost_tier": "low"}
        }
    ],
    sample_size=10,
    jobs=2,
    raise_errors=False  # Continue even if one fails
)


### 6.3 Example 3: Full Configuration with Custom Settings

Advanced usage with all configuration options


In [31]:
# Example 3: Full configuration with all options
results = run_model_experiments(
    # LLMObs configuration
    ml_app="vote-extractor-advanced",
    api_key=os.getenv("DD_API_KEY"),  # Or pass directly
    site="datadoghq.com",
    agentless_enabled=True,
    project_name="vote-extraction-project",
    
    # Dataset configuration
    dataset_name="vote-extraction-bangbamru-1-10",
    dataset_version=None,  # Latest version
    
    # Models to test
    model_configs=[
        {"model": "gemini-2.5-flash", "temperature": 0.0},
        {"model": "gemini-2.5-flash", "temperature": 0.1},
        {"model": "gemini-2.5-flash-lite", "temperature": 0.0},
        {"model": "gemini-2.5-flash-lite", "temperature": 0.1},
        {"model": "gemini-3-pro-preview", "temperature": 0.0},
        {"model": "gemini-3-pro-preview", "temperature": 0.1},
    ],
    
    # Task and evaluators (uses defaults if not specified)
    task_function=vote_extraction_task,
    evaluators=[exact_form_match, ballot_accuracy_score, vote_results_quality, has_no_errors],
    summary_evaluators=[overall_accuracy, success_rate, avg_ballot_accuracy],
    
    # Run configuration
    sample_size=10,  # Test all 10 records
    jobs=2,          # Parallel processing
    raise_errors=True,  # Stop on first error
    
    # Display options
    show_comparison=True,
    return_results=True
)

# Access detailed results
print("\n" + "=" * 80)
print("üìä DETAILED RESULTS")
print("=" * 80)

for exp in results['experiments']:
    print(f"\nüß™ {exp['experiment_name']}:")
    print(f"   Status: {exp['status']}")
    print(f"   Model: {exp['model']}")
    print(f"   Temperature: {exp['temperature']}")
    print(f"   Records: {exp['sample_size']}")
    
    if exp['summary_metrics']:
        print(f"   Metrics:")
        for key, value in exp['summary_metrics'].items():
            print(f"     - {key}: {value}")
    
    print(f"   URL: {exp['url']}")


üîß INITIALIZING DATADOG LLMOBS
‚úÖ LLMObs enabled
   App: vote-extractor-advanced
   Site: datadoghq.com
   Project: vote-extraction-project

üìä Loading dataset: vote-extraction-bangbamru-1-10
‚úÖ Dataset loaded: 11 records
   Dataset ID: 241bfded-e79d-4d2d-bbc4-a74bb06d85f9

üöÄ RUNNING 6 EXPERIMENTS
   Sample Size: 10
   Parallel Jobs: 5
   Raise Errors: True


‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üß™ Experiment 1/6: gemini-2.5-flash (T=0.0)
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
‚úÖ Created: vote-extraction-flash-t0
üìä View: https://app.datadoghq.com/llm/experiments/None
‚è±Ô∏

## üí° Tips & Next Steps

**Experiment Optimization**:
- Use `sample_size` for fast iteration during development
- Enable `jobs=4` for parallel processing on large datasets
- Set `raise_errors=True` to catch and debug failures early

**Comparing Configurations**:
```python
# Run multiple experiments with different configs
experiment_v1 = LLMObs.experiment(
    name="vote-extraction-v1",
    task=task,
    dataset=dataset,
    config={"model": "gemini-2.5-flash", "temperature": 0.0}
)

experiment_v2 = LLMObs.experiment(
    name="vote-extraction-v2",
    task=task,
    dataset=dataset,
    config={"model": "gemini-2.5-pro", "temperature": 0.1}
)

results_v1 = experiment_v1.run()
results_v2 = experiment_v2.run()

# Compare side-by-side in Datadog UI
```

**Using Pandas for Analysis**:
```python
# Export dataset to DataFrame for advanced analysis
df = experiment_dataset.as_dataframe()
print(df.head())
```

**Next Steps**:
1. ‚úÖ Create ground truth datasets (Streamlit Dataset Manager)
2. ‚úÖ Push datasets to Datadog
3. ‚úÖ Run experiments to establish baselines
4. üîÑ Iterate on prompts and models
5. üìä Compare results in Datadog
6. üöÄ Deploy best performing configuration to production

**Resources**:
- [Datadog LLMObs Experiments Documentation](https://docs.datadoghq.com/llm_observability/experiments/)
- [Guide 04: Experiments and Datasets](../../guides/llmobs/04_EXPERIMENTS_AND_DATASETS.md)
- [Evaluation Metric Types Guide](../../guides/llmobs/03_EVALUATION_METRIC_TYPES.md)
