# Thai Election SS6/1 Extraction ‚Äî Datadog LLMObs Experiments

Systematically evaluate Gemini models on extracting structured data from Thai election **announcement PDFs (Form S.S. 6/1)** stored in **Google Drive** (776 files across 76 provinces, 2026 election).

**Two document types:**
| Type | Thai | Contents |
|---|---|---|
| `‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï` | ‡πÅ‡∏ö‡∏ö‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï | Candidate-level vote totals per constituency |
| `‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠` | ‡πÅ‡∏ö‡∏ö‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠ | Party-level vote totals per constituency |

**Workflow:**
1. Setup ‚Äî install dependencies, configure credentials
2. Schema ‚Äî Pydantic models + Gemini JSON schema
3. Dataset ‚Äî load records from JSONL drive index
4. Task ‚Äî extraction function using Gemini + Google Drive URIs

## 1. Setup

In [1]:
!pip install -q google-genai pydantic pandas ddtrace python-dotenv tenacity

In [2]:
!pip freeze | grep -E 'google-genai|pydantic|pandas|ddtrace'

ddtrace==4.4.0
google-genai==1.64.0
pandas==3.0.1
pydantic==2.12.5
pydantic-settings==2.13.1
pydantic_core==2.41.5


In [38]:
import json
import os
import time
from pathlib import Path
from typing import Any, Dict, List, Literal, Optional

import pandas as pd
from dotenv import load_dotenv
from google import genai
from google.genai import types
from pydantic import BaseModel, Field

load_dotenv(override=True)
print("‚úÖ Imports ready")

‚úÖ Imports ready


In [39]:
# ‚îÄ‚îÄ Credentials ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
GEMINI_API_KEY       = os.environ["GEMINI_API_KEY"]
DD_API_KEY           = os.environ["DD_API_KEY"]
DD_APP_KEY           = os.environ["DD_APP_KEY"]

# ‚îÄ‚îÄ Project settings ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
ML_APP               = "gemini-ss6_1"
LLMOBS_PROJECT_NAME  = "vote-extraction-project"
DD_SITE              = "us3.datadoghq.com"

# ‚îÄ‚îÄ Data settings ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
DRIVE_FILES_JSONL    = Path("datasets/ect_2026_drive_files.jsonl")
DATASET_NAME         = "ss6_1_nuttee"

print(f"‚úÖ Config ready | ml_app={ML_APP} | dataset={DATASET_NAME}")

‚úÖ Config ready | ml_app=gemini-ss6_1 | dataset=ss6_1_nuttee


In [40]:
# ‚îÄ‚îÄ Monkey-patch: fix ddtrace bug with Gemini 2.5 token count ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Bug: google_utils.py line 144 does `input_tokens + output_tokens` without
# guarding against None, but Gemini 2.5 Flash omits `prompt_token_count` in
# some responses, causing TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
# https://github.com/DataDog/dd-trace-py/issues/XXXX
import ddtrace.llmobs._integrations.google_genai as _dd_google_genai
from ddtrace.llmobs._integrations.google_utils import _get_attr
from ddtrace.llmobs._constants import (
    INPUT_TOKENS_METRIC_KEY,
    OUTPUT_TOKENS_METRIC_KEY,
    CACHE_READ_INPUT_TOKENS_METRIC_KEY,
    TOTAL_TOKENS_METRIC_KEY,
    REASONING_OUTPUT_TOKENS_METRIC_KEY,
)


def _patched_extract_generation_metrics_google_genai(response):
    """Fixed version guarding against None + int TypeError (ddtrace bug with Gemini 2.5)."""
    if not response:
        return {}
    usage_metadata = _get_attr(response, "usage_metadata", {})
    if not usage_metadata:
        return {}

    usage = {}
    input_tokens = _get_attr(usage_metadata, "prompt_token_count", None)

    candidates_tokens = _get_attr(usage_metadata, "candidates_token_count", None)
    thought_tokens = _get_attr(usage_metadata, "thoughts_token_count", None)
    if candidates_tokens is not None or thought_tokens is not None:
        output_tokens = (candidates_tokens or 0) + (thought_tokens or 0)
    else:
        output_tokens = None

    cached_tokens = _get_attr(usage_metadata, "cached_content_token_count", None)
    # Fix: guard against None + int when prompt_token_count is absent
    total_tokens = _get_attr(usage_metadata, "total_token_count", None) or (
        (input_tokens + output_tokens) if input_tokens is not None and output_tokens is not None else None
    )

    if input_tokens is not None:
        usage[INPUT_TOKENS_METRIC_KEY] = input_tokens
    if output_tokens is not None:
        usage[OUTPUT_TOKENS_METRIC_KEY] = output_tokens
    if cached_tokens is not None:
        usage[CACHE_READ_INPUT_TOKENS_METRIC_KEY] = cached_tokens
    if total_tokens is not None:
        usage[TOTAL_TOKENS_METRIC_KEY] = total_tokens
    if thought_tokens is not None:
        usage[REASONING_OUTPUT_TOKENS_METRIC_KEY] = thought_tokens

    return usage


_dd_google_genai.extract_generation_metrics_google_genai = _patched_extract_generation_metrics_google_genai
print("‚úÖ ddtrace Gemini 2.5 token-count patch applied")

‚úÖ ddtrace Gemini 2.5 token-count patch applied


In [41]:
from ddtrace.llmobs import LLMObs, EvaluatorResult

LLMObs.enable(
    ml_app=ML_APP,
    api_key=DD_API_KEY,
    app_key=DD_APP_KEY,
    project_name=LLMOBS_PROJECT_NAME,
    site=DD_SITE,
    agentless_enabled=True,
)
print("‚úÖ Datadog LLMObs enabled")

‚úÖ Datadog LLMObs enabled


## 2. Schema

### SS6/1 vs SS5/18 key differences

| Dimension | SS5/18 | SS6/1 |
|---|---|---|
| Granularity | Per-polling-station | Per-constituency (aggregated) |
| Location fields | province, district, sub_district, polling_station, village | province, constituency_number only |
| Voter stats | eligible + present voters | present only (optional) |
| Ballot stats | allocated, used, good, bad, no_vote, remaining | valid, invalid, no_vote, total_used |
| Results rows | ‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï: candidate+party / ‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠: party only | same pattern |

SS6/1 is the **official constituency announcement** ‚Äî one document per constituency per form type.

In [42]:
class NumberTextPair(BaseModel):
    """Numeric value recorded as both Arabic numeral and Thai text."""

    arabic: int = Field(..., description="Arabic numeral (e.g., 12500)")
    thai_text: Optional[str] = Field(None, description="Thai text (e.g., ‡∏´‡∏ô‡∏∂‡πà‡∏á‡∏´‡∏°‡∏∑‡πà‡∏ô‡∏™‡∏≠‡∏á‡∏û‡∏±‡∏ô‡∏´‡πâ‡∏≤‡∏£‡πâ‡∏≠‡∏¢)")


class SS61FormInfo(BaseModel):
    """Header / identity fields of the SS6/1 announcement document."""

    form_type: Literal["‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï", "‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠"] = Field(
        ..., description="‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï = Constituency, ‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠ = Party List"
    )
    province: str = Field(..., description="Province name (‡∏à‡∏±‡∏á‡∏´‡∏ß‡∏±‡∏î)")
    constituency_number: str = Field(..., description="Constituency zone number (‡πÄ‡∏Ç‡∏ï‡πÄ‡∏•‡∏∑‡∏≠‡∏Å‡∏ï‡∏±‡πâ‡∏á‡∏ó‡∏µ‡πà)")
    date: Optional[str] = Field(None, description="Vote counting date (‡∏ß‡∏±‡∏ô‡∏ó‡∏µ‡πà‡∏ô‡∏±‡∏ö‡∏Ñ‡∏∞‡πÅ‡∏ô‡∏ô)")


class BallotSummary(BaseModel):
    """Aggregate ballot statistics for the constituency."""

    eligible_voters: Optional[NumberTextPair] = Field(
        None, description="‡∏ú‡∏π‡πâ‡∏°‡∏µ‡∏™‡∏¥‡∏ó‡∏ò‡∏¥‡πÄ‡∏•‡∏∑‡∏≠‡∏Å‡∏ï‡∏±‡πâ‡∏á ‚Äî total eligible voters"
    )
    present_voters: Optional[NumberTextPair] = Field(
        None, description="‡∏ú‡∏π‡πâ‡∏°‡∏≤‡πÉ‡∏ä‡πâ‡∏™‡∏¥‡∏ó‡∏ò‡∏¥ ‚Äî voters who showed up"
    )
    valid_ballots: Optional[NumberTextPair] = Field(
        None, description="‡∏ö‡∏±‡∏ï‡∏£‡∏î‡∏µ ‚Äî valid ballots counted"
    )
    invalid_ballots: Optional[NumberTextPair] = Field(
        None, description="‡∏ö‡∏±‡∏ï‡∏£‡πÄ‡∏™‡∏µ‡∏¢ ‚Äî spoiled/invalid ballots"
    )
    no_vote_ballots: Optional[NumberTextPair] = Field(
        None, description="‡πÑ‡∏°‡πà‡πÄ‡∏•‡∏∑‡∏≠‡∏Å‡∏ú‡∏π‡πâ‡πÉ‡∏î / ‡πÑ‡∏°‡πà‡∏õ‡∏£‡∏∞‡∏™‡∏á‡∏Ñ‡πå‡∏•‡∏á‡∏Ñ‡∏∞‡πÅ‡∏ô‡∏ô ‚Äî abstain ballots"
    )
    total_ballots_used: Optional[NumberTextPair] = Field(
        None, description="‡∏£‡∏ß‡∏°‡∏ö‡∏±‡∏ï‡∏£‡∏ó‡∏µ‡πà‡πÉ‡∏ä‡πâ ‚Äî total ballots used (valid + invalid + no_vote)"
    )


class ResultEntry(BaseModel):
    """One row in the vote results table."""

    number: int = Field(..., description="Row number (‡∏ó‡∏µ‡πà/‡∏•‡∏≥‡∏î‡∏±‡∏ö)")
    candidate_name: Optional[str] = Field(
        None, description="Candidate full name ‚Äî ‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï only (‡∏ä‡∏∑‡πà‡∏≠-‡∏™‡∏Å‡∏∏‡∏•)"
    )
    party_name: Optional[str] = Field(
        None, description="Party name (‡∏™‡∏±‡∏á‡∏Å‡∏±‡∏î‡∏û‡∏£‡∏£‡∏Ñ‡∏Å‡∏≤‡∏£‡πÄ‡∏°‡∏∑‡∏≠‡∏á) ‚Äî both form types"
    )
    vote_count: NumberTextPair = Field(..., description="Votes received (‡∏Ñ‡∏∞‡πÅ‡∏ô‡∏ô)")


class Official(BaseModel):
    name: str = Field(..., description="Full name")
    position: str = Field(..., description="‡∏õ‡∏£‡∏∞‡∏ò‡∏≤‡∏ô / ‡∏Å‡∏£‡∏£‡∏°‡∏Å‡∏≤‡∏£ / ‡πÄ‡∏•‡∏Ç‡∏≤‡∏ô‡∏∏‡∏Å‡∏≤‡∏£")


class SS61FormData(BaseModel):
    """Root extraction model for one SS6/1 announcement document."""

    form_info: SS61FormInfo
    ballot_summary: Optional[BallotSummary] = None
    results: List[ResultEntry] = Field(default_factory=list)
    total_votes: Optional[NumberTextPair] = Field(
        None, description="‡∏£‡∏ß‡∏° row at the bottom of the results table"
    )
    officials: Optional[List[Official]] = None


print("‚úÖ Pydantic models defined")

‚úÖ Pydantic models defined


In [43]:
# Gemini JSON schema (mirrors Pydantic models above)
_num_text_pair = {
    "type": "OBJECT",
    "required": ["arabic"],
    "properties": {
        "arabic": {"type": "INTEGER"},
        "thai_text": {"type": "STRING"},
    },
}

SS61_DATA_SCHEMA = {
    "type": "OBJECT",
    "description": "Extracted data from one SS6/1 announcement PDF",
    "required": ["form_info", "results"],
    "properties": {
        "form_info": {
            "type": "OBJECT",
            "required": ["form_type", "province", "constituency_number"],
            "properties": {
                "form_type": {
                    "type": "STRING",
                    "enum": ["‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï", "‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠"],
                    "description": "‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï = Constituency, ‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠ = Party List",
                },
                "province": {"type": "STRING"},
                "constituency_number": {
                    "type": "STRING",
                    "description": "Constituency zone number (‡πÄ‡∏Ç‡∏ï‡πÄ‡∏•‡∏∑‡∏≠‡∏Å‡∏ï‡∏±‡πâ‡∏á‡∏ó‡∏µ‡πà)",
                },
                "date": {
                    "type": "STRING",
                    "description": "Vote counting date as shown on document",
                },
            },
        },
        "ballot_summary": {
            "type": "OBJECT",
            "properties": {
                "eligible_voters": _num_text_pair,
                "present_voters": _num_text_pair,
                "valid_ballots": _num_text_pair,
                "invalid_ballots": _num_text_pair,
                "no_vote_ballots": _num_text_pair,
                "total_ballots_used": _num_text_pair,
            },
        },
        "results": {
            "type": "ARRAY",
            "description": "All rows from the vote results table",
            "items": {
                "type": "OBJECT",
                "required": ["number", "vote_count"],
                "properties": {
                    "number": {"type": "INTEGER"},
                    "candidate_name": {
                        "type": "STRING",
                        "description": "Candidate full name ‚Äî ‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï only",
                    },
                    "party_name": {"type": "STRING"},
                    "vote_count": _num_text_pair,
                },
            },
        },
        "total_votes": {
            **_num_text_pair,
            "description": "‡∏£‡∏ß‡∏° row at the bottom of the results table",
        },
        "officials": {
            "type": "ARRAY",
            "items": {
                "type": "OBJECT",
                "required": ["name", "position"],
                "properties": {
                    "name": {"type": "STRING"},
                    "position": {"type": "STRING"},
                },
            },
        },
    },
}

print("‚úÖ Gemini JSON schema defined")

‚úÖ Gemini JSON schema defined


## 3. Dataset

Load the 776-file drive index from `datasets/ect_2026_drive_files.jsonl`, infer `form_type` from the folder path, and build `input_data` records compatible with the task function.

**`input_data` shape** (matches task function expectations):
```json
{
  "drive_uri": "https://drive.google.com/uc?export=download&id={file_id}",
  "source_file_metadata": {
    "province_name": "‡∏´‡∏ô‡∏≠‡∏á‡∏Ñ‡∏≤‡∏¢",
    "path": "‡∏õ‡∏£‡∏∞‡∏Å‡∏≤‡∏®‡∏ú‡∏•‡∏Å‡∏≤‡∏£‡∏ô‡∏±‡∏ö‡∏Ñ‡∏∞‡πÅ‡∏ô‡∏ô ‡∏™‡∏™. .../‡∏´‡∏ô‡∏≠‡∏á‡∏Ñ‡∏≤‡∏¢/‡πÅ‡∏ö‡∏ö‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï/70. ‡∏´‡∏ô‡∏≠‡∏á‡∏Ñ‡∏≤‡∏¢ ‡πÄ‡∏Ç‡∏ï 3.pdf",
    "size_mb": 0.1196,
    "file_id": "1JDOkP6n...",
    "folder_id": "1EvhR0-E...",
    "form_type": "‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï"
  }
}
```

In [44]:
def infer_form_type(path: str) -> str:
    """Infer SS6/1 form type from the Google Drive folder path."""
    if "‡πÅ‡∏ö‡∏ö‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠" in path:
        return "‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠"
    if "‡πÅ‡∏ö‡∏ö‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï" in path:
        return "‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï"
    return "unknown"


def build_input_data(record: dict) -> dict:
    """Build the input_data dict expected by the task function from a JSONL record."""
    file_id = record["file_id"]
    drive_uri = f"https://drive.google.com/uc?export=download&id={file_id}"
    return {
        "drive_uri": drive_uri,
        "source_file_metadata": {
            **record,
            "form_type": infer_form_type(record["path"]),
        },
    }


# ‚îÄ‚îÄ Load JSONL ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
raw_records = []
with DRIVE_FILES_JSONL.open(encoding="utf-8") as f:
    for line in f:
        line = line.strip()
        if line:
            raw_records.append(json.loads(line))

df = pd.DataFrame(raw_records)
df["form_type"] = df["path"].apply(infer_form_type)

print(f"Total files        : {len(df)}")
print(f"‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï            : {(df['form_type'] == '‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï').sum()}")
print(f"‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠       : {(df['form_type'] == '‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠').sum()}")
print(f"Unknown            : {(df['form_type'] == 'unknown').sum()}")
print(f"Unique provinces   : {df['province_name'].nunique()}")
df.head(6)

Total files        : 776
‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï            : 378
‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠       : 383
Unknown            : 15
Unique provinces   : 76


Unnamed: 0,province_name,path,size_mb,file_id,folder_id,form_type
0,‡∏´‡∏ô‡∏≠‡∏á‡∏Ñ‡∏≤‡∏¢,‡∏õ‡∏£‡∏∞‡∏Å‡∏≤‡∏®‡∏ú‡∏•‡∏Å‡∏≤‡∏£‡∏ô‡∏±‡∏ö‡∏Ñ‡∏∞‡πÅ‡∏ô‡∏ô ‡∏™‡∏™. (8 ‡∏Å.‡∏û. 2569)/‡∏´‡∏ô‡∏≠‡∏á‡∏Ñ‡∏≤‡∏¢/‡πÅ‡∏ö‡∏ö‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï/70. ‡∏´‡∏ô‡∏≠‡∏á‡∏Ñ‡∏≤‡∏¢ ‡πÄ‡∏Ç‡∏ï 3.pdf,0.1196,1JDOkP6nW0qNfSg27DBX_7ZzZ4epF6u32,1EvhR0-EoN_vCkqo3R9xFsNp6ipA7YsHy,‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï
1,‡∏´‡∏ô‡∏≠‡∏á‡∏Ñ‡∏≤‡∏¢,‡∏õ‡∏£‡∏∞‡∏Å‡∏≤‡∏®‡∏ú‡∏•‡∏Å‡∏≤‡∏£‡∏ô‡∏±‡∏ö‡∏Ñ‡∏∞‡πÅ‡∏ô‡∏ô ‡∏™‡∏™. (8 ‡∏Å.‡∏û. 2569)/‡∏´‡∏ô‡∏≠‡∏á‡∏Ñ‡∏≤‡∏¢/‡πÅ‡∏ö‡∏ö‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï/70. ‡∏´‡∏ô‡∏≠‡∏á‡∏Ñ‡∏≤‡∏¢ ‡πÄ‡∏Ç‡∏ï 2.pdf,0.124,1bKuaF46ErxZZrQsbrFl-ri35t3SI47Do,1EvhR0-EoN_vCkqo3R9xFsNp6ipA7YsHy,‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï
2,‡∏´‡∏ô‡∏≠‡∏á‡∏Ñ‡∏≤‡∏¢,‡∏õ‡∏£‡∏∞‡∏Å‡∏≤‡∏®‡∏ú‡∏•‡∏Å‡∏≤‡∏£‡∏ô‡∏±‡∏ö‡∏Ñ‡∏∞‡πÅ‡∏ô‡∏ô ‡∏™‡∏™. (8 ‡∏Å.‡∏û. 2569)/‡∏´‡∏ô‡∏≠‡∏á‡∏Ñ‡∏≤‡∏¢/‡πÅ‡∏ö‡∏ö‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï/70. ‡∏´‡∏ô‡∏≠‡∏á‡∏Ñ‡∏≤‡∏¢ ‡πÄ‡∏Ç‡∏ï 1.pdf,0.1309,1wHWZJl0oCsp8mAR2oF_O1xDYY--m8WAt,1EvhR0-EoN_vCkqo3R9xFsNp6ipA7YsHy,‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï
3,‡∏´‡∏ô‡∏≠‡∏á‡∏Ñ‡∏≤‡∏¢,‡∏õ‡∏£‡∏∞‡∏Å‡∏≤‡∏®‡∏ú‡∏•‡∏Å‡∏≤‡∏£‡∏ô‡∏±‡∏ö‡∏Ñ‡∏∞‡πÅ‡∏ô‡∏ô ‡∏™‡∏™. (8 ‡∏Å.‡∏û. 2569)/‡∏´‡∏ô‡∏≠‡∏á‡∏Ñ‡∏≤‡∏¢/‡πÅ‡∏ö‡∏ö‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠/70. ‡∏´‡∏ô‡∏≠‡∏á‡∏Ñ‡∏≤‡∏¢ ‡πÄ‡∏Ç‡∏ï 3 (‡∏ö‡∏ä).pdf,0.2377,1zQQJm3O-vPlpj1xpk6gn-07a83XBTxzz,1ITySwBS-RMSOsVahucnByxG7clsOjHd0,‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠
4,‡∏´‡∏ô‡∏≠‡∏á‡∏Ñ‡∏≤‡∏¢,‡∏õ‡∏£‡∏∞‡∏Å‡∏≤‡∏®‡∏ú‡∏•‡∏Å‡∏≤‡∏£‡∏ô‡∏±‡∏ö‡∏Ñ‡∏∞‡πÅ‡∏ô‡∏ô ‡∏™‡∏™. (8 ‡∏Å.‡∏û. 2569)/‡∏´‡∏ô‡∏≠‡∏á‡∏Ñ‡∏≤‡∏¢/‡πÅ‡∏ö‡∏ö‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠/70. ‡∏´‡∏ô‡∏≠‡∏á‡∏Ñ‡∏≤‡∏¢ ‡πÄ‡∏Ç‡∏ï 2 (‡∏ö‡∏ä).pdf,0.2521,15ZCUDRMUbILF3Zv1Z4gvw28Hw4sRRsNm,1ITySwBS-RMSOsVahucnByxG7clsOjHd0,‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠
5,‡∏´‡∏ô‡∏≠‡∏á‡∏Ñ‡∏≤‡∏¢,‡∏õ‡∏£‡∏∞‡∏Å‡∏≤‡∏®‡∏ú‡∏•‡∏Å‡∏≤‡∏£‡∏ô‡∏±‡∏ö‡∏Ñ‡∏∞‡πÅ‡∏ô‡∏ô ‡∏™‡∏™. (8 ‡∏Å.‡∏û. 2569)/‡∏´‡∏ô‡∏≠‡∏á‡∏Ñ‡∏≤‡∏¢/‡πÅ‡∏ö‡∏ö‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠/70. ‡∏´‡∏ô‡∏≠‡∏á‡∏Ñ‡∏≤‡∏¢ ‡πÄ‡∏Ç‡∏ï 1 (‡∏ö‡∏ä).pdf,0.2663,1dCqfZ6ix_uBso92V_6HWJ3jJEQX53oI-,1ITySwBS-RMSOsVahucnByxG7clsOjHd0,‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠


In [45]:
# ‚îÄ‚îÄ Sample a balanced set for experimentation ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Pick a few records of each type across different provinces
SAMPLE_SIZE_PER_TYPE = 3

sample_baeng_khet = (
    df[df["form_type"] == "‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï"]
    .sample(n=SAMPLE_SIZE_PER_TYPE, random_state=42)
)
sample_ban_chi = (
    df[df["form_type"] == "‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠"]
    .sample(n=SAMPLE_SIZE_PER_TYPE, random_state=42)
)

all_unknown = (
    df[df["form_type"] == "unknown"]
    .sample(n=SAMPLE_SIZE_PER_TYPE, random_state=42)
)

sample_df = pd.concat([sample_baeng_khet, sample_ban_chi, all_unknown]).reset_index(drop=True)

print(f"Sample size: {len(sample_df)} records")
pd.set_option('display.max_colwidth', None)
sample_df[["province_name", "form_type", "size_mb", "path"]]

Sample size: 9 records


Unnamed: 0,province_name,form_type,size_mb,path
0,‡πÄ‡∏ä‡∏µ‡∏¢‡∏á‡πÉ‡∏´‡∏°‡πà,‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï,0.106,‡∏õ‡∏£‡∏∞‡∏Å‡∏≤‡∏®‡∏ú‡∏•‡∏Å‡∏≤‡∏£‡∏ô‡∏±‡∏ö‡∏Ñ‡∏∞‡πÅ‡∏ô‡∏ô ‡∏™‡∏™. (8 ‡∏Å.‡∏û. 2569)/‡πÄ‡∏ä‡∏µ‡∏¢‡∏á‡πÉ‡∏´‡∏°‡πà/‡πÅ‡∏ö‡∏ö‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï/14. ‡πÄ‡∏ä‡∏µ‡∏¢‡∏á‡πÉ‡∏´‡∏°‡πà ‡πÄ‡∏Ç‡∏ï 6.pdf
1,‡∏ô‡∏Ñ‡∏£‡∏£‡∏≤‡∏ä‡∏™‡∏µ‡∏°‡∏≤,‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï,0.0922,‡∏õ‡∏£‡∏∞‡∏Å‡∏≤‡∏®‡∏ú‡∏•‡∏Å‡∏≤‡∏£‡∏ô‡∏±‡∏ö‡∏Ñ‡∏∞‡πÅ‡∏ô‡∏ô ‡∏™‡∏™. (8 ‡∏Å.‡∏û. 2569)/‡∏ô‡∏Ñ‡∏£‡∏£‡∏≤‡∏ä‡∏™‡∏µ‡∏°‡∏≤/‡πÅ‡∏ö‡∏ö‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï/21. ‡∏ô‡∏Ñ‡∏£‡∏£‡∏≤‡∏ä‡∏™‡∏µ‡∏°‡∏≤ ‡πÄ‡∏Ç‡∏ï 13.pdf
2,‡∏ô‡∏Ñ‡∏£‡∏®‡∏£‡∏µ‡∏ò‡∏£‡∏£‡∏°‡∏£‡∏≤‡∏ä,‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï,0.1211,‡∏õ‡∏£‡∏∞‡∏Å‡∏≤‡∏®‡∏ú‡∏•‡∏Å‡∏≤‡∏£‡∏ô‡∏±‡∏ö‡∏Ñ‡∏∞‡πÅ‡∏ô‡∏ô ‡∏™‡∏™. (8 ‡∏Å.‡∏û. 2569)/‡∏ô‡∏Ñ‡∏£‡∏®‡∏£‡∏µ‡∏ò‡∏£‡∏£‡∏°‡∏£‡∏≤‡∏ä/‡πÅ‡∏ö‡∏ö‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï/22. ‡∏ô‡∏Ñ‡∏£‡∏®‡∏£‡∏µ‡∏ò‡∏£‡∏£‡∏°‡∏£‡∏≤‡∏ä ‡πÄ‡∏Ç‡∏ï 8.pdf
3,‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£,‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠,0.2231,‡∏õ‡∏£‡∏∞‡∏Å‡∏≤‡∏®‡∏ú‡∏•‡∏Å‡∏≤‡∏£‡∏ô‡∏±‡∏ö‡∏Ñ‡∏∞‡πÅ‡∏ô‡∏ô ‡∏™‡∏™. (8 ‡∏Å.‡∏û. 2569)/‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£/‡πÅ‡∏ö‡∏ö‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠/1. ‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£ ‡πÄ‡∏Ç‡∏ï 26 (‡∏ö‡∏ä).pdf
4,‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£,‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠,0.2405,‡∏õ‡∏£‡∏∞‡∏Å‡∏≤‡∏®‡∏ú‡∏•‡∏Å‡∏≤‡∏£‡∏ô‡∏±‡∏ö‡∏Ñ‡∏∞‡πÅ‡∏ô‡∏ô ‡∏™‡∏™. (8 ‡∏Å.‡∏û. 2569)/‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£/‡πÅ‡∏ö‡∏ö‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠/1. ‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£ ‡πÄ‡∏Ç‡∏ï 2 (‡∏ö‡∏ä).pdf
5,‡πÅ‡∏û‡∏£‡πà,‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠,0.2378,‡∏õ‡∏£‡∏∞‡∏Å‡∏≤‡∏®‡∏ú‡∏•‡∏Å‡∏≤‡∏£‡∏ô‡∏±‡∏ö‡∏Ñ‡∏∞‡πÅ‡∏ô‡∏ô ‡∏™‡∏™. (8 ‡∏Å.‡∏û. 2569)/‡πÅ‡∏û‡∏£‡πà/‡πÅ‡∏ö‡∏ö‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠/41. ‡πÅ‡∏û‡∏£‡πà ‡πÄ‡∏Ç‡∏ï 1 (‡∏ö‡∏ä).pdf
6,‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ,unknown,0.6997,‡∏õ‡∏£‡∏∞‡∏Å‡∏≤‡∏®‡∏ú‡∏•‡∏Å‡∏≤‡∏£‡∏ô‡∏±‡∏ö‡∏Ñ‡∏∞‡πÅ‡∏ô‡∏ô ‡∏™‡∏™. (8 ‡∏Å.‡∏û. 2569)/‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ/‡πÄ‡πÄ‡∏ö‡∏ö‡πÄ‡πÄ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï/3. ‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ ‡πÄ‡∏Ç‡∏ï 3 (1).pdf
7,‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ,unknown,0.8692,‡∏õ‡∏£‡∏∞‡∏Å‡∏≤‡∏®‡∏ú‡∏•‡∏Å‡∏≤‡∏£‡∏ô‡∏±‡∏ö‡∏Ñ‡∏∞‡πÅ‡∏ô‡∏ô ‡∏™‡∏™. (8 ‡∏Å.‡∏û. 2569)/‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ/‡πÄ‡πÄ‡∏ö‡∏ö‡πÄ‡πÄ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï/3. ‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ ‡πÄ‡∏Ç‡∏ï 1 (1).pdf
8,‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á,unknown,0.1012,‡∏õ‡∏£‡∏∞‡∏Å‡∏≤‡∏®‡∏ú‡∏•‡∏Å‡∏≤‡∏£‡∏ô‡∏±‡∏ö‡∏Ñ‡∏∞‡πÅ‡∏ô‡∏ô ‡∏™‡∏™. (8 ‡∏Å.‡∏û. 2569)/‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á/‡πÅ‡∏ö‡∏ö‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏ö‡∏ï/72. ‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á ‡πÄ‡∏Ç‡∏ï 2.pdf


In [46]:
# ‚îÄ‚îÄ Build input_data records ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
sample_inputs = [build_input_data(row) for row in sample_df.to_dict(orient="records")]

# Inspect one record
print("‚îÄ‚îÄ ‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï sample ‚îÄ‚îÄ")
s = sample_inputs[0]
meta = s["source_file_metadata"]
print(f"  drive_uri  : {s['drive_uri']}")
print(f"  province   : {meta['province_name']}")
print(f"  form_type  : {meta['form_type']}")
print(f"  path       : {meta['path']}")
print(f"  size_mb    : {meta['size_mb']}")

print()
print("‚îÄ‚îÄ ‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠ sample ‚îÄ‚îÄ")
s2 = sample_inputs[SAMPLE_SIZE_PER_TYPE]
meta2 = s2["source_file_metadata"]
print(f"  drive_uri  : {s2['drive_uri']}")
print(f"  province   : {meta2['province_name']}")
print(f"  form_type  : {meta2['form_type']}")
print(f"  path       : {meta2['path']}")
print(f"  size_mb    : {meta2['size_mb']}")

‚îÄ‚îÄ ‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï sample ‚îÄ‚îÄ
  drive_uri  : https://drive.google.com/uc?export=download&id=1JvmnPF8_XFcEB6f3_LxbwJCpH6tSp3NH
  province   : ‡πÄ‡∏ä‡∏µ‡∏¢‡∏á‡πÉ‡∏´‡∏°‡πà
  form_type  : ‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï
  path       : ‡∏õ‡∏£‡∏∞‡∏Å‡∏≤‡∏®‡∏ú‡∏•‡∏Å‡∏≤‡∏£‡∏ô‡∏±‡∏ö‡∏Ñ‡∏∞‡πÅ‡∏ô‡∏ô ‡∏™‡∏™. (8 ‡∏Å.‡∏û. 2569)/‡πÄ‡∏ä‡∏µ‡∏¢‡∏á‡πÉ‡∏´‡∏°‡πà/‡πÅ‡∏ö‡∏ö‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï/14. ‡πÄ‡∏ä‡∏µ‡∏¢‡∏á‡πÉ‡∏´‡∏°‡πà ‡πÄ‡∏Ç‡∏ï 6.pdf
  size_mb    : 0.106

‚îÄ‚îÄ ‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠ sample ‚îÄ‚îÄ
  drive_uri  : https://drive.google.com/uc?export=download&id=1otADY98vhxj0tUwI4-msaULPea5d2yyh
  province   : ‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£
  form_type  : ‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠
  path       : ‡∏õ‡∏£‡∏∞‡∏Å‡∏≤‡∏®‡∏ú‡∏•‡∏Å‡∏≤‡∏£‡∏ô‡∏±‡∏ö‡∏Ñ‡∏∞‡πÅ‡∏ô‡∏ô ‡∏™‡∏™. (8 ‡∏Å.‡∏û. 2569)/‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£/‡πÅ‡∏ö‡∏ö‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠/1. ‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£ ‡πÄ‡∏Ç‡∏ï 26 (‡∏ö‡∏ä).pdf
  size_mb    : 0.2231


## 4. Task

The task calls Gemini with a PDF from Google Drive and returns a structured `SS61FormData` dict.

**Prompt covers:**
- Header fields (province, constituency, date, form_type)
- Ballot summary stats (eligible/present voters, valid/invalid/no-vote/total ballots)
- Results table ‚Äî candidates+party (‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï) or party only (‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠)
- Total votes (‡∏£‡∏ß‡∏° row)
- Officials
- Validation: `total_ballots_used = valid + invalid + no_vote` and `total_votes = sum(results)`

**Span hierarchy in Datadog LLMObs:**

```
@workflow  extract_ss61_form          ‚Üê captures full input_data + output dict
  ‚îî‚îÄ‚îÄ @llm  gemini_extract_ss61       ‚Üê Gemini API call with token counts & latency
        ‚îú‚îÄ‚îÄ evaluation: ballot_check  ‚Üê score 0.0 / 1.0
        ‚îî‚îÄ‚îÄ evaluation: votes_check   ‚Üê score 0.0 / 1.0
```

**Inline External Evaluations** (submitted via `LLMObs.submit_evaluation` on the active `@llm_span`):

| Label | Check | Pass condition |
|---|---|---|
| `ballot_check` | `total_ballots_used = valid + invalid + no_vote` | Arithmetic identity holds |
| `votes_check` | `total_votes = Œ£ results[*].vote_count` | Sum matches recorded total |

In [47]:
SS61_EXTRACTION_PROMPT = """
You are an expert data entry assistant for Thai Election announcement documents (Form S.S. 6/1 ‚Äî ‡∏õ‡∏£‡∏∞‡∏Å‡∏≤‡∏®‡∏ú‡∏•‡∏Å‡∏≤‡∏£‡∏ô‡∏±‡∏ö‡∏Ñ‡∏∞‡πÅ‡∏ô‡∏ô ‡∏™‡∏™.).

This PDF is a constituency-level vote counting announcement for the 2026 Thai General Election (8 February 2569 BE).
It is either:
  ‚Ä¢ ‡πÅ‡∏ö‡∏ö‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï (‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï)   ‚Äî Constituency form: lists individual candidates with their party and vote total
  ‚Ä¢ ‡πÅ‡∏ö‡∏ö‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠ (‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠) ‚Äî Party-list form: lists parties with their vote total

EXTRACTION INSTRUCTIONS:

1. HEADER (form_info):
   - form_type: "‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï" if document title/folder indicates ‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï; "‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠" for party-list.
   - province: Thai province name (‡∏à‡∏±‡∏á‡∏´‡∏ß‡∏±‡∏î).
   - constituency_number: the ‡πÄ‡∏Ç‡∏ï‡πÄ‡∏•‡∏∑‡∏≠‡∏Å‡∏ï‡∏±‡πâ‡∏á‡∏ó‡∏µ‡πà number as a string (e.g., "1", "3").
   - date: the vote counting date exactly as written on the document (e.g., "8 ‡∏Å‡∏∏‡∏°‡∏†‡∏≤‡∏û‡∏±‡∏ô‡∏ò‡πå 2569").

2. BALLOT SUMMARY (ballot_summary) ‚Äî aggregate totals for the whole constituency:
   - eligible_voters : ‡∏ú‡∏π‡πâ‡∏°‡∏µ‡∏™‡∏¥‡∏ó‡∏ò‡∏¥‡πÄ‡∏•‡∏∑‡∏≠‡∏Å‡∏ï‡∏±‡πâ‡∏á (if present on document)
   - present_voters  : ‡∏ú‡∏π‡πâ‡∏°‡∏≤‡πÉ‡∏ä‡πâ‡∏™‡∏¥‡∏ó‡∏ò‡∏¥ (if present)
   - valid_ballots   : ‡∏ö‡∏±‡∏ï‡∏£‡∏î‡∏µ
   - invalid_ballots : ‡∏ö‡∏±‡∏ï‡∏£‡πÄ‡∏™‡∏µ‡∏¢
   - no_vote_ballots : ‡πÑ‡∏°‡πà‡πÄ‡∏•‡∏∑‡∏≠‡∏Å‡∏ú‡∏π‡πâ‡πÉ‡∏î / ‡πÑ‡∏°‡πà‡∏õ‡∏£‡∏∞‡∏™‡∏á‡∏Ñ‡πå‡∏•‡∏á‡∏Ñ‡∏∞‡πÅ‡∏ô‡∏ô
   - total_ballots_used : ‡∏£‡∏ß‡∏°‡∏ö‡∏±‡∏ï‡∏£‡∏ó‡∏µ‡πà‡πÉ‡∏ä‡πâ
   Record BOTH the Arabic numeral and the Thai text for every value.

3. RESULTS TABLE (results) ‚Äî every data row in the vote results table:
   - number         : row number (‡∏ó‡∏µ‡πà / ‡∏•‡∏≥‡∏î‡∏±‡∏ö)
   - candidate_name : full name ‚Äî ‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï ONLY (leave null for ‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠)
   - party_name     : party name (‡∏™‡∏±‡∏á‡∏Å‡∏±‡∏î‡∏û‡∏£‡∏£‡∏Ñ‡∏Å‡∏≤‡∏£‡πÄ‡∏°‡∏∑‡∏≠‡∏á / ‡∏ä‡∏∑‡πà‡∏≠‡∏û‡∏£‡∏£‡∏Ñ)
   - vote_count     : votes received (‡∏Ñ‡∏∞‡πÅ‡∏ô‡∏ô) ‚Äî both Arabic and Thai text

4. TOTAL VOTES (total_votes):
   - The "‡∏£‡∏ß‡∏°" row at the bottom of the results table ‚Äî both Arabic and Thai text.

5. OFFICIALS:
   - All names and positions from the signature / certification section.

VALIDATION RULES (apply internally before returning):
   - total_ballots_used.arabic = valid_ballots.arabic + invalid_ballots.arabic + no_vote_ballots.arabic
   - total_votes.arabic = sum of all results[*].vote_count.arabic

Return all Thai strings exactly as they appear in the document (do not translate or romanise).
"""

In [48]:
gemini_client = genai.Client(api_key=GEMINI_API_KEY, vertexai=False)
print("‚úÖ Gemini client initialized")

‚úÖ Gemini client initialized


In [49]:
from ddtrace.llmobs.decorators import llm as llm_span, workflow, task
from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential,
    retry_if_exception_type,
    before_sleep_log,
)
import logging

_gemini_retry_logger = logging.getLogger("ss61.gemini_retry")
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")


# ‚îÄ‚îÄ Evaluation helpers ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

def _arabic(obj) -> int:
    """Safely extract arabic value from a NumberTextPair dict."""
    if isinstance(obj, dict):
        return obj.get("arabic", 0) or 0
    return 0

@task(name="ballot_check")
def _ballot_check(result: dict) -> tuple[float, str]:
    """
    Check: total_ballots_used = valid_ballots + invalid_ballots + no_vote_ballots

    Returns (score, reasoning) where score is 1.0 (pass) or 0.0 (fail).
    Returns 0.0 with a note if ballot_summary is absent or all zeros.
    """
    bs = result.get("ballot_summary") or {}
    valid    = _arabic(bs.get("valid_ballots"))
    invalid  = _arabic(bs.get("invalid_ballots"))
    no_vote  = _arabic(bs.get("no_vote_ballots"))
    total    = _arabic(bs.get("total_ballots_used"))

    if total == 0 and valid == 0 and invalid == 0 and no_vote == 0:
        return 0.0, "ballot_summary missing or all zeros ‚Äî cannot verify"

    calc = valid + invalid + no_vote
    ok   = total == calc
    reasoning = (
        f"total_ballots_used={total}, "
        f"valid({valid}) + invalid({invalid}) + no_vote({no_vote}) = {calc}"
    )
    return (1.0 if ok else 0.0), reasoning

@task(name="votes_check")
def _votes_check(result: dict) -> tuple[float, str]:
    """
    Check: total_votes = sum(results[*].vote_count)

    Returns (score, reasoning) where score is 1.0 (pass) or 0.0 (fail).
    Returns 0.0 with a note if results list is empty.
    """
    entries     = result.get("results", [])
    total_votes = _arabic(result.get("total_votes"))

    if not entries:
        return 0.0, "no results entries found"

    calc = sum(_arabic(r.get("vote_count")) for r in entries)
    ok   = total_votes == calc
    reasoning = (
        f"total_votes={total_votes}, "
        f"sum({len(entries)} result entries) = {calc}"
    )
    return (1.0 if ok else 0.0), reasoning


print("‚úÖ Evaluation helpers defined: _ballot_check, _votes_check")

‚úÖ Evaluation helpers defined: _ballot_check, _votes_check


In [50]:
_task_logger = logging.getLogger("ss61.task")


@workflow(name="extract_ss61_form")
def extract_ss61_form(input_data: Dict[str, Any], config: Dict[str, Any]) -> dict:
    """
    Task function for LLMObs experiments ‚Äî SS6/1 extraction.

    Accepts a dataset record's input_data and the experiment config dict,
    calls Gemini with the SS6/1 schema, and returns a structured dict.

    After the LLM call, two External Evaluations are submitted inline to
    Datadog LLMObs on the active LLM span:
      ‚Ä¢ ballot_check ‚Äî total_ballots_used = valid + invalid + no_vote
      ‚Ä¢ votes_check  ‚Äî total_votes = sum(results[*].vote_count)

    Args:
        input_data: JSON string or dict with shape:
            {
                "drive_uri": "https://drive.google.com/uc?export=download&id=...",
                "source_file_metadata": {
                    "province_name": str,
                    "path": str,
                    "size_mb": float,
                    "file_id": str,
                    "folder_id": str,
                    "form_type": "‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï" | "‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠",
                }
            }
        config: dict with keys:
            model         (str)      ‚Äî Gemini model name, default "gemini-2.5-flash"
            temperature   (float)    ‚Äî sampling temperature, default 0.0
            max_tokens    (int)      ‚Äî max output tokens, default 8192
            thinking_mode (str|None) ‚Äî None | "LOW" | "HIGH"

    Returns:
        dict ‚Äî extracted SS61FormData (matches SS61_DATA_SCHEMA)
    """
    model         = config.get("model", "gemini-2.5-flash")
    temperature   = config.get("temperature", 0.0)
    max_tokens    = config.get("max_tokens", 8192)
    thinking_mode = config.get("thinking_mode")  # None | "LOW" | "HIGH"

    if isinstance(input_data, str):
        input_data = json.loads(input_data)

    file_id       = input_data["source_file_metadata"]["file_id"]
    form_type     = input_data["source_file_metadata"].get("form_type", "unknown")
    province_name = input_data["source_file_metadata"].get("province_name", "?")
    drive_uri     = f"https://drive.google.com/uc?export=download&id={file_id}"

    _task_logger.info(
        "[extract_ss61_form] START  province=%s  form_type=%s  model=%s  "
        "thinking=%s  max_tokens=%d  file_id=%s",
        province_name, form_type, model, thinking_mode or "none", max_tokens, file_id,
    )

    file_part = types.Part.from_uri(file_uri=drive_uri, mime_type="application/pdf")
    _task_logger.info("[extract_ss61_form] PDF part built  drive_uri=%s", drive_uri)

    _thinking_budget = {"LOW": 1024, "HIGH": 8192}
    gen_config_params: Dict[str, Any] = {
        "response_mime_type": "application/json",
        "response_schema": SS61_DATA_SCHEMA,
        "temperature": temperature,
        "max_output_tokens": max_tokens,
        "top_p": 0.95,
    }
    if thinking_mode:
        budget = _thinking_budget.get(thinking_mode, 1024)
        gen_config_params["thinking_config"] = types.ThinkingConfig(thinking_budget=budget)
        _task_logger.info(
            "[extract_ss61_form] thinking enabled  mode=%s  budget=%d tokens",
            thinking_mode, budget,
        )

    # ‚îÄ‚îÄ Inner function decorated with @llm_span so Datadog creates an LLM span.
    # Evaluations are submitted *inside* this span so they are attached to the
    # exact LLM call that produced the result.
    # @llm_span(model_name=model, name="gemini_extract_ss61", model_provider="google")
    @task(name="gemini_extract_ss61")
    def _call_and_evaluate() -> dict:
        # Retry only the API call ‚Äî evaluations run once on the successful response.
        @retry(
            stop=stop_after_attempt(4),          # 1 initial + 3 retries
            wait=wait_exponential(multiplier=2, min=2, max=30),
            retry=retry_if_exception_type(Exception),
            before_sleep=before_sleep_log(_gemini_retry_logger, logging.WARNING),
            reraise=True,
        )
        def _call_gemini():
            _task_logger.info(
                "[gemini_extract_ss61] calling Gemini  model=%s  file_id=%s", model, file_id
            )
            return gemini_client.models.generate_content(
                model=model,
                contents=[file_part, SS61_EXTRACTION_PROMPT],
                config=types.GenerateContentConfig(**gen_config_params),
            )

        response = _call_gemini()
        _task_logger.info(
            "[gemini_extract_ss61] response received  finish_reason=%s  output_tokens=%s",
            getattr(response, "finish_reason", "?"),
            getattr(getattr(response, "usage_metadata", None), "candidates_token_count", "?"),
        )

        parsed = json.loads(response.text)
        _task_logger.info(
            "[gemini_extract_ss61] parsed OK  n_results=%d  total_votes=%d",
            len(parsed.get("results") or []),
            _arabic(parsed.get("total_votes")),
        )

        # ‚îÄ‚îÄ External Evaluations ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
        # export_span(span=None) returns the current active span (this @llm_span).
        span_ctx = LLMObs.export_span(span=None)
        if span_ctx:
            ts_ms = int(time.time() * 1000)
            tags  = {"form_type": form_type}

            ballot_score, ballot_reason = _ballot_check(parsed)
            _task_logger.info(
                "[gemini_extract_ss61] ballot_check  score=%.1f  %s", ballot_score, ballot_reason
            )
            LLMObs.submit_evaluation(
                span=span_ctx,
                ml_app=ML_APP,
                label="ballot_check",
                metric_type="score",
                value=ballot_score,
                assessment="pass" if ballot_score == 1.0 else "fail",
                reasoning=ballot_reason,
                tags=tags,
                timestamp_ms=ts_ms,
            )

            votes_score, votes_reason = _votes_check(parsed)
            _task_logger.info(
                "[gemini_extract_ss61] votes_check   score=%.1f  %s", votes_score, votes_reason
            )
            LLMObs.submit_evaluation(
                span=span_ctx,
                ml_app=ML_APP,
                label="votes_check",
                metric_type="score",
                value=votes_score,
                assessment="pass" if votes_score == 1.0 else "fail",
                reasoning=votes_reason,
                tags=tags,
                timestamp_ms=ts_ms,
            )

        else:
            _task_logger.info("[gemini_extract_ss61] no active span ‚Äî evaluations skipped")

        return parsed

    result = _call_and_evaluate()
    _task_logger.info(
        "[extract_ss61_form] DONE  province=%s  form_type=%s", province_name, form_type
    )
    return result


print("‚úÖ Task function defined (with inline ballot_check + votes_check evaluations)")

‚úÖ Task function defined (with inline ballot_check + votes_check evaluations)


### Optional ‚Äî smoke-test on a single sample

In [51]:
# ‚îÄ‚îÄ Inspect the first sample input ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
sample = sample_inputs[0]
meta = sample["source_file_metadata"]

print(f"drive_uri  : {sample['drive_uri']}")
print(f"file_id    : {meta['file_id']}")
print(f"province   : {meta['province_name']}")
print(f"form_type  : {meta['form_type']}")
print(f"path       : {meta['path']}")
print(f"size_mb    : {meta['size_mb']}")

drive_uri  : https://drive.google.com/uc?export=download&id=1JvmnPF8_XFcEB6f3_LxbwJCpH6tSp3NH
file_id    : 1JvmnPF8_XFcEB6f3_LxbwJCpH6tSp3NH
province   : ‡πÄ‡∏ä‡∏µ‡∏¢‡∏á‡πÉ‡∏´‡∏°‡πà
form_type  : ‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï
path       : ‡∏õ‡∏£‡∏∞‡∏Å‡∏≤‡∏®‡∏ú‡∏•‡∏Å‡∏≤‡∏£‡∏ô‡∏±‡∏ö‡∏Ñ‡∏∞‡πÅ‡∏ô‡∏ô ‡∏™‡∏™. (8 ‡∏Å.‡∏û. 2569)/‡πÄ‡∏ä‡∏µ‡∏¢‡∏á‡πÉ‡∏´‡∏°‡πà/‡πÅ‡∏ö‡∏ö‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï/14. ‡πÄ‡∏ä‡∏µ‡∏¢‡∏á‡πÉ‡∏´‡∏°‡πà ‡πÄ‡∏Ç‡∏ï 6.pdf
size_mb    : 0.106


In [53]:
# ‚îÄ‚îÄ Run extraction on the first sample ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

os.environ["DD_TRACE_AGENT_URL"] = "http://datadog-agent:8126"

result = extract_ss61_form(
    sample,
    {"model": "gemini-3.1-pro-preview", "temperature": 0.0, "max_tokens": 8192, "thinking_mode": "LOW"},
)

print("‚îÄ‚îÄ form_info ‚îÄ‚îÄ")
print(json.dumps(result.get("form_info"), indent=2, ensure_ascii=False))

print("\n‚îÄ‚îÄ ballot_summary ‚îÄ‚îÄ")
print(json.dumps(result.get("ballot_summary"), indent=2, ensure_ascii=False))

print(f"\n‚îÄ‚îÄ results ({len(result.get('results', []))} entries) ‚îÄ‚îÄ")
for r in result.get("results", [])[:5]:
    print(json.dumps(r, ensure_ascii=False))

print("\n‚îÄ‚îÄ total_votes ‚îÄ‚îÄ")
print(json.dumps(result.get("total_votes"), indent=2, ensure_ascii=False))

2026-02-21 15:59:26,419 INFO [extract_ss61_form] START  province=‡πÄ‡∏ä‡∏µ‡∏¢‡∏á‡πÉ‡∏´‡∏°‡πà  form_type=‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï  model=gemini-3.1-pro-preview  thinking=LOW  max_tokens=8192  file_id=1JvmnPF8_XFcEB6f3_LxbwJCpH6tSp3NH
2026-02-21 15:59:26,420 INFO [extract_ss61_form] PDF part built  drive_uri=https://drive.google.com/uc?export=download&id=1JvmnPF8_XFcEB6f3_LxbwJCpH6tSp3NH
2026-02-21 15:59:26,420 INFO [extract_ss61_form] thinking enabled  mode=LOW  budget=1024 tokens
2026-02-21 15:59:26,421 INFO [gemini_extract_ss61] calling Gemini  model=gemini-3.1-pro-preview  file_id=1JvmnPF8_XFcEB6f3_LxbwJCpH6tSp3NH
2026-02-21 15:59:26,422 INFO AFC is enabled with max remote calls: 10.
2026-02-21 15:59:55,201 INFO HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-pro-preview:generateContent "HTTP/1.1 200 OK"
2026-02-21 15:59:55,205 INFO [gemini_extract_ss61] response received  finish_reason=?  output_tokens=1245
2026-02-21 15:59:55,205 INFO [gemini_extr

‚îÄ‚îÄ form_info ‚îÄ‚îÄ
{
  "form_type": "‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï",
  "province": "‡πÄ‡∏ä‡∏µ‡∏¢‡∏á‡πÉ‡∏´‡∏°‡πà",
  "constituency_number": "6",
  "date": "8 ‡πÄ‡∏î‡∏∑‡∏≠‡∏ô ‡∏Å‡∏∏‡∏°‡∏†‡∏≤‡∏û‡∏±‡∏ô‡∏ò‡πå ‡∏û.‡∏®. 2569"
}

‚îÄ‚îÄ ballot_summary ‚îÄ‚îÄ
{
  "eligible_voters": {
    "arabic": 127678,
    "thai_text": "‡∏´‡∏ô‡∏∂‡πà‡∏á‡πÅ‡∏™‡∏ô‡∏™‡∏≠‡∏á‡∏´‡∏°‡∏∑‡πà‡∏ô‡πÄ‡∏à‡πá‡∏î‡∏û‡∏±‡∏ô‡∏´‡∏Å‡∏£‡πâ‡∏≠‡∏¢‡πÄ‡∏à‡πá‡∏î‡∏™‡∏¥‡∏ö‡πÅ‡∏õ‡∏î"
  },
  "present_voters": {
    "arabic": 91404,
    "thai_text": "‡πÄ‡∏Å‡πâ‡∏≤‡∏´‡∏°‡∏∑‡πà‡∏ô‡∏´‡∏ô‡∏∂‡πà‡∏á‡∏û‡∏±‡∏ô‡∏™‡∏µ‡πà‡∏£‡πâ‡∏≠‡∏¢‡∏™‡∏µ‡πà"
  },
  "valid_ballots": {
    "arabic": 81574,
    "thai_text": "‡πÅ‡∏õ‡∏î‡∏´‡∏°‡∏∑‡πà‡∏ô‡∏´‡∏ô‡∏∂‡πà‡∏á‡∏û‡∏±‡∏ô‡∏´‡πâ‡∏≤‡∏£‡πâ‡∏≠‡∏¢‡πÄ‡∏à‡πá‡∏î‡∏™‡∏¥‡∏ö‡∏™‡∏µ‡πà"
  },
  "invalid_ballots": {
    "arabic": 6347,
    "thai_text": "‡∏´‡∏Å‡∏û‡∏±‡∏ô‡∏™‡∏≤‡∏°‡∏£‡πâ‡∏≠‡∏¢‡∏™‡∏µ‡πà‡∏™‡∏¥‡∏ö‡πÄ‡∏à‡πá‡∏î"
  },
  "no_vote_ballots": {
    "arabic": 3483,
    "thai_text": "‡∏™‡∏≤‡∏°‡∏û‡∏±‡∏ô‡∏™‡∏µ‡πà‡∏£‡πâ‡∏≠‡∏¢‡πÅ‡∏õ‡∏î‡∏™‡∏

In [54]:
# ‚îÄ‚îÄ Quick validation checks ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
def validate_extraction(result: dict) -> None:
    """Print a quick sanity-check of the extracted data."""
    bs = result.get("ballot_summary") or {}
    results = result.get("results", [])
    total_votes = (result.get("total_votes") or {}).get("arabic", 0)

    def arabic(obj):
        return (obj or {}).get("arabic", 0) or 0

    # Ballot math
    valid = arabic(bs.get("valid_ballots"))
    invalid = arabic(bs.get("invalid_ballots"))
    no_vote = arabic(bs.get("no_vote_ballots"))
    total_used = arabic(bs.get("total_ballots_used"))
    calc_used = valid + invalid + no_vote
    ballot_ok = total_used == calc_used

    # Vote math
    calc_votes = sum(arabic(r.get("vote_count")) for r in results)
    votes_ok = calc_votes == total_votes

    print(f"form_type        : {result.get('form_info', {}).get('form_type')}")
    print(f"province         : {result.get('form_info', {}).get('province')}")
    print(f"constituency     : ‡πÄ‡∏Ç‡∏ï {result.get('form_info', {}).get('constituency_number')}")
    print(f"results entries  : {len(results)}")
    print()
    print(f"ballot check     : {total_used} = {valid}+{invalid}+{no_vote}={calc_used}  {'‚úÖ' if ballot_ok else '‚ùå'}")
    print(f"votes check      : total_votes={total_votes}, sum(results)={calc_votes}  {'‚úÖ' if votes_ok else '‚ùå'}")


validate_extraction(result)

form_type        : ‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï
province         : ‡πÄ‡∏ä‡∏µ‡∏¢‡∏á‡πÉ‡∏´‡∏°‡πà
constituency     : ‡πÄ‡∏Ç‡∏ï 6
results entries  : 8

ballot check     : 91404 = 81574+6347+3483=91404  ‚úÖ
votes check      : total_votes=81574, sum(results)=81574  ‚úÖ


## 5. Local Pre-labeling Run

Run all clean `sample_inputs` locally across **4 model configurations** to generate candidate labels.

**Purpose**: Use LLMs to pre-label data so a human can review and correct efficiently.  
After human validation ‚Üí Section 6 pushes approved records as a Datadog LLMObs dataset.  
Datadog LLMObs Experiments come **after** the dataset is labeled.

| # | Model | Thinking budget |
|---|---|---|
| 1 | `gemini-3-pro-preview` | LOW  (1 024 tokens) |
| 2 | `gemini-3-pro-preview` | HIGH (8 192 tokens) |
| 3 | `gemini-2.5-pro` | LOW  (1 024 tokens) |
| 4 | `gemini-2.5-pro` | HIGH (8 192 tokens) |

In [82]:
PRELABEL_CONFIGS = [
    {
        "name": "gemini-3-pro-preview / LOW",
        "model": "gemini-3-pro-preview",
        "temperature": 0.0,
        "max_tokens": 16384,
        "thinking_mode": "LOW",
    },
    {
        "name": "gemini-3-pro-preview / HIGH",
        "model": "gemini-3-pro-preview",
        "temperature": 0.0,
        "max_tokens": 16384,
        "thinking_mode": "HIGH",
    },
    {
        "name": "gemini-2.5-pro / LOW",
        "model": "gemini-2.5-pro",
        "temperature": 0.0,
        "max_tokens": 16384,
        "thinking_mode": "LOW",
    },
    {
        "name": "gemini-2.5-pro / HIGH",
        "model": "gemini-2.5-pro",
        "temperature": 0.0,
        "max_tokens": 16384,
        "thinking_mode": "HIGH",
    },
]

pd.DataFrame(
    [{k: v for k, v in c.items() if k != "name"} | {"name": c["name"]} for c in PRELABEL_CONFIGS]
)[["name", "model", "thinking_mode", "max_tokens"]]

Unnamed: 0,name,model,thinking_mode,max_tokens
0,gemini-3-pro-preview / LOW,gemini-3-pro-preview,LOW,16384
1,gemini-3-pro-preview / HIGH,gemini-3-pro-preview,HIGH,16384
2,gemini-2.5-pro / LOW,gemini-2.5-pro,LOW,16384
3,gemini-2.5-pro / HIGH,gemini-2.5-pro,HIGH,16384


In [None]:
# ‚îÄ‚îÄ Filter to known form types ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Exclude "unknown" form_type records (‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ, ‡∏Å‡∏£‡∏∞‡∏ö‡∏µ‡πà, ‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á outliers)
sample_clean = [
    inp for inp in sample_inputs
    if inp["source_file_metadata"].get("form_type") in ("‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï", "‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠", "unknown")
]
excluded = len(sample_inputs) - len(sample_clean)
print(f"Clean records to pre-label : {len(sample_clean)}")

Clean records to pre-label : 9  (excluded 0 'unknown' form_type)


In [84]:
# ‚îÄ‚îÄ Run each config against every clean sample (local pre-labeling) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Results are collected in local_results for human review.
# Inline evaluations (ballot_check, votes_check) are still submitted to Datadog
# LLMObs via submit_evaluation so you can spot patterns in the trace explorer.

local_results: Dict[str, list] = {}  # cfg_label ‚Üí list of row dicts

for cfg in PRELABEL_CONFIGS:
    cfg_label   = cfg["name"]
    task_config = {k: v for k, v in cfg.items() if k != "name"}
    rows        = []

    print(f"‚ñ∂ {cfg_label}  ({len(sample_clean)} records)")
    for i, inp in enumerate(sample_clean, 1):
        meta = inp["source_file_metadata"]
        try:
            result       = extract_ss61_form(inp, task_config)
            b_score, b_r = _ballot_check(result)
            v_score, v_r = _votes_check(result)
            b_icon = "‚úÖ" if b_score == 1.0 else "‚ùå"
            v_icon = "‚úÖ" if v_score == 1.0 else "‚ùå"
            rows.append({
                "input":         inp,
                "result":        result,
                "ballot_score":  b_score,
                "ballot_reason": b_r,
                "votes_score":   v_score,
                "votes_reason":  v_r,
                "error":         None,
            })
        except Exception as e:
            b_icon = v_icon = "üí•"
            rows.append({
                "input":        inp,
                "result":       None,
                "ballot_score": 0.0,
                "votes_score":  0.0,
                "error":        str(e),
            })
        print(f"  [{i}/{len(sample_clean)}] {meta['province_name']} ({meta['form_type']}) ‚Äî ballot={b_icon}  votes={v_icon}")

    local_results[cfg_label] = rows
    print()

print("‚úÖ Pre-labeling complete")

‚ñ∂ gemini-3-pro-preview / LOW  (9 records)
  [1/9] ‡πÄ‡∏ä‡∏µ‡∏¢‡∏á‡πÉ‡∏´‡∏°‡πà (‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï) ‚Äî ballot=‚úÖ  votes=‚úÖ


failed to send, dropping 3 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect)


  [2/9] ‡∏ô‡∏Ñ‡∏£‡∏£‡∏≤‡∏ä‡∏™‡∏µ‡∏°‡∏≤ (‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï) ‚Äî ballot=‚úÖ  votes=‚úÖ
  [3/9] ‡∏ô‡∏Ñ‡∏£‡∏®‡∏£‡∏µ‡∏ò‡∏£‡∏£‡∏°‡∏£‡∏≤‡∏ä (‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï) ‚Äî ballot=‚úÖ  votes=‚úÖ


failed to send, dropping 3 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect) [1 skipped]


  [4/9] ‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£ (‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠) ‚Äî ballot=‚úÖ  votes=‚úÖ


failed to send, dropping 3 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect)


  [5/9] ‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£ (‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠) ‚Äî ballot=‚úÖ  votes=‚úÖ


failed to send, dropping 3 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect)


  [6/9] ‡πÅ‡∏û‡∏£‡πà (‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠) ‚Äî ballot=‚úÖ  votes=‚úÖ


failed to send, dropping 3 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect)


  [7/9] ‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ (unknown) ‚Äî ballot=‚úÖ  votes=‚úÖ
  [8/9] ‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ (unknown) ‚Äî ballot=‚úÖ  votes=‚ùå


failed to send, dropping 3 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect) [1 skipped]


  [9/9] ‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á (unknown) ‚Äî ballot=‚úÖ  votes=‚úÖ

‚ñ∂ gemini-3-pro-preview / HIGH  (9 records)
  [1/9] ‡πÄ‡∏ä‡∏µ‡∏¢‡∏á‡πÉ‡∏´‡∏°‡πà (‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï) ‚Äî ballot=‚úÖ  votes=‚úÖ


failed to send, dropping 3 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect) [1 skipped]


  [2/9] ‡∏ô‡∏Ñ‡∏£‡∏£‡∏≤‡∏ä‡∏™‡∏µ‡∏°‡∏≤ (‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï) ‚Äî ballot=‚úÖ  votes=‚úÖ
  [3/9] ‡∏ô‡∏Ñ‡∏£‡∏®‡∏£‡∏µ‡∏ò‡∏£‡∏£‡∏°‡∏£‡∏≤‡∏ä (‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï) ‚Äî ballot=‚úÖ  votes=‚úÖ


failed to send, dropping 3 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect) [1 skipped]


  [4/9] ‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£ (‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠) ‚Äî ballot=‚úÖ  votes=‚úÖ


failed to send, dropping 3 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect)


  [5/9] ‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£ (‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠) ‚Äî ballot=‚úÖ  votes=‚úÖ


failed to send, dropping 3 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect)


  [6/9] ‡πÅ‡∏û‡∏£‡πà (‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠) ‚Äî ballot=‚úÖ  votes=‚úÖ


failed to send, dropping 3 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect)


  [7/9] ‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ (unknown) ‚Äî ballot=‚úÖ  votes=‚úÖ
  [8/9] ‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ (unknown) ‚Äî ballot=‚úÖ  votes=‚úÖ


failed to send, dropping 3 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect) [1 skipped]


  [9/9] ‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á (unknown) ‚Äî ballot=‚úÖ  votes=‚úÖ

‚ñ∂ gemini-2.5-pro / LOW  (9 records)
  [1/9] ‡πÄ‡∏ä‡∏µ‡∏¢‡∏á‡πÉ‡∏´‡∏°‡πà (‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï) ‚Äî ballot=‚úÖ  votes=‚úÖ


failed to send, dropping 3 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect) [1 skipped]


  [2/9] ‡∏ô‡∏Ñ‡∏£‡∏£‡∏≤‡∏ä‡∏™‡∏µ‡∏°‡∏≤ (‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï) ‚Äî ballot=‚úÖ  votes=‚úÖ
  [3/9] ‡∏ô‡∏Ñ‡∏£‡∏®‡∏£‡∏µ‡∏ò‡∏£‡∏£‡∏°‡∏£‡∏≤‡∏ä (‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï) ‚Äî ballot=‚úÖ  votes=‚úÖ
  [4/9] ‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£ (‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠) ‚Äî ballot=‚úÖ  votes=‚úÖ


failed to send, dropping 3 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect) [2 skipped]


  [5/9] ‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£ (‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠) ‚Äî ballot=‚úÖ  votes=‚úÖ
  [6/9] ‡πÅ‡∏û‡∏£‡πà (‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠) ‚Äî ballot=‚úÖ  votes=‚úÖ


failed to send, dropping 3 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect) [1 skipped]


  [7/9] ‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ (unknown) ‚Äî ballot=‚úÖ  votes=‚úÖ
  [8/9] ‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ (unknown) ‚Äî ballot=‚úÖ  votes=‚úÖ


failed to send, dropping 3 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect) [1 skipped]


  [9/9] ‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á (unknown) ‚Äî ballot=‚úÖ  votes=‚úÖ

‚ñ∂ gemini-2.5-pro / HIGH  (9 records)
  [1/9] ‡πÄ‡∏ä‡∏µ‡∏¢‡∏á‡πÉ‡∏´‡∏°‡πà (‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï) ‚Äî ballot=‚úÖ  votes=‚úÖ
  [2/9] ‡∏ô‡∏Ñ‡∏£‡∏£‡∏≤‡∏ä‡∏™‡∏µ‡∏°‡∏≤ (‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï) ‚Äî ballot=‚úÖ  votes=‚úÖ


failed to send, dropping 3 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect) [2 skipped]


  [3/9] ‡∏ô‡∏Ñ‡∏£‡∏®‡∏£‡∏µ‡∏ò‡∏£‡∏£‡∏°‡∏£‡∏≤‡∏ä (‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï) ‚Äî ballot=‚úÖ  votes=‚úÖ
  [4/9] ‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£ (‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠) ‚Äî ballot=‚úÖ  votes=‚úÖ


failed to send, dropping 3 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect) [1 skipped]


  [5/9] ‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£ (‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠) ‚Äî ballot=‚úÖ  votes=‚úÖ
  [6/9] ‡πÅ‡∏û‡∏£‡πà (‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠) ‚Äî ballot=‚úÖ  votes=‚úÖ


failed to send, dropping 3 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect) [1 skipped]


  [7/9] ‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ (unknown) ‚Äî ballot=‚úÖ  votes=‚úÖ
  [8/9] ‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ (unknown) ‚Äî ballot=‚úÖ  votes=‚úÖ


failed to send, dropping 3 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect) [1 skipped]


  [9/9] ‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á (unknown) ‚Äî ballot=‚úÖ  votes=‚úÖ

‚úÖ Pre-labeling complete


In [85]:
# ‚îÄ‚îÄ Build comparison rows from local_results ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
comparison_rows = []

for cfg_label, rows in local_results.items():
    for row in rows:
        inp  = row["input"]
        out  = row["result"]
        meta = inp.get("source_file_metadata", {})
        filename = meta.get("path", "").split("/")[-1].replace(".pdf", "")

        ballot_ok = "‚úÖ" if row["ballot_score"] == 1.0 else ("üí•" if row["error"] else "‚ùå")
        votes_ok  = "‚úÖ" if row["votes_score"]  == 1.0 else ("üí•" if row["error"] else "‚ùå")

        comparison_rows.append({
            "config":       cfg_label,
            "province":     meta.get("province_name", "?"),
            "form_type":    meta.get("form_type", "?"),
            "file":         filename,
            "ballot_check": ballot_ok,
            "votes_check":  votes_ok,
            "all_match":    "‚úÖ" if ballot_ok == "‚úÖ" and votes_ok == "‚úÖ" else "‚ùå",
        })

comp_df = pd.DataFrame(comparison_rows)

# ‚îÄ‚îÄ Pivot: rows = file, columns = config ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
pivot = comp_df.pivot_table(
    index=["province", "form_type", "file"],
    columns="config",
    values="all_match",
    aggfunc="first",
).reset_index()

print("Comparison table  (‚úÖ both checks pass  |  ‚ùå at least one fails  |  üí• error)")
print(f"{'‚îÄ'*70}")
pivot

Comparison table  (‚úÖ both checks pass  |  ‚ùå at least one fails  |  üí• error)
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ


config,province,form_type,file,gemini-2.5-pro / HIGH,gemini-2.5-pro / LOW,gemini-3-pro-preview / HIGH,gemini-3-pro-preview / LOW
0,‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£,‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠,1. ‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£ ‡πÄ‡∏Ç‡∏ï 2 (‡∏ö‡∏ä),‚úÖ,‚úÖ,‚úÖ,‚úÖ
1,‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£,‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠,1. ‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£ ‡πÄ‡∏Ç‡∏ï 26 (‡∏ö‡∏ä),‚úÖ,‚úÖ,‚úÖ,‚úÖ
2,‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ,unknown,3. ‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ ‡πÄ‡∏Ç‡∏ï 1 (1),‚úÖ,‚úÖ,‚úÖ,‚ùå
3,‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ,unknown,3. ‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ ‡πÄ‡∏Ç‡∏ï 3 (1),‚úÖ,‚úÖ,‚úÖ,‚úÖ
4,‡∏ô‡∏Ñ‡∏£‡∏£‡∏≤‡∏ä‡∏™‡∏µ‡∏°‡∏≤,‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï,21. ‡∏ô‡∏Ñ‡∏£‡∏£‡∏≤‡∏ä‡∏™‡∏µ‡∏°‡∏≤ ‡πÄ‡∏Ç‡∏ï 13,‚úÖ,‚úÖ,‚úÖ,‚úÖ
5,‡∏ô‡∏Ñ‡∏£‡∏®‡∏£‡∏µ‡∏ò‡∏£‡∏£‡∏°‡∏£‡∏≤‡∏ä,‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï,22. ‡∏ô‡∏Ñ‡∏£‡∏®‡∏£‡∏µ‡∏ò‡∏£‡∏£‡∏°‡∏£‡∏≤‡∏ä ‡πÄ‡∏Ç‡∏ï 8,‚úÖ,‚úÖ,‚úÖ,‚úÖ
6,‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á,unknown,72. ‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á ‡πÄ‡∏Ç‡∏ï 2,‚úÖ,‚úÖ,‚úÖ,‚úÖ
7,‡πÄ‡∏ä‡∏µ‡∏¢‡∏á‡πÉ‡∏´‡∏°‡πà,‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï,14. ‡πÄ‡∏ä‡∏µ‡∏¢‡∏á‡πÉ‡∏´‡∏°‡πà ‡πÄ‡∏Ç‡∏ï 6,‚úÖ,‚úÖ,‚úÖ,‚úÖ
8,‡πÅ‡∏û‡∏£‡πà,‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠,41. ‡πÅ‡∏û‡∏£‡πà ‡πÄ‡∏Ç‡∏ï 1 (‡∏ö‡∏ä),‚úÖ,‚úÖ,‚úÖ,‚úÖ


In [86]:
# ‚îÄ‚îÄ Per-config accuracy summary ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
summary = (
    comp_df
    .assign(
        ballot_pass=comp_df["ballot_check"] == "‚úÖ",
        votes_pass=comp_df["votes_check"]   == "‚úÖ",
        all_pass=comp_df["all_match"]       == "‚úÖ",
    )
    .groupby("config", sort=False)
    .agg(
        n           =("all_pass",    "count"),
        ballot_pass =("ballot_pass", "sum"),
        votes_pass  =("votes_pass",  "sum"),
        all_pass    =("all_pass",    "sum"),
    )
    .assign(
        ballot_pct=lambda x: (x["ballot_pass"] / x["n"] * 100).round(1),
        votes_pct =lambda x: (x["votes_pass"]  / x["n"] * 100).round(1),
        all_pct   =lambda x: (x["all_pass"]    / x["n"] * 100).round(1),
    )
    .rename(columns={
        "ballot_pct": "ballot %",
        "votes_pct":  "votes %",
        "all_pct":    "all %",
    })
)

print("‚îÄ‚îÄ Per-config pass rates (internal consistency checks) ‚îÄ‚îÄ")
summary[["n", "ballot_pass", "ballot %", "votes_pass", "votes %", "all_pass", "all %"]]

‚îÄ‚îÄ Per-config pass rates (internal consistency checks) ‚îÄ‚îÄ


Unnamed: 0_level_0,n,ballot_pass,ballot %,votes_pass,votes %,all_pass,all %
config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
gemini-3-pro-preview / LOW,9,9,100.0,8,88.9,8,88.9
gemini-3-pro-preview / HIGH,9,9,100.0,9,100.0,9,100.0
gemini-2.5-pro / LOW,9,9,100.0,9,100.0,9,100.0
gemini-2.5-pro / HIGH,9,9,100.0,9,100.0,9,100.0


In [87]:
# ‚îÄ‚îÄ Inputs where configs disagree ‚Äî good candidates for human review ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
cfg_cols      = list(local_results.keys())
disagreements = pivot[pivot[cfg_cols].nunique(axis=1) > 1]

if disagreements.empty:
    print("‚úÖ All configs agree on every input.")
else:
    print(f"‚ö†Ô∏è  {len(disagreements)} input(s) where configs disagree ‚Äî review these first:\n")
    print(disagreements.to_string(index=False))

‚ö†Ô∏è  1 input(s) where configs disagree ‚Äî review these first:

 province form_type                   file gemini-2.5-pro / HIGH gemini-2.5-pro / LOW gemini-3-pro-preview / HIGH gemini-3-pro-preview / LOW
‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ   unknown 3. ‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ ‡πÄ‡∏Ç‡∏ï 1 (1)                     ‚úÖ                    ‚úÖ                           ‚úÖ                          ‚ùå


---
## 5.1 Detailed Config Comparison  *(reads directly from JSONL)*

Compares what each model config extracted for the **same file** across three levels:

| Level | What is compared |
|---|---|
| **Checks** | `ballot_check` and `votes_check` pass/fail per config |
| **Key numbers** | Ballot summary + total_votes ‚Äî do all configs agree? |
| **Vote entries** | Per-candidate/party vote count ‚Äî value and count agreement |

In [89]:
# ‚îÄ‚îÄ Load pre-label records from JSONL ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
from collections import defaultdict
from pathlib import Path

PRELABEL_PATH = Path("datasets/ss6_1_prelabels.jsonl")

prelabel_records_raw: list[dict] = []
with PRELABEL_PATH.open(encoding="utf-8") as f:
    for line in f:
        line = line.strip()
        if line:
            prelabel_records_raw.append(json.loads(line))

all_cfg_names  = sorted({r["prelabel_config"]["name"] for r in prelabel_records_raw})
all_file_ids   = list(dict.fromkeys(r["source_file"]["file_id"] for r in prelabel_records_raw))

print(f"Records loaded : {len(prelabel_records_raw)}")
print(f"Configs        : {all_cfg_names}")
print(f"Unique files   : {len(all_file_ids)}")

# ‚îÄ‚îÄ Index by file_id ‚Üí config_name ‚Üí record ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
by_file: dict[str, dict[str, dict]] = defaultdict(dict)
for rec in prelabel_records_raw:
    by_file[rec["source_file"]["file_id"]][rec["prelabel_config"]["name"]] = rec

Records loaded : 36
Configs        : ['gemini-2.5-pro / HIGH', 'gemini-2.5-pro / LOW', 'gemini-3-pro-preview / HIGH', 'gemini-3-pro-preview / LOW']
Unique files   : 9


In [90]:
# ‚îÄ‚îÄ Helper: extract key numeric values from one record ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

def _kv(rec: dict) -> dict | None:
    """Return a flat dict of key values from a pre-label record, or None on error."""
    if rec.get("error"):
        return None
    ext = rec.get("extracted_data") or {}
    bs  = ext.get("ballot_summary") or {}
    fi  = ext.get("form_info") or {}
    results = ext.get("results") or []

    return {
        # form info
        "form_type":           fi.get("form_type", "?"),
        "constituency_number": fi.get("constituency_number", "?"),
        # ballot summary
        "eligible_voters":     _arabic(bs.get("eligible_voters")),
        "present_voters":      _arabic(bs.get("present_voters")),
        "valid_ballots":       _arabic(bs.get("valid_ballots")),
        "invalid_ballots":     _arabic(bs.get("invalid_ballots")),
        "no_vote_ballots":     _arabic(bs.get("no_vote_ballots")),
        "total_ballots_used":  _arabic(bs.get("total_ballots_used")),
        # totals
        "total_votes":         _arabic(ext.get("total_votes")),
        "n_results":           len(results),
        # internal checks
        "ballot_check":        rec.get("ballot_check", 0.0),
        "votes_check":         rec.get("votes_check", 0.0),
        # raw results list for vote-level diff
        "_results":            results,
    }


def _icon(v) -> str:
    if v is None:
        return "üí•"
    return "‚úÖ" if v == 1.0 else "‚ùå"


print("‚úÖ Helpers defined")

‚úÖ Helpers defined


In [91]:
# ‚îÄ‚îÄ Level 1: Check scores + total_votes per file per config ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Pivot: rows = file, col groups = config ‚Üí ballot_check / votes_check / total_votes

NUMERIC_FIELDS = [
    "eligible_voters", "present_voters",
    "valid_ballots", "invalid_ballots", "no_vote_ballots", "total_ballots_used",
    "total_votes",
]

summary_rows = []
for fid in all_file_ids:
    cfg_data = by_file[fid]
    first_rec = next(iter(cfg_data.values()))
    meta = first_rec["source_file"]
    label = meta["path"].split("/")[-1].replace(".pdf", "")

    row: dict = {
        "province":  meta["province_name"],
        "form_type": meta["form_type"],
        "file":      label,
    }

    values_by_field: dict[str, list] = defaultdict(list)

    for cfg in all_cfg_names:
        rec = cfg_data.get(cfg)
        if rec is None:
            row[f"{cfg} | ballot"] = "‚Äî"
            row[f"{cfg} | votes"]  = "‚Äî"
            row[f"{cfg} | total_votes"] = "‚Äî"
            continue

        kv = _kv(rec)
        if kv is None:
            row[f"{cfg} | ballot"] = "üí•"
            row[f"{cfg} | votes"]  = "üí•"
            row[f"{cfg} | total_votes"] = "ERR"
            continue

        row[f"{cfg} | ballot"] = _icon(kv["ballot_check"])
        row[f"{cfg} | votes"]  = _icon(kv["votes_check"])
        row[f"{cfg} | total_votes"] = kv["total_votes"]

        for field in NUMERIC_FIELDS:
            values_by_field[field].append(kv[field])

    # Mark fields where configs disagree
    disagreed = [f for f, vs in values_by_field.items() if len(set(vs)) > 1]
    row["field_disagreements"] = ", ".join(disagreed) if disagreed else "‚úÖ all match"

    summary_rows.append(row)

summary_df = pd.DataFrame(summary_rows)

# Pretty-print column order
base_cols = ["province", "form_type", "file", "field_disagreements"]
cfg_cols  = [c for c in summary_df.columns if c not in base_cols]

print("=== Level 1: Check scores + total_votes + field disagreements ===")
summary_df[base_cols + cfg_cols]

=== Level 1: Check scores + total_votes + field disagreements ===


Unnamed: 0,province,form_type,file,field_disagreements,gemini-2.5-pro / HIGH | ballot,gemini-2.5-pro / HIGH | votes,gemini-2.5-pro / HIGH | total_votes,gemini-2.5-pro / LOW | ballot,gemini-2.5-pro / LOW | votes,gemini-2.5-pro / LOW | total_votes,gemini-3-pro-preview / HIGH | ballot,gemini-3-pro-preview / HIGH | votes,gemini-3-pro-preview / HIGH | total_votes,gemini-3-pro-preview / LOW | ballot,gemini-3-pro-preview / LOW | votes,gemini-3-pro-preview / LOW | total_votes
0,‡πÄ‡∏ä‡∏µ‡∏¢‡∏á‡πÉ‡∏´‡∏°‡πà,‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï,14. ‡πÄ‡∏ä‡∏µ‡∏¢‡∏á‡πÉ‡∏´‡∏°‡πà ‡πÄ‡∏Ç‡∏ï 6,‚úÖ all match,‚úÖ,‚úÖ,81574,‚úÖ,‚úÖ,81574,‚úÖ,‚úÖ,81574,‚úÖ,‚úÖ,81574
1,‡∏ô‡∏Ñ‡∏£‡∏£‡∏≤‡∏ä‡∏™‡∏µ‡∏°‡∏≤,‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï,21. ‡∏ô‡∏Ñ‡∏£‡∏£‡∏≤‡∏ä‡∏™‡∏µ‡∏°‡∏≤ ‡πÄ‡∏Ç‡∏ï 13,"invalid_ballots, no_vote_ballots",‚úÖ,‚úÖ,66505,‚úÖ,‚úÖ,66505,‚úÖ,‚úÖ,66505,‚úÖ,‚úÖ,66505
2,‡∏ô‡∏Ñ‡∏£‡∏®‡∏£‡∏µ‡∏ò‡∏£‡∏£‡∏°‡∏£‡∏≤‡∏ä,‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï,22. ‡∏ô‡∏Ñ‡∏£‡∏®‡∏£‡∏µ‡∏ò‡∏£‡∏£‡∏°‡∏£‡∏≤‡∏ä ‡πÄ‡∏Ç‡∏ï 8,‚úÖ all match,‚úÖ,‚úÖ,92847,‚úÖ,‚úÖ,92847,‚úÖ,‚úÖ,92847,‚úÖ,‚úÖ,92847
3,‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£,‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠,1. ‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£ ‡πÄ‡∏Ç‡∏ï 26 (‡∏ö‡∏ä),‚úÖ all match,‚úÖ,‚úÖ,92428,‚úÖ,‚úÖ,92428,‚úÖ,‚úÖ,92428,‚úÖ,‚úÖ,92428
4,‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£,‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠,1. ‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£ ‡πÄ‡∏Ç‡∏ï 2 (‡∏ö‡∏ä),‚úÖ all match,‚úÖ,‚úÖ,86220,‚úÖ,‚úÖ,86220,‚úÖ,‚úÖ,86220,‚úÖ,‚úÖ,86220
5,‡πÅ‡∏û‡∏£‡πà,‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠,41. ‡πÅ‡∏û‡∏£‡πà ‡πÄ‡∏Ç‡∏ï 1 (‡∏ö‡∏ä),‚úÖ all match,‚úÖ,‚úÖ,92182,‚úÖ,‚úÖ,92182,‚úÖ,‚úÖ,92182,‚úÖ,‚úÖ,92182
6,‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ,unknown,3. ‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ ‡πÄ‡∏Ç‡∏ï 3 (1),‚úÖ all match,‚úÖ,‚úÖ,107950,‚úÖ,‚úÖ,107950,‚úÖ,‚úÖ,107950,‚úÖ,‚úÖ,107950
7,‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ,unknown,3. ‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ ‡πÄ‡∏Ç‡∏ï 1 (1),‚úÖ all match,‚úÖ,‚úÖ,85850,‚úÖ,‚úÖ,85850,‚úÖ,‚úÖ,85850,‚úÖ,‚ùå,85850
8,‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á,unknown,72. ‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á ‡πÄ‡∏Ç‡∏ï 2,"eligible_voters, invalid_ballots, no_vote_ballots",‚úÖ,‚úÖ,79838,‚úÖ,‚úÖ,79838,‚úÖ,‚úÖ,79838,‚úÖ,‚úÖ,79838


In [92]:
# ‚îÄ‚îÄ Level 2: Numeric field comparison ‚Äî only files where configs disagree ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# For each disagreed file √ó field, show what each config returned

detail_rows = []
for fid in all_file_ids:
    cfg_data = by_file[fid]
    first_rec = next(iter(cfg_data.values()))
    meta = first_rec["source_file"]
    label = meta["path"].split("/")[-1].replace(".pdf", "")

    # Gather per-field per-config values
    kv_by_cfg = {}
    for cfg in all_cfg_names:
        rec = cfg_data.get(cfg)
        kv_by_cfg[cfg] = _kv(rec) if rec else None

    for field in NUMERIC_FIELDS:
        vals = {cfg: (kv[field] if kv else None) for cfg, kv in kv_by_cfg.items()}
        unique_vals = set(v for v in vals.values() if v is not None)
        if len(unique_vals) <= 1:
            continue  # all configs agree ‚Äî skip

        row = {
            "province": meta["province_name"],
            "form_type": meta["form_type"],
            "file": label,
            "field": field,
            "AGREE": "‚úÖ" if len(unique_vals) == 1 else "‚ùå",
        }
        for cfg in all_cfg_names:
            row[cfg] = vals.get(cfg, "‚Äî")
        detail_rows.append(row)

if not detail_rows:
    print("‚úÖ All configs agree on every numeric field for every file!")
else:
    detail_df = pd.DataFrame(detail_rows)
    print(f"=== Level 2: Numeric field disagreements ({len(detail_df)} rows) ===")
    display(detail_df[["province", "form_type", "file", "field"] + all_cfg_names])

=== Level 2: Numeric field disagreements (5 rows) ===


Unnamed: 0,province,form_type,file,field,gemini-2.5-pro / HIGH,gemini-2.5-pro / LOW,gemini-3-pro-preview / HIGH,gemini-3-pro-preview / LOW
0,‡∏ô‡∏Ñ‡∏£‡∏£‡∏≤‡∏ä‡∏™‡∏µ‡∏°‡∏≤,‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï,21. ‡∏ô‡∏Ñ‡∏£‡∏£‡∏≤‡∏ä‡∏™‡∏µ‡∏°‡∏≤ ‡πÄ‡∏Ç‡∏ï 13,invalid_ballots,6631,6631,6231,6631
1,‡∏ô‡∏Ñ‡∏£‡∏£‡∏≤‡∏ä‡∏™‡∏µ‡∏°‡∏≤,‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï,21. ‡∏ô‡∏Ñ‡∏£‡∏£‡∏≤‡∏ä‡∏™‡∏µ‡∏°‡∏≤ ‡πÄ‡∏Ç‡∏ï 13,no_vote_ballots,5354,5354,5754,5354
2,‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á,unknown,72. ‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á ‡πÄ‡∏Ç‡∏ï 2,eligible_voters,112384,112384,113384,113384
3,‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á,unknown,72. ‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á ‡πÄ‡∏Ç‡∏ï 2,invalid_ballots,3301,3301,3311,3311
4,‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á,unknown,72. ‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á ‡πÄ‡∏Ç‡∏ï 2,no_vote_ballots,3239,3239,3229,3229


In [93]:
# ‚îÄ‚îÄ Level 3: Vote-entry comparison per file ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# For each file, build a table: rows = candidate number, cols = config ‚Üí vote_count
# Highlights rows where at least one config differs.

for fid in all_file_ids:
    cfg_data = by_file[fid]
    first_rec = next(iter(cfg_data.values()))
    meta  = first_rec["source_file"]
    label = meta["path"].split("/")[-1].replace(".pdf", "")

    # Collect all candidate numbers seen across all configs
    vote_maps: dict[str, dict[int, int]] = {}
    name_maps: dict[int, str] = {}
    party_maps: dict[int, str] = {}
    for cfg in all_cfg_names:
        rec = cfg_data.get(cfg)
        kv  = _kv(rec) if rec else None
        if kv is None:
            vote_maps[cfg] = {}
            continue
        vm = {}
        for entry in kv["_results"]:
            num = entry.get("number")
            vm[num] = _arabic(entry.get("vote_count"))
            # Use the first non-null name/party we see
            if num not in name_maps and entry.get("candidate_name"):
                name_maps[num]  = entry.get("candidate_name", "")
                party_maps[num] = entry.get("party_name", "")
        vote_maps[cfg] = vm

    all_nums = sorted({n for vm in vote_maps.values() for n in vm})
    if not all_nums:
        continue

    vote_rows = []
    has_disagreement = False
    for num in all_nums:
        vals = {cfg: vote_maps[cfg].get(num, "‚Äî") for cfg in all_cfg_names}
        unique = set(v for v in vals.values() if v != "‚Äî")
        agree = len(unique) <= 1
        if not agree:
            has_disagreement = True
        row = {
            "#":       num,
            "candidate": name_maps.get(num, ""),
            "party":   party_maps.get(num, ""),
            "agree":   "‚úÖ" if agree else "‚ùå",
        }
        row.update(vals)
        vote_rows.append(row)

    vote_df = pd.DataFrame(vote_rows)
    status  = "‚ùå DISAGREE" if has_disagreement else "‚úÖ all match"
    print(f"\n{'='*70}")
    print(f"  {meta['province_name']}  {meta['form_type']}  |  {label}")
    print(f"  Vote entries: {status}")
    display(vote_df[["#", "candidate", "party", "agree"] + all_cfg_names])


  ‡πÄ‡∏ä‡∏µ‡∏¢‡∏á‡πÉ‡∏´‡∏°‡πà  ‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï  |  14. ‡πÄ‡∏ä‡∏µ‡∏¢‡∏á‡πÉ‡∏´‡∏°‡πà ‡πÄ‡∏Ç‡∏ï 6
  Vote entries: ‚úÖ all match


Unnamed: 0,#,candidate,party,agree,gemini-2.5-pro / HIGH,gemini-2.5-pro / LOW,gemini-3-pro-preview / HIGH,gemini-3-pro-preview / LOW
0,1,‡∏ô‡∏≤‡∏á‡∏™‡∏≤‡∏ß‡∏™‡∏∏‡∏†‡∏≤‡∏ô‡∏±‡∏ô‡∏ó‡πå ‡∏õ‡∏±‡∏ç‡∏ç‡∏≤‡∏ó‡∏¥‡∏û‡∏¢‡πå,‡∏Å‡∏•‡πâ‡∏≤‡∏ò‡∏£‡∏£‡∏°,‚úÖ,33043,33043,33043,33043
1,2,‡∏ß‡πà‡∏≤‡∏ó‡∏µ‡πà‡∏£‡πâ‡∏≠‡∏¢‡∏ï‡∏£‡∏µ‡∏´‡∏ç‡∏¥‡∏á‡∏≠‡∏£‡∏û‡∏£‡∏£‡∏ì ‡∏à‡∏±‡∏ô‡∏ï‡∏≤‡πÄ‡∏£‡∏∑‡πà‡∏≠‡∏á,‡∏õ‡∏£‡∏∞‡∏ä‡∏≤‡∏ä‡∏ô,‚úÖ,26649,26649,26649,26649
2,3,‡∏ô‡∏≤‡∏¢‡πÄ‡∏≠‡∏Å ‡∏õ‡∏∏‡∏Å‡∏°‡∏ì‡∏µ,‡∏ß‡∏¥‡∏ä‡∏ä‡∏±‡πà‡∏ô‡πÉ‡∏´‡∏°‡πà,‚úÖ,327,327,327,327
3,4,‡∏ô‡∏≤‡∏¢‡∏≠‡∏∏‡∏ó‡∏¥‡∏® ‡∏™‡∏≤‡∏¢‡∏î‡∏ß‡∏á‡πÅ‡∏Å‡πâ‡∏ß,‡∏õ‡∏£‡∏∞‡∏ä‡∏≤‡∏ò‡∏¥‡∏õ‡∏±‡∏ï‡∏¢‡πå,‚úÖ,1534,1534,1534,1534
4,5,‡∏ô‡∏≤‡∏¢‡∏ß‡∏£‡πÇ‡∏ä‡∏ï‡∏¥ ‡∏à‡∏µ‡πâ‡πÄ‡∏£‡∏∑‡∏≠‡∏ô,‡∏†‡∏π‡∏°‡∏¥‡πÉ‡∏à‡πÑ‡∏ó‡∏¢,‚úÖ,2041,2041,2041,2041
5,6,‡∏ô‡∏≤‡∏¢‡∏ò‡∏£‡∏£‡∏°‡∏°‡∏ç ‡∏ß‡∏∏‡∏í‡∏¥‡∏•‡∏±‡∏Å‡∏©‡∏ì‡πå,‡πÄ‡∏®‡∏£‡∏©‡∏ê‡∏Å‡∏¥‡∏à,‚úÖ,1166,1166,1166,1166
6,7,‡∏ô‡∏≤‡∏¢‡∏≠‡∏£‡∏∏‡∏ì ‡∏ò‡∏ô‡∏∞‡∏´‡∏°‡∏µ,‡∏£‡∏ß‡∏°‡πÑ‡∏ó‡∏¢‡∏™‡∏£‡πâ‡∏≤‡∏á‡∏ä‡∏≤‡∏ï‡∏¥,‚úÖ,798,798,798,798
7,8,‡∏ô‡∏≤‡∏¢‡∏ö‡∏±‡∏ì‡∏à‡∏á‡∏®‡∏±‡∏Å‡∏î‡∏¥‡πå ‡∏ß‡∏á‡∏®‡πå‡∏£‡∏±‡∏ï‡∏ô‡∏ß‡∏£‡∏£‡∏ì,‡πÄ‡∏û‡∏∑‡πà‡∏≠‡πÑ‡∏ó‡∏¢,‚úÖ,16016,16016,16016,16016



  ‡∏ô‡∏Ñ‡∏£‡∏£‡∏≤‡∏ä‡∏™‡∏µ‡∏°‡∏≤  ‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï  |  21. ‡∏ô‡∏Ñ‡∏£‡∏£‡∏≤‡∏ä‡∏™‡∏µ‡∏°‡∏≤ ‡πÄ‡∏Ç‡∏ï 13
  Vote entries: ‚ùå DISAGREE


Unnamed: 0,#,candidate,party,agree,gemini-2.5-pro / HIGH,gemini-2.5-pro / LOW,gemini-3-pro-preview / HIGH,gemini-3-pro-preview / LOW
0,1,‡∏ñ‡∏π‡∏Å‡∏ñ‡∏≠‡∏ô‡∏ä‡∏∑‡πà‡∏≠,‡∏Å‡∏•‡πâ‡∏≤‡∏ò‡∏£‡∏£‡∏°,‚ùå,0,3028,3028,3028
1,2,‡∏ô‡∏≤‡∏¢‡πÑ‡∏ß‡∏ß‡∏¥‡∏Å ‡∏™‡∏ß‡∏£‡∏£‡∏ì‡∏≤,‡∏õ‡∏£‡∏∞‡∏ä‡∏≤‡∏ò‡∏¥‡∏õ‡∏±‡∏ï‡∏¢‡πå,‚úÖ,1775,1775,1775,1775
2,3,‡∏ô‡∏≤‡∏¢‡∏ß‡∏¥‡∏ä‡∏±‡∏¢ ‡∏Ç‡∏≠‡∏´‡∏°‡∏±‡πà‡∏ô‡∏Å‡∏•‡∏≤‡∏á,‡πÄ‡∏û‡∏∑‡πà‡∏≠‡∏ö‡πâ‡∏≤‡∏ô‡πÄ‡∏°‡∏∑‡∏≠‡∏á,‚ùå,310,310,0,0
3,4,‡∏ô‡∏≤‡∏¢‡∏™‡∏∏‡∏Å‡∏§‡∏©‡∏ì‡πå ‡∏ß‡∏±‡∏ä‡∏£‡∏°‡∏≤‡∏•‡∏µ‡∏Å‡∏∏‡∏•,‡∏û‡∏•‡∏±‡∏á‡∏õ‡∏£‡∏∞‡∏ä‡∏≤‡∏£‡∏±‡∏ê,‚úÖ,1944,1944,1944,1944
4,5,‡∏ô‡∏≤‡∏¢‡∏û‡∏ä‡∏£ ‡∏à‡∏±‡∏ô‡∏ó‡∏£‡∏£‡∏ß‡∏á‡∏ó‡∏≠‡∏á,‡πÄ‡∏û‡∏∑‡πà‡∏≠‡πÑ‡∏ó‡∏¢,‚úÖ,28157,28157,28157,28157
5,6,‡∏ô‡∏≤‡∏á‡∏™‡∏≤‡∏ß‡∏ô‡∏≤‡∏•‡∏±‡∏ô‡∏ó‡∏≤ ‡∏ö‡∏∏‡∏ç‡∏ä‡∏¥‡∏ï,‡∏õ‡∏£‡∏∞‡∏ä‡∏≤‡∏ä‡∏ô,‚úÖ,25404,25404,25404,25404
6,7,‡∏ô‡∏≤‡∏¢‡∏°‡∏ô‡∏ï‡πå‡∏ä‡∏±‡∏¢ ‡∏û‡∏á‡∏©‡πå‡πÄ‡∏à‡∏£‡∏¥‡∏ç,‡∏†‡∏π‡∏°‡∏¥‡πÉ‡∏à‡πÑ‡∏ó‡∏¢,‚úÖ,5887,5887,5887,5887
7,8,‡∏ô‡∏≤‡∏¢‡∏≠‡∏±‡∏Ñ‡∏Ñ‡∏ä‡∏≤ ‡∏û‡∏£‡∏´‡∏°‡∏™‡∏π‡∏ï‡∏£,‡πÄ‡∏®‡∏£‡∏©‡∏ê‡∏Å‡∏¥‡∏à,‚ùå,3028,0,310,310



  ‡∏ô‡∏Ñ‡∏£‡∏®‡∏£‡∏µ‡∏ò‡∏£‡∏£‡∏°‡∏£‡∏≤‡∏ä  ‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï  |  22. ‡∏ô‡∏Ñ‡∏£‡∏®‡∏£‡∏µ‡∏ò‡∏£‡∏£‡∏°‡∏£‡∏≤‡∏ä ‡πÄ‡∏Ç‡∏ï 8
  Vote entries: ‚úÖ all match


Unnamed: 0,#,candidate,party,agree,gemini-2.5-pro / HIGH,gemini-2.5-pro / LOW,gemini-3-pro-preview / HIGH,gemini-3-pro-preview / LOW
0,1,‡∏ô‡∏≤‡∏á‡∏≠‡∏ß‡∏¢‡∏û‡∏£‡∏®‡∏£‡∏µ ‡πÄ‡∏ä‡∏≤‡∏ß‡∏•‡∏¥‡∏ï,‡∏†‡∏π‡∏°‡∏¥‡πÉ‡∏à‡πÑ‡∏ó‡∏¢,‚úÖ,54040,54040,54040,54040
1,2,‡∏ô‡∏≤‡∏¢‡∏ò‡∏µ‡∏£‡∏ß‡∏±‡∏í‡∏ô‡πå ‡∏ö‡∏∏‡∏ç‡∏ß‡∏£‡∏£‡∏ì,‡∏õ‡∏£‡∏∞‡∏ä‡∏≤‡∏ä‡∏ô,‚úÖ,12201,12201,12201,12201
2,3,‡∏ô‡∏≤‡∏¢‡∏ò‡∏µ‡∏£‡∏û‡∏á‡∏®‡πå ‡∏™‡∏¥‡∏ó‡∏ò‡∏≤,‡∏£‡∏ß‡∏°‡πÑ‡∏ó‡∏¢‡∏™‡∏£‡πâ‡∏≤‡∏á‡∏ä‡∏≤‡∏ï‡∏¥,‚úÖ,805,805,805,805
3,4,‡∏ô‡∏≤‡∏á‡∏™‡∏≤‡∏ß‡∏£‡∏±‡∏ï‡∏ô‡∏≤‡∏ß‡∏î‡∏µ ‡∏®‡∏£‡∏µ‡∏ô‡∏≤‡∏Ñ‡∏ä,‡πÄ‡∏û‡∏∑‡πà‡∏≠‡πÑ‡∏ó‡∏¢,‚úÖ,860,860,860,860
4,5,‡∏ô‡∏≤‡∏¢‡∏≠‡∏ô‡∏∏‡∏ä‡∏¥‡∏ï ‡∏û‡∏£‡∏´‡∏°‡∏à‡∏±‡∏ô‡∏ó‡∏£‡πå,‡∏û‡∏•‡∏±‡∏á‡∏õ‡∏£‡∏∞‡∏ä‡∏≤‡∏£‡∏±‡∏ê,‚úÖ,1630,1630,1630,1630
5,6,‡∏ô‡∏≤‡∏¢‡∏õ‡∏è‡∏¥‡∏ß‡∏±‡∏ï‡∏¥ ‡∏¢‡∏∏‡∏ï‡∏¥‡∏ò‡∏£‡∏£‡∏°,‡∏õ‡∏£‡∏∞‡∏ä‡∏≤‡∏ò‡∏¥‡∏õ‡∏±‡∏ï‡∏¢‡πå,‚úÖ,22534,22534,22534,22534
6,7,‡∏ô‡∏≤‡∏¢‡∏™‡∏°‡∏ö‡∏π‡∏£‡∏ì‡πå ‡∏´‡∏±‡∏ï‡∏ñ‡∏õ‡∏£‡∏∞‡∏î‡∏¥‡∏©‡∏ê‡πå,‡∏û‡∏•‡∏ß‡∏±‡∏ï,‚úÖ,487,487,487,487
7,8,‡∏ô‡∏≤‡∏ß‡∏≤‡∏≠‡∏≤‡∏Å‡∏≤‡∏®‡πÄ‡∏≠‡∏Å ‡∏™‡∏∏‡∏£‡∏¥‡∏ô‡∏ó‡∏£‡πå ‡πÄ‡∏°‡∏Ü‡∏≤‡∏ß‡∏£‡∏£‡∏ì,‡∏ó‡∏≤‡∏á‡πÄ‡∏•‡∏∑‡∏≠‡∏Å‡πÉ‡∏´‡∏°‡πà,‚úÖ,290,290,290,290



  ‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£  ‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠  |  1. ‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£ ‡πÄ‡∏Ç‡∏ï 26 (‡∏ö‡∏ä)
  Vote entries: ‚úÖ all match


Unnamed: 0,#,candidate,party,agree,gemini-2.5-pro / HIGH,gemini-2.5-pro / LOW,gemini-3-pro-preview / HIGH,gemini-3-pro-preview / LOW
0,1,string,‡πÑ‡∏ó‡∏¢‡∏ó‡∏£‡∏±‡∏û‡∏¢‡πå‡∏ó‡∏ß‡∏µ,‚úÖ,202,202,202,202
1,2,string,‡πÄ‡∏û‡∏∑‡πà‡∏≠‡∏ä‡∏≤‡∏ï‡∏¥‡πÑ‡∏ó‡∏¢,‚úÖ,290,290,290,290
2,3,string,‡πÉ‡∏´‡∏°‡πà,‚úÖ,56,56,56,56
3,4,string,‡∏°‡∏¥‡∏ï‡∏¥‡πÉ‡∏´‡∏°‡πà,‚úÖ,111,111,111,111
4,5,string,‡∏£‡∏ß‡∏°‡πÉ‡∏à‡πÑ‡∏ó‡∏¢,‚úÖ,161,161,161,161
5,6,string,‡∏£‡∏ß‡∏°‡πÑ‡∏ó‡∏¢‡∏™‡∏£‡πâ‡∏≤‡∏á‡∏ä‡∏≤‡∏ï‡∏¥,‚úÖ,2033,2033,2033,2033
6,7,string,‡∏û‡∏•‡∏ß‡∏±‡∏ï,‚úÖ,547,547,547,547
7,8,string,‡∏õ‡∏£‡∏∞‡∏ä‡∏≤‡∏ò‡∏¥‡∏õ‡πÑ‡∏ï‡∏¢‡πÉ‡∏´‡∏°‡πà,‚úÖ,291,291,291,291
8,9,string,‡πÄ‡∏û‡∏∑‡πà‡∏≠‡πÑ‡∏ó‡∏¢,‚úÖ,10607,10607,10607,10607
9,10,string,‡∏ó‡∏≤‡∏á‡πÄ‡∏•‡∏∑‡∏≠‡∏Å‡πÉ‡∏´‡∏°‡πà,‚úÖ,467,467,467,467



  ‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£  ‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠  |  1. ‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£ ‡πÄ‡∏Ç‡∏ï 2 (‡∏ö‡∏ä)
  Vote entries: ‚úÖ all match


Unnamed: 0,#,candidate,party,agree,gemini-2.5-pro / HIGH,gemini-2.5-pro / LOW,gemini-3-pro-preview / HIGH,gemini-3-pro-preview / LOW
0,1,‡πÑ‡∏ó‡∏¢‡∏ó‡∏£‡∏±‡∏û‡∏¢‡πå‡∏ó‡∏ß‡∏µ,‡πÑ‡∏ó‡∏¢‡∏ó‡∏£‡∏±‡∏û‡∏¢‡πå‡∏ó‡∏ß‡∏µ,‚úÖ,39,39,39,39
1,2,‡πÄ‡∏û‡∏∑‡πà‡∏≠‡∏ä‡∏≤‡∏ï‡∏¥‡πÑ‡∏ó‡∏¢,‡πÄ‡∏û‡∏∑‡πà‡∏≠‡∏ä‡∏≤‡∏ï‡∏¥‡πÑ‡∏ó‡∏¢,‚úÖ,280,280,280,280
2,3,‡πÉ‡∏´‡∏°‡πà,‡πÉ‡∏´‡∏°‡πà,‚úÖ,69,69,69,69
3,4,‡∏°‡∏¥‡∏ï‡∏¥‡πÉ‡∏´‡∏°‡πà,‡∏°‡∏¥‡∏ï‡∏¥‡πÉ‡∏´‡∏°‡πà,‚úÖ,129,129,129,129
4,5,‡∏£‡∏ß‡∏°‡πÉ‡∏à‡πÑ‡∏ó‡∏¢,‡∏£‡∏ß‡∏°‡πÉ‡∏à‡πÑ‡∏ó‡∏¢,‚úÖ,215,215,215,215
5,6,‡∏£‡∏ß‡∏°‡πÑ‡∏ó‡∏¢‡∏™‡∏£‡πâ‡∏≤‡∏á‡∏ä‡∏≤‡∏ï‡∏¥,‡∏£‡∏ß‡∏°‡πÑ‡∏ó‡∏¢‡∏™‡∏£‡πâ‡∏≤‡∏á‡∏ä‡∏≤‡∏ï‡∏¥,‚úÖ,1819,1819,1819,1819
6,7,‡∏û‡∏•‡∏ß‡∏±‡∏ï,‡∏û‡∏•‡∏ß‡∏±‡∏ï,‚úÖ,116,116,116,116
7,8,‡∏õ‡∏£‡∏∞‡∏ä‡∏≤‡∏ò‡∏¥‡∏õ‡πÑ‡∏ï‡∏¢‡πÉ‡∏´‡∏°‡πà,‡∏õ‡∏£‡∏∞‡∏ä‡∏≤‡∏ò‡∏¥‡∏õ‡πÑ‡∏ï‡∏¢‡πÉ‡∏´‡∏°‡πà,‚úÖ,211,211,211,211
8,9,‡πÄ‡∏û‡∏∑‡πà‡∏≠‡πÑ‡∏ó‡∏¢,‡πÄ‡∏û‡∏∑‡πà‡∏≠‡πÑ‡∏ó‡∏¢,‚úÖ,6226,6226,6226,6226
9,10,‡∏ó‡∏≤‡∏á‡πÄ‡∏•‡∏∑‡∏≠‡∏Å‡πÉ‡∏´‡∏°‡πà,‡∏ó‡∏≤‡∏á‡πÄ‡∏•‡∏∑‡∏≠‡∏Å‡πÉ‡∏´‡∏°‡πà,‚úÖ,358,358,358,358



  ‡πÅ‡∏û‡∏£‡πà  ‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠  |  41. ‡πÅ‡∏û‡∏£‡πà ‡πÄ‡∏Ç‡∏ï 1 (‡∏ö‡∏ä)
  Vote entries: ‚úÖ all match


Unnamed: 0,#,candidate,party,agree,gemini-2.5-pro / HIGH,gemini-2.5-pro / LOW,gemini-3-pro-preview / HIGH,gemini-3-pro-preview / LOW
0,1,string,‡πÑ‡∏ó‡∏¢‡∏ó‡∏£‡∏±‡∏û‡∏¢‡πå‡∏ó‡∏ß‡∏µ,‚úÖ,1115,1115,1115,1115
1,2,string,‡πÄ‡∏û‡∏∑‡πà‡∏≠‡∏ä‡∏≤‡∏ï‡∏¥‡πÑ‡∏ó‡∏¢,‚úÖ,1638,1638,1638,1638
2,3,string,‡πÉ‡∏´‡∏°‡πà,‚úÖ,866,866,866,866
3,4,string,‡∏°‡∏¥‡∏ï‡∏¥‡πÉ‡∏´‡∏°‡πà,‚úÖ,346,346,346,346
4,5,string,‡∏£‡∏ß‡∏°‡πÉ‡∏à‡πÑ‡∏ó‡∏¢,‚úÖ,278,278,278,278
5,6,string,‡∏£‡∏ß‡∏°‡πÑ‡∏ó‡∏¢‡∏™‡∏£‡πâ‡∏≤‡∏á‡∏ä‡∏≤‡∏ï‡∏¥,‚úÖ,1364,1364,1364,1364
6,7,string,‡∏û‡∏•‡∏ß‡∏±‡∏î,‚úÖ,48,48,48,48
7,8,string,‡∏õ‡∏£‡∏∞‡∏ä‡∏≤‡∏ò‡∏¥‡∏õ‡πÑ‡∏ï‡∏¢‡πÉ‡∏´‡∏°‡πà,‚úÖ,272,272,272,272
8,9,string,‡πÄ‡∏û‡∏∑‡πà‡∏≠‡πÑ‡∏ó‡∏¢,‚úÖ,24136,24136,24136,24136
9,10,string,‡∏ó‡∏≤‡∏á‡πÄ‡∏•‡∏∑‡∏≠‡∏Å‡πÉ‡∏´‡∏°‡πà,‚úÖ,270,270,270,270



  ‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ  unknown  |  3. ‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ ‡πÄ‡∏Ç‡∏ï 3 (1)
  Vote entries: ‚úÖ all match


Unnamed: 0,#,candidate,party,agree,gemini-2.5-pro / HIGH,gemini-2.5-pro / LOW,gemini-3-pro-preview / HIGH,gemini-3-pro-preview / LOW
0,1,‡∏ô‡∏≤‡∏¢‡∏ä‡∏≤‡∏ï‡∏¥‡∏ä‡∏≤‡∏¢ ‡∏ö‡∏±‡∏ß‡∏ã‡πâ‡∏≠‡∏ô,‡∏õ‡∏£‡∏∞‡∏ä‡∏≤‡∏ò‡∏¥‡∏õ‡∏±‡∏ï‡∏¢‡πå,‚úÖ,821,821,821,821
1,2,‡∏ô‡∏≤‡∏¢‡∏ä‡∏∏‡∏°‡∏û‡∏• ‡πÅ‡∏™‡∏á‡∏ß‡∏£‡∏£‡∏ì,‡∏õ‡∏£‡∏∞‡∏ä‡∏≤‡∏ä‡∏ô,‚úÖ,14303,14303,14303,14303
2,3,‡∏ô‡∏≤‡∏á‡∏™‡∏≤‡∏ß‡∏™‡∏∏‡∏°‡∏ì‡∏ë‡∏≤ ‡πÅ‡∏Å‡πà‡∏ô‡∏≠‡∏≤‡∏™‡∏≤,‡∏û‡∏•‡∏ß‡∏±‡∏ï,‚úÖ,509,509,509,509
3,4,‡∏ô‡∏≤‡∏á‡∏™‡∏≤‡∏ß‡∏û‡∏•‡∏≠‡∏¢ ‡∏ò‡∏ô‡∏¥‡∏Å‡∏∏‡∏•,‡πÄ‡∏û‡∏∑‡πà‡∏≠‡πÑ‡∏ó‡∏¢,‚úÖ,40278,40278,40278,40278
4,5,‡∏ô‡∏≤‡∏¢‡∏¢‡∏®‡∏ß‡∏±‡∏í‡∏ô‡πå ‡∏°‡∏≤‡πÑ‡∏û‡∏®‡∏≤‡∏•‡∏™‡∏¥‡∏ô,‡∏†‡∏π‡∏°‡∏¥‡πÉ‡∏à‡πÑ‡∏ó‡∏¢,‚úÖ,50759,50759,50759,50759
5,6,‡∏ô‡∏≤‡∏á‡∏™‡∏≤‡∏ß‡∏ß‡∏£‡∏≤‡∏û‡∏£ ‡πÄ‡∏ï‡∏ä‡∏≤‡∏ß‡∏±‡∏í‡∏ô‡∏ß‡∏¥‡∏®‡∏≤‡∏•,‡πÄ‡∏®‡∏£‡∏©‡∏ê‡∏Å‡∏¥‡∏à,‚úÖ,1280,1280,1280,1280



  ‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ  unknown  |  3. ‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ ‡πÄ‡∏Ç‡∏ï 1 (1)
  Vote entries: ‚ùå DISAGREE


Unnamed: 0,#,candidate,party,agree,gemini-2.5-pro / HIGH,gemini-2.5-pro / LOW,gemini-3-pro-preview / HIGH,gemini-3-pro-preview / LOW
0,1,‡∏û.‡∏ï.‡∏ó.‡∏Å‡∏¥‡∏ï‡∏ï‡∏¥‡∏û‡∏¥‡∏ä‡∏ç‡πå ‡∏à‡∏±‡∏ô‡∏ó‡∏£‡πå‡∏™‡∏°‡∏ö‡∏π‡∏£‡∏ì‡πå,‡∏†‡∏π‡∏°‡∏¥‡πÉ‡∏à‡πÑ‡∏ó‡∏¢,‚úÖ,9864,9864,9864,9864
1,2,‡∏ô‡∏≤‡∏¢‡∏ö‡∏∏‡∏ç‡∏§‡∏ó‡∏ò‡∏¥‡πå ‡∏ò‡∏£‡∏£‡∏°‡∏®‡∏£,‡∏Å‡∏•‡πâ‡∏≤‡∏ò‡∏£‡∏£‡∏°,‚úÖ,976,976,976,976
2,3,‡∏ô‡∏≤‡∏¢‡∏ò‡∏ô‡∏Å‡∏£ ‡∏ó‡∏≠‡∏á‡πÉ‡∏ö,‡πÄ‡∏™‡∏£‡∏µ‡∏£‡∏ß‡∏°‡πÑ‡∏ó‡∏¢,‚úÖ,515,515,515,515
3,4,‡∏ô‡∏≤‡∏¢‡∏≠‡∏ô‡∏∏‡∏Å‡∏π‡∏• ‡πÅ‡∏û‡∏£‡πÑ‡∏û‡∏®‡∏≤‡∏•,‡∏õ‡∏£‡∏∞‡∏ä‡∏≤‡∏ò‡∏¥‡∏õ‡∏±‡∏ï‡∏¢‡πå,‚úÖ,2829,2829,2829,2829
4,5,‡∏ô‡∏≤‡∏¢‡∏≠‡∏±‡∏Ñ‡∏£‡∏ô‡∏±‡∏ô‡∏ó‡πå ‡∏Å‡∏±‡∏ì‡∏ì‡πå‡∏Å‡∏¥‡∏ï‡∏ï‡∏¥‡∏ô‡∏±‡∏ô‡∏ó‡πå,‡πÄ‡∏û‡∏∑‡πà‡∏≠‡πÑ‡∏ó‡∏¢,‚úÖ,38572,38572,38572,38572
5,6,‡∏ô‡∏≤‡∏¢‡∏†‡∏π‡∏ß‡∏ô‡∏≤‡∏ó ‡∏£‡∏±‡∏®‡∏°‡∏µ‡∏§‡∏Å‡∏©‡πå‡πÄ‡∏®‡∏£‡∏©‡∏ê‡πå,‡∏õ‡∏£‡∏∞‡∏ä‡∏≤‡∏ä‡∏ô,‚ùå,26849,26849,26849,26899
6,7,‡∏ô‡∏≤‡∏¢‡πÅ‡∏î‡∏ô‡∏™‡∏£‡∏ß‡∏á ‡∏Å‡∏•‡∏¥‡πà‡∏ô‡∏™‡∏∏‡∏Ñ‡∏ô‡∏ò‡πå,‡πÄ‡∏®‡∏£‡∏©‡∏ê‡∏Å‡∏¥‡∏à,‚úÖ,1696,1696,1696,1696
7,8,‡∏ô‡∏≤‡∏¢‡∏Å‡∏≥‡∏ò‡∏£ ‡∏™‡∏£‡πâ‡∏≠‡∏¢‡∏û‡∏£‡∏£‡∏ì‡∏≤,‡∏£‡∏ß‡∏°‡πÑ‡∏ó‡∏¢‡∏™‡∏£‡πâ‡∏≤‡∏á‡∏ä‡∏≤‡∏ï‡∏¥,‚úÖ,3002,3002,3002,3002
8,9,‡∏™.‡∏≠.‡∏ò‡∏ß‡∏±‡∏ä ‡∏à‡∏π‡∏≠‡∏¥‡∏ô‡∏ó‡∏£‡πå,‡∏õ‡∏ß‡∏á‡∏ä‡∏ô‡πÑ‡∏ó‡∏¢,‚úÖ,1241,1241,1241,1241
9,10,‡∏ô‡∏≤‡∏¢‡∏ò‡∏±‡∏ä‡∏Å‡∏§‡∏ä ‡∏´‡∏≠‡∏•‡∏∞‡πÄ‡∏≠‡∏µ‡∏¢‡∏î,‡∏û‡∏•‡∏±‡∏á‡πÄ‡∏û‡∏∑‡πà‡∏≠‡πÑ‡∏ó‡∏¢,‚úÖ,306,306,306,306



  ‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á  unknown  |  72. ‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á ‡πÄ‡∏Ç‡∏ï 2
  Vote entries: ‚úÖ all match


Unnamed: 0,#,candidate,party,agree,gemini-2.5-pro / HIGH,gemini-2.5-pro / LOW,gemini-3-pro-preview / HIGH,gemini-3-pro-preview / LOW
0,1,‡∏ô‡∏≤‡∏¢‡∏™‡∏≤‡πÇ‡∏£‡∏à‡∏ô‡πå ‡∏â‡πà‡∏≥‡∏à‡∏¥‡∏ï‡∏£,‡∏õ‡∏£‡∏∞‡∏ä‡∏≤‡∏ä‡∏ô,‚úÖ,13876,13876,13876,13876
1,2,‡∏ô‡∏≤‡∏¢‡∏Å‡∏£‡∏ß‡∏µ‡∏£‡πå ‡∏õ‡∏£‡∏¥‡∏®‡∏ô‡∏≤‡∏ô‡∏±‡∏ô‡∏ó‡∏Å‡∏∏‡∏•,‡∏†‡∏π‡∏°‡∏¥‡πÉ‡∏à‡πÑ‡∏ó‡∏¢,‚úÖ,60611,60611,60611,60611
2,3,‡∏ô‡∏≤‡∏¢‡∏ä‡∏ß‡∏Å‡∏£ ‡∏®‡∏£‡∏µ‡∏£‡∏≤‡∏ä‡∏≤,‡πÄ‡∏û‡∏∑‡πà‡∏≠‡πÑ‡∏ó‡∏¢,‚úÖ,4149,4149,4149,4149
3,4,‡∏ô‡∏≤‡∏¢‡∏≠‡∏ô‡∏∏‡∏£‡∏±‡∏Å‡∏©‡πå ‡∏≠‡∏°‡∏£‡πÄ‡∏°‡∏ï‡∏ï‡∏≤‡∏à‡∏¥‡∏ï,‡∏õ‡∏£‡∏∞‡∏ä‡∏≤‡∏ò‡∏¥‡∏õ‡∏±‡∏ï‡∏¢‡πå,‚úÖ,1202,1202,1202,1202


In [94]:
# ‚îÄ‚îÄ Overall summary ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Pass rates per config across all files

print("=== Pass rates per config ===\n")
header = f"{'Config':<35} {'ballot':>7} {'votes':>7} {'both':>7}"
print(header)
print("‚îÄ" * len(header))

for cfg in all_cfg_names:
    cfg_recs = [r for r in prelabel_records_raw if r["prelabel_config"]["name"] == cfg]
    total = len(cfg_recs)
    if total == 0:
        continue
    b_pass   = sum(1 for r in cfg_recs if r.get("ballot_check") == 1.0)
    v_pass   = sum(1 for r in cfg_recs if r.get("votes_check") == 1.0)
    both     = sum(1 for r in cfg_recs if r.get("ballot_check") == 1.0 and r.get("votes_check") == 1.0)
    errors   = sum(1 for r in cfg_recs if r.get("error"))
    print(
        f"{cfg:<35} "
        f"{b_pass}/{total} ({b_pass/total:.0%})  "
        f"{v_pass}/{total} ({v_pass/total:.0%})  "
        f"{both}/{total} ({both/total:.0%})"
        + (f"  [{errors} errors]" if errors else "")
    )

# Cross-config agreement rate
n_files = len(all_file_ids)
all_numeric_agree = 0
for fid in all_file_ids:
    cfg_data = by_file[fid]
    kv_list  = [_kv(r) for r in cfg_data.values() if r and not r.get("error")]
    kv_list  = [kv for kv in kv_list if kv]
    if not kv_list:
        continue
    if all(
        len({kv[field] for kv in kv_list}) == 1
        for field in NUMERIC_FIELDS
    ):
        all_numeric_agree += 1

print(f"\nFiles where ALL configs agree on ALL numeric fields: {all_numeric_agree}/{n_files} ({all_numeric_agree/n_files:.0%})")

=== Pass rates per config ===

Config                               ballot   votes    both
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
gemini-2.5-pro / HIGH               9/9 (100%)  9/9 (100%)  9/9 (100%)
gemini-2.5-pro / LOW                9/9 (100%)  9/9 (100%)  9/9 (100%)
gemini-3-pro-preview / HIGH         9/9 (100%)  9/9 (100%)  9/9 (100%)
gemini-3-pro-preview / LOW          9/9 (100%)  8/9 (89%)  8/9 (89%)

Files where ALL configs agree on ALL numeric fields: 7/9 (78%)


## 6. Save Pre-labels for Human Review

Export all pre-labeled results to JSONL so a human can inspect and correct them.

**Review workflow:**
1. Open `datasets/ss6_1_prelabels.jsonl`
2. Check each record's `extracted_data` against the source PDF (linked via `drive_uri`)
3. Set `"human_validated": true` and optionally edit `extracted_data` if wrong
4. Run **Section 7** to push validated records to Datadog as a labeled dataset

In [88]:
PRELABEL_OUTPUT = Path("datasets/ss6_1_prelabels.jsonl")
PRELABEL_OUTPUT.parent.mkdir(parents=True, exist_ok=True)

prelabel_records = []

for cfg_label, rows in local_results.items():
    cfg_meta = next(c for c in PRELABEL_CONFIGS if c["name"] == cfg_label)
    for row in rows:
        inp  = row["input"]
        meta = inp["source_file_metadata"]
        record = {
            # ‚îÄ‚îÄ Source ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
            "source_file": {
                "file_id":      meta["file_id"],
                "province_name": meta["province_name"],
                "form_type":    meta["form_type"],
                "path":         meta["path"],
                "size_mb":      meta["size_mb"],
                "drive_uri":    inp["drive_uri"],
            },
            # ‚îÄ‚îÄ Model config used for pre-labeling ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
            "prelabel_config": {
                "name":          cfg_label,
                "model":         cfg_meta["model"],
                "thinking_mode": cfg_meta["thinking_mode"],
            },
            # ‚îÄ‚îÄ Extracted output ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
            "extracted_data":  row["result"],
            # ‚îÄ‚îÄ Internal consistency scores ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
            "ballot_check": row["ballot_score"],
            "votes_check":  row["votes_score"],
            "error":        row["error"],
            # ‚îÄ‚îÄ Human review fields (fill in after review) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
            "human_validated": False,
            "human_notes":     "",
        }
        prelabel_records.append(record)

with PRELABEL_OUTPUT.open("w", encoding="utf-8") as f:
    for rec in prelabel_records:
        f.write(json.dumps(rec, ensure_ascii=False) + "\n")

print(f"‚úÖ {len(prelabel_records)} pre-label records saved ‚Üí {PRELABEL_OUTPUT}")
print(f"   Configs: {list(local_results.keys())}")
print(f"   Files  : {PRELABEL_OUTPUT.stat().st_size / 1024:.1f} KB")

‚úÖ 36 pre-label records saved ‚Üí datasets/ss6_1_prelabels.jsonl
   Configs: ['gemini-3-pro-preview / LOW', 'gemini-3-pro-preview / HIGH', 'gemini-2.5-pro / LOW', 'gemini-2.5-pro / HIGH']
   Files  : 253.9 KB


## 7. Create Datadog LLMObs Dataset  *(run after human review)*

After reviewing the Level 3 comparison above, pick the config whose results look correct and push those records to Datadog as a labeled dataset.

**Two selection modes** (set `SELECTION_MODE` below):

| Mode | When to use |
|---|---|
| `"by_config"` | One entire config passed (e.g. `gemini-2.5-pro / HIGH` all correct) |
| `"by_file"` | Mixed ‚Äî best config per file, override in `FILE_CONFIG_OVERRIDES` |

Each dataset record will have:
- `input_data` ‚Äî drive URI + source file metadata
- `expected_output` ‚Äî validated extracted data (ground truth for evaluators)

In [95]:
# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
# ‚úèÔ∏è  CONFIGURE YOUR SELECTION
# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

SELECTION_MODE = "by_config"   # "by_config"  |  "by_file"

# ‚îÄ‚îÄ Mode 1: use one config for ALL files ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
SELECTED_CONFIG = "gemini-2.5-pro / HIGH"   # change to whichever config you reviewed

# ‚îÄ‚îÄ Mode 2: pick best config per file_id (overrides SELECTED_CONFIG) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Format: { "file_id": "config_name", ... }
FILE_CONFIG_OVERRIDES: dict[str, str] = {
    # "1JvmnPF8_XFcEB6f3_LxbwJCpH6tSp3NH": "gemini-3-pro-preview / HIGH",
}

# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
# Load pre-label records (reads from JSONL so works even without running Sec 5)
# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
PRELABEL_PATH = Path("datasets/ss6_1_prelabels.jsonl")

_all_recs: list[dict] = []
with PRELABEL_PATH.open(encoding="utf-8") as f:
    for line in f:
        line = line.strip()
        if line:
            _all_recs.append(json.loads(line))

# Index by file_id ‚Üí config_name ‚Üí record
_by_file: dict[str, dict[str, dict]] = defaultdict(dict)
for _r in _all_recs:
    _by_file[_r["source_file"]["file_id"]][_r["prelabel_config"]["name"]] = _r

available_configs = sorted({r["prelabel_config"]["name"] for r in _all_recs})
print(f"Available configs : {available_configs}")
print(f"Unique files      : {len(_by_file)}")

# ‚îÄ‚îÄ Select one record per file ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
selected: list[dict] = []
skipped:  list[str]  = []

for fid, cfg_map in _by_file.items():
    # per-file override takes priority, then SELECTED_CONFIG
    if SELECTION_MODE == "by_file" and fid in FILE_CONFIG_OVERRIDES:
        target_cfg = FILE_CONFIG_OVERRIDES[fid]
    else:
        target_cfg = SELECTED_CONFIG

    rec = cfg_map.get(target_cfg)
    if rec is None:
        skipped.append(f"{fid}  (config '{target_cfg}' not found)")
        continue
    if rec.get("error"):
        skipped.append(f"{fid}  (extraction error: {rec['error'][:60]})")
        continue

    selected.append(rec)

print(f"\nRecords selected for dataset : {len(selected)}")
if skipped:
    print(f"Skipped ({len(skipped)}):")
    for s in skipped:
        print(f"  ‚ö†Ô∏è  {s}")

# ‚îÄ‚îÄ Preview selected records ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
preview_rows = []
for rec in selected:
    meta = rec["source_file"]
    preview_rows.append({
        "province":     meta["province_name"],
        "form_type":    meta["form_type"],
        "file":         meta["path"].split("/")[-1].replace(".pdf", ""),
        "config_used":  rec["prelabel_config"]["name"],
        "ballot_check": "‚úÖ" if rec["ballot_check"] == 1.0 else "‚ùå",
        "votes_check":  "‚úÖ" if rec["votes_check"]  == 1.0 else "‚ùå",
    })
display(pd.DataFrame(preview_rows))

Available configs : ['gemini-2.5-pro / HIGH', 'gemini-2.5-pro / LOW', 'gemini-3-pro-preview / HIGH', 'gemini-3-pro-preview / LOW']
Unique files      : 9

Records selected for dataset : 9


Unnamed: 0,province,form_type,file,config_used,ballot_check,votes_check
0,‡πÄ‡∏ä‡∏µ‡∏¢‡∏á‡πÉ‡∏´‡∏°‡πà,‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï,14. ‡πÄ‡∏ä‡∏µ‡∏¢‡∏á‡πÉ‡∏´‡∏°‡πà ‡πÄ‡∏Ç‡∏ï 6,gemini-2.5-pro / HIGH,‚úÖ,‚úÖ
1,‡∏ô‡∏Ñ‡∏£‡∏£‡∏≤‡∏ä‡∏™‡∏µ‡∏°‡∏≤,‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï,21. ‡∏ô‡∏Ñ‡∏£‡∏£‡∏≤‡∏ä‡∏™‡∏µ‡∏°‡∏≤ ‡πÄ‡∏Ç‡∏ï 13,gemini-2.5-pro / HIGH,‚úÖ,‚úÖ
2,‡∏ô‡∏Ñ‡∏£‡∏®‡∏£‡∏µ‡∏ò‡∏£‡∏£‡∏°‡∏£‡∏≤‡∏ä,‡πÅ‡∏ö‡πà‡∏á‡πÄ‡∏Ç‡∏ï,22. ‡∏ô‡∏Ñ‡∏£‡∏®‡∏£‡∏µ‡∏ò‡∏£‡∏£‡∏°‡∏£‡∏≤‡∏ä ‡πÄ‡∏Ç‡∏ï 8,gemini-2.5-pro / HIGH,‚úÖ,‚úÖ
3,‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£,‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠,1. ‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£ ‡πÄ‡∏Ç‡∏ï 26 (‡∏ö‡∏ä),gemini-2.5-pro / HIGH,‚úÖ,‚úÖ
4,‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£,‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠,1. ‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£ ‡πÄ‡∏Ç‡∏ï 2 (‡∏ö‡∏ä),gemini-2.5-pro / HIGH,‚úÖ,‚úÖ
5,‡πÅ‡∏û‡∏£‡πà,‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠,41. ‡πÅ‡∏û‡∏£‡πà ‡πÄ‡∏Ç‡∏ï 1 (‡∏ö‡∏ä),gemini-2.5-pro / HIGH,‚úÖ,‚úÖ
6,‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ,unknown,3. ‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ ‡πÄ‡∏Ç‡∏ï 3 (1),gemini-2.5-pro / HIGH,‚úÖ,‚úÖ
7,‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ,unknown,3. ‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ ‡πÄ‡∏Ç‡∏ï 1 (1),gemini-2.5-pro / HIGH,‚úÖ,‚úÖ
8,‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á,unknown,72. ‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á ‡πÄ‡∏Ç‡∏ï 2,gemini-2.5-pro / HIGH,‚úÖ,‚úÖ


In [96]:
# ‚îÄ‚îÄ Push to Datadog LLMObs ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Run this cell only after reviewing the preview above.

if not selected:
    print("‚ö†Ô∏è  No records selected ‚Äî check SELECTED_CONFIG or FILE_CONFIG_OVERRIDES above.")
else:
    dataset_records = [
        {
            "input_data": json.dumps(
                {
                    "drive_uri": rec["source_file"]["drive_uri"],
                    "source_file_metadata": rec["source_file"],
                },
                ensure_ascii=False,
            ),
            "expected_output": json.dumps(
                {
                    "source_file":    rec["source_file"],
                    "extracted_data": rec["extracted_data"],
                },
                ensure_ascii=False,
            ),
            "metadata": {
                "config_used":  rec["prelabel_config"]["name"],
                "ballot_check": rec["ballot_check"],
                "votes_check":  rec["votes_check"],
            },
        }
        for rec in selected
    ]

    dataset = LLMObs.create_dataset(
        dataset_name=DATASET_NAME,
        description=(
            f"SS6/1 announcement PDFs ‚Äî 2026 Thai election  "
            f"(labels from: {SELECTED_CONFIG if SELECTION_MODE == 'by_config' else 'mixed'})"
        ),
        project_name=LLMOBS_PROJECT_NAME,
        records=dataset_records,
    )
    print(f"‚úÖ Dataset '{DATASET_NAME}' created ‚Äî {len(dataset_records)} records")
    print(f"   Config used : {SELECTED_CONFIG if SELECTION_MODE == 'by_config' else 'mixed (by_file)'}")
    print(f"   View        : {dataset.url}")

‚úÖ Dataset 'ss6_1_nuttee' created ‚Äî 9 records
   Config used : gemini-2.5-pro / HIGH
   View        : https://us3.datadoghq.com/llm/datasets/3f8bc4c3-63c1-4e95-b41a-fc416478e690


---
## 8. Experiments on Labeled Dataset

Run systematic Datadog LLMObs Experiments ‚Äî each model config is one experiment run, evaluated against the ground-truth dataset created in Section 7.

**Evaluators** (compare output vs. `expected_output` from the dataset):

| Evaluator | What it checks | Score |
|---|---|---|
| `ballot_check` | Internal: `total_ballots_used = valid + invalid + no_vote` | 0.0 / 1.0 |
| `votes_check` | Internal: `total_votes = Œ£ results[*].vote_count` | 0.0 / 1.0 |
| `total_votes_match` | Extracted `total_votes` matches ground truth | 0.0 / 1.0 |
| `ballot_summary_match` | All 6 ballot summary numbers match ground truth | 0.0 / 1.0 |
| `vote_counts_match` | Per-candidate vote counts match ground truth | 0.0 ‚Äì 1.0 (partial credit) |

**Experiment configs** (each runs as a separate named experiment for side-by-side comparison in Datadog UI):

| # | Model | Thinking |
|---|---|---|
| 1 | `gemini-2.5-flash` | none |
| 2 | `gemini-2.5-flash` | LOW |
| 3 | `gemini-2.5-flash` | HIGH |
| 4 | `gemini-3-pro-preview` | LOW |
| 5 | `gemini-3-pro-preview` | HIGH |
| 6 | `gemini-2.5-pro` | LOW |
| 7 | `gemini-2.5-pro` | HIGH |

In [55]:
from ddtrace.llmobs import EvaluatorResult
from ddtrace.llmobs.decorators import task


# ‚îÄ‚îÄ Helpers (safe parse + arabic extraction) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

def _parse(x) -> dict:
    """Parse JSON string ‚Üí dict, or return dict/None as-is."""
    if isinstance(x, str):
        try:
            return json.loads(x)
        except Exception:
            return {}
    return x or {}


# ‚îÄ‚îÄ Row-level evaluators ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

def ballot_check_eval(input_data, output_data, expected_output):
    """Internal consistency: total_ballots_used = valid + invalid + no_vote."""
    result = _parse(output_data)
    score, reason = _ballot_check(result)
    return EvaluatorResult(
        value=score,
        reasoning=reason,
        assessment="pass" if score == 1.0 else "fail",
        tags={"evaluator": "internal_consistency"},
    )


def votes_check_eval(input_data, output_data, expected_output):
    """Internal consistency: total_votes = sum(results[*].vote_count)."""
    result = _parse(output_data)
    score, reason = _votes_check(result)
    return EvaluatorResult(
        value=score,
        reasoning=reason,
        assessment="pass" if score == 1.0 else "fail",
        tags={"evaluator": "internal_consistency"},
    )


def total_votes_match(input_data, output_data, expected_output):
    """Ground-truth check: extracted total_votes matches labeled value."""
    out = _parse(output_data)
    exp = _parse(expected_output).get("extracted_data", {})

    got      = _arabic(out.get("total_votes"))
    expected = _arabic(exp.get("total_votes"))
    ok = got == expected
    return EvaluatorResult(
        value=1.0 if ok else 0.0,
        reasoning=f"extracted={got}, ground_truth={expected}",
        assessment="pass" if ok else "fail",
        tags={"evaluator": "ground_truth"},
    )


def ballot_summary_match(input_data, output_data, expected_output):
    """Ground-truth check: all 6 ballot summary numbers match labeled values."""
    FIELDS = [
        "eligible_voters", "present_voters",
        "valid_ballots", "invalid_ballots", "no_vote_ballots", "total_ballots_used",
    ]
    out    = _parse(output_data)
    exp    = _parse(expected_output).get("extracted_data", {})
    out_bs = out.get("ballot_summary") or {}
    exp_bs = exp.get("ballot_summary") or {}

    mismatches = []
    for field in FIELDS:
        got_v = _arabic(out_bs.get(field))
        exp_v = _arabic(exp_bs.get(field))
        if got_v != exp_v:
            mismatches.append(f"{field}: got={got_v} exp={exp_v}")

    ok = not mismatches
    return EvaluatorResult(
        value=1.0 if ok else 0.0,
        reasoning="; ".join(mismatches) if mismatches else f"all {len(FIELDS)} ballot fields match",
        assessment="pass" if ok else "fail",
        tags={"evaluator": "ground_truth"},
    )


def vote_counts_match(input_data, output_data, expected_output):
    """Ground-truth check: per-candidate vote counts match labeled values (partial credit)."""
    out = _parse(output_data)
    exp = _parse(expected_output).get("extracted_data", {})

    out_map = {r["number"]: _arabic(r.get("vote_count")) for r in out.get("results") or []}
    exp_map = {r["number"]: _arabic(r.get("vote_count")) for r in exp.get("results") or []}

    if not exp_map:
        return EvaluatorResult(
            value=0.0,
            reasoning="no expected results in ground truth",
            assessment="fail",
        )

    wrong    = [f"#{n}: got={out_map.get(n)} exp={v}" for n, v in exp_map.items() if out_map.get(n) != v]
    extra    = sorted(set(out_map) - set(exp_map))
    missing  = sorted(set(exp_map) - set(out_map))
    if extra:
        wrong.append(f"extra candidates: {extra}")
    if missing:
        wrong.append(f"missing candidates: {missing}")

    # partial credit: fraction of candidates with correct vote count
    correct = sum(1 for n, v in exp_map.items() if out_map.get(n) == v)
    score   = correct / len(exp_map)
    ok      = score == 1.0
    return EvaluatorResult(
        value=round(score, 4),
        reasoning="; ".join(wrong) if wrong else f"all {len(exp_map)} vote counts match",
        assessment="pass" if ok else "fail",
        tags={"evaluator": "ground_truth", "n_candidates": str(len(exp_map))},
    )


# ‚îÄ‚îÄ Summary evaluator ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

def overall_pass_rate(inputs, outputs, expected_outputs, evaluators_results):
    """Fraction of records where ALL ground-truth evaluators pass."""
    gt_keys = ["total_votes_match", "ballot_summary_match", "vote_counts_match"]
    n = len(inputs)
    if n == 0:
        return 0.0
    all_pass = 0
    for i in range(n):
        if all(
            (evaluators_results.get(k) or [None] * n)[i] in (True, 1.0)
            for k in gt_keys
            if k in evaluators_results
        ):
            all_pass += 1
    return round(all_pass / n, 4)


print("‚úÖ Evaluators defined:")
print("   Row-level  : ballot_check_eval, votes_check_eval")
print("   Ground truth: total_votes_match, ballot_summary_match, vote_counts_match")
print("   Summary    : overall_pass_rate")

‚úÖ Evaluators defined:
   Row-level  : ballot_check_eval, votes_check_eval
   Ground truth: total_votes_match, ballot_summary_match, vote_counts_match
   Summary    : overall_pass_rate


In [56]:
# ‚îÄ‚îÄ Pull the labeled dataset from Datadog ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
exp_dataset = LLMObs.pull_dataset(dataset_name=DATASET_NAME)

print(f"Dataset : {DATASET_NAME}")
print(f"Records : {len(exp_dataset)}")
#exp_dataset.as_dataframe()[["input_data", "expected_output", "metadata"]]

Dataset : ss6_1_nuttee
Records : 9


In [57]:
# ‚îÄ‚îÄ Experiment configs ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Each entry produces one named experiment run for side-by-side comparison in Datadog.

EXPERIMENT_CONFIGS = [
    {
        "name": "2.5-flash-no-think",
        "model": "gemini-2.5-flash",
        "temperature": 0.0,
        "max_tokens": 8192,
        "thinking_mode": None,
    },
    {
        "name": "2.5-flash-LOW",
        "model": "gemini-2.5-flash",
        "temperature": 0.0,
        "max_tokens": 8192,
        "thinking_mode": "LOW",
    },
    {
        "name": "2.5-flash-HIGH",
        "model": "gemini-2.5-flash",
        "temperature": 0.0,
        "max_tokens": 16384,
        "thinking_mode": "HIGH",
    },
    {
        "name": "2.5-pro-LOW",
        "model": "gemini-2.5-pro",
        "temperature": 0.0,
        "max_tokens": 16384,
        "thinking_mode": "LOW",
    },
    {
        "name": "2.5-pro-HIGH",
        "model": "gemini-2.5-pro",
        "temperature": 0.0,
        "max_tokens": 16384,
        "thinking_mode": "HIGH",
    },
    {
        "name": "3-flash-LOW",
        "model": "gemini-3-flash-preview",
        "temperature": 0.0,
        "max_tokens": 8192,
        "thinking_mode": "LOW",
    },
    {
        "name": "3-flash-HIGH",
        "model": "gemini-3-flash-preview",
        "temperature": 0.0,
        "max_tokens": 16384,
        "thinking_mode": "HIGH",
    },
    {
        "name": "3-pro-LOW",
        "model": "gemini-3-pro-preview",
        "temperature": 0.0,
        "max_tokens": 16384,
        "thinking_mode": "LOW",
    },
    {
        "name": "3-pro-HIGH",
        "model": "gemini-3-pro-preview",
        "temperature": 0.0,
        "max_tokens": 16384,
        "thinking_mode": "HIGH",
    },
    {
        "name": "3.1-pro-LOW",
        "model": "gemini-3.1-pro-preview",
        "temperature": 0.0,
        "max_tokens": 16384,
        "thinking_mode": "LOW",
    },
    {
        "name": "3.1-pro-HIGH",
        "model": "gemini-3.1-pro-preview",
        "temperature": 0.0,
        "max_tokens": 16384,
        "thinking_mode": "HIGH",
    },
]

pd.DataFrame(
    [{k: v for k, v in c.items() if k != "name"} | {"name": c["name"]} for c in EXPERIMENT_CONFIGS]
)[["name", "model", "thinking_mode", "max_tokens"]]

Unnamed: 0,name,model,thinking_mode,max_tokens
0,2.5-flash-no-think,gemini-2.5-flash,,8192
1,2.5-flash-LOW,gemini-2.5-flash,LOW,8192
2,2.5-flash-HIGH,gemini-2.5-flash,HIGH,16384
3,2.5-pro-LOW,gemini-2.5-pro,LOW,16384
4,2.5-pro-HIGH,gemini-2.5-pro,HIGH,16384
5,3-flash-LOW,gemini-3-flash-preview,LOW,8192
6,3-flash-HIGH,gemini-3-flash-preview,HIGH,16384
7,3-pro-LOW,gemini-3-pro-preview,LOW,16384
8,3-pro-HIGH,gemini-3-pro-preview,HIGH,16384
9,3.1-pro-LOW,gemini-3.1-pro-preview,LOW,16384


In [58]:
from datetime import datetime

# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
# ‚úèÔ∏è  RUN SETTINGS
# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

JOBS         = 4      # parallel workers per experiment (increase for speed)
SAMPLE_SIZE  = None      # None = full dataset, int = subset (use 3 for quick test)
RAISE_ERRORS = False  # True = stop on first error (useful when debugging)

# Limit to specific configs ‚Äî None runs all
RUN_CONFIGS: list[str] | None = None
# e.g.:  RUN_CONFIGS = ["flash-no-think", "2.5pro-HIGH"]

# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

_run_stamp = datetime.now().strftime("%m%d-%H%M")

experiment_urls: dict[str, str] = {}

for cfg in EXPERIMENT_CONFIGS:
    if RUN_CONFIGS is not None and cfg["name"] not in RUN_CONFIGS:
        print(f"‚è≠  Skipping {cfg['name']}")
        continue

    exp_name    = f"ss61-{cfg['name']}-{_run_stamp}"
    task_config = {k: v for k, v in cfg.items() if k != "name"}

    print(f"‚ñ∂ Running experiment: {exp_name}")
    print(f"   jobs={JOBS}  sample_size={SAMPLE_SIZE}  raise_errors={RAISE_ERRORS}")

    experiment = LLMObs.experiment(
        name=exp_name,
        dataset=exp_dataset,
        task=extract_ss61_form,
        evaluators=[
            ballot_check_eval,
            votes_check_eval,
            total_votes_match,
            ballot_summary_match,
            vote_counts_match,
        ],
        summary_evaluators=[overall_pass_rate],
        config=task_config,
        description=(
            f"SS6/1 extraction ‚Äî model={cfg['model']}  "
            f"thinking={cfg['thinking_mode'] or 'none'}  "
            f"sample_size={SAMPLE_SIZE}"
        ),
    )

    run_kwargs: dict = {"jobs": JOBS, "raise_errors": RAISE_ERRORS}
    if SAMPLE_SIZE is not None:
        run_kwargs["sample_size"] = SAMPLE_SIZE

    results = experiment.run(**run_kwargs)

    experiment_urls[cfg["name"]] = experiment.url
    print(f"   ‚úÖ Done ‚Äî {len(results)} records")
    print(f"   View: {experiment.url}\n")

print("=" * 60)
print("All experiment URLs:")
for name, url in experiment_urls.items():
    print(f"  {name:<20} {url}")

‚ñ∂ Running experiment: ss61-2.5-flash-no-think-0221-1602
   jobs=4  sample_size=None  raise_errors=False


2026-02-21 16:02:30,886 INFO [extract_ss61_form] START  province=‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£  form_type=‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠  model=gemini-2.5-flash  thinking=none  max_tokens=8192  file_id=1otADY98vhxj0tUwI4-msaULPea5d2yyh
2026-02-21 16:02:30,887 INFO [extract_ss61_form] START  province=‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ  form_type=unknown  model=gemini-2.5-flash  thinking=none  max_tokens=8192  file_id=15OKzVH_AbuDuJ-ObFxu_V1w7xAGKPY3G
2026-02-21 16:02:30,887 INFO [extract_ss61_form] START  province=‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£  form_type=‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠  model=gemini-2.5-flash  thinking=none  max_tokens=8192  file_id=1l8FFCSWg3cBfI79LLiauIV5AOcH8PsJ6
2026-02-21 16:02:30,888 INFO [extract_ss61_form] START  province=‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á  form_type=unknown  model=gemini-2.5-flash  thinking=none  max_tokens=8192  file_id=1ji_kd6b3ETWy-Q-7UgbEMiDRMaYml3AA
2026-02-21 16:02:30,888 INFO [extract_ss61_form] PDF part built  drive_uri=https

   ‚úÖ Done ‚Äî 3 records
   View: https://us3.datadoghq.com/llm/experiments/64b613bd-cfc2-4848-89a1-f6f7c8c62922

‚ñ∂ Running experiment: ss61-2.5-flash-LOW-0221-1602
   jobs=4  sample_size=None  raise_errors=False


2026-02-21 16:03:37,832 INFO [extract_ss61_form] START  province=‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£  form_type=‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠  model=gemini-2.5-flash  thinking=LOW  max_tokens=8192  file_id=1otADY98vhxj0tUwI4-msaULPea5d2yyh
2026-02-21 16:03:37,832 INFO [extract_ss61_form] START  province=‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ  form_type=unknown  model=gemini-2.5-flash  thinking=LOW  max_tokens=8192  file_id=15OKzVH_AbuDuJ-ObFxu_V1w7xAGKPY3G
2026-02-21 16:03:37,833 INFO [extract_ss61_form] START  province=‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£  form_type=‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠  model=gemini-2.5-flash  thinking=LOW  max_tokens=8192  file_id=1l8FFCSWg3cBfI79LLiauIV5AOcH8PsJ6
2026-02-21 16:03:37,833 INFO [extract_ss61_form] START  province=‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á  form_type=unknown  model=gemini-2.5-flash  thinking=LOW  max_tokens=8192  file_id=1ji_kd6b3ETWy-Q-7UgbEMiDRMaYml3AA
2026-02-21 16:03:37,836 INFO [extract_ss61_form] PDF part built  drive_uri=https://d

   ‚úÖ Done ‚Äî 3 records
   View: https://us3.datadoghq.com/llm/experiments/1c73e5ff-0a94-4962-89fe-ca1df31d2c95

‚ñ∂ Running experiment: ss61-2.5-flash-HIGH-0221-1602
   jobs=4  sample_size=None  raise_errors=False


2026-02-21 16:04:28,904 INFO [extract_ss61_form] START  province=‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£  form_type=‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠  model=gemini-2.5-flash  thinking=HIGH  max_tokens=16384  file_id=1otADY98vhxj0tUwI4-msaULPea5d2yyh
2026-02-21 16:04:28,905 INFO [extract_ss61_form] START  province=‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ  form_type=unknown  model=gemini-2.5-flash  thinking=HIGH  max_tokens=16384  file_id=15OKzVH_AbuDuJ-ObFxu_V1w7xAGKPY3G
2026-02-21 16:04:28,906 INFO [extract_ss61_form] START  province=‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£  form_type=‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠  model=gemini-2.5-flash  thinking=HIGH  max_tokens=16384  file_id=1l8FFCSWg3cBfI79LLiauIV5AOcH8PsJ6
2026-02-21 16:04:28,911 INFO [extract_ss61_form] PDF part built  drive_uri=https://drive.google.com/uc?export=download&id=1l8FFCSWg3cBfI79LLiauIV5AOcH8PsJ6
2026-02-21 16:04:28,909 INFO [extract_ss61_form] START  province=‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á  form_type=unknown  model=gemini-2.5-fl

   ‚úÖ Done ‚Äî 3 records
   View: https://us3.datadoghq.com/llm/experiments/0d469602-7ffd-4939-9d81-ffdfed5341c4

‚ñ∂ Running experiment: ss61-2.5-pro-LOW-0221-1602
   jobs=4  sample_size=None  raise_errors=False


2026-02-21 16:05:29,884 INFO [extract_ss61_form] START  province=‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£  form_type=‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠  model=gemini-2.5-pro  thinking=LOW  max_tokens=16384  file_id=1otADY98vhxj0tUwI4-msaULPea5d2yyh
2026-02-21 16:05:29,885 INFO [extract_ss61_form] START  province=‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ  form_type=unknown  model=gemini-2.5-pro  thinking=LOW  max_tokens=16384  file_id=15OKzVH_AbuDuJ-ObFxu_V1w7xAGKPY3G
2026-02-21 16:05:29,885 INFO [extract_ss61_form] START  province=‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£  form_type=‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠  model=gemini-2.5-pro  thinking=LOW  max_tokens=16384  file_id=1l8FFCSWg3cBfI79LLiauIV5AOcH8PsJ6
2026-02-21 16:05:29,886 INFO [extract_ss61_form] START  province=‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á  form_type=unknown  model=gemini-2.5-pro  thinking=LOW  max_tokens=16384  file_id=1ji_kd6b3ETWy-Q-7UgbEMiDRMaYml3AA
2026-02-21 16:05:29,886 INFO [extract_ss61_form] PDF part built  drive_uri=https://drive

   ‚úÖ Done ‚Äî 3 records
   View: https://us3.datadoghq.com/llm/experiments/9006f2a5-bbe6-4caa-a8ad-a0f9f5188460

‚ñ∂ Running experiment: ss61-2.5-pro-HIGH-0221-1602
   jobs=4  sample_size=None  raise_errors=False


2026-02-21 16:06:53,585 INFO [extract_ss61_form] START  province=‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£  form_type=‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠  model=gemini-2.5-pro  thinking=HIGH  max_tokens=16384  file_id=1otADY98vhxj0tUwI4-msaULPea5d2yyh
2026-02-21 16:06:53,585 INFO [extract_ss61_form] START  province=‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ  form_type=unknown  model=gemini-2.5-pro  thinking=HIGH  max_tokens=16384  file_id=15OKzVH_AbuDuJ-ObFxu_V1w7xAGKPY3G
2026-02-21 16:06:53,585 INFO [extract_ss61_form] START  province=‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£  form_type=‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠  model=gemini-2.5-pro  thinking=HIGH  max_tokens=16384  file_id=1l8FFCSWg3cBfI79LLiauIV5AOcH8PsJ6
2026-02-21 16:06:53,586 INFO [extract_ss61_form] START  province=‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á  form_type=unknown  model=gemini-2.5-pro  thinking=HIGH  max_tokens=16384  file_id=1ji_kd6b3ETWy-Q-7UgbEMiDRMaYml3AA
2026-02-21 16:06:53,586 INFO [extract_ss61_form] PDF part built  drive_uri=https://d

   ‚úÖ Done ‚Äî 3 records
   View: https://us3.datadoghq.com/llm/experiments/2a312983-7047-4e9a-8f7c-64973e00dd78

‚ñ∂ Running experiment: ss61-3-flash-LOW-0221-1602
   jobs=4  sample_size=None  raise_errors=False


2026-02-21 16:08:38,236 INFO [extract_ss61_form] START  province=‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£  form_type=‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠  model=gemini-3-flash-preview  thinking=LOW  max_tokens=8192  file_id=1otADY98vhxj0tUwI4-msaULPea5d2yyh
2026-02-21 16:08:38,240 INFO [extract_ss61_form] PDF part built  drive_uri=https://drive.google.com/uc?export=download&id=1otADY98vhxj0tUwI4-msaULPea5d2yyh
2026-02-21 16:08:38,239 INFO [extract_ss61_form] START  province=‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£  form_type=‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠  model=gemini-3-flash-preview  thinking=LOW  max_tokens=8192  file_id=1l8FFCSWg3cBfI79LLiauIV5AOcH8PsJ6
2026-02-21 16:08:38,239 INFO [extract_ss61_form] START  province=‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á  form_type=unknown  model=gemini-3-flash-preview  thinking=LOW  max_tokens=8192  file_id=1ji_kd6b3ETWy-Q-7UgbEMiDRMaYml3AA
2026-02-21 16:08:38,238 INFO [extract_ss61_form] START  province=‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ  form_type=unknown  model=g

   ‚úÖ Done ‚Äî 3 records
   View: https://us3.datadoghq.com/llm/experiments/138e0d16-1761-46c6-9ca1-28845015fb08

‚ñ∂ Running experiment: ss61-3-flash-HIGH-0221-1602
   jobs=4  sample_size=None  raise_errors=False


2026-02-21 16:09:31,071 INFO [extract_ss61_form] START  province=‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£  form_type=‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠  model=gemini-3-flash-preview  thinking=HIGH  max_tokens=16384  file_id=1otADY98vhxj0tUwI4-msaULPea5d2yyh
2026-02-21 16:09:31,071 INFO [extract_ss61_form] START  province=‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ  form_type=unknown  model=gemini-3-flash-preview  thinking=HIGH  max_tokens=16384  file_id=15OKzVH_AbuDuJ-ObFxu_V1w7xAGKPY3G
2026-02-21 16:09:31,072 INFO [extract_ss61_form] START  province=‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£  form_type=‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠  model=gemini-3-flash-preview  thinking=HIGH  max_tokens=16384  file_id=1l8FFCSWg3cBfI79LLiauIV5AOcH8PsJ6
2026-02-21 16:09:31,072 INFO [extract_ss61_form] START  province=‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á  form_type=unknown  model=gemini-3-flash-preview  thinking=HIGH  max_tokens=16384  file_id=1ji_kd6b3ETWy-Q-7UgbEMiDRMaYml3AA
2026-02-21 16:09:31,072 INFO [extract_ss61_form] PDF

   ‚úÖ Done ‚Äî 3 records
   View: https://us3.datadoghq.com/llm/experiments/14aaca77-944a-47e4-bb53-a0d7dce44cfa

‚ñ∂ Running experiment: ss61-3-pro-LOW-0221-1602
   jobs=4  sample_size=None  raise_errors=False


2026-02-21 16:11:24,163 INFO [extract_ss61_form] START  province=‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£  form_type=‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠  model=gemini-3-pro-preview  thinking=LOW  max_tokens=16384  file_id=1otADY98vhxj0tUwI4-msaULPea5d2yyh
2026-02-21 16:11:24,164 INFO [extract_ss61_form] START  province=‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ  form_type=unknown  model=gemini-3-pro-preview  thinking=LOW  max_tokens=16384  file_id=15OKzVH_AbuDuJ-ObFxu_V1w7xAGKPY3G
2026-02-21 16:11:24,164 INFO [extract_ss61_form] START  province=‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£  form_type=‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠  model=gemini-3-pro-preview  thinking=LOW  max_tokens=16384  file_id=1l8FFCSWg3cBfI79LLiauIV5AOcH8PsJ6
2026-02-21 16:11:24,165 INFO [extract_ss61_form] PDF part built  drive_uri=https://drive.google.com/uc?export=download&id=1otADY98vhxj0tUwI4-msaULPea5d2yyh
2026-02-21 16:11:24,165 INFO [extract_ss61_form] START  province=‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á  form_type=unknown  model=gemi

   ‚úÖ Done ‚Äî 3 records
   View: https://us3.datadoghq.com/llm/experiments/acd23bcf-d6c9-489a-8bd7-1eb0224f77f7

‚ñ∂ Running experiment: ss61-3-pro-HIGH-0221-1602
   jobs=4  sample_size=None  raise_errors=False


2026-02-21 16:14:28,053 INFO [extract_ss61_form] START  province=‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£  form_type=‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠  model=gemini-3-pro-preview  thinking=HIGH  max_tokens=16384  file_id=1otADY98vhxj0tUwI4-msaULPea5d2yyh
2026-02-21 16:14:28,054 INFO [extract_ss61_form] START  province=‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ  form_type=unknown  model=gemini-3-pro-preview  thinking=HIGH  max_tokens=16384  file_id=15OKzVH_AbuDuJ-ObFxu_V1w7xAGKPY3G
2026-02-21 16:14:28,055 INFO [extract_ss61_form] START  province=‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£  form_type=‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠  model=gemini-3-pro-preview  thinking=HIGH  max_tokens=16384  file_id=1l8FFCSWg3cBfI79LLiauIV5AOcH8PsJ6
2026-02-21 16:14:28,056 INFO [extract_ss61_form] START  province=‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á  form_type=unknown  model=gemini-3-pro-preview  thinking=HIGH  max_tokens=16384  file_id=1ji_kd6b3ETWy-Q-7UgbEMiDRMaYml3AA
2026-02-21 16:14:28,056 INFO [extract_ss61_form] PDF part bu

   ‚úÖ Done ‚Äî 3 records
   View: https://us3.datadoghq.com/llm/experiments/f8c27c53-f878-4e2e-9d2a-5e4f2a6d69ed

‚ñ∂ Running experiment: ss61-3.1-pro-LOW-0221-1602
   jobs=4  sample_size=None  raise_errors=False


2026-02-21 16:17:35,765 INFO [extract_ss61_form] START  province=‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£  form_type=‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠  model=gemini-3.1-pro-preview  thinking=LOW  max_tokens=16384  file_id=1otADY98vhxj0tUwI4-msaULPea5d2yyh
2026-02-21 16:17:35,765 INFO [extract_ss61_form] START  province=‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ  form_type=unknown  model=gemini-3.1-pro-preview  thinking=LOW  max_tokens=16384  file_id=15OKzVH_AbuDuJ-ObFxu_V1w7xAGKPY3G
2026-02-21 16:17:35,766 INFO [extract_ss61_form] START  province=‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£  form_type=‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠  model=gemini-3.1-pro-preview  thinking=LOW  max_tokens=16384  file_id=1l8FFCSWg3cBfI79LLiauIV5AOcH8PsJ6
2026-02-21 16:17:35,766 INFO [extract_ss61_form] START  province=‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á  form_type=unknown  model=gemini-3.1-pro-preview  thinking=LOW  max_tokens=16384  file_id=1ji_kd6b3ETWy-Q-7UgbEMiDRMaYml3AA
2026-02-21 16:17:35,769 INFO [extract_ss61_form] PDF par

   ‚úÖ Done ‚Äî 3 records
   View: https://us3.datadoghq.com/llm/experiments/9ed31a37-4bad-41fb-a580-ed50fe4efc51

‚ñ∂ Running experiment: ss61-3.1-pro-HIGH-0221-1602
   jobs=4  sample_size=None  raise_errors=False


2026-02-21 16:19:21,828 INFO [extract_ss61_form] START  province=‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£  form_type=‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠  model=gemini-3.1-pro-preview  thinking=HIGH  max_tokens=16384  file_id=1otADY98vhxj0tUwI4-msaULPea5d2yyh
2026-02-21 16:19:21,829 INFO [extract_ss61_form] START  province=‡∏Å‡∏≤‡∏ç‡∏à‡∏ô‡∏ö‡∏∏‡∏£‡∏µ  form_type=unknown  model=gemini-3.1-pro-preview  thinking=HIGH  max_tokens=16384  file_id=15OKzVH_AbuDuJ-ObFxu_V1w7xAGKPY3G
2026-02-21 16:19:21,829 INFO [extract_ss61_form] START  province=‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ‡∏£  form_type=‡∏ö‡∏±‡∏ç‡∏ä‡∏µ‡∏£‡∏≤‡∏¢‡∏ä‡∏∑‡πà‡∏≠  model=gemini-3.1-pro-preview  thinking=HIGH  max_tokens=16384  file_id=1l8FFCSWg3cBfI79LLiauIV5AOcH8PsJ6
2026-02-21 16:19:21,830 INFO [extract_ss61_form] START  province=‡∏≠‡πà‡∏≤‡∏á‡∏ó‡∏≠‡∏á  form_type=unknown  model=gemini-3.1-pro-preview  thinking=HIGH  max_tokens=16384  file_id=1ji_kd6b3ETWy-Q-7UgbEMiDRMaYml3AA
2026-02-21 16:19:21,830 INFO [extract_ss61_form] PDF

   ‚úÖ Done ‚Äî 3 records
   View: https://us3.datadoghq.com/llm/experiments/e3ab6f05-5e15-4f9b-9812-3e1e6ac68349

All experiment URLs:
  2.5-flash-no-think   https://us3.datadoghq.com/llm/experiments/64b613bd-cfc2-4848-89a1-f6f7c8c62922
  2.5-flash-LOW        https://us3.datadoghq.com/llm/experiments/1c73e5ff-0a94-4962-89fe-ca1df31d2c95
  2.5-flash-HIGH       https://us3.datadoghq.com/llm/experiments/0d469602-7ffd-4939-9d81-ffdfed5341c4
  2.5-pro-LOW          https://us3.datadoghq.com/llm/experiments/9006f2a5-bbe6-4caa-a8ad-a0f9f5188460
  2.5-pro-HIGH         https://us3.datadoghq.com/llm/experiments/2a312983-7047-4e9a-8f7c-64973e00dd78
  3-flash-LOW          https://us3.datadoghq.com/llm/experiments/138e0d16-1761-46c6-9ca1-28845015fb08
  3-flash-HIGH         https://us3.datadoghq.com/llm/experiments/14aaca77-944a-47e4-bb53-a0d7dce44cfa
  3-pro-LOW            https://us3.datadoghq.com/llm/experiments/acd23bcf-d6c9-489a-8bd7-1eb0224f77f7
  3-pro-HIGH           https://us3.datadoghq.com

In [59]:
# ‚îÄ‚îÄ Local results summary ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Quick score table from experiment results before opening the Datadog UI.

EVAL_KEYS = [
    "ballot_check_eval",
    "votes_check_eval",
    "total_votes_match",
    "ballot_summary_match",
    "vote_counts_match",
]

summary_rows = []

for cfg in EXPERIMENT_CONFIGS:
    if cfg["name"] not in experiment_urls:
        continue

    exp_name    = f"ss61-{cfg['name']}-{_run_stamp}"
    task_config = {k: v for k, v in cfg.items() if k != "name"}

    # Re-fetch experiment results by re-running is not needed;
    # we can reconstruct scores from a fresh local run for the summary.
    # Instead, re-run task on dataset records locally (no Datadog write) to
    # compute scores quickly ‚Äî or just link to the Datadog URL.
    summary_rows.append({
        "config":       cfg["name"],
        "model":        cfg["model"],
        "thinking":     cfg["thinking_mode"] or "none",
        "experiment_url": experiment_urls[cfg["name"]],
    })

if summary_rows:
    summary_df = pd.DataFrame(summary_rows)
    print("Experiments submitted. Open the URLs below to compare results in Datadog:\n")
    for _, row in summary_df.iterrows():
        print(f"  [{row['config']:<20}]  {row['experiment_url']}")
    print()
    display(summary_df[["config", "model", "thinking", "experiment_url"]])
else:
    print("No experiments ran in this session.")

Experiments submitted. Open the URLs below to compare results in Datadog:

  [2.5-flash-no-think  ]  https://us3.datadoghq.com/llm/experiments/64b613bd-cfc2-4848-89a1-f6f7c8c62922
  [2.5-flash-LOW       ]  https://us3.datadoghq.com/llm/experiments/1c73e5ff-0a94-4962-89fe-ca1df31d2c95
  [2.5-flash-HIGH      ]  https://us3.datadoghq.com/llm/experiments/0d469602-7ffd-4939-9d81-ffdfed5341c4
  [2.5-pro-LOW         ]  https://us3.datadoghq.com/llm/experiments/9006f2a5-bbe6-4caa-a8ad-a0f9f5188460
  [2.5-pro-HIGH        ]  https://us3.datadoghq.com/llm/experiments/2a312983-7047-4e9a-8f7c-64973e00dd78
  [3-flash-LOW         ]  https://us3.datadoghq.com/llm/experiments/138e0d16-1761-46c6-9ca1-28845015fb08
  [3-flash-HIGH        ]  https://us3.datadoghq.com/llm/experiments/14aaca77-944a-47e4-bb53-a0d7dce44cfa
  [3-pro-LOW           ]  https://us3.datadoghq.com/llm/experiments/acd23bcf-d6c9-489a-8bd7-1eb0224f77f7
  [3-pro-HIGH          ]  https://us3.datadoghq.com/llm/experiments/f8c27c53-f878-4e2

Unnamed: 0,config,model,thinking,experiment_url
0,2.5-flash-no-think,gemini-2.5-flash,none,https://us3.datadoghq.com/llm/experiments/64b613bd-cfc2-4848-89a1-f6f7c8c62922
1,2.5-flash-LOW,gemini-2.5-flash,LOW,https://us3.datadoghq.com/llm/experiments/1c73e5ff-0a94-4962-89fe-ca1df31d2c95
2,2.5-flash-HIGH,gemini-2.5-flash,HIGH,https://us3.datadoghq.com/llm/experiments/0d469602-7ffd-4939-9d81-ffdfed5341c4
3,2.5-pro-LOW,gemini-2.5-pro,LOW,https://us3.datadoghq.com/llm/experiments/9006f2a5-bbe6-4caa-a8ad-a0f9f5188460
4,2.5-pro-HIGH,gemini-2.5-pro,HIGH,https://us3.datadoghq.com/llm/experiments/2a312983-7047-4e9a-8f7c-64973e00dd78
5,3-flash-LOW,gemini-3-flash-preview,LOW,https://us3.datadoghq.com/llm/experiments/138e0d16-1761-46c6-9ca1-28845015fb08
6,3-flash-HIGH,gemini-3-flash-preview,HIGH,https://us3.datadoghq.com/llm/experiments/14aaca77-944a-47e4-bb53-a0d7dce44cfa
7,3-pro-LOW,gemini-3-pro-preview,LOW,https://us3.datadoghq.com/llm/experiments/acd23bcf-d6c9-489a-8bd7-1eb0224f77f7
8,3-pro-HIGH,gemini-3-pro-preview,HIGH,https://us3.datadoghq.com/llm/experiments/f8c27c53-f878-4e2e-9d2a-5e4f2a6d69ed
9,3.1-pro-LOW,gemini-3.1-pro-preview,LOW,https://us3.datadoghq.com/llm/experiments/9ed31a37-4bad-41fb-a580-ed50fe4efc51


---
## 9. Full Dataset Run  *(after reviewing sample experiment results)*

Run all **776 files from `df`** (the full drive index) ‚Äî not the 9-record labeled dataset ‚Äî with **tenacity retry + exponential backoff** to handle transient API errors.

**Retry strategy per record** (`wait_exponential(multiplier=2, min=2, max=30)`):

```
attempt 1  ‚Üí fail ‚Üí wait  2s   (min_wait √ó 2‚Å∞)
attempt 2  ‚Üí fail ‚Üí wait  4s   (min_wait √ó 2¬π)
attempt 3  ‚Üí fail ‚Üí wait  8s   (min_wait √ó 2¬≤)
attempt 4  ‚Üí fail ‚Üí give up, record error  (reraise=True)
```

Configure `FULL_RUN_JOBS`, `MAX_RETRIES`, `RETRY_MIN_WAIT`, and `RETRY_MAX_WAIT` in the settings cell below.

In [None]:
# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
# ‚úèÔ∏è  FULL-RUN SETTINGS
# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

FULL_RUN_JOBS       = 5      # parallel workers (increase for higher throughput)
MAX_RETRIES         = 5      # max retry attempts after first failure
RETRY_MIN_WAIT      = 2.0    # seconds ‚Äî minimum wait before first retry
RETRY_MAX_WAIT      = 120.0   # seconds ‚Äî cap on exponential growth
RETRY_MULTIPLIER    = 2.0    # wait = min_wait √ó multiplier^attempt
FULL_RAISE_ERRORS   = False  # True = abort entire run on unrecoverable error

# Limit to specific configs ‚Äî None runs all defined in EXPERIMENT_CONFIGS
#FULL_RUN_CONFIGS: list[str] | None = None
FULL_RUN_CONFIGS = ["2.5-pro-LOW"]
# e.g.:  FULL_RUN_CONFIGS = ["2.5pro-HIGH"]

FULL_RUN_SKIP_UNKNOWN = False   # skip files where form_type couldn't be inferred from path

# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
# Retry-wrapped task (tenacity)
# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

_retry_logger = logging.getLogger("ss61.retry")
logging.basicConfig(level=logging.INFO, format="%(message)s")

@retry(
    stop=stop_after_attempt(MAX_RETRIES + 1),           # 1 first try + N retries
    wait=wait_exponential(
        multiplier=RETRY_MULTIPLIER,
        min=RETRY_MIN_WAIT,
        max=RETRY_MAX_WAIT,
    ),
    retry=retry_if_exception_type(Exception),
    before_sleep=before_sleep_log(_retry_logger, logging.WARNING),
    reraise=True,                                        # re-raise original error after exhaustion
)
def extract_ss61_with_retry(input_data, config):
    """extract_ss61_form wrapped with tenacity exponential-backoff retry."""
    return extract_ss61_form(input_data, config)


# ‚îÄ‚îÄ Show effective wait schedule ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print(f"‚úÖ tenacity retry wrapper ready")
print(f"   attempts      : {MAX_RETRIES + 1}  (1 initial + {MAX_RETRIES} retries)")
print(f"   wait schedule : ", end="")
w = RETRY_MIN_WAIT
waits = []
for i in range(MAX_RETRIES):
    waits.append(f"{min(w, RETRY_MAX_WAIT):.0f}s")
    w *= RETRY_MULTIPLIER
print(" ‚Üí ".join(waits))
print(f"   max single wait: {RETRY_MAX_WAIT:.0f}s  (cap)")

‚úÖ tenacity retry wrapper ready
   attempts      : 6  (1 initial + 5 retries)
   wait schedule : 2s ‚Üí 4s ‚Üí 8s ‚Üí 16s ‚Üí 32s
   max single wait: 120s  (cap)


In [None]:
from concurrent.futures import ThreadPoolExecutor, as_completed

# ‚îÄ‚îÄ Worker: run one record, return (index, result_row) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
def _run_one(idx: int, record: dict, task_config: dict, cfg_name: str) -> tuple[int, dict]:
    input_data = record.get("input_data", {})
    meta       = _parse(input_data).get("source_file_metadata", {})
    try:
        result       = extract_ss61_with_retry(input_data, task_config)
        b_score, b_r = _ballot_check(result)
        v_score, v_r = _votes_check(result)
        return idx, {
            "config":        cfg_name,
            "province":      meta.get("province_name", "?"),
            "form_type":     meta.get("form_type", "?"),
            "file_id":       meta.get("file_id", "?"),
            "result":        result,
            "ballot_score":  b_score,
            "ballot_reason": b_r,
            "votes_score":   v_score,
            "votes_reason":  v_r,
            "error":         None,
        }
    except Exception as exc:
        return idx, {
            "config":        cfg_name,
            "province":      meta.get("province_name", "?"),
            "form_type":     meta.get("form_type", "?"),
            "file_id":       meta.get("file_id", "?"),
            "result":        None,
            "ballot_score":  0.0,
            "votes_score":   0.0,
            "error":         str(exc),
        }


# ‚îÄ‚îÄ Build input records from full drive index (df ‚Äî 776 files) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
_source_df = df[df["form_type"] != "unknown"] if FULL_RUN_SKIP_UNKNOWN else df
excluded   = len(df) - len(_source_df)

_all_records = [
    {"input_data": build_input_data(row)}
    for row in _source_df.to_dict("records")
]

print(f"Source  : df  ({len(df)} total files)")
print(f"Included: {len(_all_records)}  (excluded {excluded} 'unknown' form_type)")
print()

FULL_RUN_OUTPUT_DIR = Path("datasets/full_run_results")
FULL_RUN_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
_run_stamp = datetime.now().strftime("%Y%m%d_%H%M")

full_results_store: dict[str, list] = {}

for cfg in EXPERIMENT_CONFIGS:
    if FULL_RUN_CONFIGS is not None and cfg["name"] not in FULL_RUN_CONFIGS:
        print(f"‚è≠  Skipping {cfg['name']}")
        continue

    cfg_name    = cfg["name"]
    task_config = {k: v for k, v in cfg.items() if k != "name"}
    n           = len(_all_records)
    cfg_slug    = cfg_name.replace("/", "-").replace(" ", "_")
    out_path    = FULL_RUN_OUTPUT_DIR / f"ss6_1_{cfg_slug}_{_run_stamp}.jsonl"

    print(f"{'='*60}")
    print(f"‚ñ∂ {cfg_name}  ({n} records  jobs={FULL_RUN_JOBS}  retries={MAX_RETRIES})")
    print(f"  Writing ‚Üí {out_path}")

    rows: list[dict | None] = [None] * n

    with out_path.open("w", encoding="utf-8") as jsonl_file, \
         ThreadPoolExecutor(max_workers=FULL_RUN_JOBS) as pool:

        futures = {
            pool.submit(_run_one, i, rec, task_config, cfg_name): i
            for i, rec in enumerate(_all_records)
        }
        for future in as_completed(futures):
            idx, row = future.result()
            rows[idx] = row

            # ‚îÄ‚îÄ Stream-write to JSONL immediately (main thread ‚Äî no lock needed) ‚îÄ‚îÄ
            src_rec  = _all_records[idx]
            src_meta = _parse(src_rec.get("input_data", {})).get("source_file_metadata", {})
            jsonl_file.write(json.dumps({
                "source_file": {
                    "file_id":       row["file_id"],
                    "province_name": row["province"],
                    "form_type":     row["form_type"],
                    "path":          src_meta.get("path", ""),
                    "size_mb":       src_meta.get("size_mb"),
                    "drive_uri":     _parse(src_rec.get("input_data", {})).get("drive_uri", ""),
                },
                "config":         row["config"],
                "extracted_data": row["result"],
                "ballot_check":   row["ballot_score"],
                "ballot_reason":  row.get("ballot_reason", ""),
                "votes_check":    row["votes_score"],
                "votes_reason":   row.get("votes_reason", ""),
                "error":          row["error"],
            }, ensure_ascii=False) + "\n")
            jsonl_file.flush()

            # ‚îÄ‚îÄ Progress ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
            b_icon = "‚úÖ" if row["ballot_score"] == 1.0 else ("üí•" if row["error"] else "‚ùå")
            v_icon = "‚úÖ" if row["votes_score"]  == 1.0 else ("üí•" if row["error"] else "‚ùå")
            done   = sum(1 for r in rows if r is not None)
            print(
                f"  [{done:>{len(str(n))}}/{n}] "
                f"{row['province']} ({row['form_type']}) "
                f"‚Äî ballot={b_icon} votes={v_icon}"
                + (f"  ERROR: {row['error'][:60]}" if row["error"] else "")
            )

    full_results_store[cfg_name] = rows
    errors  = sum(1 for r in rows if r and r.get("error"))
    size_kb = out_path.stat().st_size / 1024
    print(f"  ‚úÖ Done ‚Äî {n} records  errors={errors}")
    print(f"  üíæ {out_path.name}  ({size_kb:.0f} KB)\n")

print("=" * 60)
print(f"‚úÖ Full run complete  |  output dir: {FULL_RUN_OUTPUT_DIR.resolve()}")

In [None]:
# ‚îÄ‚îÄ Full-run results summary ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
if not full_results_store:
    print("No full-run results yet ‚Äî run the cell above first.")
else:
    summary_rows = []
    for cfg_name, rows in full_results_store.items():
        n      = len(rows)
        errors = sum(1 for r in rows if r and r.get("error"))
        ok_rows = [r for r in rows if r and not r.get("error")]

        def _rate(key):
            vals = [r[key] for r in ok_rows]
            if not vals:
                return "‚Äî"
            passes = sum(1 for v in vals if v == 1.0)
            mean   = sum(vals) / len(vals)
            return f"{mean:.0%}  ({passes}/{len(vals)})"

        summary_rows.append({
            "config":        cfg_name,
            "total":         n,
            "errors":        errors,
            "ballot_check":  _rate("ballot_score"),
            "votes_check":   _rate("votes_score"),
        })

    full_summary_df = pd.DataFrame(summary_rows)
    print("Full-run internal check pass rates  (% pass  |  n_pass/n_ok)\n")
    display(full_summary_df)

    # ‚îÄ‚îÄ Per-file detail ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    detail_rows = []
    for cfg_name, rows in full_results_store.items():
        for row in rows:
            if row is None:
                continue
            b_icon = "‚úÖ" if row["ballot_score"] == 1.0 else ("üí•" if row["error"] else "‚ùå")
            v_icon = "‚úÖ" if row["votes_score"]  == 1.0 else ("üí•" if row["error"] else "‚ùå")
            detail_rows.append({
                "config":        row["config"],
                "province":      row["province"],
                "form_type":     row["form_type"],
                "ballot_check":  b_icon,
                "votes_check":   v_icon,
                "all_ok":        "‚úÖ" if b_icon == "‚úÖ" and v_icon == "‚úÖ" else "‚ùå",
                "error":         (row.get("error") or "")[:80],
            })

    detail_df = pd.DataFrame(detail_rows)
    print("\nPer-file detail:")
    display(detail_df)

No full-run results yet ‚Äî run the cell above first.


2026-02-21 18:37:38,493 INFO HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-pro:generateContent "HTTP/1.1 200 OK"
--- Logging error ---
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.11/3.11.14_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/logging/__init__.py", line 1113, in emit
    stream.write(msg + self.terminator)
  File "/Users/nuttee.jirattivongvibul/Projects/genai-app-python/.venv/lib/python3.11/site-packages/ipykernel/iostream.py", line 760, in write
    self._schedule_flush()
  File "/Users/nuttee.jirattivongvibul/Projects/genai-app-python/.venv/lib/python3.11/site-packages/ipykernel/iostream.py", line 656, in _schedule_flush
    self.pub_thread.schedule(_schedule_in_thread)
  File "/Users/nuttee.jirattivongvibul/Projects/genai-app-python/.venv/lib/python3.11/site-packages/ipykernel/iostream.py", line 339, in schedule
    self._event_pipe.send(b"")
  File "/Users/nuttee.jirattivongvibul/Projects/gen