# Thai Election Form Extraction — Datadog LLMObs Experiments

Systematically evaluate Gemini models on extracting structured data from Thai election PDFs (Form S.S. 5/18) stored in **Google Drive**, discovered via **BigQuery**.

**Workflow:**
1. Setup — install dependencies, configure credentials
2. Schema — Pydantic models + Gemini JSON schema
3. Dataset — pull labeled records from Datadog LLMObs
4. Task — extraction function using Gemini + Google Drive URIs
5. Evaluators — score ballot stats, voter stats, and total votes
6. Experiment — run and compare model configurations in Datadog

## 1. Setup

In [1]:
!pip install -q google-cloud-bigquery google-genai pydantic pandas ddtrace python-dotenv

In [2]:
!pip freeze | grep -E 'google-cloud-bigquery|google-genai|pydantic|pandas|ddtrace'

ddtrace==4.4.0
google-cloud-bigquery==3.40.1
google-genai==1.64.0
pandas==3.0.1
pydantic==2.12.5
pydantic_core==2.41.5


In [3]:
import json
import os
from dataclasses import dataclass
from typing import Any, Dict, List, Literal, Optional

import pandas as pd
from dotenv import load_dotenv
from google import genai
from google.cloud import bigquery
from google.genai import types
from pydantic import BaseModel, Field

load_dotenv(override=True)
print("✅ Imports ready")

✅ Imports ready


In [4]:
# ── Credentials ──────────────────────────────────────────────────────────────
GEMINI_API_KEY       = os.environ["GEMINI_API_KEY"]
GOOGLE_CLOUD_PROJECT = os.environ["GOOGLE_CLOUD_PROJECT"]
DD_API_KEY           = os.environ["DD_API_KEY"]
DD_APP_KEY           = os.environ["DD_APP_KEY"]

# ── Project settings ─────────────────────────────────────────────────────────
ML_APP               = "gemini-ss5_18"
LLMOBS_PROJECT_NAME  = "vote-extraction-project"
DD_SITE              = "us3.datadoghq.com"

# ── Data settings ─────────────────────────────────────────────────────────────
BQ_TABLE             = "sourceinth.vote69_ect.raw_files"
DATASET_NAME         = "ss5_18_nuttee"

print(f"✅ Config ready | project={GOOGLE_CLOUD_PROJECT} | ml_app={ML_APP}")

✅ Config ready | project=datadog-ese-sandbox | ml_app=gemini-ss5_18


In [5]:
from ddtrace.llmobs import LLMObs, EvaluatorResult

LLMObs.enable(
    ml_app=ML_APP,
    api_key=DD_API_KEY,
    app_key=DD_APP_KEY,
    project_name=LLMOBS_PROJECT_NAME,
    site=DD_SITE,
    agentless_enabled=True,
)
print("✅ Datadog LLMObs enabled")

✅ Datadog LLMObs enabled


## 2. Schema

In [6]:
class NumberTextPair(BaseModel):
    """Numeric value in both Arabic numeral and Thai text."""
    arabic: int = Field(..., description="Arabic numeral (e.g., 120)")
    thai_text: Optional[str] = Field(None, description="Thai text (e.g., หนึ่งร้อยยี่สิบ)")


class FormInfo(BaseModel):
    form_type: Optional[str] = Field(None, description="Constituency or PartyList")
    set_number: Optional[str] = Field(None, description="Set number (ชุดที่)")
    date: Optional[str] = Field(None, description="Date of election")
    province: Optional[str] = Field(None, description="Province name")
    constituency_number: Optional[str] = Field(None, description="Constituency number")
    district: str = Field(..., description="District name")
    sub_district: Optional[str] = Field(None, description="Sub-district name")
    polling_station_number: str = Field(..., description="Polling station number")
    village_moo: Optional[str] = Field(None, description="Village number (หมู่ที่)")


class VoterStatistics(BaseModel):
    eligible_voters: Optional[NumberTextPair] = Field(None, description="1.1 Total eligible voters")
    present_voters: Optional[NumberTextPair] = Field(None, description="1.2 Voters who showed up")


class BallotStatistics(BaseModel):
    ballots_allocated: Optional[NumberTextPair] = Field(None, description="2.1 Allocated ballots")
    ballots_used: Optional[NumberTextPair] = Field(None, description="2.2 Used ballots")
    good_ballots: Optional[NumberTextPair] = Field(None, description="2.2.1 Valid ballots")
    bad_ballots: Optional[NumberTextPair] = Field(None, description="2.2.2 Invalid ballots")
    no_vote_ballots: Optional[NumberTextPair] = Field(None, description="2.2.3 No vote ballots")
    ballots_remaining: Optional[NumberTextPair] = Field(None, description="2.3 Remaining ballots")


class VoteResult(BaseModel):
    number: int = Field(..., description="Candidate/Party number")
    candidate_name: Optional[str] = Field(None, description="Candidate name (Constituency only)")
    party_name: Optional[str] = Field(None, description="Party name")
    vote_count: NumberTextPair


class Official(BaseModel):
    name: str
    position: str = Field(..., description="ประธาน / กรรมการ / เลขานุการ")


class ElectionFormData(BaseModel):
    form_info: FormInfo
    voter_statistics: Optional[VoterStatistics] = None
    ballot_statistics: Optional[BallotStatistics] = None
    vote_results: List[VoteResult] = Field(default_factory=list)
    total_votes_recorded: Optional[NumberTextPair] = Field(
        None, description="รวม row at the bottom of vote table"
    )
    officials: Optional[List[Official]] = None


print("✅ Pydantic models defined")

✅ Pydantic models defined


In [7]:
# Gemini JSON schema (mirrors the Pydantic models above)
ELECTION_DATA_SCHEMA = {
    "type": "ARRAY",
    "description": "List of election reports found in the PDF",
    "items": {
        "type": "OBJECT",
        "required": ["form_info", "vote_results"],
        "properties": {
            "form_info": {
                "type": "OBJECT",
                "required": ["form_type", "province", "district", "polling_station_number"],
                "properties": {
                    "form_type": {"type": "STRING", "enum": ["Constituency", "PartyList"]},
                    "set_number": {"type": "STRING"},
                    "date": {"type": "STRING"},
                    "province": {"type": "STRING"},
                    "constituency_number": {"type": "STRING"},
                    "district": {"type": "STRING"},
                    "sub_district": {"type": "STRING"},
                    "polling_station_number": {"type": "STRING"},
                    "village_moo": {"type": "STRING"},
                },
            },
            "voter_statistics": {
                "type": "OBJECT",
                "properties": {
                    "eligible_voters": {
                        "type": "OBJECT",
                        "properties": {"arabic": {"type": "INTEGER"}, "thai_text": {"type": "STRING"}},
                    },
                    "present_voters": {
                        "type": "OBJECT",
                        "properties": {"arabic": {"type": "INTEGER"}, "thai_text": {"type": "STRING"}},
                    },
                },
            },
            "ballot_statistics": {
                "type": "OBJECT",
                "properties": {
                    "ballots_allocated": {
                        "type": "OBJECT",
                        "properties": {"arabic": {"type": "INTEGER"}, "thai_text": {"type": "STRING"}},
                    },
                    "ballots_used": {
                        "type": "OBJECT",
                        "required": ["arabic"],
                        "properties": {"arabic": {"type": "INTEGER"}, "thai_text": {"type": "STRING"}},
                    },
                    "good_ballots": {
                        "type": "OBJECT",
                        "required": ["arabic"],
                        "properties": {"arabic": {"type": "INTEGER"}, "thai_text": {"type": "STRING"}},
                    },
                    "bad_ballots": {
                        "type": "OBJECT",
                        "required": ["arabic"],
                        "properties": {"arabic": {"type": "INTEGER"}, "thai_text": {"type": "STRING"}},
                    },
                    "no_vote_ballots": {
                        "type": "OBJECT",
                        "required": ["arabic"],
                        "properties": {"arabic": {"type": "INTEGER"}, "thai_text": {"type": "STRING"}},
                    },
                    "ballots_remaining": {
                        "type": "OBJECT",
                        "properties": {"arabic": {"type": "INTEGER"}, "thai_text": {"type": "STRING"}},
                    },
                },
            },
            "vote_results": {
                "type": "ARRAY",
                "items": {
                    "type": "OBJECT",
                    "required": ["number", "vote_count"],
                    "properties": {
                        "number": {"type": "INTEGER"},
                        "candidate_name": {"type": "STRING"},
                        "party_name": {"type": "STRING"},
                        "vote_count": {
                            "type": "OBJECT",
                            "required": ["arabic"],
                            "properties": {"arabic": {"type": "INTEGER"}, "thai_text": {"type": "STRING"}},
                        },
                    },
                },
            },
            "total_votes_recorded": {
                "type": "OBJECT",
                "properties": {"arabic": {"type": "INTEGER"}, "thai_text": {"type": "STRING"}},
            },
            "officials": {
                "type": "ARRAY",
                "items": {
                    "type": "OBJECT",
                    "required": ["name", "position"],
                    "properties": {
                        "name": {"type": "STRING"},
                        "position": {"type": "STRING"},
                    },
                },
            },
        },
    },
}

print("✅ Gemini JSON schema defined")

✅ Gemini JSON schema defined


## 3. Dataset

Pull the labeled dataset from Datadog LLMObs. Each record has:
- `input_data` — Google Drive file metadata (file_id, path, province_name, …)
- `expected_output` — ground-truth ballot/voter/vote data

In [8]:
dataset = LLMObs.pull_dataset(
    dataset_name=DATASET_NAME,
    project_name=LLMOBS_PROJECT_NAME,
)

print(f"✅ Dataset '{DATASET_NAME}' loaded — {len(dataset)} records")
dataset.as_dataframe().head()

✅ Dataset 'ss5_18_nuttee' loaded — 5 records


Unnamed: 0_level_0,expected_output,metadata,input_data
Unnamed: 0_level_1,Unnamed: 1_level_1,datadog,Unnamed: 3_level_1
0,"{""source_file"": {""file_id"": ""1-MsML3nSXUrscvmz...","{'tags': ['ddtrace.version:4.4.0', 'sensitive_...","{""drive_uri"": ""[HTTP(S) URL Scanner]"", ""source..."
1,"{""source_file"": {""file_id"": ""1tzz6gMXk1n3pQtre...","{'tags': ['ddtrace.version:4.4.0', 'sensitive_...","{""drive_uri"": ""[HTTP(S) URL Scanner]"", ""source..."
2,"{""source_file"": {""file_id"": ""1a5jF1Oyv3UEatBq1...","{'tags': ['ddtrace.version:4.4.0', 'sensitive_...","{""drive_uri"": ""[HTTP(S) URL Scanner]"", ""source..."
3,"{""source_file"": {""file_id"": ""1_j0DNaqCXIkEk0MK...","{'tags': ['ddtrace.version:4.4.0', 'sensitive_...","{""drive_uri"": ""[HTTP(S) URL Scanner]"", ""source..."
4,"{""source_file"": {""file_id"": ""1gDxp58u2W14uhdb6...","{'tags': ['ddtrace.version:4.4.0', 'sensitive_...","{""drive_uri"": ""[HTTP(S) URL Scanner]"", ""source..."


## 4. Task

The task calls Gemini with a PDF file from Google Drive (via external URL) and returns structured extraction results.

In [9]:
EXTRACTION_PROMPT = """
You are an expert data entry assistant for Thai Election documents (Form S.S. 5/18).

CRITICAL INSTRUCTIONS:

1. Analyze all pages of this PDF carefully.

2. Extract BOTH number formats for all numerical values:
   - Arabic numerals (e.g., 120)
   - Thai text (e.g., "หนึ่งร้อยยี่สิบ")
   This applies to: voter statistics, ballot statistics, vote counts, and total votes.

3. Header Information:
   - Form type: "Constituency" (แบบแบ่งเขต) or "PartyList" (บัญชีรายชื่อ)
   - Set number (ชุดที่), Date, Province, District, Sub-district
   - Polling station number (หน่วยเลือกตั้งที่)
   - Village number (หมู่ที่) if present

4. Section 1 — Voter Statistics:
   - 1.1 Eligible voters (ผู้มีสิทธิเลือกตั้งตามบัญชี)
   - 1.2 Present voters (ผู้มาแสดงตน)

5. Section 2 — Ballot Statistics:
   - 2.1 Allocated ballots (บัตรที่ได้รับจัดสรร)
   - 2.2 Used ballots (บัตรที่ใช้)
     - 2.2.1 Valid ballots (บัตรดี)
     - 2.2.2 Invalid ballots (บัตรเสีย)
     - 2.2.3 No vote ballots (ไม่เลือก)
   - 2.3 Remaining ballots (บัตรเหลือ)

6. Section 3 — Vote Results Table:
   - Consolidate across all pages (table often spans multiple pages)
   - For each entry: number, candidate name (Constituency only), party name, vote count

7. Total Votes Recorded:
   - The "รวม" row at the bottom of the vote results table

8. Officials:
   - Names and positions from the signature section

9. Validation rules:
   - ballots_used = good_ballots + bad_ballots + no_vote_ballots
   - total_votes_recorded = sum of all vote_count values
"""

In [10]:
# Initialize clients
bq_client = bigquery.Client(project=GOOGLE_CLOUD_PROJECT)
gemini_client = genai.Client(api_key=GEMINI_API_KEY, vertexai=False)
print("✅ BigQuery and Gemini clients initialized")

✅ BigQuery and Gemini clients initialized


In [11]:
def extract_election_form(input_data: Dict[str, Any], config: Dict[str, Any]) -> List[dict]:
    """
    Task function for LLMObs experiments.

    Accepts a dataset record's input_data and the experiment config dict,
    calls Gemini, and returns parsed extraction.

    Args:
        input_data: JSON string or dict. Expected shape:
            {"drive_uri": "...", "source_file_metadata": {"file_id": ..., ...}, ...}
        config: dict with model, temperature, max_tokens, thinking_mode (optional)

    Returns:
        List of extracted election report dicts.
    """
    model          = config.get("model", "gemini-2.5-flash")
    temperature    = config.get("temperature", 0.0)
    max_tokens     = config.get("max_tokens", 8192)
    thinking_mode  = config.get("thinking_mode")  # None or "LOW" / "HIGH"

    # Dataset stores input_data as a JSON string — parse it if needed
    if isinstance(input_data, str):
        input_data = json.loads(input_data)

    source_file = input_data["source_file_metadata"]
    file_id     = source_file["file_id"]
    drive_uri   = f"https://drive.google.com/uc?export=download&id={file_id}"

    file_part = types.Part.from_uri(file_uri=drive_uri, mime_type="application/pdf")

    # thinking_budget mapped from symbolic mode name to token budget
    _thinking_budget = {"LOW": 1024, "HIGH": 8192}

    gen_config_params = {
        "response_mime_type": "application/json",
        "response_schema": ELECTION_DATA_SCHEMA,
        "temperature": temperature,
        "max_output_tokens": max_tokens,
        "top_p": 0.95,
    }
    if thinking_mode:
        gen_config_params["thinking_config"] = types.ThinkingConfig(
            thinking_budget=_thinking_budget.get(thinking_mode, 1024)
        )

    response = gemini_client.models.generate_content(
        model=model,
        contents=[file_part, EXTRACTION_PROMPT],
        config=types.GenerateContentConfig(**gen_config_params),
    )

    return json.loads(response.text)


print("✅ Task function defined")

✅ Task function defined


### Optional - smoke-test on a single sample

In [12]:
# ── Inspect dataset record structure ──────────────────────────────────────────
raw_input = dataset[0]["input_data"]

# input_data is stored as a JSON string — parse it
sample_input = json.loads(raw_input) if isinstance(raw_input, str) else raw_input

# The file metadata lives under "source_file_metadata"
source_file_metadata = sample_input["source_file_metadata"]

print("drive_uri      :", sample_input.get("drive_uri"))
print("file_id        :", source_file_metadata["file_id"])
print("province       :", source_file_metadata["province_name"])
print("path           :", source_file_metadata["path"])
print("size_kb        :", f"{source_file_metadata['size_kb']:.1f} KB")


drive_uri      : [HTTP(S) URL Scanner]
file_id        : 1-MsML3nSXUrscvmzdZb7R4yTkZcuTc5m
province       : พิจิตร
path           : เขตเลือกตั้งที่ 3/อำเภอโพธิ์ประทับช้าง/ทต.โพธิ์ประทับช้าง/หน่วยเลือกตั้งที่ 10/สส5ทับ18 น_10.pdf
size_kb        : 50.2 KB


In [15]:
# ── Run extraction on the first sample ────────────────────────────────────────
sample_result = extract_election_form(
    sample_input,  # parsed dict — task function also handles raw JSON strings
    {"model": "gemini-3-flash-preview", "temperature": 0.0, "max_tokens": 32764},
)
print(json.dumps(sample_result[0]["form_info"], indent=2, ensure_ascii=False))

{
  "form_type": "Constituency",
  "set_number": "สีขาว",
  "date": "8 พฤษภาคม 2566",
  "province": "พิจิตร",
  "constituency_number": "3",
  "district": "โพธิ์ประทับช้าง",
  "sub_district": "ดงเสือเหลือง",
  "polling_station_number": "10",
  "village_moo": "2"
}


## 5. Evaluators

Each evaluator receives `(input_data, output_data, expected_output)` and returns a float score from **0.0** (all wrong) to **1.0** (all correct).

**Dataset shapes:**
- `output_data` — list of form dicts returned by the task: `[{"form_info": ..., "ballot_statistics": ..., ...}]`
- `expected_output` — JSON **string** with shape `{"source_file": {...}, "extracted_data": [{...form data...}]}`

`_parse_expected()` handles the JSON string parsing and unwraps `extracted_data[0]` automatically.

In [29]:
def _arabic(obj) -> int:
    """Safely extract arabic value from a NumberTextPair dict."""
    if isinstance(obj, dict):
        return obj.get("arabic", 0) or 0
    return obj or 0


def _parse_expected(expected_output) -> dict:
    """
    Normalise expected_output into a plain election-form dict.

    Dataset stores expected_output as a JSON string with shape:
        {"source_file": {...}, "extracted_data": [{...form data...}]}

    Returns the first item of extracted_data, or {} on any error.
    """
    if isinstance(expected_output, str):
        try:
            expected_output = json.loads(expected_output)
        except (json.JSONDecodeError, TypeError):
            return {}

    if isinstance(expected_output, dict):
        extracted = expected_output.get("extracted_data")
        if isinstance(extracted, list) and extracted:
            return extracted[0]
        # Fallback: already the form dict directly
        return expected_output

    return {}


def _parse_output(output_data) -> dict:
    """Return the first report dict from the task output list."""
    if not output_data:
        return {}
    return output_data[0] if isinstance(output_data, list) else output_data


def ballot_statistics(input_data, output_data, expected_output) -> float:
    """
    Score ballot statistics accuracy.

    Checks: allocated, used, good, bad, no_vote, remaining
    Bonus check: ballots_used == good + bad + no_vote (internal validation)
    """
    actual   = _parse_output(output_data).get("ballot_statistics", {})
    expected = _parse_expected(expected_output).get("ballot_statistics", {})

    if not actual:
        return 0.0

    fields = ["ballots_allocated", "ballots_used", "good_ballots", "bad_ballots", "no_vote_ballots", "ballots_remaining"]
    checks = [_arabic(actual.get(f)) == _arabic(expected.get(f)) for f in fields]

    # Internal validation: used = good + bad + no_vote
    used  = _arabic(actual.get("ballots_used"))
    parts = _arabic(actual.get("good_ballots")) + _arabic(actual.get("bad_ballots")) + _arabic(actual.get("no_vote_ballots"))
    checks.append(used == parts)

    return sum(checks) / len(checks)


def voter_statistics(input_data, output_data, expected_output) -> float:
    """Score voter statistics accuracy (eligible_voters, present_voters)."""
    actual   = _parse_output(output_data).get("voter_statistics", {})
    expected = _parse_expected(expected_output).get("voter_statistics", {})

    if not actual:
        return 0.0

    checks = [
        _arabic(actual.get("eligible_voters")) == _arabic(expected.get("eligible_voters")),
        _arabic(actual.get("present_voters"))  == _arabic(expected.get("present_voters")),
    ]
    return sum(checks) / len(checks)


def total_votes(input_data, output_data, expected_output) -> float:
    """
    Score total vote accuracy.

    Checks:
    - sum(vote_results) == total_votes_recorded  (internal consistency)
    - total_votes_recorded == expected value
    """
    report   = _parse_output(output_data)
    expected = _parse_expected(expected_output)

    results          = report.get("vote_results", [])
    recorded_total   = _arabic(report.get("total_votes_recorded"))
    expected_total   = _arabic(expected.get("total_votes_recorded"))
    calculated_total = sum(_arabic(v.get("vote_count")) for v in results)

    if not results:
        return 0.0

    checks = [
        calculated_total == recorded_total,
        recorded_total   == expected_total,
    ]
    return sum(checks) / len(checks)


print("✅ Evaluators defined: ballot_statistics, voter_statistics, total_votes")

✅ Evaluators defined: ballot_statistics, voter_statistics, total_votes


## 6. Experiment Configurations

In [30]:
EXPERIMENT_CONFIGS = [
    # ── Gemini 3 Flash — deterministic ───────────────────────────────────────
    {"model": "gemini-3-flash-preview", "temperature": 0.0, "max_tokens": 32764, "thinking_mode": "LOW"},
    {"model": "gemini-3-flash-preview", "temperature": 0.0, "max_tokens": 32764, "thinking_mode": "HIGH"},
    # ── Gemini 3 Pro — deterministic ─────────────────────────────────────────
    {"model": "gemini-3-pro-preview",   "temperature": 0.0, "max_tokens": 32764, "thinking_mode": "LOW"},
    {"model": "gemini-3-pro-preview",   "temperature": 0.0, "max_tokens": 32764, "thinking_mode": "HIGH"},
    # ── Gemini 2.5 Flash (baseline) ───────────────────────────────────────────
    #{"model": "gemini-2.5-flash",        "temperature": 0.0, "max_tokens": 32764},
    # ── Gemini 3 Flash — creative ────────────────────────────────────────────
    {"model": "gemini-3-flash-preview", "temperature": 0.5, "max_tokens": 32764, "thinking_mode": "LOW"},
    {"model": "gemini-3-flash-preview", "temperature": 0.5, "max_tokens": 32764, "thinking_mode": "HIGH"},
    # ── Gemini 3 Pro — creative ───────────────────────────────────────────────
    {"model": "gemini-3-pro-preview",   "temperature": 0.5, "max_tokens": 32764, "thinking_mode": "LOW"},
    {"model": "gemini-3-pro-preview",   "temperature": 0.5, "max_tokens": 32764, "thinking_mode": "HIGH"},
    # ── Gemini 2.5 Flash — creative ───────────────────────────────────────────
    #{"model": "gemini-2.5-flash",        "temperature": 0.5, "max_tokens": 32764},
]

print(f"✅ {len(EXPERIMENT_CONFIGS)} configurations ready")
pd.DataFrame(EXPERIMENT_CONFIGS)

✅ 8 configurations ready


Unnamed: 0,model,temperature,max_tokens,thinking_mode
0,gemini-3-flash-preview,0.0,8192,LOW
1,gemini-3-flash-preview,0.0,8192,HIGH
2,gemini-3-pro-preview,0.0,8192,LOW
3,gemini-3-pro-preview,0.0,8192,HIGH
4,gemini-3-flash-preview,0.5,8192,LOW
5,gemini-3-flash-preview,0.5,8192,HIGH
6,gemini-3-pro-preview,0.5,8192,LOW
7,gemini-3-pro-preview,0.5,8192,HIGH


## 7. Run Experiments

Each call to `LLMObs.experiment().run()` sends results to Datadog.
View all runs at: https://us3.datadoghq.com/llm/experiments

In [31]:
def make_experiment_name(cfg: dict) -> str:
    model   = cfg["model"].replace(".", "_").replace("-", "_")
    temp    = f"temp{cfg['temperature']}"
    thinking = f"_thinking_{cfg['thinking_mode']}" if cfg.get("thinking_mode") else ""
    return f"{model}{thinking}_{temp}"


def run_experiment(cfg: dict) -> None:
    """Create and run a single LLMObs experiment for the given config."""
    name = make_experiment_name(cfg)
    print(f"\n▶ {name}")

    experiment = LLMObs.experiment(
        name=name,
        dataset=dataset,
        task=extract_election_form,
        evaluators=[ballot_statistics, voter_statistics, total_votes],
        config=cfg,
        description=(
            f"{cfg['model']} | temp={cfg['temperature']}"
            + (f" | thinking={cfg['thinking_mode']}" if cfg.get('thinking_mode') else "")
        ),
    )

    results = experiment.run(jobs=5)
    print(f"  ✅ Done — {experiment.url}")
    return results

In [32]:
# ── Option A: single config (quick sanity check) ─────────────────────────────
# run_experiment(EXPERIMENT_CONFIGS[0])


▶ gemini_3_flash_preview_thinking_LOW_temp0.0


failed to send, dropping 1 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect) [2 skipped]


  ✅ Done — https://us3.datadoghq.com/llm/experiments/a1d1a064-7c56-4603-b0f9-c0c961d6cdbd


{'summary_evaluations': {},
 'rows': [{'idx': 0,
   'span_id': '5474019286533082380',
   'trace_id': '6996cde6000000007383175c4d8303d3',
   'timestamp': 1771490790661967000,
   'record_id': 'ebf6791b-b5f6-4571-99f7-f5d34005f366',
   'input': '{"drive_uri": "[HTTP(S) URL Scanner]", "source_file_metadata": {"file_id": "1-MsML3nSXUrscvmzdZb7R4yTkZcuTc5m", "path": "\\u0e40\\u0e02\\u0e15\\u0e40\\u0e25\\u0e37\\u0e2d\\u0e01\\u0e15\\u0e31\\u0e49\\u0e07\\u0e17\\u0e35\\u0e48 3/\\u0e2d\\u0e33\\u0e40\\u0e20\\u0e2d\\u0e42\\u0e1e\\u0e18\\u0e34\\u0e4c\\u0e1b\\u0e23\\u0e30\\u0e17\\u0e31\\u0e1a\\u0e0a\\u0e49\\u0e32\\u0e07/\\u0e17\\u0e15.\\u0e42\\u0e1e\\u0e18\\u0e34\\u0e4c\\u0e1b\\u0e23\\u0e30\\u0e17\\u0e31\\u0e1a\\u0e0a\\u0e49\\u0e32\\u0e07/\\u0e2b\\u0e19\\u0e48\\u0e27\\u0e22\\u0e40\\u0e25\\u0e37\\u0e2d\\u0e01\\u0e15\\u0e31\\u0e49\\u0e07\\u0e17\\u0e35\\u0e48 10/\\u0e2a\\u0e2a5\\u0e17\\u0e31\\u0e1a18 \\u0e19_10.pdf", "mime_type": "application/pdf", "folder_id": "1At01M8EipiffkqpTRT2uln5FTcA-Mktj", "prov

In [None]:
# ── Option B: all configs ─────────────────────────────────────────────────────
# for cfg in EXPERIMENT_CONFIGS:
#     run_experiment(cfg)

In [33]:
# ── Option C: gemini-3 models only ────────────────────────────────────────────
for cfg in EXPERIMENT_CONFIGS:
    if "gemini-3" in cfg["model"]:
        run_experiment(cfg)


▶ gemini_3_flash_preview_thinking_LOW_temp0.0


failed to send, dropping 2 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect) [3 skipped]


  ✅ Done — https://us3.datadoghq.com/llm/experiments/c793c8d3-86bb-4268-ada5-c9876babc931

▶ gemini_3_flash_preview_thinking_HIGH_temp0.0
  ✅ Done — https://us3.datadoghq.com/llm/experiments/a815c0d6-d913-4657-ae1a-19ef042a165b

▶ gemini_3_pro_preview_thinking_LOW_temp0.0


failed to send, dropping 1 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect) [5 skipped]


  ✅ Done — https://us3.datadoghq.com/llm/experiments/5a36ddfb-c354-44ff-b48c-7545590dce3b

▶ gemini_3_pro_preview_thinking_HIGH_temp0.0


failed to send, dropping 1 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect) [3 skipped]
failed to send, dropping 1 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect)
failed to send, dropping 1 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect) [2 skipped]


  ✅ Done — https://us3.datadoghq.com/llm/experiments/436c72fa-ce76-40e9-9b87-d522eadac1fa

▶ gemini_3_flash_preview_thinking_LOW_temp0.5
  ✅ Done — https://us3.datadoghq.com/llm/experiments/467066a2-753d-4bb3-8286-e1d5a2afafc7

▶ gemini_3_flash_preview_thinking_HIGH_temp0.5


failed to send, dropping 2 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect) [3 skipped]


  ✅ Done — https://us3.datadoghq.com/llm/experiments/bb17af89-0338-4454-bd39-3ab6ced3a3ac

▶ gemini_3_pro_preview_thinking_LOW_temp0.5


failed to send, dropping 1 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect) [3 skipped]
failed to send, dropping 1 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect) [2 skipped]


  ✅ Done — https://us3.datadoghq.com/llm/experiments/aa96b82f-c900-497d-9319-e65e6f61957a

▶ gemini_3_pro_preview_thinking_HIGH_temp0.5


failed to send, dropping 1 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect) [1 skipped]
failed to send, dropping 1 traces to intake at http://datadog-agent:8126/v0.5/traces: client error (Connect) [3 skipped]


  ✅ Done — https://us3.datadoghq.com/llm/experiments/79fa8bd2-ed75-40f8-b8ae-e11b8dc3dee0


## Results

Open the Datadog Experiments UI to compare runs side-by-side:

**https://us3.datadoghq.com/llm/experiments**

Useful search filters:

| What | Filter |
|------|--------|
| Ballot score | `@evaluation.external.ballot_statistics.value:>=0.8` |
| Voter score  | `@evaluation.external.voter_statistics.value:1` |
| Votes score  | `@evaluation.external.total_votes.value:1` |
| Slow calls   | `@duration:>=10s` |