# üß† Text-to-SQL Error Analysis Notebook

Welcome to the **Text-to-SQL Error Analysis Notebook**. This notebook is designed to help you systematically evaluate and debug the performance of Text-to-SQL models using the `text2sql-eval-toolkit`.

## üìã Notebook Overview

This notebook is organized into the following sections:

1. **Setup** ‚Äì Install dependencies and prepare your environment.
2. **Inference** ‚Äì Run baseline or custom inference to generate and save SQL queries.
3. **Execution** ‚Äì Execute ground truth and predicted SQL queries against a target database.
5. **Error Analysis** ‚Äì Visualize and analyze common failure modes and error types.

## üß∞ Toolkit Features

- Plug-and-play evaluation for any Text-to-SQL model or pipeline
- Execution-based and string-based metrics
- Easy integration with Postgres, Cognos (WX BI) and SQLite databases (more to come soon!)
- A set of enterprise and academic benchmarks (with configs in: [benchmarks.json](../data/benchmarks.json))
- Tools for error analysis, including the use of LLM as a judge

---

> ‚ö†Ô∏è **Note**: This notebook assumes you have access to the internal GitHub repository and a working SSH setup for installation.

Let‚Äôs get started!


# 1. Setup

In [1]:
# ## Installation

# Auto-reload for dev
%load_ext autoreload
%autoreload 2

# Check if the package is already installed
try:
    import text2sql_eval_toolkit
except ImportError:
    # Install via git+SSH if not installed
    !pip install -e git+ssh://git@github.com:IBM/text2sql-eval-toolkit.git#egg=text2sql-eval-toolkit


In [2]:
## The toolkit comes with a set of pre-defined benchmarks in case you want to use them.
## These benchmarks require some advance setup, such as setting the right env variables including connection strings, or downloading the sqlite DBs
## Check the benchmarks.json file for the env variable or folder configrations you may need to set to use each benchmark
from pathlib import Path
import json

# benchmarks_json_file = Path().resolve().parent / "data" / "benchmarks.json"
benchmarks_json_file = Path().resolve().parent / "data" / "test-benchmarks.json"
benchmarks_info = json.load(open(benchmarks_json_file, "r"))
print("Available benchmarks:")
for benchmark_id in benchmarks_info.keys():
    print(f"- {benchmark_id}")

Available benchmarks:
- bird_mini_dev_sqlite_test_50
- bird_mini_dev_postgres_test_50
- spider_dev_test_50
- beaver_test_10
- archer_en_dev_test_10
- bird_sqlite_test_benchmark


# 2. Inference: Generating SQL for benchmark queries, or loading prior results

In [3]:
# Run your model or load precomputed predictions
# If you're using a model API or local inference, replace this with your own logic
import pandas as pd
from text2sql_eval_toolkit.inference.baseline_llm_pipeline import (
    LLMSQLGenerationPipeline,
)

# Pipeline run is async, so we need nest_asyncio to run in a notebook
import nest_asyncio

nest_asyncio.apply()

benchmark_id = "bird_mini_dev_sqlite_test_50"

model_parameters = {
    "decoding_method": "greedy",
    "max_new_tokens": 256,
    "stop_sequences": ["```"],
}

model_names = [
    "wxai:meta-llama/llama-3-3-70b-instruct",
    # "wxai:ibm/granite-3-3-8b-instruct",
    # "wxai:meta-llama/llama-4-maverick-17b-128e-instruct-fp8"
]

pipeline = LLMSQLGenerationPipeline()

# for model in model_names:
#     pipeline.run_pipeline(
#         benchmark_id, model_name=model, model_parameters=model_parameters
#     )

predictions_file = (
    benchmarks_json_file.parent / benchmarks_info[benchmark_id]["predictions"]
)
predictions = json.load(open(predictions_file, "r"))

# Gather all the pipeline IDs (name of model + -greedy-zero-shot for our baseline)
pipeline_ids = list(
    set().union(
        *(
            r["predictions"].keys()
            for r in predictions
            if "predictions" in r and isinstance(r["predictions"], dict)
        )
    )
)
print(f"Pipelines with inference results: {pipeline_ids}")

# Optional: Display a few predictions for the selected models
df = pd.DataFrame(predictions)[["id", "sql", "predictions"]].head()
expanded = df["predictions"].apply(pd.Series)
df_expanded = pd.concat([df.drop(columns=["predictions"]), expanded], axis=1)
df_expanded


Pipelines with inference results: ['wxai:openai/gpt-oss-120b-agentic-baseline2-3attempts', 'wxai:openai/gpt-oss-120b-agentic-baseline5-3attempts', 'wxai:openai/gpt-oss-120b-greedy-zero-shot-chatapi', 'wxai:meta-llama/llama-4-maverick-17b-128e-instruct-fp8-greedy-zero-shot-chatapi', 'wxai:openai/gpt-oss-120b-agentic-baseline0-3attempts', 'wxai:openai/gpt-oss-120b-agentic-baseline3-3attempts', 'wxai:ibm/granite-4-h-small-greedy-zero-shot-chatapi', 'wxai:openai/gpt-oss-120b-agentic-baseline4-3attempts', 'wxai:openai/gpt-oss-120b-agentic-baseline1-3attempts', 'wxai:meta-llama/llama-3-3-70b-instruct-greedy-zero-shot-chatapi']


Unnamed: 0,id,sql,wxai:meta-llama/llama-3-3-70b-instruct-greedy-zero-shot-chatapi,wxai:ibm/granite-4-h-small-greedy-zero-shot-chatapi,wxai:meta-llama/llama-4-maverick-17b-128e-instruct-fp8-greedy-zero-shot-chatapi,wxai:openai/gpt-oss-120b-greedy-zero-shot-chatapi,wxai:openai/gpt-oss-120b-agentic-baseline0-3attempts,wxai:openai/gpt-oss-120b-agentic-baseline1-3attempts,wxai:openai/gpt-oss-120b-agentic-baseline2-3attempts,wxai:openai/gpt-oss-120b-agentic-baseline3-3attempts,wxai:openai/gpt-oss-120b-agentic-baseline4-3attempts,wxai:openai/gpt-oss-120b-agentic-baseline5-3attempts
0,1483,[SELECT SUM(Consumption) FROM yearmonth WHERE ...,{'predicted_sql': 'SELECT SUM(Consumption) FR...,{'predicted_sql': 'SELECT SUM(Consumption) AS ...,{'predicted_sql': 'SELECT SUM(Consumption) FR...,{'predicted_sql': 'SELECT SUM(Consumption) AS ...,{'predicted_sql': 'SELECT SUM(Consumption...,{'predicted_sql': 'SELECT SUM(Consumption) AS ...,{'predicted_sql': 'SELECT SUM(Consumption) AS ...,{'predicted_sql': 'SELECT SUM(Consumption) AS ...,{'predicted_sql': 'SELECT SUM(Consumption) AS ...,{'predicted_sql': 'SELECT SUM(Consumption) AS ...
1,1471,"[SELECT CAST(SUM(IIF(Currency = 'EUR', 1, 0)) ...",{'predicted_sql': 'SELECT (SELECT COUNT(*...,{'predicted_sql': 'SELECT (SELECT COUNT(D...,{'predicted_sql': 'SELECT (SELECT COUNT(Cus...,{'predicted_sql': 'SELECT  CAST(SUM(CASE WH...,{'predicted_sql': 'SELECT /* Count of cus...,{'predicted_sql': 'SELECT (SUM(CASE WHEN ...,{'predicted_sql': 'SELECT  SUM(CASE WHEN Cu...,{'predicted_sql': 'SELECT (CAST(COUNT(CAS...,{'predicted_sql': 'SELECT SUM(CASE WHEN C...,{'predicted_sql': 'SELECT 1.0 * SUM(CASE WHEN ...
2,1473,[SELECT AVG(T2.Consumption) / 12 FROM customer...,{'predicted_sql': 'SELECT AVG(T2.Consumption) ...,{'predicted_sql': 'SELECT AVG(Consumption) / 1...,{'predicted_sql': 'SELECT AVG(Consumption) FR...,{'predicted_sql': 'SELECT AVG(ym.Consumption) ...,{'predicted_sql': '-- Average monthly consumpt...,{'predicted_sql': 'SELECT AVG(month_total) AS ...,{'predicted_sql': 'SELECT AVG(ym.Consumption) ...,{'predicted_sql': 'SELECT AVG(ym.Consumption) ...,{'predicted_sql': 'SELECT AVG(ym.Consumption) ...,{'predicted_sql': 'SELECT AVG(ym.Consumption) ...
3,1484,"[SELECT SUM(IIF(Country = 'CZE', 1, 0)) - SUM(...",{'predicted_sql': 'SELECT (SELECT COUNT(*...,{'predicted_sql': 'SELECT  SUM(CASE WHEN Co...,{'predicted_sql': 'SELECT (SELECT COUNT(Gas...,{'predicted_sql': 'SELECT (SELECT COUNT(*...,{'predicted_sql': 'SELECT (SELECT COUNT(*...,{'predicted_sql': 'SELECT (SELECT COUNT(*...,{'predicted_sql': 'SELECT  (SELECT COUNT(*)...,{'predicted_sql': 'SELECT  (SELECT COUNT(*)...,{'predicted_sql': 'SELECT (  SELECT COU...,{'predicted_sql': 'SELECT (  SELECT COUNT(*) ...
4,1480,"[SELECT SUBSTR(T2.Date, 5, 2) FROM customers A...",{'predicted_sql': 'SELECT Date FROM yearmonth...,"{'predicted_sql': 'SELECT substr(Date, 5,...",{'predicted_sql': 'SELECT Date FROM yearmonth...,{'predicted_sql': 'SELECT ym.Date FROM yearmon...,{'predicted_sql': 'SELECT ym.Date AS Year...,{'predicted_sql': 'SELECT  ym.Date AS Month...,"{'predicted_sql': 'SELECT y.Date AS Month, SUM...",{'predicted_sql': 'SELECT  ym.Date AS PeakM...,"{'predicted_sql': 'SELECT ym.Date AS month, SU...","{'predicted_sql': 'SELECT SUBSTR(ym.Date, 5, 2..."


# 3. Execution: Getting the DFs for the gt and predicted SQLs

In [4]:
## Execute gt and predicted SQLs and store them in the predictions file
from text2sql_eval_toolkit.execution.execution_tools import run_execution

# run_execution(benchmark_id)

predictions_file = (
    benchmarks_json_file.parent / benchmarks_info[benchmark_id]["predictions"]
)
predictions = json.load(open(predictions_file, "r"))

# Optional: Display a few predictions for the selected models
df = pd.DataFrame(predictions)[["id", "sql", "gt_df", "predictions"]].head()
expanded = df["predictions"].apply(pd.Series)
df_expanded = pd.concat([df.drop(columns=["predictions"]), expanded], axis=1)

pipeline_id = pipeline_ids[0]
expanded = df_expanded[pipeline_id].apply(pd.Series)
df_expanded = pd.concat([df_expanded.drop(columns=[pipeline_id]), expanded], axis=1)
cols = ["id", "sql", "gt_df", "predicted_sql", "predicted_df"]
print(f"Predictions and df for pipeline: {pipeline_id}")
df_expanded[cols]

Predictions and df for pipeline: wxai:openai/gpt-oss-120b-agentic-baseline2-3attempts


Unnamed: 0,id,sql,gt_df,predicted_sql,predicted_df
0,1483,[SELECT SUM(Consumption) FROM yearmonth WHERE ...,"{""columns"":[""SUM(Consumption)""],""index"":[0],""d...",SELECT SUM(Consumption) AS total_consumption\n...,"{""columns"":[""total_consumption""],""index"":[0],""..."
1,1471,"[SELECT CAST(SUM(IIF(Currency = 'EUR', 1, 0)) ...","{""columns"":[""ratio""],""index"":[0],""data"":[[0.06...",SELECT\n SUM(CASE WHEN Currency = 'EUR' THE...,"{""columns"":[""eur_to_czk_ratio""],""index"":[0],""d..."
2,1473,[SELECT AVG(T2.Consumption) / 12 FROM customer...,"{""columns"":[""AVG(T2.Consumption) \/ 12""],""inde...",SELECT AVG(ym.Consumption) AS avg_monthly_cons...,"{""columns"":[""avg_monthly_consumption""],""index""..."
3,1484,"[SELECT SUM(IIF(Country = 'CZE', 1, 0)) - SUM(...","{""columns"":[""SUM(IIF(Country = 'CZE', 1, 0)) -...",SELECT\n (SELECT COUNT(*) FROM gasstations ...,"{""columns"":[""discount_gasstation_difference""],..."
4,1480,"[SELECT SUBSTR(T2.Date, 5, 2) FROM customers A...","{""columns"":[""SUBSTR(T2.Date, 5, 2)""],""index"":[...","SELECT y.Date AS Month, SUM(y.Consumption) AS ...","{""columns"":[""Month"",""TotalConsumption""],""index..."


# 4. Evaluation: Getting accuracy metrics on the predictions

In [5]:
# Running evaluation (saves the output as well)

import logging
import os
from text2sql_eval_toolkit.utils import get_benchmark_info
from text2sql_eval_toolkit.evaluation.evaluation_tools import evaluate_predictions
from dotenv import load_dotenv

# Need to set env variables for LLM as judge. Modify if needed:
load_dotenv(os.path.expanduser("~/.env"))

for name in logging.root.manager.loggerDict:
    logging.getLogger(name).setLevel(logging.INFO)

benchmark_info = get_benchmark_info(benchmark_id)
predictions_path = benchmark_info["predictions_path"]
pred_eval_data, summary_df = evaluate_predictions(
    predictions_path, use_llm=True
)  # Note: Setting use_llm to True will make the evaluation much slower due to LLM inference for the LLM judge

                                                                                   

16:25:56 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):   1%|          | 6/500 [00:00<00:15, 31.35it/s]

16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):   3%|‚ñé         | 16/500 [00:00<00:08, 59.95it/s]

16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 0.0)


Evaluating (concurrency limit: 16):   6%|‚ñå         | 28/500 [00:00<00:05, 80.14it/s]

16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)


                                                                                    

16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):   9%|‚ñâ         | 44/500 [00:00<00:06, 74.09it/s]

16:25:56 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):   9%|‚ñâ         | 44/500 [00:00<00:06, 74.09it/s]

16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:56 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  12%|‚ñà‚ñè        | 60/500 [00:00<00:06, 69.55it/s]

16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 0.0)


Evaluating (concurrency limit: 16):  12%|‚ñà‚ñè        | 60/500 [00:00<00:06, 69.55it/s]

16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  14%|‚ñà‚ñç        | 69/500 [00:01<00:05, 72.06it/s]

16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  15%|‚ñà‚ñå        | 77/500 [00:01<00:06, 69.28it/s]

16:25:57 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)


                                                                                    

16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 0.0)


Evaluating (concurrency limit: 16):  19%|‚ñà‚ñä        | 93/500 [00:01<00:05, 73.80it/s]

16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  20%|‚ñà‚ñà        | 102/500 [00:01<00:05, 77.50it/s]

16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)


                                                                                     

16:25:57 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  24%|‚ñà‚ñà‚ñç       | 120/500 [00:01<00:04, 82.18it/s]

16:25:57 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  26%|‚ñà‚ñà‚ñå       | 129/500 [00:01<00:05, 73.01it/s]

16:25:57 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  27%|‚ñà‚ñà‚ñã       | 137/500 [00:01<00:04, 74.67it/s]

16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  29%|‚ñà‚ñà‚ñâ       | 145/500 [00:02<00:04, 72.33it/s]

16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  29%|‚ñà‚ñà‚ñâ       | 145/500 [00:02<00:04, 72.33it/s]

16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  31%|‚ñà‚ñà‚ñà       | 153/500 [00:02<00:05, 62.92it/s]

16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  32%|‚ñà‚ñà‚ñà‚ñè      | 161/500 [00:02<00:05, 65.66it/s]

16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  34%|‚ñà‚ñà‚ñà‚ñç      | 170/500 [00:02<00:04, 69.15it/s]

16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 0.0)


Evaluating (concurrency limit: 16):  36%|‚ñà‚ñà‚ñà‚ñå      | 179/500 [00:02<00:04, 71.13it/s]

16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 0.0)


                                                                                     

16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)


                                                                                     

16:25:58 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:58 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 0.0)


                                                                                     

16:25:59 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  43%|‚ñà‚ñà‚ñà‚ñà‚ñé     | 213/500 [00:03<00:04, 71.17it/s]

16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  43%|‚ñà‚ñà‚ñà‚ñà‚ñé     | 213/500 [00:03<00:04, 71.17it/s]

16:25:59 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  44%|‚ñà‚ñà‚ñà‚ñà‚ñç     | 221/500 [00:03<00:04, 62.21it/s]

16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 0.0)


Evaluating (concurrency limit: 16):  46%|‚ñà‚ñà‚ñà‚ñà‚ñå     | 229/500 [00:03<00:04, 62.22it/s]

16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)


                                                                                     

16:25:59 - INFO - Reusing cached LLM judge results (score: 0.0)


Evaluating (concurrency limit: 16):  47%|‚ñà‚ñà‚ñà‚ñà‚ñã     | 236/500 [00:03<00:05, 46.51it/s]

16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  48%|‚ñà‚ñà‚ñà‚ñà‚ñä     | 242/500 [00:03<00:05, 46.28it/s]

16:25:59 - INFO - Reusing cached LLM judge results (score: 0.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)


                                                                                     

16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 1.0)
16:25:59 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 0.0)


Evaluating (concurrency limit: 16):  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 252/500 [00:03<00:04, 57.48it/s]

16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 0.0)


                                                                                     

16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)


                                                                                     

16:26:00 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  55%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå    | 276/500 [00:04<00:03, 64.06it/s]

16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  57%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã    | 285/500 [00:04<00:03, 65.58it/s]

16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  58%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä    | 292/500 [00:04<00:03, 65.94it/s]

16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  61%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 304/500 [00:04<00:02, 79.45it/s]

16:26:00 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 0.0)


Evaluating (concurrency limit: 16):  63%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé   | 313/500 [00:04<00:02, 78.95it/s]

16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  64%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç   | 322/500 [00:04<00:02, 81.90it/s]

16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:00 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)


                                                                                     

16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  66%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå   | 331/500 [00:04<00:02, 72.99it/s]

16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)


                                                                                     

16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  70%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ   | 348/500 [00:05<00:02, 74.81it/s]

16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 0.0)


Evaluating (concurrency limit: 16):  71%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 356/500 [00:05<00:02, 67.09it/s]

16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)


                                                                                     

16:26:01 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)


                                                                                     

16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)


                                                                                     

16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)


                                                                                     

16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:01 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 401/500 [00:05<00:01, 86.35it/s]

16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  82%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè | 412/500 [00:05<00:00, 90.80it/s]

16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  84%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç | 422/500 [00:06<00:00, 85.59it/s]

16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)


                                                                                     

16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 0.0)


Evaluating (concurrency limit: 16):  86%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå | 431/500 [00:06<00:00, 81.06it/s]

16:26:02 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 451/500 [00:06<00:00, 78.35it/s]

16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 451/500 [00:06<00:00, 78.35it/s]

16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  92%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè| 460/500 [00:06<00:00, 67.58it/s]

16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)


                                                                                     

16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16):  95%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå| 475/500 [00:06<00:00, 66.33it/s]

16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:02 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:03 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:03 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:03 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:03 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:03 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:03 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:03 - INFO - Reusing cached LLM judge results (score: 1.0)


                                                                                     

16:26:03 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:03 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:03 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:03 - INFO - Reusing cached LLM judge results (score: 1.0)


Evaluating (concurrency limit: 16): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [00:07<00:00, 70.87it/s]


16:26:03 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:03 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:03 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:03 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:03 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:03 - INFO - Reusing cached LLM judge results (score: 0.0)
16:26:03 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:03 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:03 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:03 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:03 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:03 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:03 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:03 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:03 - INFO - Reusing cached LLM judge results (score: 1.0)
16:26:03 - INFO - Reusing cached LLM jud

In [6]:
summary_df_filtered = summary_df.loc[:, ~summary_df.columns.str.endswith("std")]
summary_df_filtered.sort_values(
    by="subset_non_empty_execution_accuracy_avg", ascending=False
)


Unnamed: 0,Model,Total,Evaluated,Number of Correct Non-Empty Data Frames,Number of Correct Subset/Superset Non-Empty Data Frames,Number of Correct Results According to LLM Judge,Evaluation Errors,Dataframe Errors,LLM Judge Errors,Total Tokens,...,sql_exact_match_avg,sql_syntactic_equivalence_avg,eval_error_avg,df_error_avg,prompt_tokens_avg,completion_tokens_avg,total_tokens_avg,inference_time_ms_avg,execution_time_ms_avg,llm_score_avg
2,wxai:meta-llama/llama-4-maverick-17b-128e-inst...,50,50,31,33,46,0,1,0,80365,...,0.18,0.22,0.0,0.02,1509.38,97.92,1607.3,2229.425,50.1092,0.92
0,wxai:meta-llama/llama-3-3-70b-instruct-greedy-...,50,50,31,32,48,0,1,0,81126,...,0.16,0.24,0.0,0.02,1549.08,73.44,1622.52,4774.583,45.9014,0.96
3,wxai:openai/gpt-oss-120b-greedy-zero-shot-chatapi,50,50,28,30,47,0,2,0,94770,...,0.04,0.08,0.0,0.04,1569.68,325.72,1895.4,3800.2334,31.119,0.94
1,wxai:ibm/granite-4-h-small-greedy-zero-shot-ch...,50,50,28,29,37,0,4,0,80121,...,0.04,0.1,0.0,0.08,1527.08,75.34,1602.42,3976.3452,38.9186,0.74
6,wxai:openai/gpt-oss-120b-agentic-baseline2-3at...,50,50,27,28,44,0,0,0,98607,...,0.04,0.08,0.0,0.0,1581.86,390.28,1972.14,4509.1956,,0.88
5,wxai:openai/gpt-oss-120b-agentic-baseline1-3at...,50,50,27,27,44,0,1,0,100369,...,0.04,0.08,0.0,0.02,1622.96,384.42,2007.38,4038.1356,,0.88
9,wxai:openai/gpt-oss-120b-agentic-baseline5-3at...,50,46,26,27,39,4,6,0,645129,...,0.04,0.04,0.0,0.04,11612.56,1290.02,12902.58,132503.8202,,0.78
4,wxai:openai/gpt-oss-120b-agentic-baseline0-3at...,50,50,20,25,44,0,1,0,119077,...,0.04,0.06,0.0,0.02,1851.48,530.06,2381.54,5256.7206,,0.88
8,wxai:openai/gpt-oss-120b-agentic-baseline4-3at...,50,46,23,25,43,4,5,0,235698,...,0.04,0.06,0.0,0.02,4015.74,698.22,4713.96,71631.531,,0.86
7,wxai:openai/gpt-oss-120b-agentic-baseline3-3at...,50,50,0,1,3,0,45,0,299717,...,0.04,0.06,0.0,0.9,4844.04,1150.3,5994.34,34748.8144,,0.06


# 5. Error Analysis: Do the results make sense?

In [7]:
with open(predictions_path, "r") as f:
    predictions = json.load(f)
print(f"Number of predictions: {len(predictions)}")
print(f"Number of evaluations: {len(pred_eval_data)}")


Number of predictions: 50
Number of evaluations: 50


In [8]:
import pandas as pd
from IPython.display import display, Markdown, clear_output
from text2sql_eval_toolkit.utils import parse_dataframe


def get_model_names(records):
    return list(records[0]["predictions"].keys())


def get_failed_records(records, pipeline_id, metric="execution_accuracy"):
    return [
        r for r in records if r["predictions"][pipeline_id]["evaluation"][metric] == 0
    ]


def safe_snippet(text, head=1000, tail=1000):
    if len(text) <= head + tail:
        return text
    return text[:head] + "\n‚Ä¶\n" + text[-tail:]


def show_failed_example(pipeline_id: str, failed: list[dict], example_index: int):
    if not failed:
        display(Markdown(f"**No failed predictions found for model `{pipeline_id}`.**"))
        return
    if example_index >= len(failed):
        display(
            Markdown(
                f"**Index {example_index} is out of range. Only {len(failed)} failed examples available.**"
            )
        )
        return

    record = failed[example_index]
    pred = record["predictions"][pipeline_id]
    
    # Handle inference errors (when SQL generation fails)
    if "inference_error" in pred:
        clear_output(wait=True)
        question_id = record.get("id", record.get("_id", f"example_{example_index}"))
        utterance = (
            record.get("page_content")
            or record.get("question")
            or record.get("utterance", "")
        )
        display(
            Markdown(
                f"### ‚ö†Ô∏è  Inference Failed - Question #{example_index} (of {len(failed)} examples) - Question ID: `{question_id}`\n\n**Question**: {utterance}"
            )
        )
        display(Markdown("### ‚ùå Inference Error"))
        display(Markdown(f"```\n{pred.get('inference_error', 'Unknown error')}\n```"))
        
        if pred.get('raw_response'):
            display(Markdown("### üìÑ Raw Model Response"))
            display(Markdown(f"```\n{pred['raw_response']}\n```"))
        
        if pred.get('prompt'):
            display(Markdown("### üìù Prompt Used"))
            display(Markdown(f"```\n{safe_snippet(pred['prompt'], head=500, tail=500)}\n```"))
        return
    
    # Get ground truth SQL from multiple possible locations
    gt_sqls = (
        record.get("sql")
        or record.get("SQL")
        or record.get("metadata", {}).get("sql", [])
    )
    gt_sqls = [gt_sqls] if isinstance(gt_sqls, str) else gt_sqls

    # Parse ground truth DFs
    gt_dfs = []
    raw_gt_dfs = record.get("gt_df", [])
    if isinstance(raw_gt_dfs, str):
        gt_dfs = [parse_dataframe(raw_gt_dfs)]
    else:
        for df in record.get("gt_df", []):
            try:
                gt_dfs.append(parse_dataframe(df))
            except Exception as e:
                gt_dfs.append(f"‚ö†Ô∏è Error loading GT DF: {e}")

    pred_df = None
    pred_df_error = None
    if "predicted_df" in pred:
        try:
            pred_df = parse_dataframe(pred["predicted_df"])
        except Exception as e:
            pred_df_error = f"‚ö†Ô∏è Error loading predicted_df: {e}"

    clear_output(wait=True)
    utterance = (
        record.get("page_content")
        or record.get("question")
        or record.get("utterance", "")
    )
    question_id = record.get("id", record.get("_id", f"example_{example_index}"))
    display(
        Markdown(
            f"### ‚ùì Failed Question #{example_index} (out of {len(failed)}) - Question ID: {question_id}\nQuestion: {utterance}"
        )
    )
    display(Markdown("### ‚úÖ Ground Truth SQL(s)"))
    for sql in gt_sqls:
        display(Markdown(f"```sql\n{sql}\n```"))

    display(Markdown("### ‚ùå Predicted SQL"))
    display(Markdown(f"```sql\n{pred['predicted_sql']}\n```"))

    display(Markdown("### üìä Evaluation Metrics"))
    eval_df = pd.DataFrame([pred.get("evaluation", {})])
    llm_explanation = None
    if "llm_explanation" in eval_df.columns:
        llm_explanation = eval_df.at[0, "llm_explanation"]
    columns_to_drop = ["gt_sql", "gt_df", "llm_explanation"]
    eval_df.drop(columns=columns_to_drop, errors="ignore", inplace=True)
    display(eval_df)

    display(Markdown("### üìò Ground Truth Result(s)"))
    for i, df in enumerate(gt_dfs):
        display(Markdown(f"**Result {i + 1}:**"))
        if isinstance(df, pd.DataFrame):
            display(df)
        else:
            display(Markdown(df))

    display(Markdown("### üìï Predicted Result"))
    if pred_df is not None:
        display(pred_df)
    elif pred_df_error:
        display(Markdown(pred_df_error))

    # Display agent trace for agentic pipelines, or prompt for standard baseline
    if "agent_trace" in pred and pred["agent_trace"]:
        display(Markdown("### ü§ñ Agent Interaction Trace"))
        trace = pred["agent_trace"]
        if isinstance(trace, list):
            for i, interaction in enumerate(trace, 1):
                step_name = interaction.get("step", f"step_{i}")
                display(Markdown(f"\n**Step {i}: {step_name}**\n"))
                
                # Show messages (prompts sent to LLM)
                if "messages" in interaction:
                    display(Markdown("<details><summary>üìù Messages</summary>\n"))
                    for msg in interaction["messages"]:
                        role = msg.get("role", "unknown").capitalize()
                        content = msg.get("content", "").strip()
                        display(Markdown(f"**{role}:**\n```\n{safe_snippet(content, head=500, tail=500)}\n```\n"))
                    display(Markdown("</details>\n"))
                
                # Show response
                if "response" in interaction:
                    display(Markdown(f"**Response:** `{safe_snippet(interaction['response'][:200])}`\n"))
                
                # Show parsed SQL if available
                if "parsed_sql" in interaction:
                    display(Markdown(f"**Parsed SQL:** \n```sql\n{interaction['parsed_sql']}\n```\n"))
                
                # Show LLM judge verdict if this is a validation step
                if "verdict" in interaction:
                    display(Markdown(f"**Verdict:** {interaction['verdict']} (Confidence: {interaction.get('confidence', 'N/A')})\n"))
                    if "reasoning" in interaction:
                        display(Markdown(f"**Reasoning:** {safe_snippet(interaction['reasoning'][:300])}\n"))
                
                # Show error if any
                if "error" in interaction:
                    display(Markdown(f"**Error:** {interaction['error']}\n"))
        else:
            display(Markdown(f"```\n{str(trace)}\n```"))
        
        # Also show number of attempts if available
        if "agent_attempts" in pred:
            display(Markdown(f"\n**Total Attempts:** {pred['agent_attempts']}"))
    elif "agent_reasoning" in pred and pred["agent_reasoning"]:
        # Fallback to agent_reasoning if trace not available
        display(Markdown("### ü§ñ Agent Reasoning"))
        reasoning_list = pred["agent_reasoning"]
        if isinstance(reasoning_list, list):
            reasoning_text = "\n".join(
                f"{i}. {reasoning}" for i, reasoning in enumerate(reasoning_list, 1)
            )
            display(Markdown(f"```\n{reasoning_text}\n```"))
        else:
            display(Markdown(f"```\n{str(reasoning_list)}\n```"))
        
        # Also show number of attempts if available
        if "agent_attempts" in pred:
            display(Markdown(f"\n**Attempts:** {pred['agent_attempts']}"))
    elif "prompt" in pred:
        display(Markdown("### üß† Prompt"))
        prompt = pred.get("prompt", "")
        if isinstance(prompt, list):
            # Chat-style prompt
            for msg in prompt:
                role = msg.get("role", "unknown").capitalize()
                content = msg.get("content", "").strip()
                display(Markdown(f"**{role}:**\n```\n{safe_snippet(content)}\n```\n"))
        else:
            # String prompt
            display(Markdown(f"```\n{safe_snippet(prompt)}\n```"))
    else:
        display(Markdown("### üß† Context"))
        display(Markdown("_No prompt or agent trace available_"))
    
    if llm_explanation:
        display(Markdown(f"### ü§ñ LLM Judge Assessment\nLLM judge score: `{eval_df.at[0, 'llm_score'] if 'llm_score' in eval_df.columns else 'N/A'}`\n"))
        display(Markdown("LLM judge explanation (if applicable):\n"))
        display(Markdown(f"```\n{llm_explanation}\n```"))


In [9]:
pipeline_id = "wxai:openai/gpt-oss-120b-greedy-zero-shot-chatapi"
failed_examples = get_failed_records(
    pred_eval_data, pipeline_id, "subset_non_empty_execution_accuracy"
)
show_failed_example(pipeline_id, failed_examples, 5)

### ‚ùì Failed Question #5 (out of 20) - Question ID: 1493
Question: In February 2012, what percentage of customers consumed more than 528.3?

### ‚úÖ Ground Truth SQL(s)

```sql
SELECT CAST(SUM(IIF(Consumption > 528.3, 1, 0)) AS FLOAT) * 100 / COUNT(CustomerID) FROM yearmonth WHERE Date = '201202'
```

### ‚ùå Predicted SQL

```sql
SELECT 
    ROUND(100.0 * SUM(CASE WHEN Consumption > 528.3 THEN 1 ELSE 0 END) / COUNT(*), 2) AS percentage
FROM yearmonth
WHERE Date = '201202'
```

### üìä Evaluation Metrics

Unnamed: 0,execution_accuracy,non_empty_execution_accuracy,subset_non_empty_execution_accuracy,logic_execution_accuracy,bird_execution_accuracy,is_sqlglot_parsable,is_sqlparse_parsable,sqlglot_equivalence,sqlglot_optimized_equivalence,sqlparse_equivalence,sql_exact_match,sql_syntactic_equivalence,eval_error,df_error,prompt_tokens,completion_tokens,total_tokens,inference_time_ms,execution_time_ms,llm_score
0,0,0,0,1,0,1,1,0,0,0,0,0,0,0,1018,224,1242,3603.17,26.62,1.0


### üìò Ground Truth Result(s)

**Result 1:**

Unnamed: 0,"CAST(SUM(IIF(Consumption > 528.3, 1, 0)) AS FLOAT) * 100 / COUNT(CustomerID)"
0,66.623008


### üìï Predicted Result

Unnamed: 0,percentage
0,66.62


### üß† Prompt

```
Your task is to convert a natural language question into an accurate SQL query using the given sqlite database schema.

**Question:**:
In February 2012, what percentage of customers consumed more than 528.3?

**Database Engine / Dialect:**:
sqlite

**Schema:**
Table: customers
  Columns:
    - CustomerID (INTEGER) (Primary Key) # Example values: 3, 5, 6, 7, 9
    - Segment (TEXT) # Example values: SME, LAM, KAM
    - Currency (TEXT) # Example values: EUR, CZK

Table: gasstations
  Columns:
    - GasStationID (INTEGER) (Primary Key) # Example values: 44, 45, 46, 47, 48
    - ChainID (INTEGER) # Example values: 13, 6, 23, 33, 4
    - Country (TEXT) # Example values: CZE, SVK
    - Segment (TEXT) # Example values: Value for money, Premium, Other, Noname, Discount

Table: products
  Columns:
    - ProductID (INTEGER) (Primary Key) # Example values: 1, 2, 3, 4, 5
    - Description (TEXT) # Example values: Rucn√≠ zad√°n√≠, Nafta, Special, Super, Natural

Table: transactions_1k
  Columns:
    - 
‚Ä¶
- CustomerID (INTEGER) (Primary Key) # Example values: 39, 63, 172, 603, 1492
    - Date (TEXT) (Primary Key) # Example values: 201112, 201201, 201202, 201203, 201204
    - Consumption (REAL) # Example values: 528.3, 1598.28, 1931.36, 1497.14, 51.06


**Instructions:**
- Only use columns listed in the schema.
- Do not use any other columns or tables not mentioned in the schema.
- Ensure the SQL query is valid and executable.
- Use proper SQL syntax and conventions.
- Generate a complete SQL query that answers the question.
- Use the correct SQL dialect for the database, i.e., sqlite.
- Do not include any explanations or comments in the SQL output.
- Your output must start with ```sql and end with ```.

***Hints***
- February 2012 refers to '201202' in yearmonth.date
- The first 4 strings of the Date values in the yearmonth table can represent year
- The 5th and 6th string of the date can refer to month.

Question: In February 2012, what percentage of customers consumed more than 528.3?
```

### ü§ñ LLM Judge Assessment
LLM judge score: `1.0`


LLM judge explanation (if applicable):


```
Yes

The predicted SQL query is correct. It accurately calculates the percentage of customers who consumed more than 528.3 in February 2012. 

Here's a breakdown of why the predicted SQL is correct:
1. The query filters the `yearmonth` table for the date '201202', which corresponds to February 2012.
2. It uses a `CASE` statement within the `SUM` function to count the number of customers who consumed more than 528.3. This is equivalent to the `IIF` function used in the ground truth SQL.
3. The result is then divided by the total count of customers (`COUNT(*)`) and multiplied by 100 to get the percentage.
4. The `ROUND` function is used to round the result to 2 decimal places, which is a reasonable formatting choice.

The predicted result (66.62) matches the ground truth result (66.623008) when rounded to 2 decimal places, which further confirms the correctness of the predicted SQL query. 

Overall, the predicted SQL query is a valid and reasonable interpretation of the natural language question, and it produces the correct result. Therefore, the verdict is "Yes". 
```sql
```
```

In [10]:
# Alternatively, use the built-in display function
from text2sql_eval_toolkit.analysis.error_analysis import (
    get_failed_records,
    format_failed_example,
)

pipeline_id = pipeline_ids[0]
print(f"Error analysis for pipeline: {pipeline_id}")
failed_records = get_failed_records(
    pred_eval_data, pipeline_id, metric="subset_non_empty_execution_accuracy"
)
failed_record_index = 5
display(
    Markdown(
        format_failed_example(
            failed_records[failed_record_index - 1],
            pipeline_id,
            failed_record_index,
            len(failed_records),
        )
    )
)

Error analysis for pipeline: wxai:openai/gpt-oss-120b-agentic-baseline2-3attempts


### ‚ùì Failed Question #5 (of 22 examples) - Question ID: `1500`


**Question**: Please list the product description of the products consumed in September, 2013.


### ‚úÖ Ground Truth SQL(s)

```sql
SELECT T3.Description FROM transactions_1k AS T1 INNER JOIN yearmonth AS T2 ON T1.CustomerID = T2.CustomerID INNER JOIN products AS T3 ON T1.ProductID = T3.ProductID WHERE T2.Date = '201309'
```

### ‚ùå Predicted SQL

```sql
SELECT DISTINCT p.Description
FROM products p
JOIN transactions_1k t ON p.ProductID = t.ProductID
WHERE t.Date >= '2013-09-01' AND t.Date < '2013-10-01'
```

### üìä Evaluation Metrics

|   execution_accuracy |   non_empty_execution_accuracy |   subset_non_empty_execution_accuracy |   logic_execution_accuracy |   bird_execution_accuracy |   is_sqlglot_parsable |   is_sqlparse_parsable |   sqlglot_equivalence |   sqlglot_optimized_equivalence |   sqlparse_equivalence |   sql_exact_match |   sql_syntactic_equivalence |   eval_error |   df_error |   prompt_tokens |   completion_tokens |   total_tokens |   inference_time_ms |   llm_score |
|---------------------:|-------------------------------:|--------------------------------------:|---------------------------:|--------------------------:|----------------------:|-----------------------:|----------------------:|--------------------------------:|-----------------------:|------------------:|----------------------------:|-------------:|-----------:|----------------:|--------------------:|---------------:|--------------------:|------------:|
|                    0 |                              0 |                                     0 |                          0 |                         0 |                     1 |                      1 |                     0 |                               0 |                      0 |                 0 |                           0 |            0 |          0 |             958 |                 247 |           1205 |              4029.6 |           0 |

### üìò Ground Truth Result(s)

**Result 1:**

| Description     |
|:----------------|
| Nafta           |
| Nafta           |
| Provoz.n√°pl.    |
| Natural         |
| Nafta           |
| Natural         |
| Natural         |
| Nemrz.kapal.    |
| Nafta           |
| Nafta           |
| Nafta           |
| Nafta           |
| Natural         |
| Natural         |
| Oleje,tuky      |
| Nafta           |
| Nafta           |
| Nafta           |
| Nafta           |
| Nafta           |
| ...             |
| ... (truncated) |
| ...             |
| Nafta           |
| Nafta           |
| Diesel +        |
| Natural         |
| Natural         |
| Natural         |
| Nafta           |
| Nafta           |
| Nafta           |
| Natural         |
| Natural         |
| Natural         |
| Nafta           |
| Nafta           |
| Nafta           |
| Nafta           |
| Nafta           |
| Nafta           |
| Nafta           |
| Nafta           |

### üìï Predicted Result

| Description   |
|---------------|

### ü§ñ Agent Interaction Trace


**Step 1: generate_sql_attempt_1**


<details>

<summary>üìù Messages</summary>



## Full Chat Prompt Conversation

<div style="max-height: 400px; overflow-y: auto; border: 1px solid #ccc; padding: 10px; background-color: #f9f9f9; font-family: monospace;">
<div style='margin-bottom: 15px;'>
<strong>System Message 1:</strong>
<pre style='margin: 5px 0; padding: 10px; background-color: #ffffff; border-left: 3px solid #007acc; white-space: pre-wrap; word-wrap: break-word;'>You are a SQL expert. Your task is to convert natural language questions into accurate SQL queries using the given database schema and instructions.</pre>
</div>
<div style='margin-bottom: 15px;'>
<strong>User Message 2:</strong>
<pre style='margin: 5px 0; padding: 10px; background-color: #ffffff; border-left: 3px solid #007acc; white-space: pre-wrap; word-wrap: break-word;'>Your task is to convert a natural language question into an accurate SQL query using the given sqlite database schema.

**Question:**:
Please list the product description of the products consumed in September, 2013.

**Database Engine / Dialect:**:
sqlite

**Schema:**
Table: customers
  Columns:
    - CustomerID (INTEGER) (Primary Key) # Example values: 3, 5, 6, 7, 9
    - Segment (TEXT) # Example values: SME, LAM, KAM
    - Currency (TEXT) # Example values: EUR, CZK

Table: gasstations
  Columns:
    - GasStationID (INTEGER) (Primary Key) # Example values: 44, 45, 46, 47, 48
    - ChainID (INTEGER) # Example values: 13, 6, 23, 33, 4
    - Country (TEXT) # Example values: CZE, SVK
    - Segment (TEXT) # Example values: Value for money, Premium, Other, Noname, Discount

Table: products
  Columns:
    - ProductID (INTEGER) (Primary Key) # Example values: 1, 2, 3, 4, 5
    - Description (TEXT) # Example values: Rucn√≠ zad√°n√≠, Nafta, Special, Super, Natural

Table: transactions_1k
  Columns:
    - TransactionID (INTEGER) (Primary Key) # Example values: 1, 2, 3, 4, 5
    - Date (DATE) # Example values: 2012-08-24, 2012-08-23, 2012-08-25, 2012-08-26
    - Time (TEXT) # Example values: 09:41:00, 10:03:00, 13:53:00, 08:49:00, 08:53:00
    - CustomerID (INTEGER) # Example values: 31543, 46707, 7654, 17373, 7881
    - CardID (INTEGER) # Example values: 486621, 550134, 684220, 536109, 99745
    - GasStationID (INTEGER) # Example values: 3704, 656, 741, 1152, 636
    - ProductID (INTEGER) # Example values: 2, 23, 5, 11, 7
    - Amount (INTEGER) # Example values: 28, 18, 1, 5, 4
    - Price (REAL) # Example values: 672.64, 430.72, 121.99, 120.74, 645.05

Table: sqlite_sequence
  Columns:
    - name (TEXT) # Example values: transactions_1k
    - seq (TEXT) # Example values: 1000

Table: yearmonth
  Columns:
    - CustomerID (INTEGER) (Primary Key) # Example values: 39, 63, 172, 603, 1492
    - Date (TEXT) (Primary Key) # Example values: 201112, 201201, 201202, 201203, 201204
    - Consumption (REAL) # Example values: 528.3, 1598.28, 1931.36, 1497.14, 51.06


**Instructions:**
- Only use columns listed in the schema.
- Do not use any other columns or tables not mentioned in the schema.
- Ensure the SQL query is valid and executable.
- Use proper SQL syntax and conventions.
- Generate a complete SQL query that answers the question.
- Use the correct SQL dialect for the database, i.e., sqlite.
- Do not include any explanations or comments in the SQL output.
- Your output must start with ```sql and end with ```.

Question: Please list the product description of the products consumed in September, 2013.</pre>
</div>
</div>


</details>


**Response:** `SELECT DISTINCT p.Description
FROM products p
JOIN transactions_1k t ON p.ProductID = t.ProductID
WHERE t.Date >= '2013-09-01' AND t.Date < '2013-10-01'`


**Parsed SQL:** 
```sql
SELECT DISTINCT p.Description
FROM products p
JOIN transactions_1k t ON p.ProductID = t.ProductID
WHERE t.Date >= '2013-09-01' AND t.Date < '2013-10-01'
```



**Total Attempts:** 1

### ü§ñ LLM Judge Assessment
LLM judge score: `0.0`


LLM judge explanation (if applicable):


<pre>No

The predicted SQL query does not match the ground truth SQL query. Although the predicted query attempts to filter transactions by date, it does so in a manner that is inconsistent with the ground truth query. The ground truth query joins the `transactions_1k` table with the `yearmonth` table on the `CustomerID` column and filters the results to include only transactions where the `Date` column in the `yearmonth` table is `&#x27;201309&#x27;`. In contrast, the predicted query filters the `transactions_1k` table directly based on the `Date` column, which may not accurately reflect the desired date range.

Furthermore, the predicted result is an empty dataframe, which suggests that the query did not return any matching rows. This is likely due to the fact that the `Date` column in the `transactions_1k` table may not contain dates in the format `&#x27;2013-09-01&#x27;` or `&#x27;2013-10-01&#x27;`, or that the dates in the table do not fall within the specified range.

In contrast, the ground truth result contains a list of product descriptions, which suggests that the ground truth query was able to successfully retrieve the desired data. Therefore, based on the differences between the predicted and ground truth queries, as well as the empty predicted result, I conclude that the predicted SQL query is incorrect. 

Note: The empty predicted result could also be due to the data in the `transactions_1k` table, but without more information about the data, it&#x27;s impossible to say for certain. However, given the differences between the predicted and ground truth queries, it&#x27;s likely that the predicted query is the cause of the empty result. 

The best answer is No.</pre>

## LLM as a judge to dig deeper

In [11]:
failed_examples[0]

{'question_id': 1480,
 'db_id': 'debit_card_specializing',
 'question': 'What was the gas consumption peak month for SME customers in 2013?',
 'evidence': 'Year 2013 can be presented as Between 201301 And 201312; The first 4 strings of the Date values in the yearmonth table can represent year; The 5th and 6th string of the date can refer to month.',
 'SQL': "SELECT SUBSTR(T2.Date, 5, 2) FROM customers AS T1 INNER JOIN yearmonth AS T2 ON T1.CustomerID = T2.CustomerID WHERE SUBSTR(T2.Date, 1, 4) = '2013' AND T1.Segment = 'SME' GROUP BY SUBSTR(T2.Date, 5, 2) ORDER BY SUM(T2.Consumption) DESC LIMIT 1",
 'difficulty': 'moderate',
 'sql': ["SELECT SUBSTR(T2.Date, 5, 2) FROM customers AS T1 INNER JOIN yearmonth AS T2 ON T1.CustomerID = T2.CustomerID WHERE SUBSTR(T2.Date, 1, 4) = '2013' AND T1.Segment = 'SME' GROUP BY SUBSTR(T2.Date, 5, 2) ORDER BY SUM(T2.Consumption) DESC LIMIT 1"],
 'meta': {'features': {'query_table_count': 2,
   'query_column_count': 7,
   'query_nested_count': 0,
   'quer

In [12]:
# If use_llm is not used in the evaluation, or you need additional evaluations, you can invoke LLM as judge as shown here.

from text2sql_eval_toolkit.inference.inference_tools import WXAIClient
from text2sql_eval_toolkit.evaluation.llm_as_judge import (
    evaluate_sql_prediction_with_llm,
    load_llm_judge_config,
)

prediction_pipeline_id = "wxai:openai/gpt-oss-120b-greedy-zero-shot-chatapi"
failed_example = failed_examples[0]
question = (
    failed_example["page_content"]
    if "page_content" in failed_example
    else failed_example["question"]
    if "question" in failed_example
    else failed_example["utterance"]
)
ground_truth_sql = failed_example["sql"]
ground_truth_df = failed_example["gt_df"]
predicted_sql = failed_example["predictions"][prediction_pipeline_id]["predicted_sql"]
predicted_df = failed_example["predictions"][prediction_pipeline_id]["predicted_df"]
prompt = failed_example["predictions"][prediction_pipeline_id]["prompt"]
llm_judge_config = load_llm_judge_config()
llm_as_judge_response = evaluate_sql_prediction_with_llm(
    question,
    ground_truth_sql,
    ground_truth_df,
    predicted_sql,
    predicted_df,
    prompt,
    llm_judge_config,
)
llm_as_judge_response

{'verdict': 'Yes',
 'score': 1.0,
 'explanation': 'Yes\n\nThe predicted SQL query is correct. It joins the `yearmonth` table with the `customers` table on the `CustomerID` column, filters the results to only include rows where the `Date` is between \'201301\' and \'201312\' (i.e., the year 2013) and the `Segment` is \'SME\'. The results are then grouped by the `Date` column, ordered by the sum of the `Consumption` column in descending order, and limited to the top row (i.e., the peak month).\n\nThe predicted result, [201304], indicates that the peak month for SME customers in 2013 was April (04). This is a reasonable interpretation of the question, as the `Date` column in the `yearmonth` table appears to be in the format \'YYYYMM\'.\n\nThe ground truth SQL query uses a different approach, extracting the month from the `Date` column using the `SUBSTR` function and grouping by the resulting month. However, the predicted SQL query is more straightforward and easier to understand, and it p

In [13]:
import pandas as pd
from tqdm import tqdm

# Use the pipeline_id from the current analysis
prediction_pipeline_id = pipeline_id

evaluation_results = []

for idx, failed_example in tqdm(enumerate(failed_examples), total=len(failed_examples)):
    question = (
        failed_example.get("page_content")
        or failed_example.get("question")
        or failed_example.get("utterance")
    )
    
    # Use safe access with .get() to handle missing fields
    ground_truth_sql = failed_example.get("sql") or failed_example.get("SQL") or failed_example.get("metadata", {}).get("sql", [])
    ground_truth_df = failed_example.get("gt_df", "")
    
    # Check if pipeline_id exists in predictions
    if prediction_pipeline_id not in failed_example.get("predictions", {}):
        print(f"Skipping example {idx}: pipeline_id not found")
        continue
    
    pred = failed_example["predictions"][prediction_pipeline_id]
    predicted_sql = pred.get("predicted_sql", "")
    predicted_df = pred.get("predicted_df", "")
    prompt = pred.get("prompt", "")
    
    # Skip if predicted_df is missing (execution error)
    if not predicted_df:
        print(f"Skipping example {idx}: no predicted_df (likely execution error)")
        continue

    eval_response = evaluate_sql_prediction_with_llm(
        question,
        ground_truth_sql,
        ground_truth_df,
        predicted_sql,
        predicted_df,
        prompt,
        llm_judge_config,
    )

    evaluation_results.append(
        {
            "Index": idx,
            "Verdict": eval_response.get("verdict"),
            "Score": eval_response.get("score"),
            "Explanation": eval_response.get("explanation"),
        }
    )

results_df = pd.DataFrame(evaluation_results)
results_df[["Index", "Verdict", "Score", "Explanation"]]


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 20/20 [02:46<00:00,  8.33s/it]


Unnamed: 0,Index,Verdict,Score,Explanation
0,0,Yes,1.0,Yes\n\nThe predicted SQL query is a reasonable...
1,1,Yes,1.0,Yes\n\nThe predicted SQL query is correct. Alt...
2,2,Yes,1.0,Yes\n\nThe predicted SQL query is correct. Alt...
3,3,Yes,1.0,Yes\n\nThe predicted SQL query is correct. It ...
4,4,Yes,1.0,Yes\n\nThe predicted SQL query is correct. Alt...
5,5,Yes,1.0,Yes\n\nThe predicted SQL query is correct. It ...
6,6,Yes,1.0,Yes\n\nThe predicted SQL query is correct. Alt...
7,7,Yes,1.0,Yes\n\nThe predicted SQL query is a reasonable...
8,8,No,0.0,No\n\nThe predicted SQL is incorrect because i...
9,9,Yes,1.0,Yes\n\nThe predicted SQL query is correct. It ...


In [14]:
index_to_investigate = 1
show_failed_example(pipeline_id, failed_examples, index_to_investigate)
# Note: LLM judge explanation is now displayed within show_failed_example() if available

### ‚ùì Failed Question #1 (out of 20) - Question ID: 1498
Question: What is the highest monthly consumption in the year 2012?

### ‚úÖ Ground Truth SQL(s)

```sql
SELECT SUM(Consumption) FROM yearmonth WHERE SUBSTR(Date, 1, 4) = '2012' GROUP BY SUBSTR(Date, 5, 2) ORDER BY SUM(Consumption) DESC LIMIT 1
```

### ‚ùå Predicted SQL

```sql
SELECT MAX(month_total) AS highest_monthly_consumption
FROM (
    SELECT SUM(Consumption) AS month_total
    FROM yearmonth
    WHERE Date LIKE '2012%'
    GROUP BY Date
)
```

### üìä Evaluation Metrics

Unnamed: 0,execution_accuracy,non_empty_execution_accuracy,subset_non_empty_execution_accuracy,logic_execution_accuracy,bird_execution_accuracy,is_sqlglot_parsable,is_sqlparse_parsable,sqlglot_equivalence,sqlglot_optimized_equivalence,sqlparse_equivalence,sql_exact_match,sql_syntactic_equivalence,eval_error,df_error,prompt_tokens,completion_tokens,total_tokens,inference_time_ms,llm_score
0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,952,390,1342,4570.91,1.0


### üìò Ground Truth Result(s)

**Result 1:**

Unnamed: 0,SUM(Consumption)
0,51787161.74


### üìï Predicted Result

Unnamed: 0,highest_monthly_consumption
0,51787161.74


### ü§ñ Agent Interaction Trace


**Step 1: generate_sql_attempt_1**


<details><summary>üìù Messages</summary>


**System:**
```
You are a SQL expert. Your task is to convert natural language questions into accurate SQL queries using the given database schema and instructions.
```


**User:**
```
Your task is to convert a natural language question into an accurate SQL query using the given sqlite database schema.

**Question:**:
What is the highest monthly consumption in the year 2012?

**Database Engine / Dialect:**:
sqlite

**Schema:**
Table: customers
  Columns:
    - CustomerID (INTEGER) (Primary Key) # Example values: 3, 5, 6, 7, 9
    - Segment (TEXT) # Example values: SME, LAM, KAM
    - Currency (TEXT) # Example values: EUR, CZK

Table: gasstations
  Columns:
    - GasStationID (
‚Ä¶
e columns listed in the schema.
- Do not use any other columns or tables not mentioned in the schema.
- Ensure the SQL query is valid and executable.
- Use proper SQL syntax and conventions.
- Generate a complete SQL query that answers the question.
- Use the correct SQL dialect for the database, i.e., sqlite.
- Do not include any explanations or comments in the SQL output.
- Your output must start with ```sql and end with ```.

Question: What is the highest monthly consumption in the year 2012?
```


</details>


**Response:** `SELECT MAX(month_total) AS highest_monthly_consumption
FROM (
    SELECT SUM(Consumption) AS month_total
    FROM yearmonth
    WHERE Date LIKE '2012%'
    GROUP BY Date
)`


**Parsed SQL:** 
```sql
SELECT MAX(month_total) AS highest_monthly_consumption
FROM (
    SELECT SUM(Consumption) AS month_total
    FROM yearmonth
    WHERE Date LIKE '2012%'
    GROUP BY Date
)
```



**Total Attempts:** 1

### ü§ñ LLM Judge Assessment
LLM judge score: `1.0`


LLM judge explanation (if applicable):


```
N/A (did not use LLM due to subset match)
```