In [1]:
!pip install -U openai sqlalchemy psycopg2 pandas tabulate


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
from llm_handler import LLMClient, LLMConversation

API_KEY = None

# Test LLMClient

## Simple call

In [2]:
client = LLMClient(api_key=API_KEY)

response = client.generate_response(
    prompt="Explain quantum tunneling in simple terms."
)

print(response)

2025-11-25 14:32:19,969 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Quantum tunneling is when a particle can pass through a barrier that it wouldn’t be able to cross in everyday (classical) thinking.

- The key idea: In quantum mechanics, particles behave like waves. The particle’s wave can extend a little bit into the barrier.
- Even if the particle’s energy is less than the barrier, the wave inside the barrier isn’t zero. It dies away, but a tiny part of it leaks to the other side.
- If the barrier isn’t too tall or too wide (and the particle isn’t too heavy), there’s a real, nonzero chance that the particle shows up on the far side. That’s tunneling.

What controls the chance:
- Barrier height and width: higher or thicker barriers make tunneling less likely.
- Particle mass: heavier particles tunnel less easily.
- Particle energy: closer to the barrier height increases tunneling probability.

A simple intuition: imagine a wave hitting a wall. Part of the wave reflects back, but part can “tunnel” through the wall and appear on the other side, even th

## Using reasoning, max_tokens, top_p

In [2]:
client = LLMClient(api_key=API_KEY)

response = client.generate_response(
    prompt="Explain quantum tunneling in simple terms.",
    max_tokens=500,
    reasoning_effort = 'minimal',
    verbosity='low'
)

print(response)

2025-11-25 14:54:22,198 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Quantum tunneling is like a particle trying to go through a wall it doesn’t have enough energy to pass over. In the weird world of quantum physics, particles don’t act like tiny balls with definite paths. Instead, they behave like waves that spread out.

- If there’s a barrier (a wall) with some height, classically the particle shouldn’t get through if its energy is less than the barrier.
- But the particle’s wave can extend on the other side of the barrier.
- Part of the wave “leaks” through the barrier and emerges on the other side, which means there’s a chance the particle appears there even though it didn’t have enough energy to climb over.

That chance is called tunneling. It explains phenomena like radioactive particles escaping a nucleus, certain electronic devices like tunnel diodes, and fusion in stars. In short: particles can sometimes appear on the other side of a barrier because their quantum wave can pass through it, not because they magically gain enough energy.


## With conversations

In [15]:
client = LLMClient(api_key=API_KEY)

conv = LLMConversation("You are a helpful assistant specialized in health science.")

conv.add_message("user", "What is creatine and how does it work?")
conv.add_message("assistant", "Creatine is a compound that helps regenerate ATP...")
conv.add_message("user", "Should athletes take it every day?")

response = client.generate_response_for_conversation(
    conversation=conv,
    reasoning_effort="minimal",
    max_tokens=1000
)

print(response)

2025-11-24 07:50:09,847 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Response(id='resp_006acfc19250e0a10069240e22e3348199a9c01467a1414a52', created_at=1763970594.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-5-nano-2025-08-07', object='response', output=[ResponseReasoningItem(id='rs_006acfc19250e0a10069240e232b148199b1ec07a25723f81f', summary=[], type='reasoning', content=None, encrypted_content=None, status=None), ResponseOutputMessage(id='msg_006acfc19250e0a10069240e2e804481998315dafb7926e2f1', content=[ResponseOutputText(annotations=[], text='Yes. For most athletes, taking creatine every day is recommended to keep muscle creatine stores elevated and support ongoing performance gains. You don’t need to take it only on training days; daily dosing helps maintain saturation.\n\nPractical dosing approaches\n\n- Loading + maintenance (faster saturation):\n  - Loading: about 20 g per day, split into 4 doses, for 5–7 days.\n  - Maintenance: 3–5 g per day after that.\n  - Pros: reaches full effect quickly (within a week or

In [16]:
conv.add_message("assistant", response)
conv.add_message("user", "Is it safe long-term?")

follow_up = client.generate_response_for_conversation(conv)
print(follow_up)

2025-11-24 07:52:20,187 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Response(id='resp_0b52c9732030b6c10069240ea6982c819496fe611f1b64b459', created_at=1763970726.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-5-nano-2025-08-07', object='response', output=[ResponseReasoningItem(id='rs_0b52c9732030b6c10069240ea71e288194ab96ebd3dde8a439', summary=[], type='reasoning', content=None, encrypted_content=None, status=None), ResponseOutputMessage(id='msg_0b52c9732030b6c10069240eb07db081949ed46e076420f7bc', content=[ResponseOutputText(annotations=[], text='Short answer: Yes, for healthy people, creatine monohydrate is generally safe for long-term use when taken at typical doses. Most research shows no meaningful harm to kidneys, liver, or other organs over multiple years.\n\nWhat the evidence suggests\n- Population: healthy adults (athletes and active individuals).\n- What’s been studied: up to several years of daily creatine at 3–5 g; occasional short-term loading (e.g., 20 g/day for a week) is also studied.\n- Findings: no co

## Many responses in parallel

In [18]:
client = LLMClient(api_key=API_KEY)

prompts = [
    "Write a short poem about AI.",
    "Summarize the plot of Inception.",
    "Explain recursion to a child.",
]

batch_responses = client.generate_multiple_responses(
    list_of_prompts=prompts,
    reasoning_effort="minimal",
    max_tokens=200
)

for i, r in enumerate(batch_responses):
    print(f"Response {i+1}: {r}\n")

2025-11-24 07:53:47,332 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Response(id='resp_06e5fbf8372b92db0069240f09df98819b91fcd3b9f46eef5f', created_at=1763970825.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-5-nano-2025-08-07', object='response', output=[ResponseReasoningItem(id='rs_06e5fbf8372b92db0069240f0a2d24819b8e41267539a4c2fb', summary=[], type='reasoning', content=None, encrypted_content=None, status=None), ResponseOutputMessage(id='msg_06e5fbf8372b92db0069240f0a5510819bb604259f3b86056f', content=[ResponseOutputText(annotations=[], text='Whirring thoughts in chrome-lit dawn,\npatterns bloom where circuits yawn.\nA whisper learned from countless streams,\ncrafting truths from borrowed dreams.\n\nNot heart, but rhythm, bright and clear,\nlogic singing, near and near.\nWe shape the dawn with careful grace,\na mirror held to humankind—our face.', type='output_text', logprobs=[])], role='assistant', status='completed', type='message')], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p

2025-11-24 07:53:48,056 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Response(id='resp_0c7fdbf750e68ded0069240f09e3bc819ba92a976c1036c059', created_at=1763970825.0, error=None, incomplete_details=IncompleteDetails(reason='max_output_tokens'), instructions=None, metadata={}, model='gpt-5-nano-2025-08-07', object='response', output=[ResponseReasoningItem(id='rs_0c7fdbf750e68ded0069240f0a2bd4819b8d84054d03957ea2', summary=[], type='reasoning', content=None, encrypted_content=None, status=None), ResponseOutputMessage(id='msg_0c7fdbf750e68ded0069240f0a4ed0819b96c45264a199e823', content=[ResponseOutputText(annotations=[], text='Sure! Here’s a simple way to think about it kid-friendly.\n\nWhat is recursion?\n- Recursion is when a problem tells you to solve a smaller version of the same problem, and you keep doing that until it’s easy to solve.\n\nExample: stacking cups to reach the top\n- Imagine you want to stack 5 cups, but you can only place one cup at a time. \n- First, you tell yourself: “I’ll stack 4 cups, and I’ll figure out how to do that the same way.

2025-11-24 07:53:48,460 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Response(id='resp_08ab141c965051730069240f0a3f448199a1a0cf48b0559085', created_at=1763970826.0, error=None, incomplete_details=IncompleteDetails(reason='max_output_tokens'), instructions=None, metadata={}, model='gpt-5-nano-2025-08-07', object='response', output=[ResponseReasoningItem(id='rs_08ab141c965051730069240f0aa6208199a01831ccbbceee20', summary=[], type='reasoning', content=None, encrypted_content=None, status=None), ResponseOutputMessage(id='msg_08ab141c965051730069240f0ac82481999b1fd71597518dc9', content=[ResponseOutputText(annotations=[], text='Inception follows Dom Cobb, a skilled thief who specializes in "extraction"—stealing secrets from inside people\'s dreams. After a mission ends disastrously, Cobb is offered a chance at redemption: perform "inception"—planting an idea in someone’s mind—rather than stealing one. This is considered nearly impossible.\n\nTo complete the job, Cobb assembles a team: Arthur (coordinator), Ariadne (a young architect who designs dreamscapes), 

## Data agent

In [66]:
import psycopg2
import pandas as pd

user = "padel_user_local"
password = "localpassword"
host = "localhost"
port = "5432"
database = "padel_league_local"

class SQLClient:
    def execute(self, sql_query):
        conn = psycopg2.connect(
            dbname=database, user=user, password=password, host=host, port=port
        )
        try:
            return pd.read_sql_query(sql_query, conn)
        finally:
            conn.close()

In [119]:
import re
from pandas.errors import DatabaseError

class Agent:
    def __init_subclass__(cls):
        super().__init_subclass__()
        if not hasattr(cls, "name") or not hasattr(cls, "description"):
            raise TypeError(f"Class '{cls.__name__}' must define 'name' and 'description'.")

    def run(self, **kwargs):
        raise NotImplementedError

class DataAgent(Agent):
    
    name = 'DataAgent'
    description = """
    Retrieves structured data from the SQL database. 
    Takes a fully-formed question about padel, the league, 
    divisions, matches, schedules, rankings, or players.
    """
    
    def __init__(self, llm_client, sql_client):
        self.llm_client = llm_client
        self.sql_client = sql_client
        
        self.schema = """
        ## Table leagues
        leagues (
        id BIGINT PRIMARY KEY,
        name TEXT
        )

        ## Table editions
        editions (
        id BIGINT PRIMARY KEY,
        name TEXT,
        league_id BIGINT REFERENCES leagues(id)
        )

        ## Table divisions
        divisions (
        id BIGINT PRIMARY KEY,
        name TEXT,
        beginning_datetime TIMESTAMP,
        rating BIGINT,
        end_date TIMESTAMP,
        logo_image_path TEXT,
        large_picture_path TEXT,
        has_ended BOOLEAN,
        open_division BOOLEAN,
        edition_id BIGINT REFERENCES editions(id),
        logo_image_id INTEGER,
        large_picture_id INTEGER
        )

        ## Table players
        players (
        id BIGINT PRIMARY KEY,
        name TEXT,
        full_name TEXT,
        birthday TIMESTAMP,
        picture_path TEXT,
        large_picture_path TEXT,
        ranking_points BIGINT,
        ranking_position BIGINT,
        height DOUBLE PRECISION,
        prefered_hand TEXT,
        prefered_position TEXT
        )

        ## Table players_in_division
        players_in_division (
        id BIGINT PRIMARY KEY,
        player_id BIGINT REFERENCES players(id),
        division_id BIGINT REFERENCES divisions(id),
        place BIGINT,
        points DOUBLE PRECISION,
        appearances BIGINT,
        percentage_of_appearances DOUBLE PRECISION,
        wins BIGINT,
        draws BIGINT,
        losts BIGINT,
        games_won BIGINT,
        games_lost BIGINT,
        matchweek BIGINT
        )

        ## Table matches
        matches (
        id BIGINT PRIMARY KEY,
        games_home_team BIGINT,
        games_away_team BIGINT,
        date_hour TIMESTAMP,
        winner BIGINT,
        matchweek BIGINT,
        field TEXT,
        played BOOLEAN,
        division_id BIGINT REFERENCES divisions(id)
        )

        ## Table players_in_match
        players_in_match (
        id BIGINT PRIMARY KEY,
        player_id BIGINT REFERENCES players(id),
        match_id BIGINT REFERENCES matches(id),
        team TEXT
        )
        """
        
    def ask_llm_client(self, prompt):
        return self.llm_client.generate_response(
            prompt=prompt,
            reasoning_effort = 'minimal',
            verbosity = 'low'
        )
        
    def extract_sql_block(self, text: str) -> str:
        """
        Extracts the SQL inside ```sql ... ``` from an LLM response.
        Raises a clean exception if no SQL is found.
        """
        match = re.search(r"```sql\s*(.*?)\s*```", text, re.DOTALL)
        if not match:
            raise ValueError("No SQL code block found in LLM output.")
        return match.group(1).strip()

    def repair_sql(self, question, faulty_sql, db_error):
        """
        Sends the schema, faulty SQL and error to the LLM
        and asks for a corrected SQL-only output.
        """
        repair_prompt = f"""
            You are an expert PostgreSQL fixer.

            You will receive:
            - The database schema
            - A natural-language question
            - The faulty SQL the previous LLM generated
            - The exact DatabaseError message

            Your task:
            - Analyze the faulty SQL
            - Identify the mistake
            - Output a corrected SQL query
            - Follow *all* the semantic rules in the schema description

            Return ONLY a ```sql ... ``` block.

            ---
            # Schema
            {self.schema}

            ---
            # Question
            {question}

            ---
            # Faulty SQL
            ```sql
            {faulty_sql}
            ```

            Database Error
            {db_error}

            Output corrected SQL (only SQL in codeblock)
        """

        llm_answer = self.llm_client.generate_response(
            model="gpt-5.1",
            prompt=repair_prompt,
            reasoning_effort="medium",
            verbosity="low"
        )
        return self.extract_sql_block(llm_answer)

    def prompt(self, user_question):
        return f"""
        You are an expert in SQL that writes accurate, safe PostgreSQL SQL queries for a database about a friendly padel league.
        Your job is to read a natural-language question and output only SQL, respecting the schema and all semantic rules.

        ---

        # Database Schema

        {self.schema}

        ---

        # Crucial Semantic Rules (INSIGHTS)

        1. Divisions are current when has_ended = FALSE.
        2. Current edition = edition where all its divisions have has_ended = FALSE.
        3. Closed edition = all divisions have has_ended = TRUE.
        4. When ask about a “Division”, default is the division with that rating in the current edition.
        5. “Última edição” usually means last closed edition.
        6. Divisions identified by rating, not name:  
        2000 → D1, 1000 → D2, 500 → D3, 250 → D4, 125 → D5.
        7. For place/points/faltas questions → current edition.
        8. For historical counts → all editions.
        9. matches.winner stores **1 for home team win**, **-1 for away team win**, **0 for tie** — NOT a player_id.
        10. To get the winning **players**, match the team logic:  
            - If winner = 1 → winners are players where LOWER(team) = 'home'  
            - If winner = -1 → winners are players where LOWER(team) = 'away'
        11. Do NOT compare TEXT with BIGINT.
        12. players_in_match shows who played each match.
        13. team is literal text 'home' or 'away'.
        14. Improvements only between **closed editions**.
        15. melhoria = place_prev - place_curr.
        16. Include edition names and division ratings when comparing improvements.
        17. Always use explicit joins via IDs.
        18. Use correct PostgreSQL types consistently.
        19. Avoid invalid or non-existent column names.
        20. Follow strict PostgreSQL GROUP BY rules.

        ---

        # Interpreting ambiguous questions
        - “Divisão X” → current edition’s division.
        - “Última edição” → last closed edition.
        - Historical questions → all editions.
        - “Quantas vezes jogou a 1ª divisão?” → count all divisions rating = 2000.
        ---

        # Final Instructions
        - Output pure SQL only.
        - Always apply insights.
        - Prefer CTEs for complex logic.

        Example:

        Input question: 

        "Quantas vitórias tem cada jogador da Divisão 3?"

        Output:

        ```sql
        WITH div3 AS (
            SELECT id
            FROM divisions
            WHERE rating = 500
            AND has_ended = FALSE
            ORDER BY id DESC
            LIMIT 1
        ),
        vitorias AS (
            SELECT 
                pim.player_id,
                COUNT(*) AS vitorias
            FROM matches m
            JOIN players_in_match pim ON pim.match_id = m.id
            WHERE m.division_id = (SELECT id FROM div3)
            AND m.played = TRUE
            AND (
                    (m.winner = 1  AND LOWER(pim.team) = 'home')
                OR  (m.winner = -1 AND LOWER(pim.team) = 'away')
            )
            GROUP BY pim.player_id
        )
        SELECT 
            p.name,
            COALESCE(v.vitorias, 0) AS vitorias
        FROM players_in_division pid
        JOIN players p ON p.id = pid.player_id
        LEFT JOIN vitorias v ON v.player_id = p.id
        WHERE pid.division_id = (SELECT id FROM div3)
        ORDER BY vitorias DESC, p.name;
        ```

        Write a valid PostgreSQL query that answers the following question.
        Return **only** a fenced markdown code block. No extra explanation.

        Question: {user_question}
        Answer:
        """

    def run(self, question, max_retries=1):
        llm_answer = self.ask_llm_client(self.prompt(question))
        sql = self.extract_sql_block(llm_answer)
        try:
            rows = self.sql_client.execute(sql)
            return {"sql": sql, "rows": rows}

        except DatabaseError as e:
            if max_retries <= 0:
                raise

            repaired_sql = self.repair_sql(question, sql, str(e))
            try:
                rows = self.sql_client.execute(repaired_sql)
                return {
                    "sql": repaired_sql,
                    "rows": rows,
                    "repaired_from": sql,
                    "error": str(e)
                }
            except DatabaseError:
                return {
                    "sql": repaired_sql,
                    "rows": "There was an error retrieving the data",
                    "repaired_from": sql,
                    "error": str(e)
                }


In [104]:
llm_client = LLMClient(api_key=API_KEY)
sql_client = SQLClient()

data_agent = DataAgent(llm_client, sql_client)

In [71]:
data_agent.run('Quem são os 5 jogadores com mais pontos de ranking?')

2025-11-26 15:43:11,605 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
  return pd.read_sql_query(sql_query, conn)


{'sql': 'WITH current_division AS (\n    SELECT d.id\n    FROM divisions d\n    WHERE d.has_ended = FALSE\n    ORDER BY d.edition_id NULLS FIRST\n    LIMIT 1\n),\ntop_players AS (\n    SELECT\n        p.id AS player_id,\n        p.name,\n        p.ranking_points\n    FROM players p\n    ORDER BY p.ranking_points DESC\n    LIMIT 5\n)\nSELECT\n    tp.name,\n    tp.ranking_points\nFROM top_players tp\nORDER BY tp.ranking_points DESC, tp.name\n;',
 'rows':          name  ranking_points
 0        Fred            7089
 1  Bernardo C            5641
 2   Miguel SG            4675
 3    Malafaya            3951
 4       Dudas            3543}

In [100]:
data_agent.run('Em que lugar está o Tomás P?')

2025-11-27 17:49:03,479 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
  return pd.read_sql_query(sql_query, conn)


{'sql': "-- Determine the current division (in the current edition) where a player named 'Tomás P' is placed\nWITH current_edition AS (\n    SELECT e.id\n    FROM editions e\n    JOIN divisions d ON d.edition_id = e.id\n    WHERE d.has_ended = FALSE\n    GROUP BY e.id\n    HAVING bool_and(d.has_ended = FALSE)\n    LIMIT 1\n),\ntom_p AS (\n    SELECT p.id\n    FROM players p\n    WHERE p.name = 'Tomás P'\n      OR p.full_name ILIKE '%Tomás P%'\n    LIMIT 1\n),\nplayer_division AS (\n    SELECT pid.division_id, d.rating\n    FROM players_in_division pid\n    JOIN divisions d ON d.id = pid.division_id\n    WHERE pid.player_id = (SELECT id FROM tom_p)\n      AND d.edition_id = (SELECT id FROM current_edition)\n    LIMIT 1\n)\nSELECT\n    CASE\n        WHEN pd.division_id IS NULL THEN 'Não encontrado na divisão atual'\n        ELSE CONCAT('Em ', d.name, ' (Divisão com rating ', d.rating, ')')\n    END AS resposta\nFROM player_division pd\nLEFT JOIN divisions d ON d.id = pd.division_id;",
 '

In [None]:
import json
from openai import OpenAI
import time

client = OpenAI(api_key=API_KEY)


def llm_judge(question: str, expected_df: pd.DataFrame, predicted_df: pd.DataFrame) -> str:

    expected_str = expected_df.to_markdown(index=False)
    predicted_str = predicted_df.to_markdown(index=False)

    prompt = f"""
    You are an expert evaluator of SQL query correctness.

    Your task: Determine whether the predicted SQL query result correctly answers the user's question.

    You will receive:
    1. The question.
    2. The ground-truth correct result (as a table).
    3. The model-generated result (as a table).

    ### IMPORTANT RULES ###
    - The answer is CORRECT even if the columns differ, as long as the **content matches semantically**.
    - Extra columns are allowed.
    - Column name mismatches should be ignored.
    - Ordering only matters if the question implies an order (e.g., "top", "ordered by").
    - Missing irrelevant columns (e.g., ranking_position) does NOT make the answer wrong.
    - Compare only the meaningful content needed to answer the question.

    Respond with a JSON object like:
    {{
    "verdict": "CORRECT" or "WRONG",
    "reason": "short 1–2 sentence explanation"
    }}

    ### QUESTION
    {question}

    ### GROUND TRUTH RESULT
    {expected_str}

    ### MODEL PREDICTED RESULT
    {predicted_str}

    Evaluate now.
    """
    response = client.chat.completions.create(
        model="gpt-5-nano",
        messages=[{"role": "user", "content": prompt}],
        reasoning_effort='minimal',
        verbosity='low'
    )

    # Attempt to parse JSON safely
    try:
        decision = json.loads(response.choices[0].message.content)
        return decision
    except:
        return {"verdict": "ERROR", "reason": "Judge output not JSON parsable"}

def evaluate_generated_sql(data_agent, questions, retries=2):
    """
    For each question:
      - runs the ground-truth SQL
      - generates SQL via generate_sql + extract_sql_block (with retries)
      - executes the generated SQL (with retries)
      - uses llm_judge() to decide CORRECT/WRONG
    Returns a DataFrame with rich metadata per question.
    """
    results = []

    for q in questions:
        user_question = q["question"]

        # 1) Ground truth
        try:
            ground_truth = data_agent.sql_client.execute(q["sql_query"])
        except Exception as e:
            # If ground-truth fails, the eval is meaningless for this question
            results.append({
                "question": user_question,
                "status": "GROUND_TRUTH_ERROR",
                "sql_query": None,
                "raw_llm_output": None,
                "result": None,
                "ground_truth": None,
                "pass": False,
                "judge_reason": "",
                "error": f"Ground truth failed: {e}",
            })
            continue

        # 2) Generate SQL via LLM
        try:
            llm_output = data_agent.ask_llm_client(data_agent.prompt(user_question))
            sql_query = data_agent.extract_sql_block(llm_output)
        except Exception as e:
            results.append({
                "question": user_question,
                "status": "LLM_ERROR",
                "sql_query": None,
                "raw_llm_output": llm_output if "llm_output" in locals() else None,
                "result": None,
                "ground_truth": ground_truth,
                "pass": False,
                "judge_reason": "",
                "error": str(e),
            })
            continue

        # 3) Execute generated SQL
        try:
            gen_df = data_agent.sql_client.execute(sql_query)
        except Exception as e:
            results.append({
                "question": user_question,
                "status": "SQL_ERROR",
                "sql_query": sql_query,
                "raw_llm_output": llm_output,
                "result": None,
                "ground_truth": ground_truth,
                "pass": False,
                "judge_reason": "",
                "error": str(e),
            })
            continue

        # 4) Judge with LLM
        time.sleep(1)
        judge = llm_judge(user_question, ground_truth, gen_df)
        verdict = judge.get("verdict", "ERROR")
        reason = judge.get("reason", "")

        results.append({
            "question": user_question,
            "status": verdict,                  # "CORRECT", "WRONG", or "ERROR"
            "sql_query": sql_query,             # extracted SQL
            "raw_llm_output": llm_output,       # full text from generate_sql
            "result": gen_df,                   # model-generated dataframe
            "ground_truth": ground_truth,       # reference dataframe
            "pass": verdict == "CORRECT",
            "judge_reason": reason,
            "error": "",
        })

    return pd.DataFrame(results)

def annotate_failures(df):
    """
    Adds a unified failure_reason column based on:
      - LLM_ERROR
      - SQL_ERROR
      - CORRECT
      - WRONG
    """

    reasons = []

    for _, row in df.iterrows():
        status = row.get("status", None)

        if status in ("LLM_ERROR", "SQL_ERROR"):
            reasons.append(status)
            continue

        if status == "CORRECT":
            reasons.append("PASS")
            continue

        if status == "WRONG":
            reasons.append("WRONG_RESULT")
            continue

        # unexpected / fallback
        reasons.append("UNKNOWN")

    df["failure_reason"] = reasons
    return df

def inspect_failure(df, question_text):
    """
    Prints a detailed inspection for a given question.
    """
    row = df[df["question"] == question_text].iloc[0]

    print("\n==========================")
    print("QUESTION:", row["question"])
    print("STATUS:", row["status"])
    print("REASON:", row["failure_reason"])
    print("==========================")

    # For LLM extraction errors
    if row["status"] == "LLM_ERROR":
        print("\n❌ LLM could not generate SQL.")
        print("ERROR:", row.get("error", ""))
        return

    # For SQL execution errors
    if row["status"] == "SQL_ERROR":
        print("\n--- SQL Generated ---")
        print(row.get("sql_query", "(none)"))
        print("\n❌ SQL Execution Error:")
        print(row.get("error", ""))
        return

    # For CORRECT/WRONG cases
    print("\n--- SQL Generated ---")
    print(row.get("sql_query"))

    print("\n--- Judge Reason ---")
    print(row.get("judge_reason", ""))

    print("\n--- Expected Result ---")
    print(row.get("ground_truth"))

    print("\n--- Actual Result ---")
    print(row.get("generated_result"))

    # Optional: try to show diff
    try:
        diff = row["ground_truth"].compare(row["generated_result"])
        print("\n--- DataFrame Differences ---")
        print(diff)
    except Exception:
        print("\n--- DataFrame Differences ---")
        print("Could not compute a clean diff.")
        
def summary_report(df):
    print("\n===== FAILURE SUMMARY =====")
    print(df["failure_reason"].value_counts())

    print("\n===== FAILURE DETAILS =====")
    for reason in df["failure_reason"].unique():
        subset = df[df["failure_reason"] == reason]
        print(f"\n### {reason} ({len(subset)})")
        for q in subset["question"]:
            print("-", q)
            
def compute_precision_score(evaluation_df: pd.DataFrame):
    """
    precision = (# of CORRECT queries) / (total queries evaluated)
    """
    total = len(evaluation_df)
    passed = evaluation_df["pass"].sum() if "pass" in evaluation_df.columns else 0

    precision = passed / total if total > 0 else 0.0

    return {
        "total_queries": total,
        "passed": int(passed),
        "failed": int(total - passed),
        "precision": round(precision, 4),
    }

In [73]:
llm_client = LLMClient(api_key=API_KEY)
sql_client = SQLClient()

data_agent = DataAgent(llm_client, sql_client)

import json

with open("queries.json", "r", encoding="utf-8") as f:
    questions = json.load(f)

eval_df = evaluate_generated_sql(data_agent, questions)
eval_df = annotate_failures(eval_df)

print(compute_precision_score(eval_df))
summary_report(eval_df)

  return pd.read_sql_query(sql_query, conn)
2025-11-26 15:43:26,028 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2025-11-26 15:43:29,448 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-11-26 15:43:30,779 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2025-11-26 15:43:33,259 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-11-26 15:43:34,657 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2025-11-26 15:43:37,841 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-11-26 15:43:40,041 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2025-11-26 15:43:42,764 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
  return pd.read_sql_query(sql_query, conn)
2025-11-26 15:43:45,137 - INFO - HTT

{'total_queries': 25, 'passed': 10, 'failed': 15, 'precision': np.float64(0.4)}

===== FAILURE SUMMARY =====
failure_reason
WRONG_RESULT    12
PASS            10
SQL_ERROR        3
Name: count, dtype: int64

===== FAILURE DETAILS =====

### PASS (10)
- Quem são os 5 jogadores com mais pontos de ranking?
- Quantos jogadores existem na liga?
- Quais são os nomes das divisões atuais?
- Quando começou a edição mais recente da liga?
- Que campos foram usados nos jogos realizados?
- Quantos jogos já foram jogados?
- Quais são os jogadores da Divisão 1 e os respetivos pontos?
- Quantas vitórias tem cada jogador da Divisão 3?
- Mostra a classificação média (pontos) dos jogadores por divisão.
- Quantas vezes é que o Talinho jogou a 1ª Divisão?

### WRONG_RESULT (12)
- Quem ganhou o último jogo da Divisão 2?
- Qual é o jogador com mais jogos jogados?
- Quantos jogos foram jogados em cada divisao?
- Quais foram os três jogadores com mais vitórias na última edição da liga?
- Que jogador melhorou m

## Generic Answer Agent

In [136]:
class GenericAnswerAgent(Agent):
    
    name = 'GenericAnswerAgent'
    description = """
    General conversational assistant. Handles questions unrelated to padel or the league.
    Takes a full natural-language question and returns a general answer.
    """
    
    def __init__(self, llm):
        self.llm = llm

    def run(self, question: str) -> str:
        prompt = f"""
        You are a helpful and friendly assistant.

        Answer the user's question naturally.

        User question:
        {question}
        """
        return self.llm.generate_response(
            prompt=prompt,
            reasoning_effort = 'minimal',
            verbosity = 'low'
        )

In [137]:
llm_client = LLMClient(api_key=API_KEY)
generic_answer_agent = GenericAnswerAgent(llm_client)
generic_answer_agent.run('Hello')

2025-12-01 22:33:52,662 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


'Hi there! How can I help you today?'

## Padel League Answer Agent

In [173]:
class PadelLeagueAnswerAgent(Agent):
    
    name = 'PadelLeagueAnswerAgent'
    description = """
    Padel League conversational assistant. Handles questions related to the padel league.
    Takes a full natural-language question, calls the SQL agent to generate a SQL query to fetch information from the DB.
    
    Has access to a SQL agent that retrieves structured data from the SQL database. 
    Takes a list of fully-formed natural-language questions about the padel league, 
    divisions, matches, schedules, rankings, or players.
    """
    
    def __init__(self, llm, data_agent):
        self.llm = llm
        self.data_agent = data_agent

    def run(self, questions: str) -> str:
        db_result = [
            {"question": question, "db_result": self.data_agent.run(question)}
            for question in questions
        ]

        db_rows_str = "\n".join(
            f"Question: {item['question']}\nRows: {item['db_result'].get('rows', [])}\n"
            for item in db_result
        )

        prompt = f"""
        You are the Padel League Assistant.

        Here is data retrieved from the database:

        {db_rows_str}

        Craft a precise, correct, friendly answer based on the data.
        If an answer cannot be determined from the available rows, explain why.
        Always answer in European Portuguese. You should be funny and sarcastic.
        
        If the information is missing please tell the user that you coulnd't retrieve information from the db, 
        ask him to rephrase the question. Only do this if the information is missing.
        """

        return self.llm.generate_response(
            prompt=prompt,
            reasoning_effort = 'minimal',
            verbosity = 'low'
        )

In [172]:
llm_client = LLMClient(api_key=API_KEY)
sql_client = SQLClient()

data_agent = DataAgent(llm_client, sql_client)

question = 'Quem são os 5 jogadores com mais pontos de ranking?'

padelleague_answer_agent = PadelLeagueAnswerAgent(llm_client, data_agent)
padelleague_answer_agent.run([question])

2025-12-02 14:13:25,728 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
  return pd.read_sql_query(sql_query, conn)
2025-12-02 14:13:27,700 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


'Aqui vai, direto ao assunto: os 5 jogadores com mais pontos de ranking são, segundo os dados disponíveis:\n\n1) Zema — 113 pontos\n2) Martinho — 82 pontos\n3) Cou — 49 pontos\n4) Bucas — 33 pontos\n5) Cameira — 26 pontos\n\nSe quiseres verificar mais detalhes ou se a ordem mudar com novas entradas, diz-me e atualizo. E sim, eu sei: podia haver mais contexto, mas estes são os dados que tenho. Se precisares de outra coisa, renova a query ou pergunta outra coisa!'

## Orchestrator Agent

In [None]:
import json
import copy

class OrchestratorAgent:
    def __init__(self, llm, agents):
        self.llm = llm
        self.agents = agents
        self.agents_description = [{"name": agent.name, "description": agent.description} for agent in agents]
        self.conversation = LLMConversation("You are a helpful assistant.")
        self.agents_dict = {agent.name: agent for agent in agents}

    def choose_agents(self, user_message):
        prompt = f"""
        You are the Orchestrator Agent in a multi-agent system.

        Your job is to:
        1. Understand the user's message IN CONTEXT of the conversation.
        2. Rewrite the message into one or more explicit questions.
        3. Decide which agent should answer each question.
        4. Output ONLY valid JSON.

        Here are the available agents:
        {self.agents_description}

        Rules:
        - If the question is NOT related to padel or the padel league → assign to GenericAnswerAgent.
        - If the question IS related to padel league, divisions, players, standings, matches → assign to PadelLeagueAnswerAgent.
        - If the user asks multiple questions, split them into separate tasks, each understandable for a standalone agent.
        - If the user refers to previous context (e.g. "and division 2?"), rewrite into a full, explicit question.
        - Keep the questions as simple as possible. Don't add information you don't see somewhere in the conversation.
        - ALWAYS return a JSON object with a top-level field "agent", which is a list of calls.

        Example output format:
        {{
        "agent": [
            {{
            "name": "PadelLeagueAnswerAgent",
            "question": [
                "What are the points for division 2 in the 2024 season?"
                ]
            }}
        ]
        }}
        
        or 
        
        {{
        "agent": [
            {{
            "name": "GenericAnswerAgent",
            "question": "Explain why padel is so popular."
            }}
        ]
        }}

        User message:
        {user_message}

        Respond with JSON only.
        """

        
        new_conversation = copy.deepcopy(self.conversation)
        new_conversation.add_message("user", prompt)
        
        raw = self.llm.generate_response_for_conversation(
            conversation=new_conversation,
            reasoning_effort = 'minimal',
            verbosity = 'low'
        )
        return json.loads(raw)
    
    def run(self, user_message):
        tasks = self.choose_agents(user_message)
        agent_name = tasks['agent'][0]['name']
        question = tasks['agent'][0]['question']
        agent = self.agents_dict[agent_name]
        answer = agent.run(question)
        self.conversation.add_message("user", user_message)
        self.conversation.add_message("assistant", answer)
        return answer

In [183]:
llm_client = LLMClient(api_key=API_KEY)
sql_client = SQLClient()

data_agent = DataAgent(llm_client, sql_client)

generic_answer_agent = GenericAnswerAgent(llm_client)
padelleague_answer_agent = PadelLeagueAnswerAgent(llm_client, data_agent)

orchestrator_agent = OrchestratorAgent(llm_client, [generic_answer_agent, padelleague_answer_agent])

In [184]:
orchestrator_agent.conversation.messages

[LLMMessage(role='system', content='You are a helpful assistant.', max_tokens=1024)]

In [185]:
question = 'Quem são os 5 jogadores com mais pontos de ranking? E os 5 com menos?'

orchestrator_agent.run(question)

2025-12-02 14:18:38,680 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


------------------------
------------------------
['Quem são os 5 jogadores com mais pontos de ranking no ranking da liga?', 'Quem são os 5 jogadores com menos pontos de ranking no ranking da liga?']
------------------------
------------------------


2025-12-02 14:18:41,297 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
  return pd.read_sql_query(sql_query, conn)
2025-12-02 14:18:43,551 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2025-12-02 14:18:46,539 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


'Claro! Vamos aos dados que me deram.\n\n- 5 jogadores com mais pontos de ranking: Bernardo C, Malafaya, Carlo, Pancho, Dinis.\n\n- 5 jogadores com menos pontos de ranking: Chico Castro, Falcão, Falcão, Luís Ferreira, Miguel C.\n\nSe precisares de uma lista sem repetições para os de menor, o dataset mostra Falcão duas vezes com 0 pontos, o que pode estar a duplicar registos. Se considerarmos apenas nomes únicos, os menos pontos são: Chico Castro, Falcão, Luís Ferreira, Miguel C. (com 0 pontos). Mas, como o ficheiro mostra duplicado de Falcão, mantemos como está: Chico Castro, Falcão, Falcão, Luís Ferreira, Miguel C.\n\n Precisas que eu corrija para apenas jogadores únicos ou confirmar se as duplicatas devem contar? Vou ajustar se me deres permissão. E sim, o Bruno Volta a pedir: sou o teu Padel League Assistant, com o humorzinho sarcástico a marcar presença.'

In [186]:
orchestrator_agent.conversation.add_message("user", "Quem é o jogador com mais empates?")
orchestrator_agent.conversation.add_message("assistant", "Não consegui encontrar essa informação.")

question = "Tenta outra vez pf"
orchestrator_agent.run(question)

2025-12-02 14:18:47,995 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


------------------------
------------------------
['Quem é o jogador com mais empates no ranking da Liga de Padel?']
------------------------
------------------------


2025-12-02 14:18:50,827 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
  return pd.read_sql_query(sql_query, conn)
2025-12-02 14:18:52,581 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


'Pelo que aparece nos dados disponíveis, o jogador com mais empates é Bernardo C, com 0.0 empates. \n\nSe procuras outra pessoa com mais empates ou se há mais linhas no banco de dados, dá-me o conjunto completo e eu digo-te já quem lidera. Não mefaças brincar com o fecho da rede sem dados!'