# Prompt Error Identification
**Purpose:** Score and identify errors in individual prompts against gold standard

---
**Copyright (c) 2025 Michael Powers**

# run subquestion results through gemini for scoring

In [56]:
import os
import logging
import json
import pandas as pd
from datetime import datetime
import google.generativeai as genai
import time
import random

In [57]:
gemini_model = 'gemini-2.5-flash-lite-preview-06-17'
api_key="YOUR_API_KEY"

In [58]:
def ask_gemini_json(prompt, use_json=True, model='models/gemini-2.0-flash-lite'):
    import os
    import google.generativeai as genai
    genai.configure(api_key=api_key)
    model = genai.GenerativeModel(model)
    if use_json:
        generation_config = genai.GenerationConfig(response_mime_type="application/json")
        response = model.generate_content(prompt, generation_config=generation_config)
    else:
        response = model.generate_content(prompt)
    return response.text

In [59]:
def clean_response(response):
    response = response.replace("<think>", "").replace("</think>", "")
    response = response.strip()
    if response.startswith("```json") and response.endswith("```"):
        response = response[len("```json"): -len("```")].strip()
    elif response.startswith("```") and response.endswith("```"):
        response = response[len("```"): -len("```")].strip()
    if response.lower().startswith('sql'):
        response = response[3:].strip()
    return response

In [71]:
def get_prompt(sub_prompt, generated_response, 
               original_user_query, tables_to_be_used,full_gold_star_sql, sub_prompt_purpose):

    json_format = """
    ```json
{
  "did_error_occur": true/false,
  "error_category": "string",
  "why_error_occurred": "string",
  "suggestions_for_improvement": "string",
  "score": integer (1-5)
}
```
    """
    
    prompt = f"""
You are an expert AI assistant tasked with evaluating the performance of individual "sub-prompts" within a RAG-to-SQL system. Your goal is to assess how accurately and effectively a sub-prompt's generated output contributes to building the correct full SQL query for a given user question.

**Crucially, your evaluation MUST be strictly aligned with the defined purpose of the 'sub_prompt' provided in `<SUB_PROMPT_PURPOSE>`. Do not infer or assume a different purpose for the sub-prompt than what is explicitly stated there.**

You will be provided with the following information:
- The sub-prompt itself (`sub_prompt`).
- The response generated by the sub-prompt (`generated_response`).
- The original user query (`original_user_query`).
- A list of table names available in the database (`tables_to_be_used`).
- The complete, correct SQL query that should be generated by the entire RAG-to-SQL process (`full_gold_star_sql`).
- The defined purpose of the 'sub_prompt' ('sub_prompt_purpose')

Your evaluation should consider the specific purpose of the sub-prompt based on`sub_prompt_purpose` (e.g., identifying tables, grouping, calculations, filtering, joins). You must determine if the `generated_response` from the sub-prompt provides the correct and necessary information to lead to the `full_gold_star_sql` in its specific domain, *as defined by its stated purpose*.

**Input:**
<SUB_PROMPT>
{sub_prompt}
</SUB_PROMPT>

<GENERATED_RESPONSE>
{generated_response}
</GENERATED_RESPONSE>

<ORIGINAL_USER_QUERY>
{original_user_query}
</ORIGINAL_USER_QUERY>

<TABLES_TO_BE_USED>
{tables_to_be_used}
</TABLES_TO_BE_USED>

<FULL_GOLD_STAR_SQL>
{full_gold_star_sql}
</FULL_GOLD_STAR_SQL>

<SUB_PROMPT_PURPOSE>
{sub_prompt_purpose}
</SUB_PROMPT_PURPOSE>

**Evaluation Criteria and Output Format:**

Based on the above inputs, analyze the `generated_response` in the context of the `sub_prompt`'s explicitly stated purpose (`<SUB_PROMPT_PURPOSE>`) and the `full_gold_star_sql`. Provide your assessment as a JSON object with the following fields:

1.  **`did_error_occur`**: A boolean (`true` if an error or significant imperfection occurred, `false` if the `generated_response` was perfectly correct and useful for its sub-prompt's goal).
2.  **`error_category`**: A string representing the most relevant error type if `did_error_occur` is `true`. If `did_error_occur` is `false`, set this to "No Error". Choose from the following categories:
    * `Schema Misidentification`: The sub-prompt incorrectly identified or missed relevant tables or columns.
    * `Incorrect Aggregation/Calculation`: The sub-prompt suggested wrong aggregations (e.g., SUM instead of COUNT), wrong columns for aggregation, or missed required calculations.
    * `Incorrect Filtering`: The sub-prompt suggested wrong filtering conditions, wrong columns for filters, or missed required filters (WHERE/HAVING clauses).
    * `Incorrect Join Logic`: The sub-prompt suggested wrong join types (e.g., LEFT instead of INNER), wrong join columns, or missed required joins.
    * `Incorrect Grouping`: The sub-prompt suggested wrong grouping columns or missed required GROUP BY clauses.
    * `Misinterpretation of User Intent`: The sub-prompt's understanding of the user query was fundamentally flawed, leading to an irrelevant or incorrect `generated_response`.
    * `Irrelevant/Hallucinated Information`: The sub-prompt included information not requested or entirely made up.
    * `Syntax/Format Error`: The `generated_response` itself was not in the expected format (e.g., malformed JSON, unparseable text).
    * `Partial Correctness`: The `generated_response` was partially correct but had significant omissions or minor inaccuracies that did not fit the more specific categories.
    * `False Negative`: The `generated_response` claimed the sub-prompt was not needed but according to the `full_gold_star_sql` it was needed. (e.g. claimed No Filtering needed when Filtering was needed)
    * `Other Error`: An error occurred that does not fit the above categories.
    * `No Error`: The `generated_response` was fully correct and aligned with the `full_gold_star_sql`'s requirements for its specific sub-task.
3.  **`why_error_occurred`**: A detailed string explaining *why* the error occurred, referring to specific parts of the `generated_response`, `original_user_query`, and `full_gold_star_sql`. **Crucially, if the `error_category` is 'Misinterpretation of Sub-Prompt Purpose', explain how the `generated_response` deviated from the `<SUB_PROMPT_PURPOSE>`**. If no error, state "N/A".
4.  **`suggestions_for_improvement`**: A string providing actionable suggestions on how to modify or improve the `sub_prompt` itself to prevent this error in the future. Be specific. If no error, state "N/A".
5.  **`score`**: An integer from 1 to 5, reflecting the performance of the `generated_response` in achieving its specific sub-prompt goal, in the context of contributing to the `full_gold_star_sql`.
    * **5 (Excellent):** The `generated_response` is perfectly accurate and complete for its intended purpose, directly and correctly contributing to the `full_gold_star_sql`. No missing or incorrect information.
    * **4 (Good):** The `generated_response` is mostly accurate and helpful, but might have minor omissions or slightly irrelevant information that doesn't hinder the overall process significantly towards the `full_gold_star_sql`.
    * **3 (Acceptable):** The `generated_response` contains some correct information but also significant omissions or inaccuracies that require substantial correction or might lead to errors down the line. It's partially helpful for reaching the `full_gold_star_sql`.
    * **2 (Poor):** The `generated_response` is largely incorrect, misleading, or missing critical information. It actively steers the process in the wrong direction or provides minimal value.
    * **1 (Failed):** The `generated_response` is completely wrong, irrelevant, unusable, or actively harms the process.

**JSON Output:**
{json_format}


    
    """
    return prompt

In [67]:
def read_string(filename):
    try:
        # First, try to open with the standard UTF-8
        with open(filename, 'r', encoding='utf-8') as f:
            return f.read()
    except UnicodeDecodeError:
        # If UTF-8 fails, try a different encoding like latin-1
        print(f"Warning: '{filename}' is not UTF-8. Trying latin-1...")
        with open(filename, 'r', encoding='latin-1') as f:
            return f.read()

In [62]:
def get_value_by_id(jsonl_file, instance_id, target_key):
    with open(jsonl_file, 'r') as f:
        for i, line in enumerate(f):
            try:
                entry = json.loads(line.strip())
                if entry.get("instance_id") == instance_id:
                    return entry.get(target_key)
            except Exception as e:
                print(f"ERROR: line {i}: {e}")
    return None
            

In [84]:
def eval_subprompts(test_directory="./results/",
                                gold_directory="./gold/",
                                output_filename="./results.csv",
                                prompt_to_evaluate = "tables",
                                gold_tables_file = "gold.json",
                                orig_prompt = "hi",
                                original_questions_file = "questions.jsonl",
                                prompt_purpose="win",
                                model=gemini_model,
                                rpm_limit=15):

    from datetime import datetime
    
    RPM_LIMIT = rpm_limit
    MAX_RETRIES = 5
    BASE_SLEEP_TIME = 8.5
    df_rows = []
    review_rows = []

    num_responses = 0

    print(f"Starting generation.")
    #Loop through all files in test_directory
    for item in os.listdir(test_directory):
        item_path = os.path.join(test_directory, item)
        if os.path.isfile(item_path) and item.endswith('.json'):
            test_filename = item_path
            try: 
                entry = json.loads(read_string(test_filename))
                test_id = entry.get("ID")
                gold_filename = os.path.join(gold_directory, f"{test_id}.sql")
                  
                #gold for full sql
                gold_contents = read_string(gold_filename)
                #response to evaluate
                to_evaluate = entry.get(prompt_to_evaluate)
                #gold schema info
                gold_schema = get_value_by_id(gold_tables_file, test_id, "gold_tables")
                #original query info
                orig_query = get_value_by_id(original_questions_file, test_id, "question")
                

                prompt = get_prompt(orig_prompt, to_evaluate, orig_query, gold_contents, gold_schema, prompt_purpose)
              
            
                # Counter for API calls made within the current minute
                requests_in_minute = 0
                start_time_minute = time.time()
                retries = 0
        
                while retries < MAX_RETRIES:
                # Check RPM limit
                    current_time = time.time()
                    if current_time - start_time_minute >= 60:
                        requests_in_minute = 0
                        start_time_minute = current_time

                    if requests_in_minute >= RPM_LIMIT:
                        wait_time = 60 - (current_time - start_time_minute)
                        print(f"Rate limit hit. Waiting for {wait_time:.2f} seconds...")
                        time.sleep(wait_time + 1)
                        requests_in_minute = 0
                        start_time_minute = time.time()

                    try: # LLM CALL
                        print(f'Making LLM Call {num_responses+1}')
                        response = ask_gemini_json(prompt, use_json=True, model=model)
                        requests_in_minute += 1
                
                    # Clean the response string
                        response = clean_response(response)
                        if num_responses < 1:
                            print(f"FIRST RESPONSE:\n{response}")

                        try:
                            num_responses += 1
                            data = json.loads(response)
                            did_error_occur = data.get("did_error_occur")
                            df_rows.append({
                                "ID" : test_id,
                                "contains_error": data.get("did_error_occur"),
                                "error_category" : data.get("error_category"),
                                "why_error_occurred": data.get("why_error_occurred"),
                                "suggestions" : data.get("suggestions_for_improvement"),
                                "score": data.get("score")
                            })

                            review_rows.append({
                                "ID": test_id,
                                "score": data.get("score"),
                                "error_category": data.get("error_category"),
                                "model_response": to_evaluate,
                                "reason" : data.get("why_error_occurred"),
                            })
                           
                        except Exception as e:
                            print(f'Error decoding json response: {e}')
                            print("-------DATA NO GOOD")
                        
                        break # Success, break out of retry loop
                    except Exception as e:
                        retries += 1
                        sleep_duration = BASE_SLEEP_TIME * (2 ** (retries - 1)) + random.uniform(0, 1)
                        print(f"API Error : {e}. Retrying in {sleep_duration:.2f}s... (Attempt {retries}/{MAX_RETRIES})")
                        time.sleep(sleep_duration)

                    if retries == MAX_RETRIES:
                        print(f"Failed to process review from input line {i} after {MAX_RETRIES} retries. Skipping.")
                        print("-------DATA NO GOOD")
            except Exception as e:
                print(f"Error decoding test data: {e}")
                return
            
    df = pd.DataFrame(df_rows)
    print(df)
    df.to_csv(output_filename, index=False)

    review_df = pd.DataFrame(review_rows)
    review_df.to_csv(f"ReviewResults_{datetime.now()}.csv", index=False)
    
    return df
    print("------DONE-----")

In [89]:
## TABLES PROMPT
output_filename_table = "./table_evaluations_v3.csv"
prompt_to_evaluate_table = "table prompt result"
table_prompt_purpose = "Identify a definitive list of tables and columns that will actually be used in the SELECT, FROM, and JOIN clauses. Inclusion of extraneous tables and columns does not make the response incorrect."
original_prompt_table = """Given the user question and the database schema context, identify the most relevant tables and columns needed to answer the question. 
    These tables and columns will be used in the SELECT, FROM and JOIN clauses.
    Focus on tables and columns that directly relate to the entities and operations mentioned in the query. Consider table relationships and how they are used for joins or filtering.

    Output your answer as a JSON object with 'tables' (list of table names) and 'columns' (list of 'table_name.column_name' strings) and 'reasoning' (string).

    --- Schema Context ---
    {schema_context}

    --- User Question ---
    {query_str}

    JSON Output:

    """
    
## GROUPING PROMPT
output_filename_grouping = "./grouping_evaluations_v2.csv"
prompt_to_evaluate_grouping = "grouping prompt result including CTE"
grouping_prompt_purpose = "To identify any GROUP BY clauses and the associated aggregate functions (COUNT, SUM, AVG, etc.)."
original_prompt_grouping = """Given the user question, the previously identified tables and columns, and the business terms context, determine if any grouping or aggregation is required.
    If so, list the columns to group by and briefly explain why. If not, state 'No grouping needed'.
    Output as a JSON object with 'group_by_columns' (list of 'table_name.column_name' strings) and 'reasoning' (string).
    
    --- Schema Context ---
    {schema_context}
    --- Business Terms Context ---
    {business_terms_context}
    --- User Question ---
    {query_str}
    --- Identified Tables/Columns ---
    {identified_tables_columns_json}
    
    JSON Out
"""

## Calculations prompt
output_filename_calculations = "./calculations_evaluations_v2.csv"
prompt_to_evaluate_calculations = "calculations prompt result"
calculations_prompt_purpose = "To define complex mathematical formulas or metrics that aren't simple aggregations (e.g., revenue / transactions)."
original_prompt_calculations = """Given the user question, the previously identified tables and columns, and the business terms context, **determine only the core mathematical or aggregate calculations required to directly answer the user's question.**

**Focus strictly on arithmetic operations (e.g., +, -, *, /) and aggregate functions (e.g., SUM, COUNT, AVG, MIN, MAX) that directly compute a value.**

**DO NOT include:**
* `GROUP BY` clauses
* `ORDER BY` clauses
* `HAVING` clauses
* `WHERE` clauses (these are for filtering, a separate step)
* Table joins or other structural SQL elements

If no direct calculations are needed (e.g., the question is simply asking to retrieve raw data or identify entities without aggregation/arithmetic), state 'No calculations needed'.

Output as a JSON object with 'calculations' (a list of concise 'calculation_string' strings) and 'reasoning' (a string explaining *why* each listed calculation is necessary to answer the question, without describing the full SQL logic).

--- Schema Context ---
{schema_context}
--- Business Terms Context ---
{business_terms_context}
--- User Question ---
{query_str}
--- Identified Tables/Columns ---
{identified_tables_columns_json}

JSON Output:
"""

## Filtering prompt
output_filename_filtering = "./filtering_evaluations_v2.csv"
prompt_to_evaluate_filtering = "filtering prompt result"
filtering_prompt_purpose = "To identify conditions that filter the data. This step must distinguish between pre-aggregation filters (WHERE) and post-aggregation filters (HAVING). This prompt is not responsible for CTEs."
original_prompt_filtering =   """Given the user question, the previously identified tables and columns, and the business terms context, determine if any filtering is required.
    If so, list the filtering. If not, state 'No filtering needed'.
    Output as a JSON object with 'filters' (list of 'filter' strings) and 'reasoning' (string).
    
    --- Schema Context ---
    {schema_context}
    --- Business Terms Context ---
    {business_terms_context}
    --- User Question ---
    {query_str}
    --- Identified Tables/Columns ---
    {identified_tables_columns_json}
    
    JSON Output:
    """





###########################
test_directory = "../application/prompt_logs/"
gold_directory = "../../spider/Spider2-main/spider2-lite/evaluation_suite/gold/sql/"
gold_tables_file = "../../spider/Spider2-main/methods/gold-tables/spider2-lite-gold-tables.jsonl"
original_questions = "../sql_test_set.jsonl"

print('\n\n ---- START TABLES ------------')
eval_subprompts(test_directory=test_directory,
                                gold_directory=gold_directory,
                                output_filename=output_filename_table,
                                prompt_to_evaluate = prompt_to_evaluate_table,
                                gold_tables_file = gold_tables_file,
                                orig_prompt =original_prompt_table,
                                original_questions_file=original_questions, 
                                prompt_purpose=table_prompt_purpose)

#print('\n\n ---- START GROUPING ------------')
#eval_subprompts(test_directory=test_directory,
#                                gold_directory=gold_directory,
#                                output_filename=output_filename_grouping,
#                                prompt_to_evaluate = prompt_to_evaluate_grouping,
#                                gold_tables_file = gold_tables_file,
#                                orig_prompt =original_prompt_table,
#                                original_questions_file=original_questions, 
#                                prompt_purpose=grouping_prompt_purpose)

#print('\n\n ---- START CALCULATIONS ------------')
#eval_subprompts(test_directory=test_directory,
#                                gold_directory=gold_directory,
#                                output_filename=output_filename_calculations,
#                                prompt_to_evaluate = prompt_to_evaluate_calculations,
#                                gold_tables_file = gold_tables_file,
#                                orig_prompt =original_prompt_table,
#                                original_questions_file=original_questions, 
#                                prompt_purpose=calculations_prompt_purpose)

#print('\n\n ---- START FILTERING ------------')
#eval_subprompts(test_directory=test_directory,
#                                gold_directory=gold_directory,
#                                output_filename=output_filename_filtering,
#                                prompt_to_evaluate = prompt_to_evaluate_filtering,
#                                gold_tables_file = gold_tables_file,
#                                orig_prompt =original_prompt_table,
#                                original_questions_file=original_questions, 
#                                prompt_purpose=filtering_prompt_purpose)



print('\n\n---------------------- DONE ---------------')



 ---- START TABLES ------------
Starting generation.
Making LLM Call 1
FIRST RESPONSE:
{
  "did_error_occur": false,
  "error_category": "No Error",
  "why_error_occurred": "N/A",
  "suggestions_for_improvement": "N/A",
  "score": 5
}
Making LLM Call 2
Making LLM Call 3
Making LLM Call 4
Making LLM Call 5
Making LLM Call 6
Making LLM Call 7
Making LLM Call 8
Making LLM Call 9
Making LLM Call 10
Making LLM Call 11
Making LLM Call 12
Making LLM Call 13
Making LLM Call 14
Making LLM Call 15
Making LLM Call 16
Making LLM Call 17
Making LLM Call 18
Making LLM Call 19
Making LLM Call 20
Making LLM Call 21
Making LLM Call 22
Making LLM Call 23
API Error : 429 You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits. [violations {
}
, links {
  description: "Learn more about Gemini API quotas"
  url: "https://ai.google.dev/gemini-api/docs/rate-limits"
}
, retry_delay {
  seconds