# Run Evaluation Harness for Transcript Agent

This notebook implements and runs the evaluation harness for the Transcript Agent. It uses the processed and annotated data prepared in the `parse_spans.ipynb` notebook, which has been uploaded as a Phoenix Dataset.

**Goal:** Evaluate the agent's performance on key criteria (Tool Usage, SQL Correctness, Final Answer Quality) using LLM-as-judge, leveraging the Phoenix Experiments framework.

**Plan:**

1.  **Setup:** Import necessary libraries and initialize the Phoenix client and the evaluation LLM (e.g., GPT-4o).
2.  **Load Dataset:** Retrieve the specific evaluation dataset (`transcript-agent-eval-data-...`) previously uploaded to Phoenix.
3.  **Define Task Function:** Create the simple "dummy" task function required by `run_experiment` to pass through the pre-computed agent outputs from the dataset.
4.  **Define Evaluators:**
    *   Create three LLM-as-judge evaluator functions using `phoenix.evals.llm_classify`.
    *   Develop prompts for each evaluator (Tool Usage, SQL Correctness, Final Answer Quality) instructing the LLM to assess the agent's output based on the input query, agent's actions/results, and referencing the human-provided labels and explanations from the dataset.
5.  **Run Experiment:** Execute `phoenix.experiments.run_experiment`, passing the loaded dataset, the task function, and the list of defined evaluators.
6.  **Review Results:** Briefly note that results (scores, LLM judge explanations) should be reviewed in the Phoenix UI experiment view.

In [1]:
# --- 1. Setup (MODIFIED FOR EXPLICIT CLOUD CLIENT CONFIGURATION - FINAL ATTEMPT) ---

import warnings
warnings.filterwarnings('ignore') # Optional: suppress warnings

import phoenix as px
from phoenix.evals import OpenAIModel, llm_classify
from phoenix.experiments import run_experiment
from phoenix.experiments.types import Example

import pandas as pd
from datetime import datetime
import json
import os
import nest_asyncio
from dotenv import load_dotenv # Make sure dotenv is imported

# Apply nest_asyncio for environments like Jupyter
nest_asyncio.apply()

# --- Load Environment Variables ---
# Ensure this runs reliably near the start if not already done
if 'dotenv_loaded' not in locals(): # Simple flag to avoid reloading if already done
    print("Attempting to load .env file...")
    # Ensure your .env file has PHOENIX_COLLECTOR_ENDPOINT and PHOENIX_CLIENT_HEADERS
    load_dotenv(verbose=True) # verbose=True shows which file is loaded
    dotenv_loaded = True # Set flag
    print("Finished loading .env file (if found).")
else:
    print(".env file assumed to be loaded previously.")
# --- End Load ---

# --- Configuration ---
# Define the API endpoint (base URL for cloud)
cloud_api_endpoint = "https://app.phoenix.arize.com"

# Get the required header string from environment
# This MUST be set correctly in your .env file: PHOENIX_CLIENT_HEADERS="api_key=YOUR_KEY_VALUE"
api_headers_str = os.getenv("PHOENIX_CLIENT_HEADERS")

if not api_headers_str:
    raise ValueError("CRITICAL: PHOENIX_CLIENT_HEADERS environment variable not found. Ensure it's set in your .env file and load_dotenv() ran.")

# Parse the header string ("api_key=value") into the required dictionary format
api_headers_dict = {}
try:
    key, value = api_headers_str.split('=', 1)
    parsed_key_name = key.strip()
    parsed_key_value = value.strip()
    if parsed_key_name != "api_key" or not parsed_key_value:
            raise ValueError("Parsed key name is not 'api_key' or value is empty.")
    api_headers_dict[parsed_key_name] = parsed_key_value # Store as dict {"api_key": "value"}
    print(f"Successfully parsed headers: Key='{parsed_key_name}'")
except Exception as parse_err:
    print(f"ERROR: Could not parse PHOENIX_CLIENT_HEADERS string: '{api_headers_str}'. Expected 'api_key=value' format. Error: {parse_err}")
    raise ValueError("Invalid PHOENIX_CLIENT_HEADERS format") from parse_err
# --- End Configuration ---


# --- Initialize Client Explicitly ---
print("\nInitializing Phoenix client explicitly for cloud...")
px_client = None # Initialize to None
try:
    # Initialize with explicit endpoint and headers arguments
    print(f"Attempting px.Client(endpoint='{cloud_api_endpoint}', headers=...)")
    px_client = px.Client(endpoint=cloud_api_endpoint, headers=api_headers_dict) # EXPLICIT INIT
    print("Phoenix client initialized successfully using explicit arguments.")
except Exception as e:
    print(f"ERROR initializing Phoenix Client explicitly: {e}")
    print("Check endpoint URL and header format/value in your .env file.")
    # px_client remains None
# --- End Initialization ---

# --- Initialize Judge LLM ---
if not os.getenv("OPENAI_API_KEY"):
    print("WARNING: OPENAI_API_KEY not found in environment. Evaluation LLM might fail.")

print("\nInitializing evaluation LLM (GPT-4o)...")
eval_model = None # Initialize to None
try:
    eval_model = OpenAIModel(model="gpt-4o")
    print("Evaluation LLM initialized.")
except Exception as e:
    print(f"Error initializing OpenAIModel: {e}")
    # eval_model remains None
# --- End Judge LLM Init ---

print("\n--- Setup Cell Complete ---")

Attempting to load .env file...
Finished loading .env file (if found).
Successfully parsed headers: Key='api_key'

Initializing Phoenix client explicitly for cloud...
Attempting px.Client(endpoint='https://app.phoenix.arize.com', headers=...)
Phoenix client initialized successfully using explicit arguments.

Initializing evaluation LLM (GPT-4o)...
Evaluation LLM initialized.

--- Setup Cell Complete ---


## 2. Load Evaluation Dataset

Retrieve the specific `transcript-agent-eval-data-...` dataset previously uploaded to Phoenix. We need this dataset object to pass to the `run_experiment` function later.

In [2]:
# --- 2. Load Dataset ---

# Exact dataset name identified from the parse_spans.ipynb notebook output
dataset_name = "transcript-agent-eval-data-20250428-102511"

print(f"Attempting to load dataset '{dataset_name}'...")

# Load the specified dataset by its exact name
# This will raise an error if the dataset doesn't exist or px_client isn't initialized
evaluation_dataset = px_client.get_dataset(name=dataset_name)
print("Dataset loaded successfully.")

# Print number of examples
print(f"Number of examples in dataset: {len(evaluation_dataset)}")
if len(evaluation_dataset) != 17:
     print(f"Warning: Dataset contains {len(evaluation_dataset)} examples, but we expected 17 based on UI/previous note.")


Attempting to load dataset 'transcript-agent-eval-data-20250428-102511'...
Dataset loaded successfully.
Number of examples in dataset: 17


In [3]:
# --- Inspect Loaded Dataset ---

print(f"Type of loaded dataset object: {type(evaluation_dataset)}")

# Display the first example to check structure
if len(evaluation_dataset) > 0:
    print("\n--- First Example ---")
    first_example = evaluation_dataset[0]

    print("\nInput Data:")
    # Assumes first_example.input exists and is dict-like
    print(json.dumps(first_example.input, indent=2))

    print("\nOutput/Expected Data:")
    # Assumes first_example.output exists and is dict-like
    print(json.dumps(first_example.output, indent=2))

    print("\nMetadata:")
    # Assumes first_example.metadata exists and is dict-like
    print(json.dumps(first_example.metadata, indent=2))

else:
    print("Dataset appears to be empty.")

Type of loaded dataset object: <class 'phoenix.experiments.types.Dataset'>

--- First Example ---

Input Data:
{
  "tool_called": true,
  "user_query": "Who is Jeff Pidcock?",
  "generated_sql": "SELECT * FROM transcript_segments WHERE text LIKE '%Jeff Pidcock%'",
  "final_answer": "I cannot answer the question about who Jeff Pidcock is based on the available transcript data."
}

Output/Expected Data:
{
  "tool_usage_explanation": "The agent correctly identified that answering this question requires querying the database to find mentions of the name.",
  "sql_correctness_label": "Incorrect",
  "tool_usage_correctness_label": "Correct",
  "final_answer_quality_label": "Fail",
  "sql_correctness_explanation": "The specific SQL query (LIKE '%Jeff Pidcock%') failed functionally. It did not retrieve the existing mention of \"Jeff Pidcock\" from the transcript, most likely due to case sensitivity, making it an incorrect implementation for the task.",
  "final_answer_explanation": "The agent 

## 3. Define Task Function

The Phoenix `run_experiment` function is designed to run a specific "task" (like executing our agent) for each example in a dataset and then evaluate the result.

In our case, we've already run the agent and processed its results into our datase|t (in the `input` fields like `final_answer`, `generated_sql`, etc.). However, the `run_experiment` function still requires *some* function to be passed as the "task".

So, we'll define a very simple "dummy" task function. Its only job is to take the `input` data provided for each example in our dataset and return it directly. This satisfies the structural requirement of `run_experiment` without re-running our agent. The evaluators we define later will then use this returned data.

In [4]:
# --- 3. Define Task Function ---
from phoenix.experiments.types import Example

def dummy_task_function(example: Example) -> dict:
    """
    This function acts as the 'task' for run_experiment.
    Since our agent outputs are already pre-computed and stored in the
    dataset's 'input' fields, this function simply returns that input data.
    The evaluators will receive this dictionary as their 'output' parameter.
    """
    # The input attribute of the Example object holds the dictionary
    # containing user_query, final_answer, generated_sql, tool_called.
    return example.input

# --- Quick test of the function ---
if len(evaluation_dataset) > 0:
    print("Testing dummy_task_function with the first example:")
    test_output = dummy_task_function(evaluation_dataset[0])
    print("Output from dummy task:")
    print(json.dumps(test_output, indent=2))
else:
    print("Skipping test, dataset is empty.")


Testing dummy_task_function with the first example:
Output from dummy task:
{
  "tool_called": true,
  "user_query": "Who is Jeff Pidcock?",
  "generated_sql": "SELECT * FROM transcript_segments WHERE text LIKE '%Jeff Pidcock%'",
  "final_answer": "I cannot answer the question about who Jeff Pidcock is based on the available transcript data."
}


## 4. Define Evaluators

Now we define the functions that will evaluate the agent's performance for each example. We will use the `phoenix.evals.llm_classify` function to create LLM-as-judge evaluators for our three criteria: Tool Usage Correctness, SQL Correctness, and Final Answer Quality.

**Design Approach:**

We'll build each evaluator sequentially using a "Recipe & Chef" analogy:

1.  **Define the Prompt Template (The Recipe):** For each criterion, we'll first write the detailed instructions (the prompt template) telling the LLM *how* to perform the specific evaluation.
2.  **Define the Evaluator Function (The Chef):** Next, we'll create the Python function (the evaluator) that takes the data for an example, uses the corresponding prompt template (recipe), and manages the call to the LLM (the worker) via `llm_classify`.
3.  **Test the Evaluator:** We'll run a quick test on the first example to ensure the evaluator function works as expected.

We will repeat this Prompt -> Function -> Test sequence for each of our three evaluation criteria:

1.  **Tool Usage Correctness:** Was the decision to call the SQL tool (or not) appropriate?
2.  **SQL Correctness:** If the SQL tool was called, was the generated SQL query correct and effective?
3.  **Final Answer Quality:** Was the final text answer provided to the user clear, correct, and relevant?

Each evaluator function will ultimately return a score (e.g., 1.0 for success, 0.0 for failure) based on the LLM judge's assessment.

In [5]:
# --- 4a. Prompt Template: Tool Usage Correctness ---

TOOL_USAGE_PROMPT_TEMPLATE = """
You are evaluating an AI agent's decision on whether to use a specific tool ('query_database') to answer a user's query about a workshop transcript.
The agent has access to a database table 'transcript_segments'.

**Instructions:**
1. Analyze the User Query.
2. Analyze the Agent's Action: Did the agent call the 'query_database' tool? (indicated by 'tool_called' flag and presence/absence of 'generated_sql').
3. Determine if the Agent's Action was Correct: Should the agent have used the tool to answer this query effectively? Consider if the query asks for specific factual information likely in the transcript vs. general knowledge or conversational elements.
4. Compare your assessment to the Human Label and Explanation provided (for context, but make your own judgment).
5. Output a final LABEL ('Correct' or 'Incorrect') based *only* on your assessment of the agent's action.
6. Provide a detailed EXPLANATION for your label, referencing the query and the agent's action.

**Input Data:**
User Query: {user_query}
Agent Called Tool ('query_database'): {tool_called}
Agent Generated SQL (if tool called): {generated_sql}

**Reference (Human Annotation):**
Human Label: {tool_usage_correctness_label}
Human Explanation: {tool_usage_explanation}

**Your Task:**
Based *only* on the User Query and the Agent's Action, was the decision to use (or not use) the 'query_database' tool correct?

EXPLANATION: [Provide your reasoning here]
LABEL: [Correct or Incorrect]
"""

print("Tool Usage Prompt Template defined.")
# print(TOOL_USAGE_PROMPT_TEMPLATE) # Optional: uncomment to view the template


Tool Usage Prompt Template defined.


In [6]:
    # --- 4b. Evaluator Function: Tool Usage Correctness ---

    def evaluate_tool_usage(output: dict, expected: dict, input: dict, model_to_use: OpenAIModel) -> float:
        """
        Evaluates Tool Usage Correctness using LLM-as-judge based on TOOL_USAGE_PROMPT_TEMPLATE.

        Args:
            output (dict): The dictionary returned by the dummy_task_function.
                           Contains 'user_query', 'generated_sql', 'tool_called'.
            expected (dict): The dictionary containing the expected outputs (human labels/explanations).
                             Contains 'tool_usage_correctness_label', 'tool_usage_explanation'.
            input (dict): The dictionary containing the original input keys.
            model_to_use (OpenAIModel): The initialized OpenAIModel instance for the judge.

        Returns:
            float: Score (1.0 for Correct, 0.0 for Incorrect based on LLM judge).
                   Returns 0.0 if evaluation fails or label is missing from LLM response.
        """
        # Prepare data for the prompt template
        user_query = output.get('user_query')
        tool_called = output.get('tool_called')
        generated_sql = output.get('generated_sql', 'N/A') # Use N/A if None
        human_label = expected.get('tool_usage_correctness_label')
        human_explanation = expected.get('tool_usage_explanation')

        # Check if essential inputs for the LLM are present
        if user_query is None or tool_called is None:
             print(f"Warning: Missing essential input (query or tool_called) for Tool Usage eval. Returning 0.0")
             return 0.0

        # Create DataFrame for llm_classify (needs dicts)
        eval_df = pd.DataFrame([{
            "user_query": user_query,
            "tool_called": tool_called,
            "generated_sql": generated_sql,
            "tool_usage_correctness_label": human_label if human_label is not None else "N/A",
            "tool_usage_explanation": human_explanation if human_explanation is not None else "N/A"
        }])

        # Removed the check for eval_model in locals()

        # Call LLM judge using the template defined previously, passing the specific model
        response = llm_classify(
            data=eval_df,
            template=TOOL_USAGE_PROMPT_TEMPLATE, # Uses the variable defined earlier
            model=model_to_use, # Use the passed-in model
            rails=["Correct", "Incorrect"], # Expected LLM output labels
            provide_explanation=True
        )

        # Extract the label assigned by the LLM judge
        try:
            llm_label = response['label'].iloc[0]
            score = 1.0 if llm_label == 'Correct' else 0.0
            return score
        except (IndexError, KeyError, TypeError) as e:
             print(f"Error parsing LLM response for Tool Usage: {e}. Response: {response}")
             return 0.0 # Score as incorrect if LLM response is malformed

    print("Evaluator function 'evaluate_tool_usage' defined (updated to accept model).")

Evaluator function 'evaluate_tool_usage' defined (updated to accept model).


In [7]:
# --- 4c. Test: Tool Usage Evaluator ---

# Ensure the dataset object exists and has examples
if 'evaluation_dataset' in locals() and len(evaluation_dataset) > 0 and 'eval_model' in locals():
    print("Testing Tool Usage evaluator with the first example:")
    # Get the necessary parts from the first example
    first_example = evaluation_dataset[0]
    test_output = dummy_task_function(first_example) # Use dummy task to get 'output' dict
    test_expected = first_example.output # Ground truth labels/explanations
    test_input = first_example.input # Contains original inputs

    # Call the evaluator function (Corrected Line Below - passing model)
    try:
        # Ensure eval_model is passed to the updated function
        score = evaluate_tool_usage(output=test_output,
                                    expected=test_expected,
                                    input=test_input,
                                    model_to_use=eval_model) # Pass eval_model here
        print(f"LLM Judge Score for Tool Usage (First Example): {score}")
        print("(Score reflects LLM judgment: 1.0 for 'Correct', 0.0 for 'Incorrect')")
    except Exception as e:
        print(f"An error occurred during the evaluate_tool_usage test: {e}")
        import traceback
        traceback.print_exc()


    # You can also inspect the individual components passed to the evaluator
    # print("\nData passed to evaluator:")
    # print("Output (from dummy task):", json.dumps(test_output, indent=2))
    # print("Expected (human labels):", json.dumps(test_expected, indent=2))
    # print("Input (original):", json.dumps(test_input, indent=2))

elif 'evaluation_dataset' not in locals() or len(evaluation_dataset) == 0:
    print("Skipping test, evaluation_dataset not loaded or is empty.")
else: # This means eval_model is missing
    print("Skipping test, eval_model not found. Ensure the Setup cell was run.")


Testing Tool Usage evaluator with the first example:


llm_classify |          | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s

LLM Judge Score for Tool Usage (First Example): 0.0
(Score reflects LLM judgment: 1.0 for 'Correct', 0.0 for 'Incorrect')


In [8]:
# --- Introspect Tool Usage on First 5 Examples (Show Explanations) ---
import pandas as pd # Make sure pandas is imported if not already

num_examples_to_test = 5
print(f"Inspecting Tool Usage LLM response for first {num_examples_to_test} examples...")

if 'evaluation_dataset' in locals() and len(evaluation_dataset) > 0 and 'eval_model' in locals():
    for i in range(min(num_examples_to_test, len(evaluation_dataset))):
        print(f"\n--- Processing Example {i} ---")
        example = evaluation_dataset[i]
        output_data = dummy_task_function(example) # Agent's output from dataset
        expected_data = example.output           # Human labels from dataset
        input_data = example.input               # Original input query etc.

        # Prepare data for llm_classify (similar to inside evaluate_tool_usage)
        user_query = output_data.get('user_query')
        tool_called = output_data.get('tool_called')
        generated_sql = output_data.get('generated_sql', 'N/A')
        human_label = expected_data.get('tool_usage_correctness_label', 'N/A')
        human_explanation = expected_data.get('tool_usage_explanation', 'N/A')

        if user_query is None or tool_called is None:
             print(f"  Skipping Example {i}: Missing essential input (query or tool_called).")
             continue

        eval_df = pd.DataFrame([{
            "user_query": user_query,
            "tool_called": tool_called,
            "generated_sql": generated_sql,
            "tool_usage_correctness_label": human_label,
            "tool_usage_explanation": human_explanation
        }])

        try:
            # Call llm_classify directly HERE within the test cell
            response_df = llm_classify(
                data=eval_df,
                template=TOOL_USAGE_PROMPT_TEMPLATE, # Use the existing template
                model=eval_model,                   # Use the existing model
                rails=["Correct", "Incorrect"],
                provide_explanation=True
            )

            # Extract score AND explanation from the response DataFrame
            llm_label = response_df['label'].iloc[0]
            explanation = response_df['explanation'].iloc[0]
            score = 1.0 if llm_label == 'Correct' else 0.0

            print(f"  Example {i}: Score={score}, LLM Explanation: {explanation}")

        except Exception as e:
            print(f"  Example {i}: ERROR during llm_classify call: {e}")
            # import traceback
            # traceback.print_exc() # Uncomment for full error details if needed

else:
    print("Skipping test - dataset or eval_model not loaded.")

print("\n--- End Inspection ---")

Inspecting Tool Usage LLM response for first 5 examples...

--- Processing Example 0 ---


llm_classify |          | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s

  Example 0: Score=0.0, LLM Explanation: The user query asks for information about 'Jeff Pidcock'. This is a specific factual query that likely requires information from the workshop transcript to answer accurately. The agent's decision to use the 'query_database' tool is appropriate because it allows the agent to search the transcript for mentions of 'Jeff Pidcock' and provide a precise answer based on the content of the transcript. The generated SQL query is correctly formulated to search for any segments in the transcript that mention 'Jeff Pidcock'. Therefore, the agent's action to call the tool is correct.

--- Processing Example 1 ---


llm_classify |          | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s

  Example 1: Score=0.0, LLM Explanation: The user query specifically asks for what Stefan Krawczyk said during his introduction. This is a request for specific factual information that would be found in the transcript of the workshop. The agent correctly decided to use the 'query_database' tool to retrieve this information, as it involves searching for a specific speaker and context within the transcript. The generated SQL query is appropriately designed to find segments where Stefan Krawczyk is the speaker and the content is related to his introduction. Therefore, the agent's action to call the tool was correct.

--- Processing Example 2 ---


llm_classify |          | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s

  Example 2: Score=0.0, LLM Explanation: The user query asks for a list of all unique speakers mentioned in the workshop transcript. This is a request for specific factual information that is likely stored in the 'transcript_segments' database table. The agent correctly decided to use the 'query_database' tool to retrieve this information. The generated SQL query, 'SELECT DISTINCT speaker FROM transcript_segments', is appropriate for obtaining a list of unique speakers, as it selects distinct speaker names from the table. Therefore, the agent's action to call the tool and the SQL query generated are both correct for answering the user's query.

--- Processing Example 3 ---


llm_classify |          | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s

  Example 3: Score=0.0, LLM Explanation: The user query asks for the total number of words spoken by a specific speaker, Hugo, in the workshop transcript. This is a factual query that requires specific data from the transcript, specifically the aggregation of word counts for Hugo. The agent correctly decided to use the 'query_database' tool to retrieve this information, as it involves summing up the word counts from the database table 'transcript_segments' where the speaker is Hugo. The generated SQL query is appropriate for this task, as it selects the sum of the word counts for the specified speaker. Therefore, the agent's action to call the tool was correct.

--- Processing Example 4 ---


llm_classify |          | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s

  Example 4: Score=0.0, LLM Explanation: The user query specifically requests to find segments in a transcript that mention the word 'evaluation' and to provide the timestamps for these segments. This is a request for specific factual information that is likely stored in the 'transcript_segments' database table. The agent correctly called the 'query_database' tool and generated an appropriate SQL query to retrieve the relevant data. The SQL query is designed to search for the keyword 'evaluation' in the text of the transcript segments and return the start and end times along with the text, which aligns perfectly with the user's request. Therefore, the agent's action to use the tool was appropriate and necessary to fulfill the query.

--- End Inspection ---


In [9]:
# --- Define Revised Prompt: Tool Usage Correctness ---

REVISED_TOOL_USAGE_PROMPT_TEMPLATE = """
You are evaluating an AI agent's decision on whether to use a specific tool ('query_database') to answer a user's query about a workshop transcript.
The agent has access to a database table 'transcript_segments'.

**Instructions:**
1. Analyze the User Query: What information is the user asking for?
2. Analyze the Agent's Action: Did the agent call the 'query_database' tool? (indicated by 'tool_called' flag).
3. Determine if the Agent's Action was Correct: Based ONLY on the User Query, should the agent have used the 'query_database' tool to answer effectively?
    - 'Correct': Tool usage is appropriate if the query asks for specific factual information likely only found within the transcript data.
    - 'Incorrect': Tool usage is inappropriate if the query is conversational, asks for general knowledge, or could be answered without accessing the transcript data.

**Input Data:**
User Query: {user_query}
Agent Called Tool ('query_database'): {tool_called}
Agent Generated SQL (if tool called): {generated_sql}

**Your Task:**
Based *only* on the User Query and the Agent's Action, was the decision to use (or not use) the 'query_database' tool correct?

EXPLANATION: [Provide your reasoning here, focusing only on the query and the agent's action.]
LABEL: [Correct or Incorrect]
"""

print("Revised Tool Usage Prompt Template defined.")
# print(REVISED_TOOL_USAGE_PROMPT_TEMPLATE) # Optional: uncomment to view

Revised Tool Usage Prompt Template defined.


In [10]:
# --- Test Tool Usage on First 5 Examples (Using REVISED Prompt) ---
import pandas as pd # Ensure pandas is imported

num_examples_to_test = 5
print(f"Testing Tool Usage with REVISED prompt for first {num_examples_to_test} examples...")

if 'evaluation_dataset' in locals() and len(evaluation_dataset) > 0 and 'eval_model' in locals() and 'REVISED_TOOL_USAGE_PROMPT_TEMPLATE' in locals():
    for i in range(min(num_examples_to_test, len(evaluation_dataset))):
        print(f"\n--- Processing Example {i} ---")
        example = evaluation_dataset[i]
        output_data = dummy_task_function(example)
        expected_data = example.output # Still needed if you want to compare later
        input_data = example.input

        # Prepare data for llm_classify
        user_query = output_data.get('user_query')
        tool_called = output_data.get('tool_called')
        generated_sql = output_data.get('generated_sql', 'N/A')
        # Note: We don't need human labels for the prompt input anymore, but keep for potential comparison
        human_label = expected_data.get('tool_usage_correctness_label', 'N/A')
        human_explanation = expected_data.get('tool_usage_explanation', 'N/A')


        if user_query is None or tool_called is None:
             print(f"  Skipping Example {i}: Missing essential input.")
             continue

        # DataFrame still includes human labels, though not used in revised prompt
        eval_df = pd.DataFrame([{
            "user_query": user_query,
            "tool_called": tool_called,
            "generated_sql": generated_sql,
            "tool_usage_correctness_label": human_label,
            "tool_usage_explanation": human_explanation
        }])

        try:
            # Call llm_classify directly using the REVISED template
            response_df = llm_classify(
                data=eval_df,
                template=REVISED_TOOL_USAGE_PROMPT_TEMPLATE, # Use the new template variable
                model=eval_model,
                rails=["Correct", "Incorrect"],
                provide_explanation=True
            )

            llm_label = response_df['label'].iloc[0]
            explanation = response_df['explanation'].iloc[0]
            score = 1.0 if llm_label == 'Correct' else 0.0

            print(f"  Example {i}: Score={score}, LLM Explanation: {explanation}")
            # You could add a comparison here if desired:
            # print(f"    (Human Label was: {human_label})")

        except Exception as e:
            print(f"  Example {i}: ERROR during llm_classify call: {e}")

else:
    missing = []
    if 'evaluation_dataset' not in locals() or len(evaluation_dataset) == 0:
        missing.append("dataset")
    if 'eval_model' not in locals():
        missing.append("eval_model")
    if 'REVISED_TOOL_USAGE_PROMPT_TEMPLATE' not in locals():
        missing.append("revised prompt template")
    print(f"Skipping test - required components not loaded: {', '.join(missing)}")


print("\n--- End Revised Prompt Test ---")

Testing Tool Usage with REVISED prompt for first 5 examples...

--- Processing Example 0 ---


llm_classify |          | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s

  Example 0: Score=0.0, LLM Explanation: The user query asks for information about 'Jeff Pidcock'. This is a specific factual query that likely requires accessing the transcript data to find relevant information about this individual. The agent's decision to use the 'query_database' tool to search for mentions of 'Jeff Pidcock' in the transcript is appropriate, as it is the most direct way to obtain accurate and specific information about him from the workshop transcript.

--- Processing Example 1 ---


llm_classify |          | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s

  Example 1: Score=0.0, LLM Explanation: The user query specifically asks for what Stefan Krawczyk said during his introduction. This is a request for specific factual information that would be found in the transcript data. The agent's decision to use the 'query_database' tool to retrieve this information is appropriate, as it allows the agent to access the exact words spoken by Stefan Krawczyk during his introduction, which is likely stored in the transcript segments.

--- Processing Example 2 ---


llm_classify |          | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s

  Example 2: Score=0.0, LLM Explanation: The user query asks for a list of all unique speakers mentioned in the workshop transcript. This is a specific factual request that requires accessing the transcript data to identify and list the unique speakers. The agent's decision to call the 'query_database' tool and execute a SQL query to retrieve distinct speaker names from the 'transcript_segments' table is appropriate and necessary to fulfill the user's request. Therefore, the agent's action to use the tool is correct.

--- Processing Example 3 ---


llm_classify |          | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s

  Example 3: Score=0.0, LLM Explanation: The user query asks for the total number of words spoken by a specific individual, Hugo, in a workshop transcript. This is a specific factual question that requires accessing the transcript data to calculate the total word count for Hugo. The agent's decision to use the 'query_database' tool is appropriate because the information needed to answer the query is likely stored in the database table 'transcript_segments'. Therefore, the agent's action to call the tool and generate the SQL query to sum the word count for Hugo is correct.

--- Processing Example 4 ---


llm_classify |          | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s

  Example 4: Score=0.0, LLM Explanation: The user query specifically asks for segments mentioning 'evaluation' along with their timestamps. This is a request for specific factual information that is likely only available in the transcript data. The agent's decision to use the 'query_database' tool to retrieve this information is appropriate, as it allows the agent to search the transcript segments for mentions of 'evaluation' and provide the corresponding timestamps, which cannot be answered without accessing the database.

--- End Revised Prompt Test ---


## Attempting Direct LLM Call for Tool Usage Evaluation

The previous attempts using `phoenix.evals.llm_classify` with various prompt refinements (`TOOL_USAGE_PROMPT_TEMPLATE`, `REVISED_TOOL_USAGE_PROMPT_TEMPLATE`, `REVISED_TOOL_USAGE_PROMPT_TEMPLATE_V2`) consistently produced contradictory results. The LLM's generated explanations indicated correct reasoning about tool usage appropriateness, but the final classification label forced by the `rails=["Correct", "Incorrect"]` parameter was persistently 'Incorrect' (Score=0.0).

This suggests a potential issue either with how `llm_classify` handles the rails in conjunction with the explanation for this specific task, or a deeper problem with the LLM's ability to follow the structured output format reliably under these conditions.

To isolate the problem, we will now bypass `llm_classify` and directly query the evaluation LLM (`gpt-4o`) using the `REVISED_TOOL_USAGE_PROMPT_TEMPLATE_V2`. We will manually inspect the raw output to see if the explanation and the final label align when generated without the constraints of the `llm_classify` framework. This will help determine if the core LLM can perform the task correctly when called directly.

In [11]:
# --- Direct OpenAI API Call Test for Tool Usage (Example 0, V1 Prompt) ---
# Bypasses Phoenix entirely. Requires 'openai' library and OPENAI_API_KEY env var.

import os
import re
import json
from openai import OpenAI

print("Performing DIRECT OpenAI API call test (Example 0, V1 Prompt)...")

# --- Configuration ---
MODEL_TO_USE = "gpt-4o" # Specify the model
# --- End Configuration ---

# Ensure necessary components are available (Dataset and Prompt V1)
missing = []
if 'evaluation_dataset' not in locals() or len(evaluation_dataset) == 0:
    missing.append("evaluation_dataset")
if 'REVISED_TOOL_USAGE_PROMPT_TEMPLATE' not in locals():
    missing.append("REVISED_TOOL_USAGE_PROMPT_TEMPLATE (V1)")
if "OPENAI_API_KEY" not in os.environ:
    missing.append("OPENAI_API_KEY environment variable")

if not missing:
    # Initialize OpenAI client directly
    try:
        client = OpenAI()
        print(f"OpenAI client initialized for model: {MODEL_TO_USE}")
    except Exception as e:
        print(f"Error initializing OpenAI client: {e}")
        client = None

    if client:
        example_index = 0
        example = evaluation_dataset[example_index]
        # Get input data robustly
        if 'dummy_task_function' in locals():
            output_data = dummy_task_function(example)
        elif hasattr(example, 'input'):
             output_data = example.input
        else:
            print(f"  Skipping Example {example_index}: Cannot access input data.")
            output_data = None

        if output_data:
            expected_data = example.output if hasattr(example, 'output') else {}

            # Prepare data
            user_query = output_data.get('user_query')
            tool_called = output_data.get('tool_called')
            generated_sql = output_data.get('generated_sql', 'N/A')
            human_label = expected_data.get('tool_usage_correctness_label', 'N/A')

            if user_query is None or tool_called is None:
                 print(f"  Skipping Example {example_index}: Missing essential input fields (user_query or tool_called).")
            else:
                # Format the V1 prompt
                formatted_prompt = REVISED_TOOL_USAGE_PROMPT_TEMPLATE.format(
                    user_query=user_query,
                    tool_called=tool_called,
                    generated_sql=generated_sql
                )

                print("\n--- Prompt Sent to OpenAI API (V1) ---")
                print(formatted_prompt)
                print("------------------------------------")

                try:
                    # Make the direct API call
                    response = client.chat.completions.create(
                        model=MODEL_TO_USE,
                        messages=[{"role": "user", "content": formatted_prompt}],
                        temperature=0.0,
                        # max_tokens=250 # Optional: limit response length
                    )
                    raw_output = response.choices[0].message.content

                    print("\n--- Raw OpenAI API Response ---")
                    print(raw_output)
                    print("-----------------------------")

                    # Simple parsing attempt
                    explanation_match = re.search(r"EXPLANATION:\s*(.*?)\s*LABEL:", raw_output, re.DOTALL | re.IGNORECASE)
                    label_match = re.search(r"LABEL:\s*(\w+)", raw_output, re.IGNORECASE)

                    extracted_explanation = explanation_match.group(1).strip() if explanation_match else "Parsing failed"
                    extracted_label = label_match.group(1).strip() if label_match else "Parsing failed"

                    print("\n--- Parsed Output (V1 Prompt) ---")
                    print(f"Extracted Explanation: {extracted_explanation}")
                    print(f"Extracted Label: {extracted_label}")
                    print("---------------------------------")

                    print("\n>>> Please manually check if the Extracted Label ('Correct'/'Incorrect') logically follows the Extracted Explanation.")
                    print(f"    (For reference, Human Label was: {human_label})")

                except Exception as e:
                    print(f"\n--- ERROR during OpenAI API call ---")
                    print(e)
                    import traceback
                    traceback.print_exc()
                    print("------------------------------------")
else:
    print(f"Skipping test - required components not loaded: {', '.join(missing)}")


print("\n--- End Direct OpenAI API Call Test ---")

Performing DIRECT OpenAI API call test (Example 0, V1 Prompt)...
OpenAI client initialized for model: gpt-4o

--- Prompt Sent to OpenAI API (V1) ---

You are evaluating an AI agent's decision on whether to use a specific tool ('query_database') to answer a user's query about a workshop transcript.
The agent has access to a database table 'transcript_segments'.

**Instructions:**
1. Analyze the User Query: What information is the user asking for?
2. Analyze the Agent's Action: Did the agent call the 'query_database' tool? (indicated by 'tool_called' flag).
3. Determine if the Agent's Action was Correct: Based ONLY on the User Query, should the agent have used the 'query_database' tool to answer effectively?
    - 'Correct': Tool usage is appropriate if the query asks for specific factual information likely only found within the transcript data.
    - 'Incorrect': Tool usage is inappropriate if the query is conversational, asks for general knowledge, or could be answered without accessin

In [12]:
# --- Evaluate ALL Examples Directly via OpenAI API (No Progress Bar) ---
# Focuses on core logic, minimal error handling, NO tqdm dependency.

import os
import re
import json
from openai import OpenAI
import pandas as pd
# Removed: from tqdm.notebook import tqdm

print("Evaluating ALL examples via DIRECT OpenAI API call (No Progress Bar)...")

# --- Configuration ---
MODEL_TO_USE = "gpt-4o"
PROMPT_TEMPLATE = REVISED_TOOL_USAGE_PROMPT_TEMPLATE # Assumes V1 is defined
# --- End Configuration ---

# List to store results
evaluation_results = []

# Initialize OpenAI client (basic check)
try:
    client = OpenAI()
    print(f"OpenAI client initialized for model: {MODEL_TO_USE}")
except Exception as e:
    print(f"STOPPING: Failed to initialize OpenAI Client: {e}. Make sure OPENAI_API_KEY is set.")
    client = None # Ensure client is None if init fails

if client and 'evaluation_dataset' in locals() and PROMPT_TEMPLATE:
    print(f"Processing {len(evaluation_dataset)} examples...")

    # Removed tqdm wrapper from the loop
    for i, example in enumerate(evaluation_dataset):
        print(f"Processing Example {i}...") # Simple print indicator instead of progress bar

        # --- 1. Get Data ---
        user_query = example.input.get('user_query', 'N/A')
        tool_called = example.input.get('tool_called', None)
        generated_sql = example.input.get('generated_sql', 'N/A')
        human_label = example.output.get('tool_usage_correctness_label', 'N/A')

        llm_label = "Skipped"
        llm_explanation = "Skipped due to missing input"

        if user_query != 'N/A' and tool_called is not None:
            # --- 2. Format Prompt ---
            prompt = PROMPT_TEMPLATE.format(
                user_query=user_query,
                tool_called=tool_called,
                generated_sql=generated_sql
            )

            try:
                # --- 3. Call API ---
                response = client.chat.completions.create(
                    model=MODEL_TO_USE,
                    messages=[{"role": "user", "content": prompt}],
                    temperature=0.0,
                )
                raw_output = response.choices[0].message.content

                # --- 4. Parse Result ---
                explanation_match = re.search(r"EXPLANATION:\s*(.*?)\s*LABEL:", raw_output, re.DOTALL | re.IGNORECASE)
                label_match = re.search(r"LABEL:\s*(\w+)", raw_output, re.IGNORECASE)

                llm_explanation = explanation_match.group(1).strip() if explanation_match else "Parsing failed"
                llm_label = label_match.group(1).strip().capitalize() if label_match else "Parsing failed"

            except Exception as e:
                # Minimal error handling for API call failure
                print(f"  API Call Error on Example {i}: {e}")
                llm_label = "API Error"
                llm_explanation = f"API Call Error: {e}"

        # --- 5. Store Essentials ---
        evaluation_results.append({
            "index": i,
            "query": user_query,
            "human_label": human_label,
            "llm_label": llm_label,
            "llm_explanation": llm_explanation,
        })

    # --- Convert to DataFrame ---
    print("\nEvaluation complete. Creating DataFrame...")
    results_df_final = pd.DataFrame(evaluation_results)

    # --- Display DataFrame ---
    print("Direct API Evaluation Results (No Progress Bar):")
    pd.set_option('display.max_rows', 50)
    pd.set_option('display.max_colwidth', 150)
    display(results_df_final)

else:
    if not client:
        print("Evaluation skipped because OpenAI client failed to initialize.")
    else:
        print("Evaluation skipped - check dataset and prompt template definitions.")

print("\n--- End Direct OpenAI API Full Evaluation (No Progress Bar) ---")

Evaluating ALL examples via DIRECT OpenAI API call (No Progress Bar)...
OpenAI client initialized for model: gpt-4o
Processing 17 examples...
Processing Example 0...
Processing Example 1...
Processing Example 2...
Processing Example 3...
Processing Example 4...
Processing Example 5...
Processing Example 6...
Processing Example 7...
Processing Example 8...
Processing Example 9...
Processing Example 10...
Processing Example 11...
Processing Example 12...
Processing Example 13...
Processing Example 14...
Processing Example 15...
Processing Example 16...

Evaluation complete. Creating DataFrame...
Direct API Evaluation Results (No Progress Bar):


Unnamed: 0,index,query,human_label,llm_label,llm_explanation
0,0,Who is Jeff Pidcock?,Correct,Correct,"The user query ""Who is Jeff Pidcock?"" is asking for specific information about an individual named Jeff Pidcock. This type of query typically requ..."
1,1,What did Stefan Krawczyk say during his introduction?,Correct,Correct,The user query specifically asks for what Stefan Krawczyk said during his introduction. This is a request for specific factual information that is...
2,2,List all unique speakers mentioned.,Correct,Correct,The user query asks for a list of all unique speakers mentioned in the workshop transcript. This is a request for specific factual information tha...
3,3,How many words did Hugo speak in total?,Correcr,Correct,The user query asks for a specific factual piece of information: the total number of words spoken by Hugo. This type of query requires accessing d...
4,4,Find segments mentioning 'evaluation' and provide timestamps.,Correct,Correct,The user query specifically asks for segments mentioning the word 'evaluation' along with their timestamps. This request requires searching throug...
5,5,Which speaker has the most segments?,Correct,Correct,The user query asks for specific factual information about which speaker has the most segments in a workshop transcript. This type of information ...
6,6,What is the total word count for all segments combined?,Correct,Parsing failed,** The user query asks for the total word count for all segments combined. This is a specific factual request that requires aggregating data from ...
7,7,Who mentioned Carvana?,Correct,Correct,"The user query asks for specific information about who mentioned ""Carvana"" in a workshop transcript. This type of query requires accessing the tra..."
8,8,List the builders in residence mentioned.,Correct,Correct,"The user query specifically asks for a list of ""builders in residence"" mentioned, which is a request for specific factual information that is like..."
9,9,When did Nathan Danielsen first speak?,Correct,Correct,The user query asks for specific factual information about when Nathan Danielsen first spoke during a workshop. This type of information is likely...



--- End Direct OpenAI API Full Evaluation (No Progress Bar) ---


## Defining a Custom Evaluator Function for Tool Usage (Direct API Call)

Our previous attempts to evaluate Tool Usage using `phoenix.evals.llm_classify` resulted in persistent inconsistencies: the LLM judge's explanations suggested correct reasoning, but the final label forced by the `rails=["Correct", "Incorrect"]` parameter was always 'Incorrect'.

We subsequently tested making direct calls to the OpenAI API (`gpt-4o`) using the simplified `REVISED_TOOL_USAGE_PROMPT_TEMPLATE`. This approach **worked correctly**, yielding consistent explanations and labels that matched the human annotations across all examples.

Therefore, we will now define a new evaluator function, `evaluate_tool_usage_direct_api`, that encapsulates this successful direct API call logic. This function will:

1.  Accept the `output` (from the agent/task function) and `expected` (from the dataset) dictionaries as input, following the standard Phoenix evaluator signature.
2.  Extract the necessary fields (`user_query`, `tool_called`, `generated_sql`).
3.  Format the `REVISED_TOOL_USAGE_PROMPT_TEMPLATE`.
4.  Call the OpenAI API directly.
5.  Parse the response to get the LLM's label ('Correct'/'Incorrect') and explanation.
6.  Return a `phoenix.evals.models.scoring.Score` object containing the score (1.0 for 'Correct', 0.0 for 'Incorrect') and the LLM's explanation.

This allows us to replace the faulty `llm_classify`-based evaluator with our custom, validated logic while potentially still using the `phoenix.experiments.run_experiment` framework for overall execution and integration with other evaluators (like SQL and Final Answer quality, which we will define next).

In [13]:
# --- Phoenix-Compatible Evaluator: Tool Usage (Direct API - Returns Float) ---
# Corrected based on L11(1).ipynb: Accepts output/expected, returns float score.

import os
import re
import json
from openai import OpenAI
# Removed Score object import - not needed based on reference
import logging

# Configure logging (optional)
logging.basicConfig(level=logging.WARNING, format='%(levelname)s: %(message)s')

# --- Configuration ---
MODEL_TO_USE = "gpt-4o"
# Assumes REVISED_TOOL_USAGE_PROMPT_TEMPLATE is defined globally
if 'REVISED_TOOL_USAGE_PROMPT_TEMPLATE' not in globals():
    logging.error("STOPPING: REVISED_TOOL_USAGE_PROMPT_TEMPLATE not found.")
    REVISED_TOOL_USAGE_PROMPT_TEMPLATE = "Prompt not defined"

# Initialize OpenAI client once
try:
    openai_client = OpenAI()
    # logging.info(f"OpenAI client initialized for model: {MODEL_TO_USE}") # Less verbose
except Exception as e:
    logging.error(f"Failed to initialize OpenAI Client: {e}. Check API key.")
    openai_client = None
# --- End Configuration ---


def evaluate_tool_usage_direct_api(output: dict, expected: dict) -> float:
    """
    Evaluates tool usage correctness via direct OpenAI API calls.
    Returns a float score (1.0 for Correct, 0.0 for Incorrect/Error).
    Compatible with phoenix.experiments.run_experiment evaluators list.

    Args:
        output: Dictionary containing agent outputs (user_query, tool_called, etc.).
        expected: Dictionary containing expected outputs/labels from the dataset.

    Returns:
        A float score (0.0 or 1.0).
    """
    global openai_client, REVISED_TOOL_USAGE_PROMPT_TEMPLATE

    score = 0.0 # Default score for errors/skips/Incorrect

    if not openai_client:
        logging.warning("Skipping evaluation: OpenAI client not initialized")
        return score

    if not REVISED_TOOL_USAGE_PROMPT_TEMPLATE or REVISED_TOOL_USAGE_PROMPT_TEMPLATE == "Prompt not defined":
         logging.warning("Skipping evaluation: Prompt template not defined")
         return score

    # --- 1. Get Data ---
    user_query = output.get('user_query')
    tool_called = output.get('tool_called')
    generated_sql = output.get('generated_sql', 'N/A')

    if user_query is None or tool_called is None:
        logging.warning("Skipping evaluation: Missing 'user_query' or 'tool_called' in output")
        return score

    # --- 2. Format Prompt ---
    try:
        prompt = REVISED_TOOL_USAGE_PROMPT_TEMPLATE.format(
            user_query=user_query,
            tool_called=tool_called,
            generated_sql=generated_sql
        )
    except KeyError as e:
        logging.warning(f"Skipping evaluation: Error formatting prompt - missing key {e}")
        return score # Return 0.0 on formatting error

    # --- 3. Call API & Parse ---
    try:
        response = openai_client.chat.completions.create(
            model=MODEL_TO_USE,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0,
        )
        raw_output = response.choices[0].message.content

        # Parse Label
        label_match = re.search(r"LABEL:\s*(\w+)", raw_output, re.IGNORECASE)
        llm_label_str = label_match.group(1).strip().capitalize() if label_match else "Parsing failed"

        # --- 4. Determine Score ---
        if llm_label_str == "Correct":
            score = 1.0
        # else: score remains 0.0 for "Incorrect", "Parsing failed", or API errors

    except Exception as e:
        logging.error(f"API Call Error during tool usage evaluation: {e}")
        # score remains 0.0

    return score

# --- Quick Test (Optional) ---
if 'evaluation_dataset' in locals() and len(evaluation_dataset) > 0:
    if openai_client and REVISED_TOOL_USAGE_PROMPT_TEMPLATE != "Prompt not defined":
        print("\n--- Testing evaluate_tool_usage_direct_api (returns float) with Example 0 ---")
        test_example = evaluation_dataset[0]
        test_output_data = test_example.input
        test_expected_data = test_example.output
        test_score_float = evaluate_tool_usage_direct_api(test_output_data, test_expected_data)
        print(f"Score Returned: {test_score_float} (Type: {type(test_score_float)})")
        print("--- End Test ---")
    else:
        print("\nSkipping function test - OpenAI client or prompt not ready.")
else:
    print("\nSkipping function test - evaluation_dataset not loaded.")



--- Testing evaluate_tool_usage_direct_api (returns float) with Example 0 ---
Score Returned: 1.0 (Type: <class 'float'>)
--- End Test ---


## Integration Test: Running Experiment with Custom Tool Usage Evaluator

Before defining the evaluators for SQL Correctness and Final Answer Quality, we will run a minimal Phoenix experiment using **only** our custom `evaluate_tool_usage_direct_api` function.

**Purpose:**

*   Verify that our custom evaluator integrates correctly with the `phoenix.experiments.run_experiment` framework.
*   Confirm that the scores generated by the direct API calls are logged to the Phoenix backend and visible in the UI.

This serves as a crucial integration check to ensure our direct API call approach works within the Phoenix ecosystem before proceeding further. We expect to see scores (likely all 1.0) for the `evaluate_tool_usage_direct_api` evaluator in the resulting Phoenix experiment UI.

In [None]:
    # --- Run Experiment with ONLY the Custom Tool Usage Evaluator ---
    # ... (previous code in the cell: imports, checks, defining experiment_name etc.) ...

    if not missing:
        print("Running experiment with only the custom direct API tool usage evaluator...")
        # ... (now_str, experiment_name defined here) ...

        # *** ADDED: Explicitly set environment vars right before run_experiment ***
        print("Ensuring environment variables are set for run_experiment...")
        cloud_endpoint = "https://app.phoenix.arize.com"
        client_headers = os.getenv("PHOENIX_CLIENT_HEADERS") # Get from env (should be loaded)
        if client_headers:
             os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = cloud_endpoint
             os.environ["PHOENIX_CLIENT_HEADERS"] = client_headers # Re-set it just in case
             print(f"  Set PHOENIX_COLLECTOR_ENDPOINT={cloud_endpoint}")
             print(f"  Set PHOENIX_CLIENT_HEADERS=api_key=...")
        else:
             print("  WARNING: PHOENIX_CLIENT_HEADERS not found in os.getenv(), cannot set for run_experiment.")
             # Optionally raise an error here if headers are essential

        # *** Original run_experiment call (without client=px_client) ***
        experiment_tool_usage_only = run_experiment(
            evaluation_dataset,                 # First positional: dataset
            dummy_task_function,                # Second positional: task function
            evaluators=[evaluate_tool_usage_direct_api], # Use list with just the one evaluator
            experiment_name=experiment_name,    # Keyword: experiment_name
            experiment_description="Testing direct API tool usage evaluator integration." # Keyword: description
            # concurrency=5 # Optional concurrency
        )

        print(f"\nExperiment '{experiment_name}' run initiated.")
        print("Please check the Phoenix UI for results.")

    else:
        print(f"Skipping experiment - required components not loaded: {', '.join(missing)}")

    print("\n--- End Tool Usage Only Experiment ---")

## Define Evaluators: SQL Correctness & Final Answer Quality (Direct API)

Following the successful pattern established for `evaluate_tool_usage_direct_api`, we now define the evaluators for SQL Correctness and Final Answer Quality using direct calls to the Gemini API via our `call_gemini` helper function.

Each evaluator:
1. Takes the experiment `example` dictionary as input.
2. Formats a specific prompt using data from the example (`user_query`, `generated_sql`, `final_answer`).
3. Calls the `call_gemini` function.
4. Parses the response ("Correct"/"Incorrect" or "Good"/"Bad") into a float score (1.0 or 0.0).
5. Handles potential missing data or API errors gracefully by returning 0.0.

Finally, we combine all three custom evaluators into a list (`all_custom_evaluators`) to be used in the experiment run.

In [None]:
# --- CELL 1: DEFINITIONS ---
import re
import os
from openai import OpenAI # Ensure OpenAI is imported
# Assumes openai_client is initialized globally in a *previous* cell and is working
# Assumes MODEL_TO_USE = "gpt-4o" is defined globally

# --- Define Prompt Template ---
SQL_CORRECTNESS_PROMPT_TEMPLATE = """Evaluate if the Generated SQL is semantically correct and appropriate for the User Query. Consider typical schemas (e.g., transcript_segments table). Ignore Final Answer quality.

User Query:
{user_query}

Generated SQL:
{generated_sql}

Is the SQL correct and appropriate?
Provide a brief EXPLANATION and finish with LABEL: Correct or LABEL: Incorrect.
"""
print("Defined: SQL_CORRECTNESS_PROMPT_TEMPLATE")

# --- Helper Function to Call OpenAI ---
# Define this if it's not already defined and available from another cell
def call_openai_judge(prompt: str, model: str = "gpt-4o") -> str:
    """Calls the specified OpenAI model as a judge and returns the raw text response."""
    if not openai_client:
        raise RuntimeError("OpenAI client is not initialized.")
    try:
        response = openai_client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0,
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error during OpenAI API call: {e}")
        return "API Error" # Return error string
print("Defined: call_openai_judge helper function")

# --- Evaluator Definition ---
def evaluate_sql_correctness_direct_api(output: dict, expected: dict) -> float:
    """
    (Simplified) Evaluates SQL correctness via direct OpenAI API call.
    Returns 1.0 for Correct, 0.0 otherwise.
    """
    # Function body as defined in the previous simplified version...
    if not openai_client:
        print("Prerequisite Error: OpenAI client not initialized.")
        return 0.0
    if not SQL_CORRECTNESS_PROMPT_TEMPLATE:
        print("Prerequisite Error: SQL_CORRECTNESS_PROMPT_TEMPLATE not defined.")
        return 0.0

    user_query = output.get('user_query')
    generated_sql = output.get('generated_sql')

    if not user_query or not generated_sql:
        return 0.0

    try:
        prompt = SQL_CORRECTNESS_PROMPT_TEMPLATE.format(
            user_query=user_query,
            generated_sql=generated_sql
        )
    except KeyError as e:
         print(f"Prompt Formatting Error: Missing key {e}")
         return 0.0

    try:
        raw_output = call_openai_judge(prompt, model=MODEL_TO_USE) # Use helper
        label_match = re.search(r"LABEL:\s*(\w+)", raw_output, re.IGNORECASE)

        if label_match and label_match.group(1).strip().capitalize() == "Correct":
            return 1.0
        else:
            # Covers API Error string, parsing failure, or Incorrect label
            # print(f"Debug: Raw output '{raw_output[:50]}...' resulted in 0.0") # Optional debug
            return 0.0

    except Exception as e:
        print(f"Unexpected error during SQL correctness evaluation: {e}")
        return 0.0

print("Defined: evaluate_sql_correctness_direct_api function")
# --- END CELL 1 ---

In [None]:
# --- CELL 2: FULL EVALUATION & DATAFRAME OUTPUT ---
import re
import pandas as pd # Import pandas
# Assumes functions from Cell 1 (call_openai_judge, evaluate_sql_correctness_direct_api) are defined
# Assumes openai_client, evaluation_dataset, MODEL_TO_USE, SQL_CORRECTNESS_PROMPT_TEMPLATE are available

print(f"\n--- Running Full Evaluation for SQL Correctness on {len(evaluation_dataset)} Examples ---")

# --- Prerequisite Check ---
test_passed = True
if 'evaluation_dataset' not in locals() or len(evaluation_dataset) == 0:
    print("Evaluation FAIL: evaluation_dataset not found or empty.")
    test_passed = False
if 'openai_client' not in locals() or not openai_client:
    print("Evaluation FAIL: openai_client not initialized.")
    test_passed = False
if 'SQL_CORRECTNESS_PROMPT_TEMPLATE' not in locals() or not SQL_CORRECTNESS_PROMPT_TEMPLATE:
     print("Evaluation FAIL: SQL_CORRECTNESS_PROMPT_TEMPLATE not defined.")
     test_passed = False
if 'evaluate_sql_correctness_direct_api' not in locals():
     print("Evaluation FAIL: evaluate_sql_correctness_direct_api function not defined (Run Cell 1?).")
     test_passed = False
if 'call_openai_judge' not in locals():
     print("Evaluation FAIL: call_openai_judge function not defined (Run Cell 1?).")
     test_passed = False
# --- End Prerequisite Check ---

results_list = [] # Initialize list to store results

if test_passed:
    # Iterate through ALL examples using enumerate
    for i, test_example in enumerate(evaluation_dataset):
        print(f"Processing Example {i}...") # Progress indicator

        test_output_data = test_example.input
        test_expected_data = test_example.output # needed for function signature

        user_query = test_output_data.get('user_query', 'MISSING')
        generated_sql = test_output_data.get('generated_sql', 'MISSING')

        # --- Call LLM judge directly for raw output ---
        raw_judge_response = "Skipped direct call" # Default
        if user_query != 'MISSING' and generated_sql != 'MISSING':
            try:
                test_prompt = SQL_CORRECTNESS_PROMPT_TEMPLATE.format(
                    user_query=user_query,
                    generated_sql=generated_sql
                )
                raw_judge_response = call_openai_judge(test_prompt, model=MODEL_TO_USE) # Use helper
                # Reduce printing inside the loop for large datasets
                # print("--- Raw LLM Judge Response ---")
                # print(raw_judge_response)
                # print("----------------------------")
            except Exception as e:
                print(f"  Error calling LLM Judge directly on Example {i}: {e}")
                raw_judge_response = f"Error during direct call: {e}"
        # else:
            # print(f"  Skipping direct LLM call on Example {i} due to missing query/SQL.")


        # --- Call the evaluator function ---
        test_score_float = 0.0 # Default score
        try:
            test_score_float = evaluate_sql_correctness_direct_api(test_output_data, test_expected_data)
            # print(f"  Score Returned: {test_score_float}") # Optional print inside loop
        except Exception as e:
            print(f"  Error calling evaluator function on Example {i}: {e}")
            test_score_float = 0.0 # Assign 0.0 on error

        # --- Append results to list ---
        results_list.append({
            "index": i,
            "user_query": user_query,
            "generated_sql": generated_sql,
            "raw_llm_response": raw_judge_response,
            "sql_correctness_score": test_score_float
        })

    # --- Convert list to DataFrame ---
    print("\nEvaluation loop complete. Creating DataFrame...")
    results_df = pd.DataFrame(results_list)

    # --- Display DataFrame ---
    print("SQL Correctness Evaluation Results:")
    pd.set_option('display.max_rows', 100) # Show more rows if needed
    pd.set_option('display.max_colwidth', 200) # Show more text width
    display(results_df) # Use display() for better rendering in notebooks

else:
    print("--- Evaluation Aborted due to failed prerequisites ---")

print(f"\n--- End Full Evaluation ---")

In [None]:
# --- Compare LLM SQL Correctness Scores to Human Labels ---
import pandas as pd # Ensure pandas is imported

print("Comparing LLM SQL scores to human labels...")

# --- Configuration ---
# Key for the human label in evaluation_dataset[i].output
# Identified from notebook inspection as 'sql_correctness_label'
HUMAN_LABEL_KEY = 'sql_correctness_label'
# --- End Configuration ---

try:
    # Check prerequisite DataFrames/Datasets
    if 'results_df' not in locals():
        raise NameError("'results_df' DataFrame not found. Please run the evaluation cell first.")
    if 'evaluation_dataset' not in locals():
        raise NameError("'evaluation_dataset' not found. Please ensure it is loaded.")

    # 1. Extract Human Labels from the dataset
    human_labels = [example.output.get(HUMAN_LABEL_KEY, "MISSING") for example in evaluation_dataset]

    # Check length consistency
    if len(human_labels) != len(results_df):
        raise ValueError(f"Mismatch in lengths: results_df has {len(results_df)} rows, but extracted {len(human_labels)} human labels.")

    # 2. Add Human Labels column to the results DataFrame
    results_df['human_sql_label'] = human_labels

    # 3. Add LLM Label (string format) & Comparison column
    def score_to_label(score):
        # Converts 1.0 to "Correct", anything else (0.0, errors) to "Incorrect"
        return "Correct" if score == 1.0 else "Incorrect"

    results_df['llm_sql_label'] = results_df['sql_correctness_score'].apply(score_to_label)

    # Compare strings (case-insensitive), handle "MISSING" human labels
    results_df['match'] = results_df.apply(
        lambda row: str(row['llm_sql_label']).lower() == str(row['human_sql_label']).lower()
                    if row['human_sql_label'] != "MISSING" else None, # Result is None if human label was missing
        axis=1
    )

    # 4. Display Key Comparison Columns
    print("\nComparison Results (Human vs. LLM Judge for SQL Correctness):")
    display_cols = ['user_query', 'generated_sql', 'human_sql_label', 'llm_sql_label', 'match', 'raw_llm_response']
    # Filter out any columns that might not exist (e.g., if df creation failed partially)
    display_cols = [col for col in display_cols if col in results_df.columns]
    pd.set_option('display.max_rows', 100)
    pd.set_option('display.max_colwidth', 200)
    display(results_df[display_cols])

    # 5. Calculate and Print Match Percentage
    if 'match' in results_df.columns:
        match_count = results_df['match'].sum() # Counts True values
        valid_comparisons = results_df['match'].notna().sum() # Counts non-None values
        if valid_comparisons > 0:
            match_percentage = (match_count / valid_comparisons) * 100
            print(f"\nAgreement between LLM Judge and Human Labels: {match_percentage:.2f}% ({match_count}/{valid_comparisons})")
        else:
            print("\nCould not calculate agreement percentage (no valid comparisons).")

except (NameError, AttributeError, ValueError, KeyError) as e:
    print(f"\nError during comparison: {e}")
    print("Please ensure 'results_df' exists, 'evaluation_dataset' is loaded correctly,")
    print(f"and the key '{HUMAN_LABEL_KEY}' is correct for the human labels in evaluation_dataset[i].output.")
except Exception as e:
     print(f"An unexpected error occurred during comparison: {e}")


In [None]:
# --- Evaluator 3: Final Answer Quality (Mimicking Successful Pattern) ---
import re
import logging # Or use print if preferred for errors
# Assumes openai_client is initialized globally
# Assumes MODEL_TO_USE = "gpt-4o" is defined globally

# --- Define Prompt Template ---
# Make sure this is defined before the function
FINAL_ANSWER_QUALITY_PROMPT_TEMPLATE = """Evaluate if the Final Answer accurately and completely answers the User Query, based ONLY on the query and answer text. Do not assume external data or SQL.

User Query:
{user_query}

Final Answer:
{final_answer}

Is the Final Answer good quality (accurate, relevant, complete)?
Provide a brief EXPLANATION and finish with LABEL: Good or LABEL: Bad.
"""
print("Defined: FINAL_ANSWER_QUALITY_PROMPT_TEMPLATE") # Optional confirmation

# --- Evaluator Definition ---
def evaluate_final_answer_quality_direct_api(output: dict, expected: dict) -> float:
    """
    Evaluates Final Answer quality via direct OpenAI API call (Mimics successful pattern).
    Returns 1.0 for Good, 0.0 otherwise (Bad, Error, Parsing Failure).
    """
    # Rely on global variables: openai_client, FINAL_ANSWER_QUALITY_PROMPT_TEMPLATE, MODEL_TO_USE
    score = 0.0 # Default to 0.0 (representing "Bad" or error)

    # --- 1. Prerequisites & Data Checks ---
    if not openai_client:
        print("Prerequisite Error: OpenAI client not initialized.") # Changed from logging
        return 0.0
    if not FINAL_ANSWER_QUALITY_PROMPT_TEMPLATE:
        print("Prerequisite Error: FINAL_ANSWER_QUALITY_PROMPT_TEMPLATE not defined.")
        return 0.0

    user_query = output.get('user_query')
    final_answer = output.get('final_answer')

    if not user_query or not final_answer:
        # print("Data Error: Missing 'user_query' or 'final_answer' in output.") # Optional
        return 0.0

    # --- 2. Format Prompt ---
    try:
        prompt = FINAL_ANSWER_QUALITY_PROMPT_TEMPLATE.format(
            user_query=user_query,
            final_answer=final_answer
        )
    except KeyError as e:
         print(f"Prompt Formatting Error: Missing key {e}")
         return 0.0

    # --- 3. Call API & Parse Result ---
    try:
        # Assuming call_openai_judge helper is defined and available
        raw_output = call_openai_judge(prompt, model=MODEL_TO_USE)
        label_match = re.search(r"LABEL:\s*(\w+)", raw_output, re.IGNORECASE)

        # Return 1.0 ONLY if API succeeded AND label is exactly "Good"
        if label_match and label_match.group(1).strip().capitalize() == "Good":
            score = 1.0
        # else: score remains 0.0

    except NameError:
        # Explicitly catch if the helper function isn't defined
         print("Error: call_openai_judge function not found.")
         # score remains 0.0
    except Exception as e:
        # Covers cases: OpenAI API call failed inside helper, or other unexpected errors
        print(f"API Call/Evaluation Error during Final Answer evaluation: {e}")
        # score remains 0.0

    return score

print("Defined: evaluate_final_answer_quality_direct_api function")

In [None]:
# --- Full Evaluation & Comparison for Final Answer Quality ---
import pandas as pd
import re # Ensure re is imported

print("\n--- Running Full Evaluation for Final Answer Quality ---")

# --- Configuration ---
# Key for the human label in evaluation_dataset[i].output
# Identified from notebook inspection as 'final_answer_quality_label'
HUMAN_LABEL_KEY_ANSWER = 'final_answer_quality_label'

# Expected Labels from the LLM for this evaluator
# Note: Dataset uses "Fail" but our prompt asks for "Bad". We need to map.
HUMAN_LABEL_MAP = {"Fail": "Bad"} # Map human label "Fail" to expected LLM "Bad"
LLM_POSITIVE_LABEL = "Good" # What our prompt asks for as the "good" label
# --- End Configuration ---


# --- Prerequisite Check ---
eval_passed = True
if 'evaluation_dataset' not in locals() or len(evaluation_dataset) == 0:
    print("Evaluation FAIL: evaluation_dataset not found or empty.")
    eval_passed = False
if 'openai_client' not in locals() or not openai_client:
    print("Evaluation FAIL: openai_client not initialized.")
    eval_passed = False
if 'FINAL_ANSWER_QUALITY_PROMPT_TEMPLATE' not in locals() or not FINAL_ANSWER_QUALITY_PROMPT_TEMPLATE:
     print("Evaluation FAIL: FINAL_ANSWER_QUALITY_PROMPT_TEMPLATE not defined.")
     eval_passed = False
if 'evaluate_final_answer_quality_direct_api' not in locals():
     print("Evaluation FAIL: evaluate_final_answer_quality_direct_api function not defined.")
     eval_passed = False
if 'call_openai_judge' not in locals():
     print("Evaluation FAIL: call_openai_judge function not defined.")
     eval_passed = False
# --- End Prerequisite Check ---

answer_results_list = [] # Initialize list for results

if eval_passed:
    print(f"Processing {len(evaluation_dataset)} examples for Final Answer Quality...")
    # Iterate through ALL examples
    for i, test_example in enumerate(evaluation_dataset):
        # print(f"Processing Example {i}...") # Can uncomment for verbose progress

        test_output_data = test_example.input
        test_expected_data = test_example.output # needed for function signature

        user_query = test_output_data.get('user_query', 'MISSING')
        final_answer = test_output_data.get('final_answer', 'MISSING')
        human_label_raw = test_example.output.get(HUMAN_LABEL_KEY_ANSWER, "MISSING_KEY")

        # --- Call LLM judge directly for raw output ---
        raw_judge_response_answer = "Skipped direct call" # Default
        if user_query != 'MISSING' and final_answer != 'MISSING':
            try:
                test_prompt_answer = FINAL_ANSWER_QUALITY_PROMPT_TEMPLATE.format(
                    user_query=user_query,
                    final_answer=final_answer
                )
                raw_judge_response_answer = call_openai_judge(test_prompt_answer, model=MODEL_TO_USE)
            except Exception as e:
                # print(f"  Error calling LLM Judge directly on Example {i}: {e}") # Verbose error
                raw_judge_response_answer = f"Error during direct call: {e}"
        # else:
             # print(f"  Skipping direct LLM call on Example {i} due to missing query/answer.")


        # --- Call the evaluator function ---
        test_score_float_answer = 0.0 # Default score
        try:
            test_score_float_answer = evaluate_final_answer_quality_direct_api(test_output_data, test_expected_data)
        except Exception as e:
            # print(f"  Error calling evaluator function on Example {i}: {e}") # Verbose error
            test_score_float_answer = 0.0 # Assign 0.0 on error

        # --- Append results to list ---
        answer_results_list.append({
            "index": i,
            "user_query": user_query,
            "final_answer": final_answer,
            "human_answer_label_raw": human_label_raw, # Store the original human label
            "raw_llm_response_answer": raw_judge_response_answer,
            "final_answer_score": test_score_float_answer
        })

    # --- Convert list to DataFrame ---
    print("\nEvaluation loop complete. Creating DataFrame...")
    answer_results_df = pd.DataFrame(answer_results_list)

    # --- Add Comparison Columns ---
    print("Adding comparison columns...")
    # Map human label "Fail" to "Bad" for comparison
    answer_results_df['human_answer_label_mapped'] = answer_results_df['human_answer_label_raw'].map(HUMAN_LABEL_MAP).fillna(answer_results_df['human_answer_label_raw'])

    # Convert score (1.0/0.0) to LLM label ("Good"/"Bad")
    def answer_score_to_label(score):
        return LLM_POSITIVE_LABEL if score == 1.0 else "Bad" # Assumes 0.0 means Bad

    answer_results_df['llm_answer_label'] = answer_results_df['final_answer_score'].apply(answer_score_to_label)

    # Compare mapped human label with LLM label (case-insensitive)
    answer_results_df['answer_match'] = answer_results_df.apply(
        lambda row: str(row['llm_answer_label']).lower() == str(row['human_answer_label_mapped']).lower()
                    if row['human_answer_label_mapped'] not in ["MISSING", "MISSING_KEY"] else None,
        axis=1
    )


    # --- Display DataFrame ---
    print("\nFinal Answer Quality Evaluation Results:")
    # Select and reorder columns
    display_cols_answer = ['index', 'user_query', 'final_answer', 'human_answer_label_raw', 'llm_answer_label', 'answer_match', 'raw_llm_response_answer']
    display_cols_answer = [col for col in display_cols_answer if col in answer_results_df.columns] # Ensure columns exist

    pd.set_option('display.max_rows', 100)
    pd.set_option('display.max_colwidth', 200)
    display(answer_results_df[display_cols_answer])

    # --- Calculate Match Percentage ---
    if 'answer_match' in answer_results_df.columns:
        match_count_answer = answer_results_df['answer_match'].sum() # Counts True values
        valid_comparisons_answer = answer_results_df['answer_match'].notna().sum() # Counts non-None values
        if valid_comparisons_answer > 0:
            match_percentage_answer = (match_count_answer / valid_comparisons_answer) * 100
            print(f"\nAgreement between LLM Judge and Human Labels (Final Answer): {match_percentage_answer:.2f}% ({match_count_answer}/{valid_comparisons_answer})")
        else:
            print("\nCould not calculate agreement percentage (no valid comparisons).")

else:
    print("--- Final Answer Evaluation Aborted due to failed prerequisites ---")

print(f"\n--- End Final Answer Quality Evaluation ---")


## Final Answer Quality Evaluation & Alignment Challenge

We successfully ran the `evaluate_final_answer_quality_direct_api` function across the dataset and compared its judgments to the human labels.

**Key Result:**

*   **Agreement with Human Labels:** ~18% (3/17)

**Analysis:**

The LLM judge achieved very low agreement with the human assessment for final answer quality. Examining the results table reveals the likely cause:
*   The LLM judge, following our prompt to evaluate *only* based on the query and answer text (ignoring SQL/data correctness), rated almost all answers as "Good".
*   The human labels (`Pass`/`Fail`/`Correct`), however, likely incorporated factual correctness based on the underlying data, resulting in many "Fail" labels.
*   This fundamental mismatch in evaluation criteria led to the significant disagreement.

**Pedagogical Opportunity:**

This outcome serves as an excellent illustration of the challenges in automated evaluation:
*   **Technical Success vs. Meaningful Results:** We successfully built and ran the evaluator, but the results lack strong alignment with human judgment in this case.
*   **Importance of Criteria & Prompting:** It highlights how critical defining the *right* evaluation criteria and crafting effective prompts is. Simply asking if an answer is "Good" based on text alone was insufficient here.
*   **Improvement Task:** This presents a clear opportunity for improvement. Students could be tasked with refining the `FINAL_ANSWER_QUALITY_PROMPT_TEMPLATE` to encourage the LLM to consider factual accuracy (perhaps by providing the generated SQL or context), or exploring different evaluation scales beyond simple "Good/Bad", aiming to increase alignment with human judgment.

This demonstrates that building LLM-as-judge systems requires not just coding but careful consideration of evaluation design and alignment.

## Run Full Experiment via Phoenix

Now that all three evaluator functions (`evaluate_tool_usage_direct_api`, `evaluate_sql_correctness_direct_api`, `evaluate_final_answer_quality_direct_api`) have been defined using the direct OpenAI API call pattern and verified, we will execute the full evaluation harness using `phoenix.experiments.run_experiment`.

This function will:
1.  Iterate through each example in the `evaluation_dataset`.
2.  Run the `dummy_task_function` for each example (simply passing through the pre-computed agent outputs).
3.  Call each of our three defined evaluator functions for every example.
4.  Log the inputs, outputs, human labels, and the scores from all three evaluators to the Phoenix/Arize platform under a timestamped experiment name.

The results, including aggregate scores for each evaluator, will be summarized below the cell upon completion, and the full details can be explored in the linked Phoenix UI. This provides a centralized record of the agent's performance across all evaluation criteria for this version.

In [None]:
# --- Run Phoenix Experiment with All Corrected OpenAI Evaluators ---
import phoenix as px # Ensure phoenix is imported
from phoenix.experiments import run_experiment
from datetime import datetime
# Assumes 'evaluation_dataset' is loaded
# Assumes 'dummy_task_function' is defined (or define it here)
# Assumes the three evaluator functions are defined:
#   - evaluate_tool_usage_direct_api
#   - evaluate_sql_correctness_direct_api
#   - evaluate_final_answer_quality_direct_api

print("\n--- Preparing to run Phoenix experiment with all OpenAI evaluators ---")

# --- Combine Evaluators ---
# Ensure the function names below match exactly how they were defined
try:
    all_final_evaluators = [
        evaluate_tool_usage_direct_api,
        evaluate_sql_correctness_direct_api,
        evaluate_final_answer_quality_direct_api
    ]
    print(f"Created list 'all_final_evaluators' with {len(all_final_evaluators)} functions.")
except NameError as e:
    print(f"Error: One or more evaluator functions not defined: {e}")
    all_final_evaluators = None # Prevent running experiment if list creation failed

# --- Define Dummy Task Function (if not already defined) ---
# This function simply passes the input data through, as the agent results are pre-computed in the dataset.
if 'dummy_task_function' not in locals():
    print("Defining dummy_task_function...")
    def dummy_task_function(example):
        """Takes an example and returns its input field."""
        # Ensure it returns a dictionary suitable for the evaluators
        return example.input if hasattr(example, 'input') else {}
    print("dummy_task_function defined.")


# --- Prerequisite Check for Run ---
run_passed = True
if 'evaluation_dataset' not in locals() or len(evaluation_dataset) == 0:
    print("Run FAIL: evaluation_dataset not found or empty.")
    run_passed = False
if 'dummy_task_function' not in locals():
     print("Run FAIL: dummy_task_function not defined.")
     run_passed = False
if not all_final_evaluators or len(all_final_evaluators) != 3:
     print("Run FAIL: Evaluator list 'all_final_evaluators' not ready.")
     run_passed = False
# --- End Prerequisite Check ---


if run_passed:
    print("\nRunning Phoenix experiment...")
    now_str = datetime.now().strftime("%Y%m%d-%H%M%S")
    experiment_name = f"Full_OpenAI_Eval_{now_str}"

    # Call run_experiment with positional dataset/task_fn and keyword evaluators
    try:
        experiment_run = run_experiment(
            evaluation_dataset,         # 1st Positional: Dataset
            dummy_task_function,        # 2nd Positional: Task Function
            evaluators=all_final_evaluators, # Keyword: List of evaluator functions
            experiment_name=experiment_name,
            experiment_description="Full evaluation using direct OpenAI calls for ToolUsage, SQLCorrectness, FinalAnswer."
            # concurrency=... # Optional: Adjust concurrency if needed
        )
        print(f"\nExperiment '{experiment_name}' run initiated.")
        print("Check the Phoenix UI for detailed results and scores from all evaluators.")
        # The experiment summary table will print automatically below if successful.

    except Exception as e:
        print(f"\nError during run_experiment: {e}")
        print("Please check function signatures, dataset structure, and Phoenix connection.")

else:
    print("\n--- Experiment Run Aborted due to failed prerequisites ---")


### Addressing Evaluation Warnings and Zero Scores

The previous execution of `run_experiment` for the `gpt-4o` dataset completed, but generated warnings (`Skipping evaluation: Missing 'user_query' or 'tool_called' in output`) and resulted in average scores of 0.0 for all evaluators.

**Cause:**

This happened because the `dummy_task_function` (defined earlier) simply returned the raw `.input` data from each example. This raw data has a nested structure (e.g., the user query is inside `input['messages'][1]['content']`). However, our evaluator functions (`evaluate_tool_usage_direct_api`, etc.) expect to receive a flat dictionary with top-level keys like `user_query`, `tool_called`, `final_answer`, and `generated_sql`. Since these keys weren't present at the top level of the data passed to the evaluators, they skipped the evaluation or defaulted to a 0.0 score.

**Solution:**

To fix this, we need to **redefine `dummy_task_function`**. The new definition (in the next cell) will correctly extract the required fields from the nested `example.input` data structure and return a flat dictionary in the format expected by our evaluator functions.

After redefining `dummy_task_function`, we will re-run the `run_experiment` cell for the `gpt-4o` dataset. The warnings should disappear, and the evaluators should now compute meaningful scores.

In [None]:
# Cell: Redefine dummy_task_function to extract data correctly

import json # Ensure json is imported
from phoenix.experiments.types import Example # Ensure Example is imported

# --- Define Dummy Task Function (Corrected Extraction) ---
# This version extracts the fields needed by the evaluators from the example input.
print("\n--- Re-defining dummy_task_function with data extraction ---")

# Helper function to safely extract user query from messages list
def get_user_query(messages_list):
    if not isinstance(messages_list, list): return None
    for msg in messages_list:
        if isinstance(msg, dict) and msg.get("role") == "user" and "content" in msg:
            return msg["content"]
    return None

# Helper to find the final assistant answer (can be refined based on agent log structure)
def get_final_answer(messages_list):
     if not isinstance(messages_list, list): return None
     for msg in reversed(messages_list): # Look from the end
         if isinstance(msg, dict) and msg.get("role") == "assistant":
             # Prefer final 'content' if available
             if msg.get("content"):
                 return msg.get("content")
             # If no content, maybe check if tool call was the last action? Requires more context.
     return None # Fallback

# Helper to check if a specific tool was called
def check_tool_called(messages_list, tool_name="query_database"):
    if not isinstance(messages_list, list): return False
    for msg in messages_list:
        if isinstance(msg, dict) and msg.get("role") == "assistant" and msg.get("tool_calls"):
             for tool_call in msg.get("tool_calls", []):
                  if isinstance(tool_call, dict) and tool_call.get("function", {}).get("name") == tool_name:
                       return True
    return False

# Helper to extract SQL query
def get_generated_sql(messages_list):
    if not isinstance(messages_list, list): return None
    for msg in messages_list:
         if isinstance(msg, dict) and msg.get("role") == "assistant" and msg.get("tool_calls"):
              for tool_call in msg.get("tool_calls", []):
                   if isinstance(tool_call, dict) and tool_call.get("function", {}).get("name") == "query_database":
                        try:
                             args = json.loads(tool_call.get("function", {}).get("arguments", "{}"))
                             return args.get("sql_query")
                        except json.JSONDecodeError:
                             return None # Argument parsing failed
    return None


def dummy_task_function(example: Example) -> dict:
    """
    Extracts relevant fields from the example.input (which contains the raw run data)
    and returns a FLAT dictionary matching the structure expected by the evaluators.
    """
    if not hasattr(example, 'input') or not isinstance(example.input, dict):
        return {} # Return empty if input is missing or not a dict

    input_data = example.input
    # Assuming the structure like {"messages": [...]} is consistent in example.input
    messages = input_data.get("messages", [])

    # Extract the data pieces needed by the evaluators
    user_query = get_user_query(messages)
    final_answer = get_final_answer(messages)
    tool_called = check_tool_called(messages)
    generated_sql = get_generated_sql(messages)

    # Return the flat dictionary that evaluators expect as their 'output' argument
    return {
        "user_query": user_query,
        "final_answer": final_answer,
        "tool_called": tool_called,
        "generated_sql": generated_sql
    }

print("dummy_task_function re-defined with extraction logic.")

# --- Quick test of the function (Optional but recommended) ---
if 'evaluation_dataset_gpt4o' in locals() and len(evaluation_dataset_gpt4o) > 0:
    print("\nTesting updated dummy_task_function with the first gpt4o example:")
    test_output = dummy_task_function(evaluation_dataset_gpt4o[0])
    print("Output from updated dummy task:")
    print(json.dumps(test_output, indent=2))
    # Check if expected keys are present:
    print(f"  Keys present: {list(test_output.keys())}")
    assert "user_query" in test_output, "user_query missing!"
    assert "tool_called" in test_output, "tool_called missing!"
    assert "final_answer" in test_output, "final_answer missing!"
    # generated_sql might be None if tool wasn't called, which is okay
    print("  Test output structure seems correct.")

elif 'evaluation_dataset' in locals() and len(evaluation_dataset) > 0:
     print("\nTesting updated dummy_task_function with the first baseline example:")
     test_output = dummy_task_function(evaluation_dataset[0]) # Test with baseline data structure too if possible
     print("Output from updated dummy task:")
     print(json.dumps(test_output, indent=2))

else:
    print("\nSkipping dummy_task_function test, datasets not ready.")


In [None]:
# Cell to Run Experiment on GPT-4o Dataset (with function redefined inside)

import json # Ensure json is imported
import phoenix as px # Ensure phoenix is imported if needed
from phoenix.experiments import run_experiment
from datetime import datetime
from typing import List, Callable, Dict, Any # Ensure imports
from phoenix.experiments.types import Example # Make sure Example is imported

# --- Re-define Dummy Task Function HERE to ensure it's used ---
print("\n--- Defining/Re-defining dummy_task_function with extraction logic IN THIS CELL ---")

# Helper function to safely extract user query from messages list
def get_user_query(messages_list):
    if not isinstance(messages_list, list): return None
    for msg in messages_list:
        if isinstance(msg, dict) and msg.get("role") == "user" and "content" in msg:
            return msg["content"]
    return None

# Helper to find the final assistant answer
def get_final_answer(messages_list):
     if not isinstance(messages_list, list): return None
     for msg in reversed(messages_list): # Look from the end
         if isinstance(msg, dict) and msg.get("role") == "assistant":
             # Prefer final 'content' if available
             if msg.get("content"):
                 return msg.get("content")
             # If no content, maybe check if tool call was the last action? Requires more context.
     return None # Fallback

# Helper to check if a specific tool was called
def check_tool_called(messages_list, tool_name="query_database"):
    if not isinstance(messages_list, list): return False
    for msg in messages_list:
        if isinstance(msg, dict) and msg.get("role") == "assistant" and msg.get("tool_calls"):
             for tool_call in msg.get("tool_calls", []):
                  if isinstance(tool_call, dict) and tool_call.get("function", {}).get("name") == tool_name:
                       return True
    return False

# Helper to extract SQL query
def get_generated_sql(messages_list):
    if not isinstance(messages_list, list): return None
    for msg in messages_list:
         if isinstance(msg, dict) and msg.get("role") == "assistant" and msg.get("tool_calls"):
              for tool_call in msg.get("tool_calls", []):
                   if isinstance(tool_call, dict) and tool_call.get("function", {}).get("name") == "query_database":
                        try:
                             # Use .get() with default to handle potential missing keys safely
                             func_dict = tool_call.get("function", {})
                             args_str = func_dict.get("arguments", "{}")
                             args = json.loads(args_str)
                             return args.get("sql_query")
                        except json.JSONDecodeError:
                             return None # Argument parsing failed
                        except Exception: # Catch other potential errors during access
                             return None
    return None

# The actual dummy task function definition
def dummy_task_function(example: Example) -> dict:
    """
    Extracts relevant fields from the example.input (which contains the raw run data)
    and returns a FLAT dictionary matching the structure expected by the evaluators.
    """
    if not hasattr(example, 'input') or not isinstance(example.input, dict):
        return {} # Return empty if input is missing or not a dict

    input_data = example.input
    # Assuming the structure like {"messages": [...]} is consistent in example.input
    messages = input_data.get("messages", [])

    # Extract the data pieces needed by the evaluators
    user_query = get_user_query(messages)
    final_answer = get_final_answer(messages)
    tool_called = check_tool_called(messages)
    generated_sql = get_generated_sql(messages)

    # Return the flat dictionary that evaluators expect as their 'output' argument
    return {
        "user_query": user_query,
        "final_answer": final_answer,
        "tool_called": tool_called,
        "generated_sql": generated_sql
    }

print("dummy_task_function defined/re-defined in this cell.")
# --- End Function Definition ---


# --- Now proceed with running the experiment ---
# Check if the gpt4o dataset variable is loaded from previous cell
if 'evaluation_dataset_gpt4o' in locals() and len(evaluation_dataset_gpt4o) > 0:
    missing_deps = []
    if 'dummy_task_function' not in locals(): missing_deps.append("dummy_task_function") # Should be found now
    # Ensure the evaluator list from earlier setup cells is available
    if 'all_final_evaluators' not in locals(): missing_deps.append("all_final_evaluators list")
    if 'px_client' not in locals() or px_client is None: missing_deps.append("px_client")

    if not missing_deps:
        print("\nRunning full experiment for GPT-4o using direct API evaluators...")
        now_str = datetime.now().strftime("%Y%m%d-%H%M%S")
        # Give this experiment run a distinct name
        experiment_name_gpt4o = f"GPT4o_Full_OpenAI_Eval_{now_str}" # Matched naming convention

        # Define type hint for evaluator list (if not defined globally)
        EvaluatorList = List[Callable[[Dict[str, Any]], float]]

        # Assuming run_experiment works without explicit env vars here
        try:
            full_experiment_gpt4o = run_experiment(
                evaluation_dataset_gpt4o,           # Use the GPT-4o dataset variable
                dummy_task_function,                # Uses the function defined above in this cell
                evaluators=all_final_evaluators,   # Re-use the same evaluators list
                experiment_name=experiment_name_gpt4o, # Use the new, distinct experiment name
                experiment_description="Eval GPT-4o: Direct API ToolUsage, SQLCorrectness, FinalAnswer." # Updated description
                # concurrency=5 # Optional: Adjust concurrency if needed
            )

            print(f"\nFull Experiment '{experiment_name_gpt4o}' run initiated.")
            print(f"Check Phoenix UI for results associated with experiment name '{experiment_name_gpt4o}'.")

        except Exception as e:
            # Catch potential errors like Connection refused if it reappears
            print(f"\nERROR during run_experiment for GPT-4o: {e}")
            print("If this is a connection error, consider adding os.environ lines before the call.")


    else:
        print(f"Skipping GPT-4o experiment - missing dependencies: {', '.join(missing_deps)}")

else:
    print(f"Skipping GPT-4o experiment - 'evaluation_dataset_gpt4o' was not found or is empty. Did you run the cell above to load it?")

print("\n--- End GPT-4o Full Experiment ---")


In [None]:
# --- Evaluator 3: Final Answer Quality (Mimicking Successful Pattern) ---
import re
import logging # Or use print if preferred for errors
# Assumes openai_client is initialized globally
# Assumes MODEL_TO_USE = "gpt-4o" is defined globally

# --- Define Prompt Template ---
# Make sure this is defined before the function
FINAL_ANSWER_QUALITY_PROMPT_TEMPLATE = """Evaluate if the Final Answer accurately and completely answers the User Query, based ONLY on the query and answer text. Do not assume external data or SQL.

User Query:
{user_query}

Final Answer:
{final_answer}

Is the Final Answer good quality (accurate, relevant, complete)?
Provide a brief EXPLANATION and finish with LABEL: Good or LABEL: Bad.
"""
print("Defined: FINAL_ANSWER_QUALITY_PROMPT_TEMPLATE") # Optional confirmation

# --- Evaluator Definition ---
def evaluate_final_answer_quality_direct_api(output: dict, expected: dict) -> float:
    """
    Evaluates Final Answer quality via direct OpenAI API call (Mimics successful pattern).
    Returns 1.0 for Good, 0.0 otherwise (Bad, Error, Parsing Failure).
    """
    # Rely on global variables: openai_client, FINAL_ANSWER_QUALITY_PROMPT_TEMPLATE, MODEL_TO_USE
    score = 0.0 # Default to 0.0 (representing "Bad" or error)

    # --- 1. Prerequisites & Data Checks ---
    if not openai_client:
        print("Prerequisite Error: OpenAI client not initialized.") # Changed from logging
        return 0.0
    if not FINAL_ANSWER_QUALITY_PROMPT_TEMPLATE:
        print("Prerequisite Error: FINAL_ANSWER_QUALITY_PROMPT_TEMPLATE not defined.")
        return 0.0

    user_query = output.get('user_query')
    final_answer = output.get('final_answer')

    if not user_query or not final_answer:
        # print("Data Error: Missing 'user_query' or 'final_answer' in output.") # Optional
        return 0.0

    # --- 2. Format Prompt ---
    try:
        prompt = FINAL_ANSWER_QUALITY_PROMPT_TEMPLATE.format(
            user_query=user_query,
            final_answer=final_answer
        )
    except KeyError as e:
         print(f"Prompt Formatting Error: Missing key {e}")
         return 0.0

    # --- 3. Call API & Parse Result ---
    try:
        # Assuming call_openai_judge helper is defined and available
        raw_output = call_openai_judge(prompt, model=MODEL_TO_USE)
        label_match = re.search(r"LABEL:\s*(\w+)", raw_output, re.IGNORECASE)

        # Return 1.0 ONLY if API succeeded AND label is exactly "Good"
        if label_match and label_match.group(1).strip().capitalize() == "Good":
            score = 1.0
        # else: score remains 0.0

    except NameError:
        # Explicitly catch if the helper function isn't defined
         print("Error: call_openai_judge function not found.")
         # score remains 0.0
    except Exception as e:
        # Covers cases: OpenAI API call failed inside helper, or other unexpected errors
        print(f"API Call/Evaluation Error during Final Answer evaluation: {e}")
        # score remains 0.0

    return score

print("Defined: evaluate_final_answer_quality_direct_api function")

In [None]:
# --- Full Evaluation & Comparison for Final Answer Quality ---
import pandas as pd
import re # Ensure re is imported

print("\n--- Running Full Evaluation for Final Answer Quality ---")

# --- Configuration ---
# Key for the human label in evaluation_dataset[i].output
# Identified from notebook inspection as 'final_answer_quality_label'
HUMAN_LABEL_KEY_ANSWER = 'final_answer_quality_label'

# Expected Labels from the LLM for this evaluator
# Note: Dataset uses "Fail" but our prompt asks for "Bad". We need to map.
HUMAN_LABEL_MAP = {"Fail": "Bad"} # Map human label "Fail" to expected LLM "Bad"
LLM_POSITIVE_LABEL = "Good" # What our prompt asks for as the "good" label
# --- End Configuration ---


# --- Prerequisite Check ---
eval_passed = True
if 'evaluation_dataset' not in locals() or len(evaluation_dataset) == 0:
    print("Evaluation FAIL: evaluation_dataset not found or empty.")
    eval_passed = False
if 'openai_client' not in locals() or not openai_client:
    print("Evaluation FAIL: openai_client not initialized.")
    eval_passed = False
if 'FINAL_ANSWER_QUALITY_PROMPT_TEMPLATE' not in locals() or not FINAL_ANSWER_QUALITY_PROMPT_TEMPLATE:
     print("Evaluation FAIL: FINAL_ANSWER_QUALITY_PROMPT_TEMPLATE not defined.")
     eval_passed = False
if 'evaluate_final_answer_quality_direct_api' not in locals():
     print("Evaluation FAIL: evaluate_final_answer_quality_direct_api function not defined.")
     eval_passed = False
if 'call_openai_judge' not in locals():
     print("Evaluation FAIL: call_openai_judge function not defined.")
     eval_passed = False
# --- End Prerequisite Check ---

answer_results_list = [] # Initialize list for results

if eval_passed:
    print(f"Processing {len(evaluation_dataset)} examples for Final Answer Quality...")
    # Iterate through ALL examples
    for i, test_example in enumerate(evaluation_dataset):
        # print(f"Processing Example {i}...") # Can uncomment for verbose progress

        test_output_data = test_example.input
        test_expected_data = test_example.output # needed for function signature

        user_query = test_output_data.get('user_query', 'MISSING')
        final_answer = test_output_data.get('final_answer', 'MISSING')
        human_label_raw = test_example.output.get(HUMAN_LABEL_KEY_ANSWER, "MISSING_KEY")

        # --- Call LLM judge directly for raw output ---
        raw_judge_response_answer = "Skipped direct call" # Default
        if user_query != 'MISSING' and final_answer != 'MISSING':
            try:
                test_prompt_answer = FINAL_ANSWER_QUALITY_PROMPT_TEMPLATE.format(
                    user_query=user_query,
                    final_answer=final_answer
                )
                raw_judge_response_answer = call_openai_judge(test_prompt_answer, model=MODEL_TO_USE)
            except Exception as e:
                # print(f"  Error calling LLM Judge directly on Example {i}: {e}") # Verbose error
                raw_judge_response_answer = f"Error during direct call: {e}"
        # else:
             # print(f"  Skipping direct LLM call on Example {i} due to missing query/answer.")


        # --- Call the evaluator function ---
        test_score_float_answer = 0.0 # Default score
        try:
            test_score_float_answer = evaluate_final_answer_quality_direct_api(test_output_data, test_expected_data)
        except Exception as e:
            # print(f"  Error calling evaluator function on Example {i}: {e}") # Verbose error
            test_score_float_answer = 0.0 # Assign 0.0 on error

        # --- Append results to list ---
        answer_results_list.append({
            "index": i,
            "user_query": user_query,
            "final_answer": final_answer,
            "human_answer_label_raw": human_label_raw, # Store the original human label
            "raw_llm_response_answer": raw_judge_response_answer,
            "final_answer_score": test_score_float_answer
        })

    # --- Convert list to DataFrame ---
    print("\nEvaluation loop complete. Creating DataFrame...")
    answer_results_df = pd.DataFrame(answer_results_list)

    # --- Add Comparison Columns ---
    print("Adding comparison columns...")
    # Map human label "Fail" to "Bad" for comparison
    answer_results_df['human_answer_label_mapped'] = answer_results_df['human_answer_label_raw'].map(HUMAN_LABEL_MAP).fillna(answer_results_df['human_answer_label_raw'])

    # Convert score (1.0/0.0) to LLM label ("Good"/"Bad")
    def answer_score_to_label(score):
        return LLM_POSITIVE_LABEL if score == 1.0 else "Bad" # Assumes 0.0 means Bad

    answer_results_df['llm_answer_label'] = answer_results_df['final_answer_score'].apply(answer_score_to_label)

    # Compare mapped human label with LLM label (case-insensitive)
    answer_results_df['answer_match'] = answer_results_df.apply(
        lambda row: str(row['llm_answer_label']).lower() == str(row['human_answer_label_mapped']).lower()
                    if row['human_answer_label_mapped'] not in ["MISSING", "MISSING_KEY"] else None,
        axis=1
    )


    # --- Display DataFrame ---
    print("\nFinal Answer Quality Evaluation Results:")
    # Select and reorder columns
    display_cols_answer = ['index', 'user_query', 'final_answer', 'human_answer_label_raw', 'llm_answer_label', 'answer_match', 'raw_llm_response_answer']
    display_cols_answer = [col for col in display_cols_answer if col in answer_results_df.columns] # Ensure columns exist

    pd.set_option('display.max_rows', 100)
    pd.set_option('display.max_colwidth', 200)
    display(answer_results_df[display_cols_answer])

    # --- Calculate Match Percentage ---
    if 'answer_match' in answer_results_df.columns:
        match_count_answer = answer_results_df['answer_match'].sum() # Counts True values
        valid_comparisons_answer = answer_results_df['answer_match'].notna().sum() # Counts non-None values
        if valid_comparisons_answer > 0:
            match_percentage_answer = (match_count_answer / valid_comparisons_answer) * 100
            print(f"\nAgreement between LLM Judge and Human Labels (Final Answer): {match_percentage_answer:.2f}% ({match_count_answer}/{valid_comparisons_answer})")
        else:
            print("\nCould not calculate agreement percentage (no valid comparisons).")

else:
    print("--- Final Answer Evaluation Aborted due to failed prerequisites ---")

print(f"\n--- End Final Answer Quality Evaluation ---")


In [None]:
    import re # Needed for parsing
    from typing import Dict, Any # Added type hinting

    # --- Evaluator 2: SQL Correctness (Direct API Call) ---

    SQL_CORRECTNESS_PROMPT_TEMPLATE = """Evaluate if the Generated SQL is semantically correct and appropriate for the User Query. Consider typical schemas (e.g., transcript_segments table). Ignore Final Answer quality.

    User Query:
    {user_query}

    Generated SQL:
    {generated_sql}

    Is the SQL correct and appropriate? Answer ONLY "Correct" or "Incorrect".
    """

    def evaluate_sql_correctness_direct_api(example: Dict[str, Any]) -> float:
        """Evaluates SQL correctness via direct LLM API call (1.0 Correct, 0.0 Incorrect)."""
        input_data = example.get("input", {})
        user_query = input_data.get("user_query", "")
        generated_sql = input_data.get("generated_sql", "")

        if not user_query or not generated_sql: return 0.0

        prompt = SQL_CORRECTNESS_PROMPT_TEMPLATE.format(user_query=user_query, generated_sql=generated_sql)

        try:
            response_text = call_gemini(prompt)
            return 1.0 if re.search(r"Correct", response_text, re.IGNORECASE) else 0.0
        except Exception as e:
            # print(f"Error during SQL correctness direct API call: {e}") # Keep commented unless debugging
            return 0.0

    # --- Evaluator 3: Final Answer Quality (Direct API Call) ---

    FINAL_ANSWER_QUALITY_PROMPT_TEMPLATE = """Evaluate if the Final Answer accurately and completely answers the User Query, based ONLY on the query and answer text. Do not assume external data or SQL.

    User Query:
    {user_query}

    Final Answer:
    {final_answer}

    Is the Final Answer Good or Bad? Answer ONLY "Good" or "Bad".
    """

    def evaluate_final_answer_quality_direct_api(example: Dict[str, Any]) -> float:
        """Evaluates Final Answer quality via direct LLM API call (1.0 Good, 0.0 Bad)."""
        input_data = example.get("input", {})
        user_query = input_data.get("user_query", "")
        final_answer = input_data.get("final_answer", "")

        if not user_query or not final_answer: return 0.0

        prompt = FINAL_ANSWER_QUALITY_PROMPT_TEMPLATE.format(user_query=user_query, final_answer=final_answer)

        try:
            response_text = call_gemini(prompt)
            return 1.0 if re.search(r"Good", response_text, re.IGNORECASE) else 0.0
        except Exception as e:
            # print(f"Error during Final Answer quality direct API call: {e}") # Keep commented unless debugging
            return 0.0

    print("Defined direct API evaluators: evaluate_sql_correctness_direct_api, evaluate_final_answer_quality_direct_api")

    # Combine all custom evaluators
    all_custom_evaluators = [
        evaluate_tool_usage_direct_api,
        evaluate_sql_correctness_direct_api,
        evaluate_final_answer_quality_direct_api
    ]
    print(f"Combined evaluator list created with {len(all_custom_evaluators)} evaluators.")

## Run Full Experiment with All Custom Evaluators

Now we execute the Phoenix `run_experiment` function using:
* The `evaluation_dataset`.
* The `dummy_task_function` (as a placeholder for the actual agent).
* Our list of all three custom direct API evaluators (`all_custom_evaluators`).

This will run the evaluation process for each example in the dataset, calling our custom functions to generate scores for Tool Usage, SQL Correctness, and Final Answer Quality based on the direct API calls.

The results, including inputs, outputs, and scores, will be logged to a new, timestamped experiment in the Phoenix UI, allowing for detailed analysis and comparison later.

In [None]:
# --- Run Full Experiment with All Custom Direct API Evaluators (Corrected Call) ---
import phoenix as px
from phoenix.experiments import run_experiment
from datetime import datetime
from typing import List, Callable, Dict, Any # Added imports

# Ensure necessary components are loaded
missing = []
if 'evaluation_dataset' not in locals() or len(evaluation_dataset) == 0: missing.append("evaluation_dataset")
if 'dummy_task_function' not in locals(): missing.append("dummy_task_function")
if 'all_custom_evaluators' not in locals() or len(all_custom_evaluators) != 3: missing.append("all_custom_evaluators list")
if 'px_client' not in locals(): missing.append("px_client")

if not missing:
    print("Running full experiment with all custom direct API evaluators...")
    now_str = datetime.now().strftime("%Y%m%d-%H%M%S")
    experiment_name = f"FullDirectAPI_Eval_{now_str}"

    # Define type hint for evaluator list
    EvaluatorList = List[Callable[[Dict[str, Any]], float]]

    # *** Corrected call: Pass dummy_task_function as the second positional argument ***
    full_experiment = run_experiment(
        evaluation_dataset,                 # First positional: dataset
        dummy_task_function,                # Second positional: task function
        evaluators=all_custom_evaluators,   # Keyword: evaluators
        experiment_name=experiment_name,    # Keyword: experiment_name
        experiment_description="Eval: Direct API ToolUsage, SQLCorrectness, FinalAnswer." # Keyword: description
        # concurrency=5
    )

    print(f"\nFull Experiment '{experiment_name}' run initiated.")
    print("Check Phoenix UI for results.")
    # if hasattr(px_client, 'get_endpoint'): print(f"Phoenix UI Endpoint: {px_client.get_endpoint()}")

else:
    print(f"Skipping full experiment - missing: {', '.join(missing)}")

print("\n--- End Full Experiment ---")

### Experiment: Running the Agent with a Different Model (`gpt-4o`)

Having established a baseline run and evaluation using the **`gpt-4o-mini`** model, our next step is to run an experiment comparing its performance against a different model, **`gpt-4o`**.

The process involves:
1.  Modifying the `MODEL` constant in `src/agent/agent.py` to `"gpt-4o"`.
2.  Re-running the agent script (`python src/agent/agent.py`), which executes the predefined test queries using the new model.
3.  Crucially, ensuring the OpenTelemetry traces for this experimental run are captured in Arize Phoenix so we can directly compare latency, token usage, and potentially tool calls between the `gpt-4o-mini` (baseline) and `gpt-4o` runs within the Phoenix UI.

#### Challenge: Capturing Traces for the Experimental Run (`gpt-4o`)

When we attempted to re-run `src/agent/agent.py` (after changing the model to `gpt-4o`), we unexpectedly hit significant issues with the Arize Phoenix tracing configuration. Even though the tracing setup code and the `.env` file *hadn't changed* since the previous successful baseline run (or seemed straightforward based on initial setup), the script failed to connect to the Phoenix cloud endpoint (`https://app.phoenix.arize.com`).

**Debugging the Connection Issues:**

*   **Defaulting to Localhost:** The primary issue was that the `phoenix.otel.register()` function, despite having environment variables set (either standard `OTEL_...` or `PHOENIX_...`), consistently ignored the cloud endpoint and attempted to connect via **gRPC** to `localhost:4317`, resulting in `StatusCode.UNAVAILABLE` errors.
*   **Unreliable Auto-Detection:** Attempts to guide the automatic detection by forcing the protocol (`http/protobuf`) or using specific environment variable names recommended in documentation (`PHOENIX_COLLECTOR_ENDPOINT`, `PHOENIX_CLIENT_HEADERS`) were unsuccessful in making the library use the correct hostname from the environment variables. It kept defaulting to `localhost` (either port 4317 for gRPC or 6006 for HTTP).

**Working Solution: Explicit Configuration:**

The only reliable way to ensure traces were sent correctly to `https://app.phoenix.arize.com` for this experimental run was to **bypass the automatic environment variable detection entirely within the `phoenix.otel.register()` function.**

This required modifying `src/agent/agent.py` for the tracing setup block:
1.  Ensure the `.env` file contains the correct header variable as specified in the Phoenix cloud documentation: `PHOENIX_CLIENT_HEADERS="api_key=YOUR_KEY_VALUE"`.
2.  Update the Python code to:
    *   Explicitly define the full endpoint URL: `endpoint = "https://app.phoenix.arize.com/v1/traces"`
    *   Read the `PHOENIX_CLIENT_HEADERS` environment variable using `os.getenv()`.
    *   Parse the header string into the required dictionary format: `headers_dict = {"api_key": "YOUR_KEY_VALUE"}`.
    *   Pass these directly to the registration function:
        ```python
        phoenix_tracer_provider = register(
            project_name=PROJECT_NAME,
            endpoint=endpoint,
            headers=headers_dict
        )
        ```

**Outcome:**

With this explicit configuration hardcoded in the script, the agent successfully connected to Phoenix, and the traces for our `gpt-4o` experimental run were captured. This allows us to proceed with comparing the two models (`gpt-4o-mini` vs `gpt-4o`) within the Phoenix UI. This troubleshooting detour highlights potential fragility in automatic OTel configuration detection and underscores the utility of explicit configuration when encountering connection problems.

In [None]:
# Cell to Load GPT-4o Dataset

import json # Make sure json is imported

# This is the name we decided on for the dataset created in the UI.
# Make sure this exactly matches the name in your Phoenix UI.
new_dataset_name = "Experiment_GPT4o_AllSpans"

print(f"\nAttempting to load dataset '{new_dataset_name}'...")

# Check if client was initialized successfully in the setup cell
if 'px_client' is None or 'px_client' not in locals():
    raise NameError("Phoenix client 'px_client' was not initialized successfully. Please re-run the modified Setup Cell (#1).")

# Load the specified dataset by its exact name into a NEW variable
evaluation_dataset_gpt4o = px_client.get_dataset(name=new_dataset_name)
print("Dataset loaded successfully.")

# Print number of examples
print(f"Number of examples in new dataset '{new_dataset_name}': {len(evaluation_dataset_gpt4o)}")

# --- Inspect the first few examples ---
num_examples_to_show = 3 # Adjust if you want to see more/less
print(f"\n--- Inspecting First {num_examples_to_show} Examples from '{new_dataset_name}' ---")

if len(evaluation_dataset_gpt4o) > 0:
    for i, example in enumerate(evaluation_dataset_gpt4o[:num_examples_to_show]):
        print(f"\n--- Example {i+1} ---")
        print("\nInput Data:")
        try:
            print(json.dumps(example.input, indent=2))
        except Exception as e:
            print(f"Could not display input: {e}")
        print("\nOutput/Label Data:") # Check if labels are present as expected
        try:
            print(json.dumps(example.output, indent=2))
        except Exception as e:
            print(f"Could not display output/labels: {e}")
        print("\nMetadata:")
        try:
            print(json.dumps(example.metadata, indent=2))
        except Exception as e:
            print(f"Could not display metadata: {e}")
else:
    print(f"Dataset '{new_dataset_name}' appears to be empty.")

print(f"\n--- Finished Loading and Inspecting GPT-4o Dataset ({new_dataset_name}) ---")


In [None]:
    # Cell 1: Initialize Phoenix Client EXPLICITLY for Cloud

    import phoenix as px
    import os
    import json

    print("--- Initializing Phoenix Client Explicitly for Cloud ---")

    # --- Configuration ---
    cloud_api_endpoint = "https://app.phoenix.arize.com"
    api_headers_str = os.getenv("PHOENIX_CLIENT_HEADERS") # Read from environment

    if not api_headers_str:
        raise ValueError("CRITICAL: PHOENIX_CLIENT_HEADERS environment variable not found.")

    api_headers_dict = {}
    try:
        key, value = api_headers_str.split('=', 1)
        api_headers_dict[key.strip()] = value.strip()
        if not api_headers_dict: raise ValueError("Parsed is empty.")
        print(f"Found headers: Key='{list(api_headers_dict.keys())[0]}'")
    except Exception as parse_err:
        raise ValueError(f"Invalid PHOENIX_CLIENT_HEADERS format: '{api_headers_str}'. Expected 'key=value'. Error: {parse_err}") from parse_err
    # --- End Configuration ---

    # --- Initialize Client Explicitly ---
    try:
        print(f"Attempting px.Client(endpoint='{cloud_api_endpoint}', headers=...)")
        # Use explicit endpoint and headers
        px_client = px.Client(endpoint=cloud_api_endpoint, headers=api_headers_dict)
        print("Phoenix client initialized successfully using explicit arguments.")
    except Exception as e:
        print(f"ERROR initializing Phoenix Client explicitly: {e}")
        px_client = None
    # --- End Initialization ---

    print("--- Client Initialization Complete ---")

### Prepare Combined Data for Evaluation

We now have two datasets loaded:

1.  `evaluation_dataset`: Contains the results from the baseline (`gpt-4o-mini`) run. Crucially, this dataset was potentially created from examples where feedback was provided in the Phoenix UI, meaning it might not contain *all* examples from the original run and its `.output` field contains the ground truth labels and explanations derived from that UI feedback.
2.  `evaluation_dataset_gpt4o`: Contains the results from the experimental (`gpt-4o`) run, loaded directly from the trace data. Its `.input` field contains the agent's outputs for this run (e.g., `final_answer`, `generated_sql`), but its `.output` field likely lacks the ground truth labels.

Our LLM-as-a-Judge evaluators (`all_custom_evaluators`) need both the agent's output (from the `gpt-4o` run) and the corresponding ground truth labels (from the baseline dataset).

Therefore, the next step is to **combine** these two datasets within the notebook. We will iterate through the examples from the `gpt-4o` run (`evaluation_dataset_gpt4o`) and, using the `user_query` as a key, attempt to find the matching example with ground truth labels in the baseline dataset (`evaluation_dataset`).

We will create a new list, `combined_eval_data`, containing only the examples where a match was found. Each item in this list will have:
*   The `user_query`.
*   The outputs generated by the `gpt-4o` model (`gpt4o_output`).
*   The ground truth labels and explanations (`ground_truth`) from the baseline dataset.

This `combined_eval_data` list will be the input for our evaluation process in the subsequent steps. We will also report how many examples from the `gpt-4o` run could be matched and how many were skipped due to missing labels in the baseline dataset.

In [None]:
# Cell: Compare Datasets and Prepare Combined Data for Evaluation

import json

print("\n--- Comparing Datasets and Preparing Combined Data ---")

# Ensure both datasets are loaded
if 'evaluation_dataset' not in locals() or 'evaluation_dataset_gpt4o' not in locals():
    raise NameError("One or both datasets ('evaluation_dataset', 'evaluation_dataset_gpt4o') are not loaded.")

# --- Extract Baseline Labels (keyed by user_query) ---
baseline_labels = {}
for example in evaluation_dataset:
    try:
        # Assuming 'user_query' is directly in the input field
        query = example.input.get("user_query")
        if query and example.output: # Check if query exists and there's label data
            baseline_labels[query] = example.output
    except Exception as e:
        print(f"Warning: Could not process baseline example: {e}")

print(f"\nExtracted {len(baseline_labels)} labeled examples from baseline dataset ('evaluation_dataset').")
if len(baseline_labels) < len(evaluation_dataset):
     print(f"  (Note: Some baseline examples might have been skipped if missing 'user_query' or 'output' field).")


# --- Extract Experimental Results (keyed by user_query) ---
experimental_results = {}
for example in evaluation_dataset_gpt4o:
     try:
        # Assuming 'user_query' is directly in the input field
        query = example.input.get("user_query")
        if query:
            # Store the whole input dict containing gpt-4o's answers/SQL etc.
            experimental_results[query] = example.input
     except Exception as e:
        print(f"Warning: Could not process experimental example: {e}")

print(f"\nExtracted {len(experimental_results)} examples from experimental dataset ('evaluation_dataset_gpt4o').")

# --- Create Combined Data for Evaluation ---
combined_eval_data = []
missing_labels_count = 0
found_labels_count = 0

print("\nCombining data...")
for query, gpt4o_result in experimental_results.items():
    if query in baseline_labels:
        # Found corresponding labels in the baseline dataset
        combined_item = {
            "user_query": query,
            "gpt4o_output": gpt4o_result, # Contains final_answer, generated_sql etc. from gpt-4o run
            "ground_truth": baseline_labels[query] # Contains labels and explanations
        }
        combined_eval_data.append(combined_item)
        found_labels_count += 1
    else:
        # No labels found for this query in the baseline dataset
        # print(f"  Skipping query (no labels found): {query[:80]}...") # Uncomment to see skipped queries
        missing_labels_count += 1

print(f"\nSuccessfully combined data for {found_labels_count} examples.")
if missing_labels_count > 0:
    print(f"Skipped {missing_labels_count} examples from the gpt-4o run because corresponding labels were not found in the baseline dataset.")

# --- Inspect the first combined item ---
if combined_eval_data:
    print("\n--- First Combined Example ---")
    print(json.dumps(combined_eval_data[0], indent=2))
else:
    print("\nNo combined data was prepared. Check dataset alignment or content.")

print("\n--- Data Preparation Complete ---")
# The variable 'combined_eval_data' now holds the data ready for evaluation.


In [None]:
# Cell to Inspect Structure of Experimental Dataset Example

import json

print("\n--- Inspecting Structure of First Example from 'evaluation_dataset_gpt4o' ---")

if 'evaluation_dataset_gpt4o' in locals() and len(evaluation_dataset_gpt4o) > 0:
    first_exp_example = evaluation_dataset_gpt4o[0]

    print("\n--- first_exp_example.input ---")
    try:
        # See what keys are actually inside 'input'
        print(f"Keys: {list(first_exp_example.input.keys())}")
        print(json.dumps(first_exp_example.input, indent=2))
    except Exception as e:
        print(f"Could not display input: {e}")

    print("\n--- first_exp_example.output ---")
    try:
        # Check if 'output' contains anything useful (might be None or empty)
        print(f"Keys: {list(first_exp_example.output.keys())}")
        print(json.dumps(first_exp_example.output, indent=2))
    except Exception as e:
        print(f"Could not display output: {e}")


    print("\n--- first_exp_example.metadata ---")
    try:
         # Check if metadata holds the query
        print(f"Keys: {list(first_exp_example.metadata.keys())}")
        print(json.dumps(first_exp_example.metadata, indent=2))
    except Exception as e:
        print(f"Could not display metadata: {e}")

else:
    print("Variable 'evaluation_dataset_gpt4o' not found or is empty.")

print("\n--- Inspection Complete ---")


In [None]:
# Cell: Compare Datasets and Prepare Combined Data for Evaluation (Corrected Extraction)

import json

print("\n--- Comparing Datasets and Preparing Combined Data (Corrected Extraction) ---")

# Ensure both datasets are loaded
if 'evaluation_dataset' not in locals() or 'evaluation_dataset_gpt4o' not in locals():
    raise NameError("One or both datasets ('evaluation_dataset', 'evaluation_dataset_gpt4o') are not loaded.")

# Helper function to safely extract user query from messages list
def get_user_query(messages_list):
    if not isinstance(messages_list, list):
        return None
    for msg in messages_list:
        if isinstance(msg, dict) and msg.get("role") == "user" and "content" in msg:
            return msg["content"]
    return None

# --- Extract Baseline Labels (keyed by user_query) ---
# Assumes baseline dataset structure also has query in input (potentially different structure)
# If baseline also failed before, this needs adjustment too. Let's assume it worked.
baseline_labels = {}
for example in evaluation_dataset:
    try:
        # --- !! ADJUST THIS IF BASELINE STRUCTURE IS DIFFERENT !! ---
        query = example.input.get("user_query") # Assuming baseline WAS flat structure
        # If baseline also had nested structure, use:
        # query = get_user_query(example.input.get("messages"))
        # --- !! ---------------------------------------------- !! ---

        if query and example.output: # Check if query exists and there's label data
            baseline_labels[query] = example.output
    except Exception as e:
        print(f"Warning: Could not process baseline example: {e}")

print(f"\nExtracted {len(baseline_labels)} labeled examples from baseline dataset ('evaluation_dataset').")


# --- Extract Experimental Results (keyed by user_query) ---
# We now know the structure for the experimental dataset
experimental_results = {}
for example in evaluation_dataset_gpt4o:
     try:
        # Extract query using the helper function for the known nested structure
        query = get_user_query(example.input.get("messages"))
        if query:
            # Need to find the agent's actual output (final_answer, generated_sql)
            # This likely comes from the processing done in parse_spans.ipynb or similar step
            # Let's ASSUME the important outputs were copied into the .input field
            # during dataset creation, similar to the baseline structure for simplicity.
            # If not, we need to figure out where the final_answer etc. are stored for gpt4o run.
            # For now, let's just store the whole input dict.
            experimental_results[query] = example.input # Storing the raw input for now
     except Exception as e:
        print(f"Warning: Could not process experimental example: {e}")

print(f"\nExtracted {len(experimental_results)} examples from experimental dataset ('evaluation_dataset_gpt4o').")
if len(experimental_results) == 0:
     print("ERROR: Failed to extract any results from the experimental dataset. Check extraction logic.")


# --- Create Combined Data for Evaluation ---
combined_eval_data = []
missing_labels_count = 0
found_labels_count = 0

print("\nCombining data...")
if len(experimental_results) > 0:
    for query, gpt4o_exp_input in experimental_results.items():
        if query in baseline_labels:
            # Found corresponding labels in the baseline dataset

            # **** IMPORTANT ASSUMPTION ****
            # We assume the key agent outputs (final_answer, generated_sql, tool_called)
            # are somehow available within gpt4o_exp_input (the .input field from the dataset).
            # This might be incorrect if the Experiment dataset only stored raw messages/tools.
            # If evaluation fails later, we need to revisit how gpt4o_output is constructed here.
            # Let's default to passing the whole dict for now.
            gpt4o_output_data_for_eval = gpt4o_exp_input

            combined_item = {
                "user_query": query,
                # This is the data the evaluator will receive as the 'output' of the task
                "gpt4o_output_for_evaluator": gpt4o_output_data_for_eval,
                # This is the ground truth data from the baseline dataset
                "ground_truth": baseline_labels[query]
            }
            combined_eval_data.append(combined_item)
            found_labels_count += 1
        else:
            # No labels found for this query in the baseline dataset
            missing_labels_count += 1

    print(f"\nSuccessfully combined data for {found_labels_count} examples.")
    if missing_labels_count > 0:
        print(f"Skipped {missing_labels_count} examples from the gpt-4o run because corresponding labels were not found in the baseline dataset.")
    if found_labels_count == 0 and len(experimental_results) > 0:
        print("WARNING: No examples could be matched between datasets based on user_query. Check for subtle differences in query strings.")


    # --- Inspect the first combined item ---
    if combined_eval_data:
        print("\n--- First Combined Example Structure---")
        # Print structure, not necessarily full content
        first_combined = combined_eval_data[0]
        print(f"Keys: {list(first_combined.keys())}")
        print(f"  user_query: {first_combined.get('user_query')[:80]}...")
        print(f"  gpt4o_output_for_evaluator keys: {list(first_combined.get('gpt4o_output_for_evaluator', {}).keys())}")
        print(f"  ground_truth keys: {list(first_combined.get('ground_truth', {}).keys())}")
        # print(json.dumps(combined_eval_data[0], indent=2)) # Uncomment for full details
    else:
        print("\nNo combined data was prepared. Check dataset alignment or content.")

else:
     print("\nSkipping combination because no experimental results were extracted.")


print("\n--- Data Preparation Complete ---")
# The variable 'combined_eval_data' now holds the data ready for evaluation.


### Run GPT-4o Evaluation

We have successfully loaded the experimental dataset containing the results from the `gpt-4o` agent run into the variable `evaluation_dataset_gpt4o`.

We have also confirmed that our custom evaluation functions (`all_custom_evaluators`) operate in a **zero-shot** manner. They evaluate the agent's performance (Tool Usage, SQL Correctness, Final Answer Quality) by analyzing the agent's inputs and outputs (contained within `evaluation_dataset_gpt4o.input`) using an LLM judge (`call_gemini`), without requiring the separate ground truth labels that were added to the baseline dataset via the UI.

**Next Step: Run Evaluation**

The next step is to execute the evaluation using the `phoenix.experiments.run_experiment` function. We will pass it:
*   The `evaluation_dataset_gpt4o` (containing the data from the `gpt-4o` run).
*   The `dummy_task_function` (which simply passes the necessary input data through).
*   The `all_custom_evaluators` list (containing our zero-shot evaluator functions).

This will run the evaluations for the `gpt-4o` model and log the results to a new experiment run in Phoenix, allowing us to compare performance against the baseline.

In [None]:
# Cell: Run Experiment on GPT-4o Dataset (with function redefined inside to ensure correct version is used)

import json # Ensure json is imported
import phoenix as px # Ensure phoenix is imported if needed
from phoenix.experiments import run_experiment
from datetime import datetime
from typing import List, Callable, Dict, Any # Ensure imports
from phoenix.experiments.types import Example # Make sure Example is imported

# --- Define/Re-define Dummy Task Function HERE to ensure it's used by run_experiment ---
print("\n--- Defining/Re-defining dummy_task_function with data extraction IN THIS CELL ---")

# Helper function to safely extract user query from messages list
def get_user_query(messages_list):
    if not isinstance(messages_list, list): return None
    for msg in messages_list:
        if isinstance(msg, dict) and msg.get("role") == "user" and "content" in msg:
            return msg["content"]
    return None

# Helper to find the final assistant answer
def get_final_answer(messages_list):
     if not isinstance(messages_list, list): return None
     for msg in reversed(messages_list): # Look from the end
         if isinstance(msg, dict) and msg.get("role") == "assistant":
             if msg.get("content"): return msg.get("content")
     return None

# Helper to check if a specific tool was called
def check_tool_called(messages_list, tool_name="query_database"):
    if not isinstance(messages_list, list): return False
    for msg in messages_list:
        if isinstance(msg, dict) and msg.get("role") == "assistant" and msg.get("tool_calls"):
             for tool_call in msg.get("tool_calls", []):
                  if isinstance(tool_call, dict) and tool_call.get("function", {}).get("name") == tool_name:
                       return True
    return False

# Helper to extract SQL query
def get_generated_sql(messages_list):
    if not isinstance(messages_list, list): return None
    for msg in messages_list:
         if isinstance(msg, dict) and msg.get("role") == "assistant" and msg.get("tool_calls"):
              for tool_call in msg.get("tool_calls", []):
                   if isinstance(tool_call, dict) and tool_call.get("function", {}).get("name") == "query_database":
                        try:
                             func_dict = tool_call.get("function", {})
                             args_str = func_dict.get("arguments", "{}")
                             args = json.loads(args_str)
                             return args.get("sql_query")
                        except json.JSONDecodeError: return None
                        except Exception: return None
    return None

# The actual dummy task function definition
def dummy_task_function(example: Example) -> dict:
    """
    Extracts relevant fields from the example.input (which contains the raw run data)
    and returns a FLAT dictionary matching the structure expected by the evaluators.
    """
    if not hasattr(example, 'input') or not isinstance(example.input, dict): return {}
    input_data = example.input
    messages = input_data.get("messages", [])
    user_query = get_user_query(messages)
    final_answer = get_final_answer(messages)
    tool_called = check_tool_called(messages)
    generated_sql = get_generated_sql(messages)
    # Return the flat dictionary that evaluators expect as their 'output' argument
    return { "user_query": user_query, "final_answer": final_answer,
             "tool_called": tool_called, "generated_sql": generated_sql }

print("dummy_task_function defined/re-defined in this cell.")
# --- End Function Definition ---


# --- Now proceed with running the experiment ---
# Check if the gpt4o dataset variable is loaded from previous cell
if 'evaluation_dataset_gpt4o' in locals() and len(evaluation_dataset_gpt4o) > 0:
    missing_deps = []
    if 'dummy_task_function' not in locals(): missing_deps.append("dummy_task_function") # Should be found now
    # Ensure the evaluator list from earlier setup cells is available
    if 'all_final_evaluators' not in locals(): missing_deps.append("all_final_evaluators list")
    if 'px_client' not in locals() or px_client is None: missing_deps.append("px_client")

    if not missing_deps:
        print("\nRunning full experiment for GPT-4o using direct API evaluators...")
        now_str = datetime.now().strftime("%Y%m%d-%H%M%S")
        # Give this experiment run a distinct name
        experiment_name_gpt4o = f"GPT4o_Full_OpenAI_Eval_{now_str}" # Matched naming convention

        # Define type hint for evaluator list (if not defined globally)
        EvaluatorList = List[Callable[[Dict[str, Any]], float]]

        # Assuming run_experiment works without explicit env vars here
        try:
            full_experiment_gpt4o = run_experiment(
                evaluation_dataset_gpt4o,           # Use the GPT-4o dataset variable
                dummy_task_function,                # Uses the function defined above in this cell
                evaluators=all_final_evaluators,   # Re-use the same evaluators list
                experiment_name=experiment_name_gpt4o, # Use the new, distinct experiment name
                experiment_description="Eval GPT-4o: Direct API ToolUsage, SQLCorrectness, FinalAnswer." # Updated description
                # concurrency=5 # Optional: Adjust concurrency if needed
            )

            print(f"\nFull Experiment '{experiment_name_gpt4o}' run initiated.")
            print(f"Check Phoenix UI for results associated with experiment name '{experiment_name_gpt4o}'.")

        except Exception as e:
            # Catch potential errors like Connection refused if it reappears
            print(f"\nERROR during run_experiment for GPT-4o: {e}")
            print("If this is a connection error, consider adding os.environ lines before the call.")

    else:
        print(f"Skipping GPT-4o experiment - missing dependencies: {', '.join(missing_deps)}")

else:
    print(f"Skipping GPT-4o experiment - 'evaluation_dataset_gpt4o' was not found or is empty. Did you run the cell above to load it?")

print("\n--- End GPT-4o Full Experiment ---")