# Debugging Phoenix Evaluations

This notebook helps diagnose and fix common issues with Phoenix evaluations, focusing particularly on QA evaluations using Azure OpenAI models.

In [None]:
# Apply nest_asyncio early to avoid asyncio issues
import nest_asyncio
nest_asyncio.apply()

import os
import pandas as pd
import re
from phoenix.evals import OpenAIModel, QA_PROMPT_TEMPLATE, QA_PROMPT_RAILS_MAP, llm_classify
from phoenix.experiments.types import EvaluationResult

  from .autonotebook import tqdm as notebook_tqdm


## 1. Inspect QA Template & Variables

First, let's examine the template to understand exactly what variables it expects:

In [2]:
# Print the template to clearly see expected variables
import re

template_text = str(QA_PROMPT_TEMPLATE)
print(template_text)

# Dynamically extract `{var}` placeholders from the template
template_vars = sorted(set(re.findall(r"\{(\w+)\}", template_text)))
print("\nExtracted template variables:", template_vars)

print("\nRails (expected output values):", list(QA_PROMPT_RAILS_MAP.values()))


You are given a question, an answer and reference text. You must determine whether the
given answer correctly answers the question based on the reference text. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {input}
    ************
    [Reference]: {reference}
    ************
    [Answer]: {output}
    [END DATA]
Your response must be a single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"correct" means that the question is correctly and fully answered by the answer.
"incorrect" means that the question is not correctly or only partially answered by the
answer.


Extracted template variables: ['input', 'output', 'reference']

Rails (expected output values): ['correct', 'incorrect']


## 2. Test with a Single Example

Testing with a single example helps isolate issues and verify the basic functionality works:

In [3]:
# Test data with proper column names matching template variables
test_data = pd.DataFrame({
    "input": ["What is the amount of men in Prague at the end of Q3 2024?"],
    "reference": ["676069"],
    "output": ["Based on our data, there are 676,069 men living in Prague at the end of Q3 2024."]
})

# Display the test data
test_data

Unnamed: 0,input,reference,output
0,What is the amount of men in Prague at the end...,676069,"Based on our data, there are 676,069 men livin..."


## 3. Test Azure OpenAI Configuration

Make sure your Azure OpenAI configuration works correctly:

In [4]:
# Configure the Azure OpenAI Model
model = OpenAIModel(
    model="gpt-4o__test1",  # For Azure, this is the deployment name
    api_version="2024-05-01-preview", 
    azure_endpoint=os.getenv('AZURE_OPENAI_ENDPOINT'),
    api_key=os.getenv('AZURE_OPENAI_API_KEY'),
    temperature=0.0
)

# Test the model with a simple query
try:
    response = model("Hello, please respond with 'OK' if you're working correctly.")
    print("Model test:", response)
    print("✅ Azure OpenAI connection successful")
except Exception as e:
    print("❌ Azure OpenAI connection failed:")
    print(f"Error: {e}")

Model test: OK
✅ Azure OpenAI connection successful


## 4. Manual QA Evaluation Test

Run a single evaluation to verify the functionality:

In [5]:
rails = list(QA_PROMPT_RAILS_MAP.values())

try:
    # Run classification on single example
    eval_results = llm_classify(
        data=test_data,  # Use 'data' not 'dataframe'
        template=QA_PROMPT_TEMPLATE,
        model=model,
        rails=rails,
        provide_explanation=True
    )
    
    # Display the results
    print("Evaluation results:")
    print(eval_results)
    print("\nExpected rails:", rails)
    print("\nActual label:", eval_results["label"].iloc[0])
    print("\nExplanation:", eval_results["explanation"].iloc[0])
    
except Exception as e:
    print(f"❌ Evaluation failed with error: {e}")
    
    # Try to get more detailed error information
    import traceback
    traceback.print_exc()

llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:02<00:00 |  2.43s/it

Evaluation results:
     label                                        explanation exceptions  \
0  correct  To determine if the answer is correct, we comp...         []   

  execution_status  execution_seconds  
0        COMPLETED           1.433905  

Expected rails: ['correct', 'incorrect']

Actual label: correct

Explanation: To determine if the answer is correct, we compare the information provided in the answer with the reference text. The question asks for the number of men in Prague at the end of Q3 2024. The reference text states the number as 676069. The answer provided states that there are 676,069 men living in Prague at the end of Q3 2024. Both the numerical value and the context match exactly. Therefore, the answer correctly answers the question based on the reference text.





## 5. Implement QA Evaluator Function

Test the evaluator function in isolation:

In [9]:
def qa_test_evaluator(output, reference):
    """LLM‑based evaluator that returns a Phoenix EvaluationResult."""
    from phoenix.experiments.types import EvaluationResult  # local import avoids circular issues

    answer = str(output.get("results", "Based on our data, there are 676,069 men living in Prague at the end of Q3 2024."))
    question = output.get("query", "What is the amount of men in Prague at the end of Q3 2024?")
    context = str(reference.get("context", "676069"))

    # Build DataFrame expected by QA_PROMPT_TEMPLATE
    df_eval = pd.DataFrame(
        {"input": [question], "reference": [context], "output": [answer]}
    )
    print("Evaluation DataFrame:")
    print(df_eval)

    try:
        eval_df = llm_classify(
            data=df_eval,
            template=QA_PROMPT_TEMPLATE,
            model=model,
            rails=rails,
            provide_explanation=True,
        )
        label = eval_df["label"].iloc[0]
        score = 1 if label == "correct" else 0
        explanation = eval_df["explanation"].iloc[0]

        print(f"Evaluation successful: {label} (score={score})")
        print(f"Explanation: {explanation}")

        return EvaluationResult(label=label, score=score, explanation=explanation)
    except Exception as e:
        print(f"Evaluation failed: {e}")
        return EvaluationResult(
            label="error",
            score=0,
            explanation=f"Error during evaluation: {e}",
        )

# Test the evaluator with dummy data
test_output = {"query": "What is the amount of men in Prague at the end of Q3 2024?", "results": "Based on our data, there are 676,069 men living in Prague at the end of Q3 2024."}
test_reference = {"context": "676069"}

result = qa_test_evaluator(test_output, test_reference)

Evaluation DataFrame:
                                               input reference  \
0  What is the amount of men in Prague at the end...    676069   

                                              output  
0  Based on our data, there are 676,069 men livin...  


llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:02<00:00 |  2.42s/it

Evaluation successful: correct (score=1)
Explanation: To determine if the answer is correct, we compare the information provided in the answer with the reference text. The question asks for the number of men in Prague at the end of Q3 2024. The reference text states the number as 676069. The answer states that there are 676,069 men living in Prague at the end of Q3 2024. Both the numerical value and the context match exactly. Therefore, the answer correctly answers the question based on the reference text.





## 6. Finding Experiment Issues in Phoenix

The `run_experiment` function might fail if there are event loop issues. Here's a simpler approach that might be more reliable:

In [10]:
from phoenix.session.client import Client

# Initialize Phoenix client
px_client = Client(warn_if_server_not_running=True)

# Convert to smaller test set
test_ground_truth = {
    "What is the amount of men in Prague at the end of Q3 2024?": "676069",
}
test_df = pd.DataFrame(test_ground_truth.items(), columns=["question", "context"])

# Create a simplified dataset just for testing
try:
    test_dataset = px_client.upload_dataset(
        dataset_name=f"test_dataset_{pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')}",
        dataframe=test_df,
        input_keys=["question"],
        output_keys=["context"]
    )
    print(f"Created test dataset with {len(test_df)} records")
except Exception as e:
    print(f"Error creating dataset: {e}")

📤 Uploading dataset...
💾 Examples uploaded: https://app.phoenix.arize.com/datasets/RGF0YXNldDoyOQ==/examples
🗄️ Dataset version ID: RGF0YXNldFZlcnNpb246MzQ=
Created test dataset with 1 records


In [11]:
# Implement a simpler synchronous task function just for testing
def simple_task(input_data):
    """A simplified task function for testing"""
    question = input_data["question"]
    # Return a predefined result instead of calling main()
    return {
        "query": question,
        "results": f"Based on our data, the answer to '{question}' is 676069.",
        "error": None
    }

# Remove wrapper – use the evaluator function directly
# evaluator = create_evaluator(qa_test_evaluator)

try:
    from phoenix.experiments import run_experiment
    simple_experiment = run_experiment(
        dataset=test_dataset,
        task=simple_task,
        evaluators=[qa_test_evaluator],  # pass function directly
        experiment_name=f"simple_test_{pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')}"
    )
    print("Experiment completed successfully!")
except Exception as e:
    print(f"Error running experiment: {e}")
    import traceback
    traceback.print_exc()

🧪 Experiment started.
📺 View dataset experiments: https://app.phoenix.arize.com/datasets/RGF0YXNldDoyOQ==/experiments
🔗 View this experiment: https://app.phoenix.arize.com/datasets/RGF0YXNldDoyOQ==/compare?experimentId=RXhwZXJpbWVudDo3Mw==


running tasks |██████████| 1/1 (100.0%) | ⏳ 00:01<00:00 |  1.35s/it
running tasks |██████████| 1/1 (100.0%) | ⏳ 00:01<00:00 |  1.35s/it


✅ Task runs completed.
🧠 Evaluation started.


running experiment evaluations |          | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s

Evaluation DataFrame:
                                               input reference  \
0  What is the amount of men in Prague at the end...    676069   

                                              output  
0  Based on our data, the answer to 'What is the ...  


llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:02<00:00 |  2.31s/it



Evaluation successful: correct (score=1)
Explanation: To determine if the answer is correct, we first examine the question, which asks for the amount of men in Prague at the end of Q3 2024. The reference text provides the number 676069. The answer states that the amount of men in Prague at the end of Q3 2024 is 676069, which matches the reference text exactly. Therefore, the answer correctly and fully addresses the question.


running experiment evaluations |██████████| 1/1 (100.0%) | ⏳ 00:03<00:00 |  3.79s/it


🔗 View this experiment: https://app.phoenix.arize.com/datasets/RGF0YXNldDoyOQ==/compare?experimentId=RXhwZXJpbWVudDo3Mw==

Experiment Summary (04/27/25 08:50 PM +0200)
--------------------------------------------
           evaluator  n  n_scores  avg_score  n_labels    top_2_labels
0  qa_test_evaluator  1         1        1.0         1  {'correct': 1}

Tasks Summary (04/27/25 08:50 PM +0200)
---------------------------------------
   n_examples  n_runs  n_errors
0           1       1         0
Experiment completed successfully!



