# Testing QA Evaluator Implementation

This notebook will help test and debug the QA evaluator implementation to ensure it works correctly with Phoenix.

In [None]:
import os
import pandas as pd
import nest_asyncio
from phoenix.evals import OpenAIModel, QA_PROMPT_TEMPLATE, QA_PROMPT_RAILS_MAP, llm_classify

# Apply nest_asyncio for better performance
nest_asyncio.apply()

  from .autonotebook import tqdm as notebook_tqdm


## Test Data Setup

First, let's create some test data that follows the expected format for the QA template:
- `input`: The question being asked
- `reference`: The context/ground truth
- `output`: The generated answer to evaluate

In [2]:
# Sample test data
test_data = pd.DataFrame({
    "input": ["What is the amount of men in Prague at the end of Q3 2024?"],
    "reference": ["676069"],
    "output": ["Based on the data, the number of men in Prague at the end of Q3 2024 is 676,069."]
})

test_data

Unnamed: 0,input,reference,output
0,What is the amount of men in Prague at the end...,676069,"Based on the data, the number of men in Prague..."


## Configure Model

Now set up the Azure OpenAI model using your credentials:

In [3]:
# Configure the model
model = OpenAIModel(
    model="gpt-4o__test1",  # Azure deployment name
    api_version="2024-05-01-preview",
    azure_endpoint=os.getenv('AZURE_OPENAI_ENDPOINT'),
    api_key=os.getenv('AZURE_OPENAI_API_KEY'),
    temperature=0.0
)

## Run Evaluation

Run the evaluation using the Phoenix LLM classify function:

In [4]:
# Run the evaluation
rails = list(QA_PROMPT_RAILS_MAP.values())
try:
    results = llm_classify(
        data=test_data,
        template=QA_PROMPT_TEMPLATE,
        model=model,
        rails=rails,
        provide_explanation=True
    )
    results
except Exception as e:
    print(f"Error during evaluation: {e}")

llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:02<00:00 |  2.63s/it


In [5]:
print(results)

     label                                        explanation exceptions  \
0  correct  To determine if the answer is correct, we comp...         []   

  execution_status  execution_seconds  
0        COMPLETED            1.62129  


## Debugging Common Issues

If you encounter errors, check these common issues:

1. **Template Variable Names**: Ensure DataFrame columns match the template variables exactly
2. **API Connectivity**: Verify Azure OpenAI endpoint and credentials are correct
3. **Model Availability**: Check if the specified model/deployment exists in your Azure resource

In [6]:
# Print the QA template to verify expected variables
print(QA_PROMPT_TEMPLATE)


You are given a question, an answer and reference text. You must determine whether the
given answer correctly answers the question based on the reference text. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {input}
    ************
    [Reference]: {reference}
    ************
    [Answer]: {output}
    [END DATA]
Your response must be a single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"correct" means that the question is correctly and fully answered by the answer.
"incorrect" means that the question is not correctly or only partially answered by the
answer.

