# Verify Code-first Setup

The [Azure AI Evaluation client library](https://pypi.org/project/azure-ai-evaluation/) helps you assess the performance of your generative AI applications.

1. **Evaluate data results** from your generative AI applications
1. **Evaluate app targets** directly to assess live responses
1. **Evaluate using AI-assessted metrics** for quality and safety

It does this with three core features:

1. **Simulators** - to help you generate synthetic datasets for evaluation
1. **Built-in Evaluators** - to help you evaluate the performance of your generative AI applications
1. **evaluate()** - API to help you run evaluations in bulk model, using multiple evaluators and datasets

_In this notebook, we will verify the code-first setup of the Azure AI Evaluation client library._

---

## 1. Verify Installed Packages

In [None]:
# Check that you have the azure-ai libraries installed
!pip list | grep azure

In [None]:
# Check that you have the laatest openai libraries installed
!pip list | grep openai

In [None]:
# Check that you are logged into Azure, for generating keyless auth credentials
!az ad signed-in-user show

In [None]:

import pandas as pd
import os
import json
from pprint import pprint

In [None]:
# Generate a default credential
# You must be logged into Azure first (az login --use-device-code)

from azure.identity import DefaultAzureCredential
credential=DefaultAzureCredential()
pprint(credential)


In [None]:
# Generate the azure-ai-project object

# Project Connection String
connection_string = os.environ.get("AZURE_AI_CONNECTION_STRING")

# Extract details
region_id, subscription_id, resource_group_name, project_name = connection_string.split(";")

# Populate it
azure_ai_project = {
    "subscription_id": subscription_id,
    "resource_group_name": resource_group_name,
    "project_name": project_name,
}
pprint(azure_ai_project)

---

## 2. Try an NLP Evaluator

In [None]:
from azure.ai.evaluation import BleuScoreEvaluator

# NLP bleu score evaluator
bleu_score_evaluator = BleuScoreEvaluator()
result = bleu_score_evaluator(
    response="Tokyo is the capital of Japan.",
    ground_truth="The capital of Japan is Tokyo."
)

pprint(result)

---

## 3. Try an AI-Assisted Quality Evaluator

In [None]:
from azure.ai.evaluation import RelevanceEvaluator

# AI assisted quality evaluator
model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
    "azure_deployment": os.environ.get("AZURE_OPENAI_EVAL_DEPLOYMENT"),
}

relevance_evaluator = RelevanceEvaluator(model_config)
result = relevance_evaluator(
    query="What is the capital of Japan?",
    response="The capital of Japan is Tokyo."
)

pprint(result)

---

## 4. Try an AI-Assisted Safety Evaluator



In [None]:
from azure.ai.evaluation import ViolenceEvaluator

# AI assisted safety evaluator
violence_evaluator = ViolenceEvaluator(azure_ai_project=azure_ai_project,credential=credential)
result = violence_evaluator(
    query="What is the capital of France?",
    response="Paris."
)
pprint(result)


---

## 5. Try a Custom Evaluator

In [None]:
# Custom evaluator as a function to calculate response length
def response_length(response, **kwargs):
    return len(response)

# Custom class based evaluator to check for blocked words
class BlocklistEvaluator:
    def __init__(self, blocklist):
        self._blocklist = blocklist

    def __call__(self, *, answer: str, **kwargs):
        contains_block_word = any(word in answer for word in self._blocklist)
        return {"score": contains_block_word}

blocklist_evaluator = BlocklistEvaluator(blocklist=["bad", "worst", "terrible"])

# Test custom evaluator 1
result = response_length("The capital of Japan is Tokyo.")
print(result)

# Test custom evaluator 2
result = blocklist_evaluator(answer="The capital of Japan is Tokyo.")
print(result)

# Test custom evaluator 3
result = blocklist_evaluator(answer="This is a bad idea.")
print(result)

---

## 6. Try a Composite Evaluator

In [None]:

from azure.ai.evaluation import evaluate, QAEvaluator

qa_evaluator = QAEvaluator(model_config=model_config)

eval_output = evaluate(
    data=str("data.jsonl"),
    evaluators={"QAEvaluator": qa_evaluator},
    evaluation_name="06-using-composite evaluator",
    evaluator_config={
        "QAEvaluator": {
            "column_mapping": {
                "query": "${data.query}",
                "response": "${data.response}",
                "context": "${data.ground_truth}",
                "ground_truth": "${data.ground_truth}",
            }
        }
    },
    
    # Optionally provide your AI Foundry project information to track your evaluation results in your Azure AI Foundry project
    azure_ai_project = azure_ai_project,
    
    # Optionally provide an output path to dump a json of metric summary, row level data and metric and AI Foundry URL
    output_path="./data_composite_results.json"
)

---

## 7. Try the Simulator

Simulators allow users to generate synthentic data using their application. Simulator expects the user to have a callback method that invokes their AI application. The intergration between your AI application and the simulator happens at the callback method. 

In [None]:
from azure.ai.evaluation.simulator import Simulator

simulator = Simulator(model_config=model_config)

In [None]:
# This is my "application" for generting simulated responses from input text

from typing import List, Dict, Any, Optional
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

# This is the application chat_completion endpoin
def call_to_your_ai_application(query: str) -> str:
    # logic to call your application
    # use a try except block to catch any errors
    token_provider = get_bearer_token_provider(
        DefaultAzureCredential(), 
        "https://cognitiveservices.azure.com/.default"
    )

    deployment = os.environ.get("AZURE_OPENAI_EVAL_DEPLOYMENT")
    endpoint = os.environ.get("AZURE_OPENAI_ENDPOINT")
    client = AzureOpenAI(
        azure_endpoint=endpoint,
        api_version=os.environ.get("AZURE_OPENAI_API_VERSION"),
        azure_ad_token_provider=token_provider,
    )
    completion = client.chat.completions.create(
        model=deployment,
        messages=[
            {
                "role": "user",
                "content": query,
            }
        ],
        max_tokens=800,
        temperature=0.7,
        top_p=0.95,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None,
        stream=False,
    )
    message = completion.to_dict()["choices"][0]["message"]
    # change this to return the response from your application
    return message["content"]

# This is the callback function that is called by the Simulator
# It takes the messages from the simulator and calls your application
async def callback(
    messages: List[Dict],
    stream: bool = False,
    session_state: Any = None,  # noqa: ANN401
    context: Optional[Dict[str, Any]] = None,
) -> dict:
    messages_list = messages["messages"]
    # get last message
    latest_message = messages_list[-1]
    query = latest_message["content"]
    context = None
    # call your endpoint or ai application here
    response = call_to_your_ai_application(query)
    # we are formatting the response to follow the openAI chat protocol format
    formatted_response = {
        "content": response,
        "role": "assistant",
        "context": {
            "citations": None,
        },
    }
    messages["messages"].append(formatted_response)
    return {"messages": messages["messages"], "stream": stream, "session_state": session_state, "context": context}

In [None]:
# In this example we use a wikipedia article as raw text to generate Query Response pairs. 
import wikipedia

wiki_search_term = "Leonardo da vinci"
wiki_title = wikipedia.search(wiki_search_term)[0]
wiki_page = wikipedia.page(wiki_title)
text = wiki_page.summary[:5000]

In [None]:
# Call the Simulator
outputs = await simulator(
    target=callback,
    text=text,
    num_queries=4,
    max_conversation_turns=3,
    tasks=[
        f"I am a student and I want to learn more about {wiki_search_term}",
        f"I am a teacher and I want to teach my students about {wiki_search_term}",
        f"I am a researcher and I want to do a detailed research on {wiki_search_term}",
        f"I am a statistician and I want to do a detailed table of factual data concerning {wiki_search_term}",
    ],
)

In [None]:
# Save the output to a file
from pathlib import Path

output_file = Path("data_simulation.json")
with output_file.open("a") as f:
    json.dump(outputs, f)

In [None]:
# Now you can run evaluations on the "simulated" dataset
# Here we will try to run the following evaluators:
#   GroundednessEvaluator, 
#   RelevanceEvaluator, 
#   CoherenceEvaluator, 
#   FluencyEvaluator, 
#   SimilarityEvaluator, 
#   F1ScoreEvaluator 
#
# From the documentation we know that running those evaluators needs 
# the following data:
# { query, response, context, ground_truth }
#
# For simplicity's sake, we can use our source document text as both 
# context and ground_truth. This step only evaluates the first user message 
# and first response from your AI Application for each of the simulated 
# conversations. **LET'S CREATE THE JSONL**

In [None]:
# Write the data to a variable in json format
data_simulation_jsonl = ""
for output in outputs:
    query = None
    response = None
    context = text
    ground_truth = text
    for message in output["messages"]:
        if message["role"] == "user":
            query = message["content"]
        if message["role"] == "assistant":
            response = message["content"]
    if query and response:
        data_simulation_jsonl += (
            json.dumps(
                {
                    "query": query,
                    "response": response,
                    "context": context,
                    "ground_truth": ground_truth,
                }
            )
            + "\n"
        )

In [None]:
# Store that to a JSONL file format for evaluations
data_simulation_jsonl_file = Path("data_simulation.jsonl")
with data_simulation_jsonl_file.open("w") as f:
    f.write(data_simulation_jsonl)

In [None]:
# Now run the evaluation using this JSONL with a QAEvaluator 
# This is a compositve evaluator that conveniently runs 
# GroundednessEvaluator, RelevanceEvaluator, CoherenceEvaluator, FluencyEvaluator, SimilarityEvaluator, F1ScoreEvaluator

# Optionally set the azure_ai_project to upload the evaluation results to 
# Azure AI Foundry

from azure.ai.evaluation import evaluate, QAEvaluator

qa_evaluator = QAEvaluator(model_config=model_config)

eval_output = evaluate(
    data=str(data_simulation_jsonl_file),
    evaluators={"QAEvaluator": qa_evaluator},
    evaluation_name="07-using-simulator-data",
    evaluator_config={
        "QAEvaluator": {
            "column_mapping": {
                "query": "${data.query}",
                "response": "${data.response}",
                "context": "${data.context}",
                "ground_truth": "${data.ground_truth}",
            }
        }
    },
    azure_ai_project=azure_ai_project,  # optional to store the evaluation results in Azure AI Studio
    output_path="./data_simulation_eval_results.json",  # optional to store the evaluation results in a file
)

---

## 8. Try the `evaluate()` API

### 6.1 Generate the data in JSONL format 

I used this prompt with Copilot

```bash title="" linenums="0"
create a JSONL file called data.jsonl in the notebooks folder. 
Make sure it has 5 lines - each has {query, truth, response} propertoes - where the query can be related to camoing or hiking equipment
````

### 6.2 Run the evaluation

In [None]:
from azure.ai.evaluation import evaluate
from azure.ai.evaluation import RelevanceEvaluator

# provide your data here
data="data.jsonl",

# configure your quality evaluators here
relevance_evaluator = RelevanceEvaluator(model_config)

result = evaluate(
    data="data.jsonl", # provide your data here
    evaluators={
        #"blocklist": blocklist_evaluator,
        "relevance": relevance_evaluator
    },
    evaluation_name="08-using-evaluate-api",
    # column mapping
    evaluator_config={
        "relevance": {
            "column_mapping": {
                "query": "${data.query}",
                "ground_truth": "${data.ground_truth}",
                "response": "${data.response}"
            } 
        }
    },
    # Optionally provide your AI Foundry project information to track your evaluation results in your Azure AI Foundry project
    azure_ai_project = azure_ai_project,
    
    # Optionally provide an output path to dump a json of metric summary, row level data and metric and AI Foundry URL
    output_path="./data_evaluation_results.json"
)

### 6.3 Run evaluation on app

In [None]:
# Check that you have the latest BeautifulSoup libraries installed
!pip list | grep bs4

In [None]:
# Check that you have the latest jinja2 libraries installed
!pip list | grep jinja2

from askwiki import ask_wiki

result = evaluate(
    data="askwiki.jsonl",
    target=ask_wiki,
    evaluators={
        "relevance": relevance_evaluator
    },
    evaluation_name="07-using-app-target",
    evaluator_config={
        "default": {
            "column_mapping": {
                "query": "${data.query}",
                "ground_truth": "${data.ground_truth}",
                "response": "${data.response}"
            } 
        }
    }
)