# Why LLM Evals?
Traditional software testing isn't applicable on LLM's because their output isn't deterministic. BUt you still need a way to judge the quality of your LLM outputs.

When you adjust your prompts or user input's different data, how will you know whether your application has improved and by how much using?

Well, say hello to LLM Evals

Traditionally Machine Learning output was evaluated on certain metrics depending on area of application like NLP, Ranking Alogithms, Regression Tasks etc. Some common metrics are-

1.   Accuracy
2.   Precision
3.   Recall
4. Mean Absolute Error (MAE)
5. Mean Squared Error (MSE)
6. BLEU (Bilingual Evaluation Understudy)
7. ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
8. Precision@k
9. Mean Reciprocal Rank (MRR)



# Let's see some of them in action

In [None]:
# !pip install evaluate
# !pip install nltk
# !pip install --upgrade boto3 botocore s3fs aiobotocore
# !pip install deepeval

In [None]:
import warnings
warnings.filterwarnings('ignore')

Let's compare the two strings to see how similar they are, let's use BLEU as a metric

Task: We have a story of fox jumping over the dog and let's say the AI system was aksed to summarize it.
To evalaute it, we will ask human also to write a summary and we will comapre the two using some metrics

In [None]:
import evaluate
from pprint import pprint

bleu = evaluate.load('bleu')

predictions = ["A quick brown fox jumps over a lazy dog"]
human_response = ["A quick brown fox jumps over a dog"]

bleu = evaluate.load("bleu")
results = bleu.compute(predictions=predictions, references=human_response)
pprint(results)

You can use multiple metrics for the same task and get more indicators of perfromance of your model. Let's also calculate METEOR score for a comparing AI generated summary with a human expected summary

In [None]:

meteor = evaluate.load("meteor")

predictions = ["A quick brown fox jumps over a lazy dog"]
human_response = [["A quick brown fox jumps over a dog"]]

results = meteor.compute(predictions=predictions, references=human_response)
pprint(results)

## Is there a problem here?

1.   Need of labelled data

  1.   Labelled data is expensive to annotate, i.e, getting that human_response is hard
  2.   It is biased by human judgment and intelligence.
  3.   It's not scalable
  2.   It's still not accurate enough.

2.   Lack of semantic understanding.




### Another problem look:
**Task:**

In **question answering task** like this one below:

**Context:** Working at mines is not easy. Workers work all day. After a long day at work, the workers go to the their beds to rest for the night. And there is hardly anything else to do.

**Question:** What did the workers do after finishing their work?

**Human answer:** They went to their beds to sleep and rest for the night.

**AI generated answer:** They went to sleep.



Let's apply the BLEU on this example

In [None]:
ai_response = ["They went to their beds to sleep and rest for the night."]
human_response = [["They went to sleep"]]

results = bleu.compute(predictions=ai_response, references=human_response)
pprint(results)

Don't worry, let's use METEOR

In [None]:
ai_response = ["They went to their beds to sleep and rest for the night."]
human_response = [["Went to rest"]]

results = meteor.compute(predictions=ai_response, references=human_response)
print(results)

## Oh, that didn't work!

What's the solution here. Well, LLMs to the rescue ✈

Let's use LLMs to evaluate LLM outputs .



Python library to the rescue - [deepeval](https://github.com/confident-ai/deepeval)

In [None]:
import os
from deepeval.models import GPTModel
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv("../.env")

openai_eval = GPTModel(
    model="gpt-4.1-mini"
)


In [None]:

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7, model=openai_eval)
test_case = LLMTestCase(
    input="What did the workers do after finishing their work?",
    # Replace this with the actual output from your LLM application
    actual_output="They went for rest.",
    expected_output="They went to their beds to sleep for the night",
    retrieval_context=["After a long day at work, the workers went to the their beds to rest for the night"]
)
evaluate([test_case], [answer_relevancy_metric])

## Pretty cool, right!

Let's take it up a notch. What if we didn't have the ground/golden truth/labelled output by human expert?

Good news, LLMs have world knowledge, they can act like human for us.

In [None]:
from pprint import pprint 
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7, model=openai_eval)
test_case = LLMTestCase(
    input="What did the workers do after finishing their work?",
    # Replace this with the actual output from your LLM application
    actual_output="They went for rest.",
    retrieval_context=["After a long day at work, the workers went to the their beds to rest for the night"]
)
result = evaluate([test_case], [answer_relevancy_metric])
pprint(result)

Discussion: What is pretty cool here? 👀









Let's see more cool things....

## Buzz word of the GenAI era - Hallucinations.

Can we test if model is hallucinating using Evals?

In [None]:
from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

# Replace this with the actual documents that you are passing as input to your LLM.
context=["A man with blond-hair, and a brown shirt drinking out of a public water fountain."]

# Replace this with the actual output from your LLM application
actual_output="A blond was drinking in public."

test_case = LLMTestCase(
    input="What was the blond doing?",
    actual_output=actual_output,
    context=context
)
metric = HallucinationMetric(threshold=0.5, model= openai_eval)

metric.measure(test_case)
pprint(metric.score)
pprint(metric.reason)

# or evaluate test cases in bulk
evaluate([test_case], [metric])

In [None]:
from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

# Replace this with the actual documents that you are passing as input to your LLM.
context=["A man with blond-hair, and a brown shirt drinking out of a public water fountain."]

# Replace this with the actual output from your LLM application
actual_output="A blond drinking water in public."

test_case = LLMTestCase(
    input="What was the blond doing?",
    actual_output=actual_output,
    context=context
)
metric = HallucinationMetric(threshold=0.5, model= openai_eval)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

# or evaluate test cases in bulk
evaluate([test_case], [metric])

## Let's go one level deep and measure TOXICITY

In [None]:
from deepeval.metrics import ToxicityMetric

# Replace this with the actual documents that you are passing as input to your LLM.
input_content="A man with blond-hair, and a brown shirt drinking out of a public water fountain."

# Replace this with the actual output from your LLM application
actual_output="A blond drinking water in public."

toxicity_metric = ToxicityMetric(threshold=0.9, model=openai_eval)
test_case = LLMTestCase(
    input=input_content,
    # Replace this with the actual output from your LLM application
    actual_output=actual_output
)

toxicity_metric.measure(test_case)
print(toxicity_metric.score)
print(toxicity_metric.reason)

## Should always remember to be INCLUSIVE!

In [None]:
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams, LLMTestCase

# for details on how to create more such metrics, refer - https://docs.confident-ai.com/docs/metrics-llm-evals
inclusivity_metric = GEval(
    name="Inclusivity",
    criteria="Determine whether the output uses only inclusive language based on the expected output.",
    evaluation_steps=[
        "Check whether there are any terms, phrases, or structures that could be considered exclusive, biased, or marginalizing based on gender, race, ethnicity, ability, age, or other identity factors.'",
        "If any such terms are found, the output is considered not inclusive.",
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model = openai_eval,
    threshold = 0.9,
    strict_mode = True
    # verbose_mode = True
)

test_case = LLMTestCase(
    # Input to LLM
    input="Create a course for tech leaders on how to write application for tech accelorator programs.",
    actual_output="Each applicant should ensure that he finishes the application before the deadline",

)

inclusivity_metric.measure(test_case)
print(f"Inclusivity score: {inclusivity_metric.score}")
print(f"Reason : {inclusivity_metric.reason}")