### LLM Evaluation with DeepEval
- More details can be found https://github.com/confident-ai/deepeval
- same example of actual use cases can be found https://github.com/xinyaohuu/llm-evaluation/tree/main

In [1]:
import os,sys
sys.path.insert(0,'../libs')
from dotenv import load_dotenv
env_path = '../../.env'
load_dotenv(dotenv_path=env_path)

True


#### Most basic useage 

In [9]:
from deepeval.test_case import LLMTestCase
from deepeval import evaluate

In [6]:
test_case = LLMTestCase(
  input="What if these shoes don't fit?",
  # Replace this with the actual output of your LLM application
  actual_output="We offer a 30-day full refund at no extra cost.",
  # Replace this with the retrieval context (in the RAG pipeline) of your LLM application
  retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."] ## Only RAG metrics requre retrieval_context when creating a test case
)

- to evaluate it, need to create a evaluation metric 
In this example, we create an `AnswerRelevancyMetric`, which measures the answer relevancy of a RAG based LLM application. Not all metrics are RAG metrics. For a list of full metrics and an explanation for each, visit [the metrics section in our docs](https://docs.confident-ai.com/docs/metrics-introduction)

In [8]:
from deepeval.metrics import AnswerRelevancyMetric
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.6,model="gpt-4o", include_reason=True)
### look at https://docs.confident-ai.com/docs/metrics-answer-relevancy#how-is-it-calculated for more details on the threshold

With a test case and metric ready, you can start using our `evaluate()` function to evaluate your LLM (application).

The `evaluate()` function accepts a list of test cases, and a list of metrics. Under the hood, it evaluates each individual test case using the list of provided metrics. A test case only passess if all the metrics are passing. For more information, including how to use our Pytest integration for evaluation, visit the evaluation [section in our docs.](https://docs.confident-ai.com/docs/evaluation-introduction)

In [11]:
evaluate([test_case], [answer_relevancy_metric])

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:02,  2.78s/test case]



Metrics Summary

  - ✅ Answer Relevancy (score: 1.0, threshold: 0.6, strict: False, evaluation model: gpt-4o, reason: The score is 1.00 because the answer is completely relevant and directly addresses the concern about shoe fit without any irrelevant information. Great job!, error: None)

For test case:

  - input: What if these shoes don't fit?
  - actual output: We offer a 30-day full refund at no extra cost.
  - expected output: None
  - context: None
  - retrieval context: ['All customers are eligible for a 30 day full refund at no extra cost.']


Overall Metric Pass Rates

Answer Relevancy: 100.00% pass rate







EvaluationResult(test_results=[TestResult(name='test_case_0', success=True, metrics_data=[MetricData(name='Answer Relevancy', threshold=0.6, success=True, score=1.0, reason='The score is 1.00 because the answer is completely relevant and directly addresses the concern about shoe fit without any irrelevant information. Great job!', strict_mode=False, evaluation_model='gpt-4o', error=None, evaluation_cost=0.0027575, verbose_logs='Statements:\n[\n    "We offer a 30-day full refund at no extra cost."\n] \n \nVerdicts:\n[\n    {\n        "verdict": "yes",\n        "reason": null\n    }\n]')], conversational=False, multimodal=False, input="What if these shoes don't fit?", actual_output='We offer a 30-day full refund at no extra cost.', expected_output=None, context=None, retrieval_context=['All customers are eligible for a 30 day full refund at no extra cost.'])], confident_link=None)

In [12]:
## or you ca also run it stand alone
answer_relevancy_metric.measure(test_case)
print(answer_relevancy_metric.score)
print(answer_relevancy_metric.reason)

Output()

1.0
The score is 1.00 because the response is perfectly relevant and directly addresses the concern about shoe fit without any irrelevant information. Great job!


#### Some other Example Matricx

**G-Eval**: is a framework that uses LLMs with chain-of-thoughts (CoT) to evaluate LLM outputs based on ANY custom criteria. The G-Eval metric is the most versatile type of metric deepeval has to offer, and is capable of evaluating almost any use case with human-like accuracy.

In [13]:
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

In [14]:

correctness_metric = GEval(
    name="Correctness",
    criteria="Determine whether the actual output is factually correct based on the expected output.",
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
        "You should also heavily penalize omission of detail",
        "Vague language, or contradicting OPINIONS, are OK"
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.6,
    model='gpt-4o',
    verbose_mode=False
)

In [15]:
test_case = LLMTestCase(
    input="The dog chased the cat up the tree, who ran up the tree?",
    actual_output="It depends, some might consider the cat, while others might argue the dog.",
    expected_output="The cat."
)

correctness_metric.measure(test_case)
print(correctness_metric.score)
print(correctness_metric.reason)

Output()

0.24787772456933815
The actual output uses vague language and does not directly identify the cat as the one who ran up the tree, omitting this detail.
