## <b><font color='darkblue'>Preface</font></b>
([source](https://www.datacamp.com/tutorial/deepeval?utm_cid=19589720824&utm_aid=157098106775&utm_campaign=230119_1-ps-other~dsa-tofu~all_2-b2c_3-apac_4-prc_5-na_6-na_7-le_8-pdsh-go_9-nb-e_10-na_11-na&utm_loc=9197821-&utm_mtd=-m&utm_kw=&utm_source=google&utm_medium=paid_search&utm_content=ps-other~apac-en~dsa~tofu~tutorial~artificial-intelligence&gad_source=1&gad_campaignid=19589720824&gclid=Cj0KCQiA-YvMBhDtARIsAHZuUzJyeVnID2osRpZlxxBQUcq0XKoaQKp6AHN1AfbUL7TOc4re1IKxP6YaAmRmEALw_wcB)) <font size='3ptx'><b>Learn to use [DeepEval](https://deepeval.com/) to create Pytest-like relevance tests, evaluate LLM outputs with the G-eval metric, and benchmark Qwen 2.5 using MMLU (Massive Multitask Language Understanding).</b></font>

Few evaluation frameworks are available in the machine learning ecosystem that can effectively test large language models (LLMs) on various properties. <b>With the exponential growth in the development of LLMs, it has become crucial to assess their performance across various tasks and evaluate their strengths, weaknesses, and limitations</b>. 

<b><font size='3ptx'>[DeepEval](https://github.com/confident-ai/deepeval) steps in as a comprehensive and reliable solution to address this need</font></b>, offering a robust framework for testing LLMs on multiple dimensions, such as accuracy, reasoning, coherence, and ethical alignment.

<b>In this tutorial, you will learn how to set up [DeepEval](https://github.com/confident-ai/deepeval) and create a relevance test similar to the Pytest approach.</b> Then, you will test the LLM outputs using the G-eval metric and run MMLU benchmarking on the Qwen 2.5 model. This is a beginner-friendly and simple tutorial intended for anyone with a technical background who wants to understand the DeepEval ecosystem better.

### <b><font color='darkgreen'>Understanding DeepEval</font></b>
<font size='3ptx'><b>DeepEval is an open-source evaluation framework designed specifically for large language models, enabling developers to efficiently build, improve, test, and monitor LLM-based applications.</b> It simplifies the process of evaluating LLM outputs by allowing users to "unit test" their models, much like using Pytest for traditional software testing.</font>

<b>DeepEval offers a wide range of features to ensure comprehensive evaluation, including 14 research-backed LLM evaluation metrics, synthetic dataset generation, LLMs benchmarks, red team, and real-time evaluations in production</b>. With these features, DeepEval not only streamlines testing but also ensures the safety and reliability of LLM-powered applications.

Read the [**LLM Evaluation blog**](https://www.datacamp.com/blog/llm-evaluation) to learn how to evaluate large language models using key metrics, methodologies, and best practices.

### <b><font color='darkgreen'>Initiailziation & Importing</font></b>
We need to install some packages beforehand to start this tutorial:

In [1]:
#!pip install -U deepeval
#!pip install -U bitsandbytes

In [2]:
!pip freeze | grep -P '(deepeval|bitsandbytes)'

bitsandbytes==0.49.2
deepeval==3.8.4


In [3]:
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

## <b><font color='darkblue'>Getting Started with DeepEval</font></b>
Write the following Python file using the Jupyter Notebook magic function. 

In [5]:
def test_relevancy():
    relevancy_metric = AnswerRelevancyMetric(threshold=0.5, model="gemini-2.5")
    test_case_1 = LLMTestCase(
        input="Can I return these shoes after 30 days?",
        actual_output="Unfortunately, returns are only accepted within 30 days of purchase.",
        retrieval_context=[
            "All customers are eligible for a 30-day full refund at no extra cost.",
            "Returns are only accepted within 30 days of purchase.",
        ],
    )
    assert_test(test_case_1, [relevancy_metric])

In [7]:
# test_relevancy()

The following test result shows that the LLM test case has successfully passed the test and achived a perfect relevancy score of 1.0, indicating that the model's response was completely relevant. 