# G-Eval 

g-eval เป็นเฟรมเวิร์คที่ใช้ LLM ให้ใช้งานในรูปแบบ CoT เพื่อประเมินผลลัพธ์จาก LLM โดยสามารถกำหนด criteria ที่ต้องการได้ 

### Required Arguments

ในการใช้งาน g-eval จำเป็นจะต้องมี LLMTestCase ที่มีส่วนประกอบของ input และ actual_output 

### Example 

In [1]:
from langchain_openai import AzureChatOpenAI
from deepeval.models.base_model import DeepEvalBaseLLM
import sys
sys.path.append('/opt/project/src/evaluate_llm/')
from api_key_config import settings
import os

os.environ["OPENAI_API_VERSION"] = settings.OPENAI_API_VERSION
os.environ["OPENAI_API_KEY"] = settings.OPENAI_API_KEY
os.environ["AZURE_OPENAI_ENDPOINT"] = settings.AZURE_OPENAI_ENDPOINT

class AzureOpenAI(DeepEvalBaseLLM):
    def __init__(
        self,
        model
    ):
        self.model = model

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        return chat_model.invoke(prompt).content

    async def a_generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        res = await chat_model.ainvoke(prompt)
        return res.content

    def get_model_name(self):
        return "Custom Azure OpenAI Model"

# Replace these with real values
custom_model = AzureChatOpenAI(
    deployment_name="gpt-35-turbo",
)
azure_openai = AzureOpenAI(model=custom_model)



In [2]:
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

correctness_metric = GEval(
    name="Correctness",
    criteria="Determine whether the actual output is factually correct based on the expected output.",
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
        "You should also heavily penalize omission of detail",
        "Vague language, or contradicting OPINIONS, are OK"
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model = azure_openai
)

There are three mandatory and five optional parameters required when instantiating an GEval class:

- name: name of metric
- criteria: a description outlining the specific evaluation aspects for each test case.
- evaluation_params: a list of type LLMTestCaseParams. Include only the parameters that are relevant for evaluation.
- [Optional] evaluation_steps: a list of strings outlining the exact steps the LLM should take for evaluation. If evaluation_steps is not provided, GEval will generate a series of evaluation_steps on your behalf based on the provided criteria. You can only provide either evaluation_steps OR criteria, and not both.
- [Optional] threshold: the passing threshold, defaulted to 0.5.
- [Optional] model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of type DeepEvalBaseLLM. Defaulted to 'gpt-4o'.
- [Optional] strict_mode: a boolean which when set to True, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted to False.
- [Optional] async_mode: a boolean which when set to True, enables concurrent execution within the measure() method. Defaulted to True.

In [3]:
from deepeval.test_case import LLMTestCase
...

test_case = LLMTestCase(
    input="The dog chased the cat up the tree, who ran up the tree?",
    actual_output="It depends, some might consider the cat, while others might argue the dog.",
    expected_output="The cat."
)

correctness_metric.measure(test_case)
print(correctness_metric.score)
print(correctness_metric.reason)

Output()

0.8
The actual output includes vague language and contradicts the expected output.


paper: https://arxiv.org/pdf/2303.16634