# G-Eval

G-Eval is a framework that uses LLMs with chain-of-thoughts (CoT) to evaluate LLM outputs based on **any** custom criteria. It is the most versatile metric offered by DeepEval.

As demonstrated in this [paper](https://arxiv.org/abs/2303.16634), it is capable of evaluating almost any use case with human-like accuracy.

This notebook runs through an example using G-Eval from [DeepEval](https://docs.confident-ai.com/).

In [None]:
# First, import relevant modules.

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams, LLMTestCase
from deepeval.models.base_model import DeepEvalBaseLLM

import os, requests
import httpx

We open the same summaries and context data as used in `LLM_as_a_Judge_Exaple.ipynb`

The context data is about the history of the NHS from [this](https://www.england.nhs.uk/nhsbirthday/about-the-nhs-birthday/nhs-history/) page from the NHS England website.

Summary 1 is 100% grounded, whilst summary 2 contains an incorrect sentence not from the source material.

In [None]:
with open("example_documents/nhs_history.txt", "r") as file:
    data = file.read()
with open("example_documents/summary_1.txt", "r") as file:
    summary_1 = file.read()
with open("example_documents/summary_2.txt", "r") as file:
    summary_2 = file.read()

We then create a custom LLM class from DeepEval. This calls an AzureOpenAI model we have deployed.

The documentation from DeepEval on using a custom LLM for evaluation is found [here](https://docs.confident-ai.com/docs/metrics-introduction#using-a-custom-llm), this gives examples of using LLM's hosted in other locations.

In [None]:
class AzureOpenAI(DeepEvalBaseLLM):

    # Our initialisation function must contain a 'model' parameter.
    # We also pass the url and api key for our LLM endpoint.
    def __init__(self, model, endpoint_url, api_key):
        self.endpoint_url = endpoint_url
        self.api_key = api_key
        self.model = model

    # Returns the name of the model we are using
    def get_model_name(self):
        return "Azure OpenAI Model"
    
    # Returns the model we are using.
    def load_model(self):
        return self.model

    # Generates a response from the model given a prompt.
    def generate(self, prompt: str) -> str:

        headers = {
            "Content-Type": "application/json",
            "api-key": self.api_key,
        }

        payload = {
        "messages": [
            {
            "role": "system",
            "content": [
                {
                "type": "text",
                "text": prompt
                }
            ]
            },
        ],
        "temperature": 0.4,
        "top_p": 0.95,
        "max_tokens": 800
        }

        try:
            response = requests.post(self.endpoint_url, headers=headers, json=payload)
            response.raise_for_status()  # Will raise an HTTPError if the HTTP request returned an unsuccessful status code
        except requests.RequestException as e:
            raise SystemExit(f"Failed to make the request. Error: {e}")

        json_response = response.json()
        
        return json_response["choices"][0]["message"]["content"]
    
    # Asynchronously generates responses form the LLM using httpx.
    async def a_generate(self, prompt: str) -> str:

        headers = {
            "Content-Type": "application/json",
            "api-key": self.api_key,
        }

        payload = {
            "messages": [
                {
                    "role": "system",
                    "content": [
                        {
                            "type": "text",
                            "text": prompt
                        }
                    ]
                },
            ],
            "temperature": 0.4,
            "top_p": 0.95,
            "max_tokens": 800
        }

        # Open an asynchronous context for the HTTP client session
        async with httpx.AsyncClient() as client:
            try:
                response = await client.post(self.endpoint_url, headers=headers, json=payload)
                response.raise_for_status()
            except httpx.RequestError as e:
                raise SystemExit(f"Failed to make the request. Error: {e}")

        json_response = response.json()
        
        return json_response["choices"][0]["message"]["content"]

Now that we have our class set up, we can call it and ensure it works correctly.

In [None]:
ENDPOINT = os.getenv("ENDPOINT_URL")
API_KEY = os.getenv("AZURE_OPENAI_API_KEY")

model = AzureOpenAI(model = "model", endpoint_url=ENDPOINT, api_key=API_KEY)

In [None]:
model.generate("Tell me a fact")

Now let us define our G-Eval metric.

We test for groundedness, this needs a name, a criteria (rules of which to follow), the parameters needed for evaluation and a model.

In [None]:
groundedness_metric = GEval(
    name="Groundedness",
    criteria="Determine whether each sentence in the actual output is grounded based on the context. For the actual output to be grounded, each sentence must have clear support within the context.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.CONTEXT],
    model=AzureOpenAI(model = "model", endpoint_url=ENDPOINT, api_key=API_KEY)
)

Let's run a a test with summary_1. We expect this to be highly grounded.

In [None]:
test_case = LLMTestCase(
input = "None",
actual_output = summary_1,
context = [data]
)

groundedness_metric.measure(test_case)

print(groundedness_metric.score)
print(groundedness_metric.reason)

Across 5 runs, this was ranked a value of 1 every time, suggesting the response is very grounded.

In the final output it gave the following reasoning: 

```
All sentences in the actual output are clearly supported by the context: the introduction of the Hib vaccine in 1992, laser surgery for twin-to-twin transfusion syndrome, the establishment of the NHS Organ Donor Register in 1994, the use of a vaccine against Group C meningococcal disease in 1999, and the introduction of NHS walk-in centres in 2000 are all mentioned in the context.
```

Now let us look at summary_2. Remember, this has a non-grounded sentence included.

In [None]:
test_case = LLMTestCase(
input = "None",
actual_output = summary_2,
context = [data]
)
groundedness_metric.measure(test_case)

print(groundedness_metric.score)
print(groundedness_metric.reason)

Across 5 runs, summary 2 was ranked as 0.8, 0.9, 0.9, 0.9 and 0.8 for groundedness - less than summary 1. 

In the final run we were given this reasoning: 

```
Most events in the actual output are supported by the context: the Hib vaccine in 1992, laser surgery for twin-to-twin transfusion syndrome in 1992, the NHS Organ Donor Register in 1994, the Group C meningococcal vaccine in 1999, and NHS walk-in centres in 2000. However, the massive tea party in 1993 for a knee transplant lacks context support
```

This is entirely as expected.