# [Code for paper "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment"](https://github.com/nlpyang/geval/tree/main)

## Example: Summary

In [1]:
from typing import Optional, List
from tqdm.notebook import tqdm
import os
import json
import time

from dotenv import load_dotenv
from openai import OpenAI  # >=1.0.0

In [2]:
class GEval:
    def __init__(
        self,
        prompt: str,
        api_key: Optional[str] = None,
        model: str = 'gpt-4-0613'
    ) -> None:
        self.prompt = prompt

        if api_key is None:
            load_dotenv()  # .env 파일로부터 환경변수를 로드
            api_key = os.getenv('OPENAI_API_KEY')  # 환경변수에서 API 키를 가져옴

        if api_key is None:
            raise ValueError("OpenAI API key is not provided. Please provide an api_key argument or set the 'OPENAI_API_KEY' in your .env file.")

        self.api_key = api_key
        self.client = OpenAI(api_key=self.api_key)
        self.model = model
    
    def evaluate(
        self,
        source: str,
        summ: str,
    ) -> List[str]:
        cur_prompt = self.prompt.replace('{{Document}}', source).replace('{{Summary}}', summ)
        
        ignore = 0
        while True:
            try:
                _response = self.client.chat.completions.create(
                    model=self.model,
                    messages=[{"role": "system", "content": cur_prompt}],
                    temperature=2,
                    max_tokens=5,
                    top_p=1,
                    frequency_penalty=0,
                    presence_penalty=0,
                    stop=None,
                    n=20
                )
                time.sleep(0.5)
                responses = [_response.choices[i].message.content for i in
                                range(len(_response.choices))]
                break
            except Exception as e:
                print(e)
                if "limit" in str(e):
                    time.sleep(2)
                else:
                    ignore += 1
                    print('ignored', ignore)
                    break

        print('ignored total', ignore)
        return responses

In [3]:
prompt = """You will be given one summary written for a news article.

Your task is to rate the summary on one metric.

Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed.

Evaluation Criteria:

Coherence (1-5) - the collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby "the summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic."

Evaluation Steps:

1. Read the news article carefully and identify the main topic and key points.
2. Read the summary and compare it to the news article. Check if the summary covers the main topic and key points of the news article, and if it presents them in a clear and logical order.
3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.


Example:


Source Text:

{{Document}}

Summary:

{{Summary}}


Evaluation Form (scores ONLY):

- Coherence:"""

g_eval = GEval(prompt=prompt)

summeval_fp = 'data/summeval.json'
with open(summeval_fp) as f:
    summeval = json.load(f)

print("Total number of summeval entries:", len(summeval))
print("Top 3 summeval entries:")
for entry in summeval[:3]:
    print(entry)

Total number of summeval entries: 1600
Top 3 summeval entries:
{'doc_id': 'dm-test-8764fb95bfad8ee849274873a92fb8d6b400eee2', 'system_id': 'M11', 'source': "Paul Merson has restarted his row with Andros Townsend after the Tottenham midfielder was brought on with only seven minutes remaining in his team 's 0-0 draw with Burnley on Sunday . 'Just been watching the game , did you miss the coach ? # RubberDub # 7minutes , ' Merson put on Twitter . Merson initially angered Townsend for writing in his Sky Sports column that 'if Andros Townsend can get in ( the England team ) then it opens it up to anybody . ' Paul Merson had another dig at Andros Townsend after his appearance for Tottenham against Burnley Townsend was brought on in the 83rd minute for Tottenham as they drew 0-0 against Burnley Andros Townsend scores England 's equaliser in their 1-1 friendly draw with Italy in Turin on Tuesday night The former Arsenal man was proven wrong when Townsend hit a stunning equaliser for England ag

In [4]:
new_json = []
for instance in tqdm(summeval[:3]):
    new_json.append(instance)
    all_responses = g_eval.evaluate(source=instance['source'], summ=instance['system_output'])
    instance['all_responses'] = all_responses
    new_json.append(instance)
    
new_json

  0%|          | 0/3 [00:00<?, ?it/s]

ignored total 0
ignored total 0
ignored total 0


[{'doc_id': 'dm-test-8764fb95bfad8ee849274873a92fb8d6b400eee2',
  'system_id': 'M11',
  'source': "Paul Merson has restarted his row with Andros Townsend after the Tottenham midfielder was brought on with only seven minutes remaining in his team 's 0-0 draw with Burnley on Sunday . 'Just been watching the game , did you miss the coach ? # RubberDub # 7minutes , ' Merson put on Twitter . Merson initially angered Townsend for writing in his Sky Sports column that 'if Andros Townsend can get in ( the England team ) then it opens it up to anybody . ' Paul Merson had another dig at Andros Townsend after his appearance for Tottenham against Burnley Townsend was brought on in the 83rd minute for Tottenham as they drew 0-0 against Burnley Andros Townsend scores England 's equaliser in their 1-1 friendly draw with Italy in Turin on Tuesday night The former Arsenal man was proven wrong when Townsend hit a stunning equaliser for England against Italy and he duly admitted his mistake . 'It 's not 

# [DeepEval](https://github.com/confident-ai/deepeval?tab=readme-ov-file)

In [17]:
from deepeval import evaluate
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.dataset import EvaluationDataset
from deepeval.metrics import GEval

dataset = EvaluationDataset(
    test_cases=[LLMTestCase(input=instance['source'], actual_output=instance['system_output']) for instance in summeval[:3]]
)

coherence_metric = GEval(
    name="Coherence",
    criteria="""Coherence - the collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby "the summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic.""",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
    ],
    evaluation_steps=[
        "1. Read the news article carefully and identify the main topic and key points.",
        "2. Read the summary and compare it to the news article. Check if the summary covers the main topic and key points of the news article, and if it presents them in a clear and logical order.",
        "3. Assign a score for coherence."
    ],
    model='gpt-4-0613'
)
evaluate(dataset, [coherence_metric])

Output()

Evaluating test cases...
Event loop is already running. Applying nest_asyncio patch to allow async execution...


Output()



Output()





Metrics Summary

  - ❌ Coherence (GEval) (score: 0.2773726230323355, threshold: 0.5, strict: False, evaluation model: gpt-4-0613, reason: The output summary is confusing and does not accurately represent the main points of the article. Key details, such as the conflict between Paul Merson and Andros Townsend and Townsend's performance for England, are missing or misrepresented., error: None)

For test case:

  - input: Paul Merson has restarted his row with Andros Townsend after the Tottenham midfielder was brought on with only seven minutes remaining in his team 's 0-0 draw with Burnley on Sunday . 'Just been watching the game , did you miss the coach ? # RubberDub # 7minutes , ' Merson put on Twitter . Merson initially angered Townsend for writing in his Sky Sports column that 'if Andros Townsend can get in ( the England team ) then it opens it up to anybody . ' Paul Merson had another dig at Andros Townsend after his appearance for Tottenham against Burnley Townsend was brought on



[TestResult(success=False, metrics=[<deepeval.metrics.g_eval.g_eval.GEval object at 0x7fb2f85a96a0>], input="Paul Merson has restarted his row with Andros Townsend after the Tottenham midfielder was brought on with only seven minutes remaining in his team 's 0-0 draw with Burnley on Sunday . 'Just been watching the game , did you miss the coach ? # RubberDub # 7minutes , ' Merson put on Twitter . Merson initially angered Townsend for writing in his Sky Sports column that 'if Andros Townsend can get in ( the England team ) then it opens it up to anybody . ' Paul Merson had another dig at Andros Townsend after his appearance for Tottenham against Burnley Townsend was brought on in the 83rd minute for Tottenham as they drew 0-0 against Burnley Andros Townsend scores England 's equaliser in their 1-1 friendly draw with Italy in Turin on Tuesday night The former Arsenal man was proven wrong when Townsend hit a stunning equaliser for England against Italy and he duly admitted his mistake . '