# Summarization

ใช้ LLM เพื่อพิจารณาว่า output LLM สร้างการสรุปที่ถูกต้องตามข้อเท็จจริงหรือไม่ พร้อมทั้งรวมรายละเอียดที่จำเป็นจากข้อความต้นฉบับด้วย 

ส่วนประกอบสำคัญของ SummarizationMetric :
- input
- actual_output

# Example 

In [1]:
from langchain_openai import AzureChatOpenAI
from deepeval.models.base_model import DeepEvalBaseLLM
import sys
sys.path.append('/opt/project/src/evaluate_llm/')
from api_key_config import settings
import os

os.environ["OPENAI_API_VERSION"] = settings.OPENAI_API_VERSION
os.environ["OPENAI_API_KEY"] = settings.OPENAI_API_KEY
os.environ["AZURE_OPENAI_ENDPOINT"] = settings.AZURE_OPENAI_ENDPOINT

class AzureOpenAI(DeepEvalBaseLLM):
    def __init__(
        self,
        model
    ):
        self.model = model

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        return chat_model.invoke(prompt).content

    async def a_generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        res = await chat_model.ainvoke(prompt)
        return res.content

    def get_model_name(self):
        return "Custom Azure OpenAI Model"

# Replace these with real values
custom_model = AzureChatOpenAI(
    deployment_name="gpt-35-turbo",
)
azure_openai = AzureOpenAI(model=custom_model)



In [2]:
# This is the original text to be summarized
input = """
The 'coverage score' is calculated as the percentage of assessment questions
for which both the summary and the original document provide a 'yes' answer. This
method ensures that the summary not only includes key information from the original
text but also accurately represents it. A higher coverage score indicates a
more comprehensive and faithful summary, signifying that the summary effectively
encapsulates the crucial points and details from the original content.
"""

# This is the summary, replace this with the actual output from your LLM application
actual_output="""
The coverage score quantifies how well a summary captures and
accurately represents key information from the original text,
with a higher score indicating greater comprehensiveness.
"""

In [3]:
from deepeval import evaluate
from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase
...

test_case = LLMTestCase(input=input, actual_output=actual_output)
metric = SummarizationMetric(
    threshold=0.5,
    model=azure_openai,
    assessment_questions=[
        "Is the coverage score based on a percentage of 'yes' answers?",
        "Does the score ensure the summary's accuracy with the source?",
        "Does a higher score mean a more comprehensive summary?"
    ]
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

# or evaluate test cases in bulk
evaluate([test_case], [metric])

Output()

Output()

0.5
The score is 0.50 because the summary accurately captures the main point about the coverage score calculation but does not explicitly mention how it quantifies how well a summary represents key information from the original text. It also fails to include any extra information not mentioned in the original text. It does not raise any questions that the original text can answer, so the summary is good in that aspect.
Evaluating test cases...
Event loop is already running. Applying nest_asyncio patch to allow async execution...




Metrics Summary

  - ❌ Summarization (score: 0.3333333333333333, threshold: 0.5, strict: False, evaluation model: Custom Azure OpenAI Model, reason: The score is 0.33 because the summary accurately captures the information in the original text without any contradicting information or extra details. Additionally, it does not leave out any important questions that the original text can answer but not the summary., error: None)
      - Alignment (score: 1.0)
      - Coverage (score: 0.3333333333333333)

For test case:

  - input: 
The 'coverage score' is calculated as the percentage of assessment questions
for which both the summary and the original document provide a 'yes' answer. This
method ensures that the summary not only includes key information from the original
text but also accurately represents it. A higher coverage score indicates a
more comprehensive and faithful summary, signifying that the summary effectively
encapsulates the crucial points and details from the original co



[TestResult(success=False, metrics=[<deepeval.metrics.summarization.summarization.SummarizationMetric object at 0x7ff27ece9460>], input="\nThe 'coverage score' is calculated as the percentage of assessment questions\nfor which both the summary and the original document provide a 'yes' answer. This\nmethod ensures that the summary not only includes key information from the original\ntext but also accurately represents it. A higher coverage score indicates a\nmore comprehensive and faithful summary, signifying that the summary effectively\nencapsulates the crucial points and details from the original content.\n", actual_output='\nThe coverage score quantifies how well a summary captures and\naccurately represents key information from the original text,\nwith a higher score indicating greater comprehensiveness.\n', expected_output=None, context=None, retrieval_context=None)]

There are seven optional parameters when instantiating an SummarizationMetric class:

- [Optional] threshold: the passing threshold, defaulted to 0.5.
- [Optional] assessment_questions: a list of close-ended questions that can be answered with either a 'yes' or a 'no'. These are questions you want your summary to be able to ideally answer, and is especially helpful if you already know what a good summary for your use case looks like. If assessment_questions is not provided, we will generate a set of assessment_questions for you at evaluation time. The assessment_questions are used to calculate the coverage_score.
- [Optional] n: the number of assessment questions to generate when assessment_questions is not provided. Defaulted to 5.
- [Optional] model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of type DeepEvalBaseLLM. Defaulted to 'gpt-4o'.
- [Optional] include_reason: a boolean which when set to True, will include a reason for its evaluation score. Defaulted to True.
- [Optional] strict_mode: a boolean which when set to True, enforces a strict evaluation criterion. In strict mode, the metric score becomes binary: a score of 1 indicates a perfect result, and any outcome less than perfect is scored as 0. Defaulted as False.
- [Optional] async_mode: a boolean which when set to True, enables concurrent execution within the measure() method. Defaulted to True.

## How Is It Calculated?

สูตรการคำนวณการหา summarization metric 

Summarization=min(Alignment Score,Coverage Score)

- alignment_score ข้อมูลมี hallucinated หรือขัดแย้งกับ original text ไหม 
- coverage_score บทสรุปมีข้อความที่จำเป็นขากต้นฉบับหรือไม่ 

จากที่ดู alignment_score จะคล้ายกับการหา HallucinationMetric ในขณะที่ coverage_score เป็นการคำนวณจากการสร้างชุดคำถาม ที่ตอบ yes/no ก่อนคำนวณอัตราส่วนของ original text และบทสรุปจากคำตอบเดียวกัน

https://www.confident-ai.com/blog/a-step-by-step-guide-to-evaluating-an-llm-text-summarization-task