#### Automatic Summary Evaluation

This project demonstrates an LLM-as-a-judge technique to evaluate summaries generated by machine learning models. Summary evaluation is useful application of LLMs in production environments and has analogies in similar evaluations of GenAI outputs (or the automated evaluation of human outputs).

This notebook uses data from SummEval [1] which is a dataset of CNN and Daily Mail articles, machine generated summaries of those articles, and human evaluations of those summaries. In this notebook, a G-Eval technique is used to evaluate the summaries on consistency. G-Eval is a powerful tool with relatively high correlation with human evaluations with the benefit that it can be used with a foundational model and does not require fine-tuning to achieve good results. It is especially powerful when the metric otherwise difficult to score - How would you the define the "relevance" of a summary using a rules-based-nlp method? How do you get enough "relevance" examples to train an NLP model?

G-Eval [2] was originally used to better measure Relevance, Consistency, Fluency, and Coherence. Relevance being the measure of how well the summary captures the key points of the article. Consistency being the measure of whether the summary reproduces all facts accurately and does not make up untrue information. Fluency being the measure of the quality of individual sentences (grammatically correct). Coherence being the measure of the quality of the summary as a whole.

In general, machine written summaries are highly Fluent and Coherent [3] - to the point that focus as can shift entirely to Relevance and Consistency. Relevance is not as "solved" as Fluency or Coherence, but LLMs often produce output with acceptable levels of relevance. The biggest struggle for LLMs is often consistency - coming from an incorrect representation of events in the source text. These incorrect representations are known as hallucinations and can be an incorrect description of events that contradict the source (known as intrinsic hallucinations), or the introduction of new information that cannot be verified by the source (extrinsic hallucinations). that being said, SummEval was published in 2020 and the summaries of the dataset were produced with non-LLM models and are therefore not always fully fluent or coherent.

This implementation of G-Eval uses several custom prompts as well as a modified version of an example prompt from Microsoft's Prompt Flow [4].

In [None]:
import os
import json
from typing import Union, Tuple, Dict, Callable

from numpy.typing import ArrayLike
import asyncio
import aiohttp
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import pearsonr, spearmanr, kendalltau

import llm_utils
from llm_utils import LlmUtils

# Dataset paths
SUMM_EVAL_DATA = "./data/summEval/summeval.json"

# Cached request/responses
CACHE_PATH = "./data/summEval/llm_cache.pickle"

# Async Parameters
NUM_PARALLEL_CONNS_LLAMA = 3
NUM_PARALLEL_CONNS_GPT = 2

#### The Dataset

This dataset is from SummEval and is comprised of full text (source) articles from CNN and Daily Mail, summaries generated by various NLP models, and human evaluations of those summaries.

For this notebook, the only score we are interested in is the consistency score. Consistency can have multiple definitions in the literature and in practice, but the definition given to the human evaluators was: "The [Consistency] rating measures whether the facts in the summary are consistent with the facts in the original article. Consider whether the summary does reproduce all facts accurately and does not make up unture information.

It has been several years since SummEval has been published. Since then, much of the literature has broken consistency into 2 metrics: faithfulness (also known as groundedness) and coverage (also known as completeness or recall). This notebook will focus on consistency since there is not ground truth data for faithfulness or coverage in the SummEval dataset.

However, it is an interesting, almost philosophical, discussion to be had about faithfulness and coverage. Domain spcific language, especially with easy to understand definitions, can often be useful when discussing with stakeholders because it can avoid miscomunications.

Faithfulness: Faithfulness is defined as staying consistent and truthful to the provided source, and we do not consider ‘factuality’ where valid external facts are acceptable [3]. A summary that omits details but isn't otherwise misleading or inaccuracte would still be faithful. Not considering factuality means a factually inaccurate piece of information in the source should be faithfully represented in the summary. Therefore a faithful summary of a flat-earther article would have the same factual inaccuracies that are present in the source article.

Coverage: Coverage is defined as the comprehensiveness of the summary and quantifies how well a summary captures and accurately represents key information from the source. A summary that contains all information from the text, but hallucinates new information, so long as the hallucinations do not mislead or inaccurately represent the source, would still have high coverage.

In [None]:
# Data used for evaluation available from the G-Eval paper's GitHub repo
#   https://raw.githubusercontent.com/nlpyang/geval/refs/heads/main/data/summeval.json
summ_eval = json.load(open(SUMM_EVAL_DATA))
summ_eval = pd.DataFrame(summ_eval)
summ_eval["consistency"] = summ_eval["scores"].apply(lambda x: x["consistency"])
summ_eval.drop(["doc_id", "system_id", "reference", "scores"] , axis=1, inplace=True)
summ_eval.rename(columns={"source": "article", "system_output": "summary"}, inplace=True)

# 5 articles (and all 16 summaries associated with each of the articles) get flagged by the Azure OpenAI Service'set
#   Content Filter, even on the most permissive settings
# This filter can be removed for trusted users so long as they still uphold Microsoft's content moderation policies
#   My personal account is not a trusted user, so I cannot remove the content filter
# I enumerate each index explicitly because it would be possible for a specific summary to be flagged, but the other
#   summaries associated with the same article. This makes it easier to apply to a new datraset in the future
azure_error_indices = [
      80,   81,   82,   83,   84,   85,   86,   87,   88,   89,   90,   91,   92,   93,   94,   95,
     352,  353,  354,  355,  356,  357,  358,  359,  360,  361,  362,  363,  364,  365,  366,  367,
     448,  449,  450,  451,  452,  453,  454,  455,  456,  457,  458,  459,  460,  461,  462,  463,
    1472, 1473, 1474, 1475, 1476, 1477, 1478, 1479, 1480, 1481, 1482, 1483, 1484, 1485, 1486, 1487,
    1520, 1521, 1522, 1523, 1524, 1525, 1526, 1527, 1528, 1529, 1530, 1531, 1532, 1533, 1534, 1535
]

#### Calling the models

This notebook compares the performance of 4 models on 4 prompts.

The models are Llama3.2-vision [5], Phi4 [6], DeepSeek-R1-Distill-Qwen-14B [7], and GPT-4o mini [8]. 
* Llama3.2-vision is an 11B parameter model released by Meta in 09/2024
* Phi4 is a 14B parameter model released by Microsoft in 12/2024
* DeepSeek-R1-Distill-Qwen-14B uses the Qwen-14B model architecture and was distilled from DeepSeek-R1 (released by DeepSeek in 01/2025)
* GPT 4o mini is OpenAI's "most cost-efficeint small model" and was released in 07/2024.

All models except GPT-4o mini are open source and are called using a local OLLAMA server. GPT-4o mini is called using the Azure OpenAI Service. It would be easy to compare more models, especially ones available on Ollama or Azure OpenAI Service. The open source models were selected because they are the most performative (and largest) models that I can reasonably run on my personal GPU. GPT-4o mini was selected because it is the most performative model for which I am willing to pay the monetary cost. GPT-4o and o1-mini are both more than an order of magnitude more expensive and o1 is nearly 2 orders of magnitude more expensive.

Both OLLAMA and the Azure OpenAI Service use a similar REST API endpoints, but there are some small differences between the APIs. All differences are contained and managed by llm_utils.

In [None]:
# All the calls to the LLMs get saved to a 'cache', just a dictionary with requests as keys and responses as values
# This is useful when rerunning and adding new models or otherwise experimenting or debugging since the calls to the
#   LLMs are expensive both in time and for GPT, money. 

# I using a try-except in this manner so I can re-run the full notebook without accidentally overwritting the cache.
# The cache is pickled and saved to disk once all samples are passed through a prompt-model pair
try:
    CACHE
except:
    if os.path.exists(CACHE_PATH):
        with open(CACHE_PATH, "rb") as f:
            CACHE = pickle.load(f)
    else:
        CACHE = dict()

In [None]:
# Creates the request payload for the model calls
# Uses structured outputs to ensure the model gives a output than can be parsed
def get_request_payload(
        model_name: llm_utils.AllowedModelNames,
        prompt: Union[str, Tuple[str, str]]
) -> dict:
    """ Creates the request payload for the given model
    """
    # Set some fields based on the inputs
    model_type = LlmUtils.get_model_type(model_name)
    num_output_tokens = 10
    seed = 314159265

    # If randomness options were not supplied, set them to minimum values
    temperature = 0.0
    top_p = 0.0
    top_k = 1

    # Create the portions of the request payload that are identical between Ollama and Azure OpenAI
    payload = {
        "stream": False,
    }

    # Create the Ollama specific portions if this is a llama request
    if model_type == "ollama":
        payload["prompt"] = prompt
        payload["model"] = model_name
        payload["options"]= {
            "seed": seed,
            "temperature": temperature,
            "top_p": top_p,
            "top_k": top_k,
            "num_predict": num_output_tokens
        }

        payload["format"] = {
            'properties': {
                'score': {
                    'enum': [1, 2, 3, 4, 5],
                    'title': 'Score',
                    'type': 'integer'
                }
            },
            'required': ['score'],
            'title': 'score',
            'type': 'object'
        }

    # Create the Azure OpenAI specific portions if this is a GPT request
    else:
        payload["messages"] = (
            {
                "role": "system",
                "content": prompt[0]
            },
            {
                "role": "user",
                "content": prompt[1]
            }
        )
        payload["seed"] = seed
        payload["temperature"] = temperature
        payload["top_p"] = top_p
        payload["max_tokens"] = num_output_tokens

        payload["response_format"] = {
            "type": "json_schema",
            "json_schema": {
                "name": "score",
                "schema": {
                    "type": "object",
                    "properties": {
                        "score": {
                            "enum": [1, 2, 3, 4, 5],
                            "type": "integer"
                        }
                    },
                    "required": ["score"],
                    "additionalProperties": False
                },
                "strict": True
            }
        }

    return payload

#### Scoring Metric

Once all summaries are scored, the correlation between those scores and the human evaluations is measured. Correlation is measured using Pearson, Spearman, and Kendall-Tau correlation. Pearson correlation measure the linear relationship between the machine scores and human evaluations while Spearman and Kendall-Tau look more so at the ranking of the data and compare the monotonic relationship between the machine scores and human evaluations. Kendall-Tau is typically more robust when dealing with tied ranks and will be the preferred metric of this analysis since both the machine scores and human evaluations have many ties.

On the subject of tied rankings: G-Eval outputs a binned score - an integer with values 1, 2, 3, 4, 5 (although it never gave a score of 1 in any of my trials). In practice, this leads to ~500-way ties in the ranking of article-summary pairs. Allowing for so many ties makes the evaluation task significantly easier. The authors of G-Eval noted this fact, and they also confessed it likely explains why the normalized probaility score (which was a continuous score from 1-5) was worse than the binned integer score.

In [None]:
# Calculate the correlation between the model and human evaluations
def summEval_correlations(y_true: ArrayLike, y_pred: ArrayLike)-> Tuple[float, float, float]:
    pearson = pearsonr(y_true, y_pred)[0]
    spearman = spearmanr(y_true, y_pred)[0]
    kendall_tau = kendalltau(y_true, y_pred)[0]
    return pearson, spearman, kendall_tau

#### G-Eval Prompting

A G-Eval-based defines a metric in the prompt then instructs a non-fine-tunedl large language model to return the score of the article-summary pair using that metric. The model is instructed to (and forced by the structured outputs API) to return a JSON of the score. Enforcing a JSON output ensures that the score will be able to be parsed.

The conversation that I will be simulating with stakeholders in regards to G-Eval revolves arount inconsistency. In this case, inconsistency refers to the LLM's outputs. When prompt engineering, a seemingly inconsequential change to the prompt or article-summary pair can lead to undesired outputs from the LLM. To highlight this, I include 3 extremely similar prompts and an additional similar prompt that is from Microsoft Prompt Flow [4]. Ideally, for a given model the 4 prompts should produce simlar results. In practice the 4 prompts produce different results with +/- 5 point (10% of the correlation scores), and the ranking of the prompts based on correlation score is different for each model.

The prompts do not include any advanced prompt engineering techniques such as Chain of Thought, Few-Shot Prompting, etc. Different models respond differently to advanced prompt engineering techniques, and aren't included as this is a simple demonstration. Additionally, it can be difficult to prompt engineer when using a G-Eval-based prompt because the model output has no explainability. Without fine-tuning the model, it can be very difficult to develop confidence that the model's output will produce good results on new data.

Prompt A is the base prompt. It is the most simple prompt.
Prompt B is prompt A with an additional, redundant line that repeats part of the instructions.
Prompt C is prompt B with several additional instructions. It instructs the model to be strict in its output. Notably, the model is well constrained as I use the structured outputs parameters with both Ollama and Azure OpenAI.
Prompt D is similar to A, B, and C but is the most dissimlar of the 4. It is the Microsoft Prompt Flow example. It includes many short sentences, a short list of instructions, and generous newline characters.

In fairness to LLMs, larger models tend to give more consistent outputs. GPT-4o and o1 tend to be much more consistent, but still not nearly as consistent as other non-LLM models and techniques. I elaborate more in the "Discussion" section.

In [None]:
g_eval_base_system_prompt = """You are a super intelligent artificial intelligence that scores a summary based on its consistency with the source article. Your score is a rating from 1 to 5 (1, 2, 3, 4, 5) with 1 being a bad score and 5 being a good score. Your output is in the following format: {"score": rating} where rating is 1, 2, 3, 4 or 5.
Consistency is defined as the factual alignment between the summary and the source article. A factually consistent summary contains only statements that are entailed by the source article. Summaries that contain inconsistent information should be penalized.
{additional_instructions}
"""

g_eval_base_user_message = """Source Article:
{article}

Summary:
{summary}"""

g_eval_consistency_suffix = """

Consistency JSON:
"""

def g_eval_faithfulness_prompt_a(
        summary: str,
        article: str,
        model_type: llm_utils.ServerTypes
) -> Union[str, Tuple[str, str]]:
    system_prompt = g_eval_base_system_prompt
    system_prompt = system_prompt.replace("{additional_instructions}", "")
    
    user_message = g_eval_base_user_message
    user_message = user_message.replace("{article}", article)
    user_message = user_message.replace("{summary}", summary)
    
    if model_type == "ollama":
        system_prompt = system_prompt + user_message + g_eval_consistency_suffix
        return system_prompt

    else:
        return system_prompt, user_message



def g_eval_faithfulness_prompt_b(
        summary: str,
        article: str,
        model_type: llm_utils.ServerTypes
) -> Union[str, Tuple[str, str]]:
    additional_instructions = "You will be given a summary and the source article. Your task is to answer with the rated score for that summary.\n"
    system_prompt = g_eval_base_system_prompt
    system_prompt = system_prompt.replace("{additional_instructions}", additional_instructions)
    
    user_message = g_eval_base_user_message
    user_message = user_message.replace("{article}", article)
    user_message = user_message.replace("{summary}", summary)
    
    if model_type == "ollama":
        system_prompt = system_prompt + user_message + g_eval_consistency_suffix
        return system_prompt

    else:
        return system_prompt, user_message


def g_eval_faithfulness_prompt_c(
        summary: str,
        article: str,
        model_type: llm_utils.ServerTypes
) -> Union[str, Tuple[str, str]]:
    additional_instructions = g_eval_base_system_prompt + """You will be given a summary and the source article. Your task is to answer with the rated score, and only the score, for that summary.
Your answers should *strictly* be a number from 1 to 5. Do *not* output anything except a single number between 1 and 5.
Do *not* output additional information, comments, or context. Do *not* answer with any spaces, whitespace, newline characters or any other formatting.
"""
    system_prompt = g_eval_base_system_prompt
    system_prompt = system_prompt.replace("{additional_instructions}", additional_instructions)
    
    user_message = g_eval_base_user_message
    user_message = user_message.replace("{article}", article)
    user_message = user_message.replace("{summary}", summary)
    
    if model_type == "ollama":
        system_prompt = system_prompt + user_message + g_eval_consistency_suffix
        return system_prompt

    else:
        return system_prompt, user_message

In [None]:
def g_eval_faithfulness_prompt_d(
        summary: str,
        article: str,
        model_type: llm_utils.ServerTypes
) -> Union[str, Tuple[str, str]]:
    system_prompt = """You will be given a source document. You will then be given one summary written for this source document.

Your task is to rate the summary on one metric.

Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed.

Evaluation Criteria:

Consistency (1-5) - the factual alignment between the summary and the summarized source. A factually consistent summary contains only statements that are entailed by the source document. Annotators were also asked to penalize summaries that contained hallucinated facts.

Evaluation Steps:

1. Read the source document carefully and identify the main facts and details it presents.
2. Read the summary and compare it to the source document. Check if the summary contains any factual errors that are not supported by the source document.
3. Assign a score for consistency based on the Evaluation Criteria.
4. Format your response into a JSON of the form {"score": consistency} where consistency is your score from 1 to 5.
"""

    user_message = f"""Source Document:

{article}

Summary:

{summary}"""

    if model_type == "ollama":
        system_prompt = system_prompt + "\n" + user_message + g_eval_consistency_suffix + "\n"
        return system_prompt

    else:
        return system_prompt, user_message


#### Calling the Models

We have 1600 pairs of summaries and articles to churn through. While it is overkill for my local Ollama server and GPT deployment with a low token per minute rate limit, I make these calls asyncronously. It is overkill because I then have to limit the number of simultaneous connections to Ollama and GPT otherwise they may timeout waiting in the queue. For GPT, I use exponential backoff to throttle the calls when rate limited is reached.

As mentioned briefly in the "G-Eval Prompting" section, a JSON schema is enforced on the model output to ensure the output will be able to be parsed. Without it, the model can "go rogue" and might output more than just the integer answer that we want. The strucutred outputs API puts more guardrails on the model and makes them more acceptable for production environments.

In [None]:
async def g_eval_faithfulness(
        model_name: llm_utils.AllowedModelNames,
        session: aiohttp.ClientSession,
        summary: str,
        article: str,
        prompt_base: Callable
        ):
    """ Returns the score for one article-summary pair using the given model and prompt_function
    """
    model_type = LlmUtils.get_model_type(model_name)
    prompt = prompt_base(summary, article, model_type)
    response = await LlmUtils.call_cached_llm(
        model_name,
        session,
        get_request_payload(
            model_name,
            prompt,
        ),
        CACHE,
        returned_stopped_only=False
    )

    try:
        return int(json.loads(response.strip())['score'])
    except:
        print("response:", response)
        return None


async def g_eval_faithfulnesses(
        model_name: llm_utils.AllowedModelNames,
        summ_eval: pd.DataFrame,
        prompts: Dict[str, Callable]
):
    """ Returns the correlation for the given model on all article-summary pairs for each of the 4 prompts
    """
    # Setup the processing for this model
    model_type = LlmUtils.get_model_type(model_name)
    num_conns = NUM_PARALLEL_CONNS_LLAMA if model_type == "ollama" else NUM_PARALLEL_CONNS_GPT
    connector = aiohttp.TCPConnector(limit=num_conns)
    timeout = aiohttp.ClientTimeout(total=None)

    geval_correlations = {}
    async with aiohttp.ClientSession(connector=connector, timeout=timeout) as session:
        # Run each prompt
        for prompt_name, prompt_base in prompts.items():
            # Asyncronously run each article-summary pair through the model
            geval_tasks = [
                g_eval_faithfulness(model_name, session, summary, article, prompt_base)
                for summary, article in zip(summ_eval["summary"], summ_eval["article"])
            ]
            geval_results = await asyncio.gather(*geval_tasks)
            
            # Get the correlation scors for this model and prompt
            geval_correlations[prompt_name] = summEval_correlations(summ_eval["consistency"], geval_results)
            
            with open(CACHE_PATH, "wb") as f:
                pickle.dump(CACHE, f, pickle.HIGHEST_PROTOCOL)

    return geval_correlations


In [None]:
prompts = {
    "a": g_eval_faithfulness_prompt_a,
    "b": g_eval_faithfulness_prompt_b,
    "c": g_eval_faithfulness_prompt_c,
    "d": g_eval_faithfulness_prompt_d
}
g_eval_correlations = {
    "phi4": await g_eval_faithfulnesses("phi4", summ_eval, prompts),
    "llama3.2-vision": await g_eval_faithfulnesses("llama3.2-vision", summ_eval, prompts),
    "deepseek-r1:14b": await g_eval_faithfulnesses("deepseek-r1:14b", summ_eval, prompts),
    # "gpt4o-mini": await g_eval_faithfulnesses("gpt4o-mini", summ_eval, prompts),
}

In [None]:
with open("./llm_cache.pickle", "wb") as f:
    pickle.dump(CACHE, f, pickle.HIGHEST_PROTOCOL)

#### Performance

The models perform well enough, but the scores have a good amount of variation. Not only were the scores +/- 5 points (10% of the score), but the rankings of the prompts differed between models.

In [None]:
models = list(g_eval_correlations.keys())
prompts = list(g_eval_correlations[models[0]].keys())
correlation_names = ['Pearson', 'Spearman', 'Kendall-Tau']

x = np.arange(len(models))
width = 0.2  # width of bars

fig, axes = plt.subplots(1, len(correlation_names), figsize=(25, 5), sharey=True)
handles = []
labels = [f'Prompt {prompt}' for prompt in prompts]

for i, metric in enumerate(correlation_names):
    ax = axes[i]
    for j, prompt in enumerate(prompts):
        values = [g_eval_correlations[model][prompt][i] for model in models]
        bars = ax.bar(x + j * width, values, width)
        
        # Collect handles and labels only once
        if i == 0:
            handles.append(bars[0])
            ax.set_ylabel("Score")
    
    ax.set_xlabel("Models")
    ax.set_title(f"{metric} Correlation")
    ax.set_xticks(x + width * 1.5)
    ax.set_xticklabels(models)

# Add a single legend to the figure
fig.legend(handles, labels, loc='lower center', ncol=len(prompts), bbox_to_anchor=(0.5, -0.1))
plt.suptitle("Evaluation Scores by Model, Technique, and Correlation Type")
plt.show()

#### Discussion and Example Application

Hopefully I have shown that LLMs are not a silver bullet and there are concerns in using an LLMs for tasks that is not generative in nature. A task such as evaluating the faithfulness of article-summary pairs is likely best to be left to an embedding model rather than a language model.

That being said, generative models absolutely have their place in the data scientists or machine learning engineer's toolkit. For one, generative models are often able to very quickly produce a good enough result. Consider the following example, keeping in line with the spirit of the summary-article pairs of SummEval:
* Let's assume we are working on an internal process to summarizes full length transcripts of video calls
* Previously, it was company policy for employees to manually summarize meetings
* The full transcript along with the summary is logged in some records system
* For these summaries, there are company guidelines as far as professional language, factual correctness, etc
* These summaries are very important for legal and record keeping reasons
* Employees are inconsistent with their summaries. Many employees don't do a great job writing summaries, and we want to improve the summaries
* Instead of manually reviewing every summary-transcript pair, we can use a technique like G-Eval to score them based on those existing policies
* Any transcript-summary pair with a score of 3 or lower gets sent back to a human to be revised

Better yet, we can leverage the generative nature of LLMs to create the summaries from the start
* Have the LLM create the summary based on the full transcript. But feedback from the stakeholders indicates the biggest issue with the summaries is the faithfulness.
* Use a G-Eval-based technique to evaluate the transcript-summary pair, sending it back to the LLM to be rewritten if it scores a 3 or lower
* After a summary with a score of 4 or 5 is created, we present it to a human employee for the final evaluation
* After the human employee reviews the summary and approves it, it is sent to the same records system as before

The best part about this system was that it could be created quickly. It required no training data and the infrastructure to host the LLM already exists via Azure OpenAI Service or AWS Bedrock. The other parts of this service, like how we show it to the human for review or how we automatically log it in the record system, are necessary regardless of the machine learning technique we use to summarize and evalute the summaries. Even if this system isn't perfect, it's a lot easier to correct a factually inconsistent summary than it is to write one from scratch. So we are likely still reducing the manual labor required to write the summaries even if we didn't eliminate it.

There is a major benefit to quickly creating a system: "A lot of times, people don't know what they want until you show it to them” - Steve Jobs. Stakeholders (and people in general) often know what improvements can be made, and what would drive business value. However, they don't know how to create that system. In our example, the stakeholders knew they needed to improve the summarization, but didn't know how they could go about doing it. Create a system, even one that is imperfect like G-Eval, closes tha gap between a paper diagram and the real world. After creating the system, our stakeholders can re-evaluate if we solved the problem at hand. We have a modular system, so if the summaries are not faithful enough to thir source transcript due to hallucinations, we can improve the summarization and evaluation portions of the pipeline.

If we wanted to improve the evaluation step of the pipeline, MQAG [9] and TrueTeacher [10] are both techniques that are more correlated with human evaluation of faithfulness. MQAGs especially so.
* MQAG uses a language model trained on multiple choice questions and answers, and an embedding model is trained to answer those questions. The language model is then used to generate multiple choice questions and answers from the summary. The embedding model is trained to answers those questions based on the source article (without access to the summary). If the embedding model gets the answer incorrect, that implies information in the summary differs from information in the article (and therefore the summary is to some degree unfaithful to the article). Based on the proportion of correct answers, the overall faithfulness of the summary can be calculated.
* TrueTeacher first generates machine written summaries from source articles. A LLM, pre-trained on similar text, is then used to label those articles with faithfulness labels. Because the summaries are machine written, they are expected to have some amount of factual inconsistencies. An embedding model is then trained on the source articles and generated summaries with the labels as ground truth. The embedding model is eventually able to surpase the performance of the LLM, even though the LLM was used to generate the labels in the first place.

TODO: The drawback to those methods is they require some kind of training data, and more time to develop the model as well as host it. MQAG requires a set of questions and answers generated on text similar to the application. TrueTeacher requires a large corpus for pretraining an LLM, even if it isn't labeled. G-Eval requires nothing and can be used with a good degree of success out of the box. 

#### Citations

[1] Fabbri, Alexander R., et al. "Summeval: Re-evaluating summarization evaluation." Transactions of the Association for Computational Linguistics 9 (2021): 391-409.

[2] Liu, Yang, et al. "G-eval: Nlg evaluation using gpt-4 with better human alignment." arXiv preprint arXiv:2303.16634 (2023).

[3] Maynez, Joshua, et al. "On faithfulness and factuality in abstractive summarization." arXiv preprint arXiv:2005.00661 (2020).

[4] Fujimoto, K., & Maneck, B. (2024, February 24). Prompt flow. Prompt flow - Prompt flow documentation. https://microsoft.github.io/promptflow/ Maneck, B., & Fujimoto, K. (n.d.). Promptflow/examples/flows/evaluation/eval-summarization at main · Microsoft/promptflow. Summarization Evaluation. https://github.com/microsoft/promptflow/tree/main/examples/flows/evaluation/eval-summarization#meta-evaluation 

[5] AI@Meta. (2024). Llama 3.2 Model Card. GitHub. https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD.md 

[6] Abdin, Marah, et al. "Phi-4 technical report." arXiv preprint arXiv:2412.08905 (2024).

[7] DeepSeek-AI, et al. ‘DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning’. arXiv [Cs.CL], 2025, http://arxiv.org/abs/2501.12948. arXiv.

[8] GPT-4O Mini: Advancing Cost-Efficient Intelligence. GPT-4o mini: advancing cost-efficient intelligence. (2024, July 18). https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence 

[9] Manakul, Potsawee, Adian Liusie, and Mark JF Gales. "MQAG: Multiple-choice question answering and generation for assessing information consistency in summarization." arXiv preprint arXiv:2301.12307 (2023).

[10] Gekhman, Zorik, et al. "Trueteacher: Learning factual consistency evaluation with large language models." arXiv preprint arXiv:2305.11171 (2023).
