
<div class="title-slide">
  
# Module 1.1 - Foundations of GenAI Evaluation  
<span style="font-size:20px; line-height:2;">
Dr. Hari Manassery Koduvely <br> 
Principal Data Scientist  <br>  
Cybersecurity Analytics <br>  
Ottawa, Canada <br>  
November 24, 2025


## How to Clone this Respository on your Work Laptop   
- <span style="font-size:18px;">If your laptop do not have Git, install from https://github.com/git-guides/install-git</span>  
- <span style="font-size:18px;">If your laptop do not have an IDE to open Jupyter Notebook install Vistual Studio Code from here https://code.visualstudio.com/download</span>  
- <span style="font-size:18px;">To clone the repository:</span>  
    - <span style="font-size:16px;">git clone https://github.com/harik68/Course-Evaluation-of-GenAI-Applications.git</span>  
    - <span style="font-size:16px;">git clone https://github.com/harik68/Course-Evaluation-of-GenAI-Applications.git</span>  
- <span style="font-size:18px;">Change directory to Course-Evaluation-of-GenAI-Applications/Session-1-Evaluation-Principles-and-Methods</span>  
- <span style="font-size:18px;">Open the Jupyter Notebook Module1.1-Foundations-of-GenAI-Evaluation.ipynb using the IDE</span>    

### Importing Libraries

In [None]:
from typing import Dict
from collections import Counter
import pandas as pd
from IPython.display import Image, display
import textwrap
import os 
from IPython.display import Image, display, HTML
from openai import OpenAI

## Learning Objectives  

- <span style="font-size:18px;"> Understand why evaluation is critical for GenAI applications </span>    
  
- <span style="font-size:18px;"> Learn key challenges in evaluating non-deterministic systems </span>  
  
- <span style="font-size:18px;"> Identify different evaluation paradigms </span>   
  
- <span style="font-size:18px;"> Identify where the evaluations fit in the SDLC pipeline </span>  



### Why evaluation is critical for GenAI Applications?

<div style='text-align: center;'>
    <img src='../Images/Why-Evaluation-Critical-for-GenAI.png' height='700'>
    <div style='font-size:16px; color:gray; margin-top:8px;'>Figure 1: Why Evaluation is Critical for GenAI Applications <br> (AI Generated Image)</div>
</div>

#### More Details ...



- <span style="font-size:18px;">Ensures Output Quality </span>   
    - Evaluation helps verify that GenAI models produce accurate, relevant, and coherent outputs 
    - Example: a chatbot providing correct medical advice.    
  
- <span style="font-size:18px;">Detects Bias and Fairness Issues </span>   
    - Regular evaluation can uncover biases in model responses, such as gender or racial bias in hiring recommendations.  
    
- <span style="font-size:18px;">Monitors Model Drift </span>   
    - Continuous evaluation identifies when a model's performance degrades over time due to changes in data or user behavior.  
    
- <span style="font-size:18px;">Supports Regulatory Compliance </span>   
    - Evaluation ensures outputs meet legal and ethical standards, such as GDPR compliance in data privacy.  
    
- <span style="font-size:18px;">Guides Model Improvement </span>   
    - Feedback from evaluation highlights areas for retraining or fine-tuning, like improving summarization accuracy in news aggregation.  
    
- <span style="font-size:18px;">Builds User Trust </span>   
    - Demonstrating consistent and reliable performance increases user confidence in GenAI applications   
    - Example: AI writing assistants.  
    
- <span style="font-size:18px;">Prevents Harmful Outputs </span>   
    - Evaluation helps catch toxic, unsafe, or inappropriate content before it reaches users    
    - Example: content moderation tools.  
    
- <span style="font-size:18px;">Measures Real-world Impact </span>   
    - Evaluation assesses how well the model performs in actual use cases, such as customer support automation.  
    
- <span style="font-size:18px;">Enables Benchmarking </span>   
    - Comparing model performance against baselines or competitors helps set improvement goals.  
    
- <span style="font-size:18px;">Informs Deployment Decisions </span>   
    - Evaluation results guide when a model is ready for production or needs further development.  
   

### What are the Key Challenges in Evaluating GenAI Applications?  
- <span style="font-size:18px;">Traditional software evaluation is a deterministic process of **Verification**</span>  
  
- <span style="font-size:18px;">GenAI application evaluation is a holistic process of **Validation**</span>  

### Silent Failure

<div style='text-align: center;'>
    <img src='../Images/Agent-Failure-Modes.png' height='600'>
    <div style='font-size:16px; color:gray; margin-top:8px;'>Figure 2: Agent Failure Modes <br> Reference: Agent Quality - Google White Paper</div>
</div>

#### Need of Evaluating all Components 

<div style='text-align: center;'>
    <img src='../Images/LLM-Summarization-Pipeline-Evaluation.jpg' height='600'>
    <div style='font-size:16px; color:gray; margin-top:8px;'>Figure 3: LLM Summarization Pipeline Evaluation <br> (AI Generated Image)</div>
</div>

#### More Details ...

  

- <span style="font-size:18px;"> All Components needs Evaluation </span>   
    - Evaluation of only LLM Models is not sufficient   
    - Poor performance of any component in the pipeline can affect the final output quality  

- <span style="font-size:18px;">Non-deterministic Outputs </span>   
    - GenAI models can produce different responses to the same input, making consistent evaluation difficult   
    - Example: a chatbot giving varied answers to the same question 
   
- <span style="font-size:18px;">Subjectivity of Quality </span>   
    - Assessing output quality often depends on human judgment, which can vary   
    - Example: evaluating the creativity of AI-generated stories  
  
- <span style="font-size:18px;">Lack of Clear Ground Truth </span>   
    - Many tasks lack a single correct answer, complicating automated evaluation   
    - Example: summarizing a news article  
  
- <span style="font-size:18px;">Bias and Fairness Detection </span>  
    - Identifying subtle biases in outputs requires careful analysis   
    - Example: AI resume screening favoring certain demographics 
  
- <span style="font-size:18px;">Scalability of Human Evaluation </span>  
    - Manual review is time-consuming and expensive for large-scale outputs   
    - Example: reviewing thousands of AI-generated images
  
- <span style="font-size:18px;">Contextual Understanding </span>  
    - Models may fail to grasp nuanced context, leading to misleading outputs   
    - Example: translation errors due to cultural references 
  
- <span style="font-size:18px;">Measuring Real-world Impact </span>  
    - Simulated metrics may not reflect actual user experience or business outcomes  
    - Example: customer satisfaction with AI support
  
- <span style="font-size:18px;">Robustness to Adversarial Inputs </span>  
    - Evaluating how models handle unexpected or malicious inputs is challenging  
    - Example: prompt injection attacks  
  
- <span style="font-size:18px;">Ethical and Safety Concerns </span>  
    - Ensuring outputs do not cause harm or violate ethical standards  
    - Example: AI generating unsafe medical advice
  

### What are the Evaluation Paradigms for GenAI Applications?

<div style='text-align: center;'>
    <img src='../Images/GenAI_Evaluation_Paradigms.jpg' height='600'>
    <div style='font-size:16px; color:gray; margin-top:8px;'>Figure 4: GenAI Evaluation Paradigms <br> (AI Generated Image)</div>
</div>

#### More Details ...



- <span style="font-size:18px;">Human-in-the-Loop Evaluation</span>  
    - Involves human judges assessing the quality, relevance, or safety of model outputs.  
    - Considered as Gold Standard  
  
- <span style="font-size:18px;">Automated Metrics</span>  
    - Uses predefined metrics (e.g., BLEU, ROUGE, perplexity) to score outputs against references or standards.   
    - Uses more powerful GenAI models (e.g. LLM-as-a-Judge).  
  
- <span style="font-size:18px;">Adversarial Testing</span>  
    - Exposes models to challenging or intentionally tricky inputs to probe weaknesses and robustness.  
  
- <span style="font-size:18px;">User Feedback</span>  
    - Collects real-world user ratings or comments to evaluate model performance in production.  
  
- <span style="font-size:18px;">A/B Testing</span>  
    - Compares different model versions with live users to measure impact on key metrics.  
  
- <span style="font-size:18px;">Benchmarking</span>  
    - Assesses models against standardized datasets and tasks for objective comparison.  
  

### Where Do Evaluations Fit in the SDLC Pipeline for GenAI Applications?

<div style='text-align: center;'>
    <img src='../Images/SDLC_GenAI_Evaluations.jpg' height='600'>
    <div style='font-size:16px; color:gray; margin-top:8px;'>Figure 5: GenAI Evaluations in a SDLC Pipeline <br> (AI Generated Image)</div>
</div>

#### More Details ...



Evaluation is a continuous process throughout the Software Development Life Cycle (SDLC) for GenAI applications. Key stages include:

- <span style="font-size:18px;">Requirement Analysis</span>    
    - Define evaluation criteria and success metrics (e.g., accuracy, fairness, safety).

- <span style="font-size:18px;">Data Collection & Preparation</span>  
    - Evaluate data quality, representativeness, and potential biases.

- <span style="font-size:18px;">Model Development</span>  
    - Use automated metrics and small-scale human evaluation to assess model iterations.

- <span style="font-size:18px;">Model Validation & Testing</span>  
    - Conduct comprehensive evaluation on an unseen dataset

- <span style="font-size:18px;">Deployment</span>  
    - Monitor model performance with real-world user feedback and A/B testing.

- <span style="font-size:18px;">Maintenance & Monitoring</span>  
    - Continuously evaluate for model drift, emerging biases, and real-world impact.
    - Update evaluation criteria as requirements evolve.

 

## Types of GenAI Evaluations      

### Reference-based Evaluation 
- <span style="font-size:18px;">Compares model outputs to one or more predefined "reference" answers or ground truth examples.</span>  
  
- <span style="font-size:18px;">Uses automated metrics like BLEU (for translation), ROUGE (for summarization), or METEOR to quantify similarity.</span>  
  
- <span style="font-size:18px;">Effective when high-quality reference data is available (e.g., machine translation, text summarization).</span>  
  
- <span style="font-size:18px;">May penalize creative or valid outputs that differ from references, limiting flexibility.</span>  
  
    - Example: Comparing an AI-generated summary of an article to a human-written summary using ROUGE score.</span>


### Common Evaluation Metrics for Reference-based Evaluation

- <span style="font-size:18px;">ROUGE: (Recall-Oriented Understudy for Gisting Evaluation)</span>  
    - Measures overlap of n-grams, word sequences, and word pairs between the machine-generated summary and reference summaries. 
    - Common variants include ROUGE-N, ROUGE-L, and ROUGE-S.  
  
- <span style="font-size:18px;">BLEU: (Bilingual Evaluation Understudy)</span>  
    - Originally developed for machine translation.  
    - Measures n-gram precision between generated and reference texts.  
  
- <span style="font-size:18px;">METEOR: (Metric for Evaluation of Translation with Explicit ORdering)   
    - Considers synonymy and stemming, providing a more nuanced comparison than BLEU.  
  
- <span style="font-size:18px;">Precision, Recall, and F1 Score</span>  
    - Used to evaluate the overlap of content units (e.g., sentences, key phrases) between generated and reference summaries.  
  
- <span style="font-size:18px;">Coverage</span>
    - Measures how much of the important content from the source is included in the summary.  
  
- <span style="font-size:18px;">Compression Ratio</span>  
    - The ratio of the length of the summary to the length of the original text.  
  
- <span style="font-size:18px;">BERT (Bidirectional Encoder Representations from Transformers) Score</span>  
    - Measures the semantic similarity between generated and reference summaries using contextual embeddings from the BERT model  
    - Provides a more nuanced evaluation than n-gram overlap.


### Reference-free Evaluation
- <span style="font-size:18px;">Assesses model outputs without relying on reference answers or ground truth.</span>  
- <span style="font-size:18px;">Uses criteria such as fluency, coherence, relevance, or factual accuracy.</span>  
  
- <span style="font-size:18px;">Useful for open-ended tasks where multiple valid outputs exist (e.g., dialogue generation, creative writing).</span>  
  
- <span style="font-size:18px;">Enables evaluation in specialized domains lacking annotated reference data</span>  
  

### Multi-Dimensional Evaluation

- <span style="font-size:18px;">GenAI outputs are complex and can't be fully assessed by a single metric.</span>  
  - <span style="font-size:18px;">A chatbot response needs to be:</span>   
    - <span style="font-size:18px;">factually correct</span>  
    - <span style="font-size:18px;">relevant</span>
    - <span style="font-size:18px;">fluent</span>
    - <span style="font-size:18px;">safe </span>   
  - <span style="font-size:18px;">Evaluating only one aspect may miss issues in other areas.</span>  

- <span style="font-size:18px;">Multi-dimensional evaluation involves assessing outputs along several axes:</span>  
  - <span style="font-size:18px;">Factual accuracy</span>
  - <span style="font-size:18px;">Relevance</span>
  - <span style="font-size:18px;">Fluency/grammar</span>
  - <span style="font-size:18px;">Coherence</span>
  - <span style="font-size:18px;">Safety/toxicity</span>

- <span style="font-size:18px;">Use a combination of automated metrics and human ratings.</span>  
  - <span style="font-size:18px;">ROUGE score for content relevance </span>
  - <span style="font-size:18px;">Human raters scoring summaries for fluency and factual correctness</span>
  - <span style="font-size:18px;">Toxicity classifier to check for unsafe content</span>


## Exercise 1  
#### Compute ROUGE Score between a Generted Summary and Reference Summary Text  
- <span style="font-size:18px;">Text Summarization use case.</span> 
- <span style="font-size:18px;">Use CNN-DailyMail Dataset</span> 
- <span style="font-size:18px;">Containis News Articles and their summaries written by journalists.</span>  
- <span style="font-size:18px;">Dataset available from [Kaggle Competition Website](https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail)</span>  
- <span style="font-size:18px;">Dataset has 3 columns:</span> 
  - <span style="font-size:18px;">An id unique to each article</span>
  - <span style="font-size:18px;">article which contains the text of the article</span> 
  - <span style="font-size:18px;">highlights which is the summary</span>  
- <span style="font-size:18px;">Split as 3 files, Train, Validation and Test</span>    
- <span style="font-size:18px;">Dataset distribution:</span>
  - <span style="font-size:18px;">Train: 287,113</span>  
  - <span style="font-size:18px;">Validation: 13,368</span>  
  - <span style="font-size:18px;">Test: 11,490</span>    
- <span style="font-size:18px;">Sampled Down Dataset for Exercise</span>
  - <span style="font-size:18px;">Train: 1,000</span>  
  - <span style="font-size:18px;">Validation: 1,00</span>  
  - <span style="font-size:18px;">Test: 1,00</span>  
- <span style="font-size:18px;">Sampled data is in the folder Data-Summarization</span>    

#### Importing Python Libraries

#### Setting up Open AI API for LLM

In [None]:
#os.environ["OPENAI_API_KEY"] = "your_api_key_here"  # Replace with your actual API key 

In [19]:
# Initialize the client
api_key = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=api_key)

In [20]:
def get_chatgpt_response(prompt, model="gpt-4.1-mini"):
    """
    Sends a query to ChatGPT API and returns the model's response text.
    
    Args:
        prompt (str): The question or instruction for the model.
        model (str): Model name to use (default: "gpt-4.1-mini").
    
    Returns:
        str: The text output from ChatGPT.
    """
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    
    # Extract the message text
    return response.choices[0].message.content

#### Python Function to Generate Prompt

In [21]:
def generate_summarization_prompt(article_text: str, summary_length: int = 100) -> str:
    """
    Generates a prompt for text summarization to send to the OpenAI API.

    Args:
        article_text (str): The input article or document to summarize.
        summary_length (int): Desired length of the summary in words (default: 100).

    Returns:
        str: The formatted prompt for the OpenAI API.
    """
    prompt = (
        f"Summarize the following article in about {summary_length} words:\n\n"
        f"Article:\n{article_text}\n\n"
        "Summary:"
    )
    return prompt

#### Python function to compute the ROUGE Score


In [22]:

def rouge_n(reference: str, generated: str, n: int = 1) -> Dict[str, float]:
    """
    - Compute ROUGE-N (recall, precision, f1) between reference and generated summaries.
    - The function first splits both texts into n-grams
    - Then counts their occurrences
    - Calculates the overlap between reference and generated n-grams 
    - Returns a dictionary with recall, precision, and F1 score

    Args:
        reference (str): The reference summary text.
        generated (str): The generated summary text.
        n (int): n-gram length (default 1 for ROUGE-1).
    Returns:
        dict: Dictionary with recall, precision, and f1 scores.
    """
    def ngrams(text, n):
        tokens = text.lower().split()
        return [tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1)]

    ref_ngrams = Counter(ngrams(reference, n))
    gen_ngrams = Counter(ngrams(generated, n))

    overlap = sum((ref_ngrams & gen_ngrams).values())
    ref_count = sum(ref_ngrams.values())
    gen_count = sum(gen_ngrams.values())

    recall = overlap / ref_count if ref_count > 0 else 0.0
    precision = overlap / gen_count if gen_count > 0 else 0.0
    f1 = (2 * recall * precision) / (recall + precision) if (recall + precision) > 0 else 0.0

    return {"recall": recall, "precision": precision, "f1": f1}

# Example usage:
# reference = "The cat sat on the mat."
# generated = "The cat is on the mat."
# print(rouge_n(reference, generated, n=1))  # ROUGE-1
# print(rouge_n(reference, generated, n=2))  # ROUGE-2           

In [23]:
# Load the train data
df_train = pd.read_csv('../Data-Summarization/train_sample.csv')

In [24]:
df_train.count()

id            10
article       10
highlights    10
dtype: int64

In [25]:
article_text = df_train.loc[0, 'article']
reference_summary = df_train.loc[0, 'highlights']

In [26]:
# Print the artile as a wrapped Text
print(textwrap.fill(article_text, width=100))

A woman in the Northwest Highlands of Scotland who'd fallen ill tested negative for Ebola, the
Scottish government said Tuesday. A spokesman for the government said the woman had been in West
Africa recently, though she had no direct contact with anyone with Ebola. "A patient at Aberdeen
Royal Infirmary has tested negative for Ebola," the press release said. "The individual was
transferred to the hospital by the Scottish Ambulance Service yesterday after falling ill while
visiting Torridon in the Scottish Highlands." Meanwhile, a health care worker who was diagnosed with
the Ebola virus after returning to Scotland from Sierra Leone was transferred to the Royal Free
Hospital in London. The patient is Pauline Cafferkey, 39, of Glasgow, Scotland, the hospital said.
She was working with Save the Children at an Ebola treatment center, said Michael von Bertele,
humanitarian director at that organization. She traveled via Casablanca, Morocco, and London
Heathrow Airport before arriving at Gla

In [27]:
# Print the reference summary as a wrapped Text
print(textwrap.fill(reference_summary, width=100))

Woman in the Scottish Highlands tests negative for Ebola, government says . A health care worker
diagnosed with the virus is moved to a London hospital . She was working with Save the Children in
Sierra Leone as a volunteer nurse . A third suspected Ebola case is being tested in the southwest of
England, officials say .


### Generate Prompt for Summarization

In [28]:
summary_length = 200
prompt = generate_summarization_prompt(article_text, summary_length)
print(textwrap.fill(prompt, width=100))

Summarize the following article in about 200 words:  Article: A woman in the Northwest Highlands of
Scotland who'd fallen ill tested negative for Ebola, the Scottish government said Tuesday. A
spokesman for the government said the woman had been in West Africa recently, though she had no
direct contact with anyone with Ebola. "A patient at Aberdeen Royal Infirmary has tested negative
for Ebola," the press release said. "The individual was transferred to the hospital by the Scottish
Ambulance Service yesterday after falling ill while visiting Torridon in the Scottish Highlands."
Meanwhile, a health care worker who was diagnosed with the Ebola virus after returning to Scotland
from Sierra Leone was transferred to the Royal Free Hospital in London. The patient is Pauline
Cafferkey, 39, of Glasgow, Scotland, the hospital said. She was working with Save the Children at an
Ebola treatment center, said Michael von Bertele, humanitarian director at that organization. She
traveled via Casablanc

### Generate Summary Using OpenAI API


In [32]:

model = "gpt-3.5-turbo"
generated_summary = get_chatgpt_response(prompt, model)

In [33]:
print(textwrap.fill(generated_summary, width=100))

A woman in the Northwest Highlands of Scotland tested negative for Ebola after falling ill, despite
recent travel to West Africa. Meanwhile, a health care worker named Pauline Cafferkey was diagnosed
with Ebola after returning to Scotland from Sierra Leone and was transferred to a London hospital.
Cafferkey had traveled through Morocco and London before arriving in Glasgow. The hospital she was
transferred to has a high-level isolation unit equipped with a specially designed tent for patients.
UK authorities are working to trace those who came into contact with Cafferkey. Another suspected
Ebola case is being tested in southwest England. As of December 24, the current Ebola outbreak had
resulted in at least 7,693 deaths and 19,695 cases in Liberia, Sierra Leone, and Guinea.
Humanitarian workers returning from Ebola-affected countries are expected to monitor their health
for 21 days. Both the British and Scottish governments are closely monitoring the situation to
protect public health.

In [34]:
roughe_1 = rouge_n(reference_summary, generated_summary, n=1)

In [35]:
print(roughe_1)

{'recall': 0.6724137931034483, 'precision': 0.2532467532467532, 'f1': 0.3679245283018868}


## LLM-as-a-Judge Methodology

- <span style="font-size:20px;">Using large language models (LLMs) to automatically evaluate and score the outputs of other AI models. 
  
- <span style="font-size:20px;">Judge-LLMs are prompted with evaluation criteria and asked to rate or compare model outputs.</span>      
  
- <span style="font-size:20px;">Replaces or supplements human judges.</span>  
  
- <span style="font-size:20px;">Ideal for scaling the evaluation process.</span>  


<div style='text-align: center;'>
    <img src='../Images/LLM-as-Judge-Illustration.png' height='500'>
    <div style='font-size:16px; color:gray; margin-top:8px;'>Figure 6: LLM-as-a-Judge Illustration <br> Reference: Mastering LLM as Judge, Galileo</div>
</div>

Reference: Mastering LLM as Judge, Galileo

### Is Human Feedback Always the Best?

<div style='text-align: center;'>
    <img src='../Images/Human-Feedback-Not-Gold-Standard.png' height='600'>
    <div style='font-size:16px; color:gray; margin-top:8px;'>
        Figure 7: Human Feedback Not Gold Standard<br>
        Reference: Human Feedback is Not a Gold Standard, ICLR 2024
    </div>
</div>

#### Challenges with Human Evaluation  
- <span style="font-size:18px;">More assertive or complex outputs tend to be perceived as more factually accurate. </span> 
  
- <span style="font-size:18px;">Have individual preferences and biases which can skew evaluation results. </span>  
  
- <span style="font-size:18px;">Enterprise scale human evaluation is both resource-intensive and time-consuming. /<span> 

#### Advantages of using LLM-as-a-Judge  
 - <span style="font-size:18px;">**Scalability**: Ideal for large scale evalutions involving thoudands of samples. </span>  
   
 - <span style="font-size:18px;">**Cost-Effectiveness**: Significantly lower cost compared to use of expert humans. </span>  
   
 - <span style="font-size:18px;">**Flexibility**: LLMs can be fine-tuned or prompt-engineered for specific tasks in different domains to reduce bias and enchance relevance. </span>  
   
 - <span style="font-size:18px;">**Use in Complex Evaluation Scenarios**: Evaluate intricate texts across various formats, providing nuanced assessment. </span>  

### Example Scenarios of Using LLM-as-a-Judge  

#### Example 1 - Evaluation of Customer Service Response Generation  
<span style="font-size:18px;">**Goal**: Evaluate an AI system that generates customer service responses.</span>   
  
<span style="font-size:18px;">Statistical metrics  won’t capture whether the response is helpful or empathetic.</span>    
  
<span style="font-size:18px;">Using an LLM as a judge, it is possible to evaluate:</span>  
- <span style="font-size:16px;">Does the response address the customer’s underlying concern?</span>  
  
- <span style="font-size:16px;">Is the tone appropriately professional yet friendly?</span>  
  
- <span style="font-size:16px;">Does it handle cultural nuances well?</span>  
  
- <span style="font-size:16px;">Would this likely lead to customer satisfaction?</span>  

#### Example 2 - Retrieval-Augmented Generation Evaluation 

<div style='text-align: center;'>
    <img src='../Images/LLM-as-Judge-in-RAG.png' height='600'>
    <div style='font-size:16px; color:gray; margin-top:8px;'>
        Figure 8: LLM as Judge in RAG<br>
        Reference: Mastering LLM as Judge, Galileo
    </div>
</div>

### Types of LLM-as-Judge Evaluation 

- <span style="font-size:18px;">Single Output Scoring with a Reference.</span>    
   
- <span style="font-size:18px;">Single Output Scoring without a Reference.</span>    
   
- <span style="font-size:18px;">Pairwise Comparison.</span>    

#### 1. Single Output without a Reference  
<span style="font-size:18px;">LLM is tasked with assigning scores based on predefined criteria.</span>   
- <span style="font-size:16px;">Scores are typically assigned on a discrete scale with a limited number of values.</span>    
   
- <span style="font-size:16px;">Each value on the scale should be clearly defined to ensure consistency in evaluation.</span>  
  
- <span style="font-size:16px;">The LLM relies solely on the output and the evaluation criteria provided in the prompt.</span>  

#### Example  

<span style="font-size:16px;">**Output to evaluate**: “I understand your frustration with the delayed delivery. Our team is working on your order, and you’ll receive a tracking number within 24 hours.”</span>    
   
<span style="font-size:16px;">**Scoring criteria Rubric (1-3)**:</span>  
  
- <span style="font-size:16px;">Unprofessional or dismissive: 1</span>   
   
- <span style="font-size:16px;">Professional but incomplete resolution: 2 </span>  
  
- <span style="font-size:16px;">Professional, empathetic, and provides clear resolution: 3 </span>      

<span style="font-size:16px;">**Advantage**: Single Output Scoring (Without Reference) is particularly useful for
straightforward evaluations where the quality of the output can be assessed
independently.</span>  

#### 2. Single Output Scoring with a Reference 

<span style="font-size:18px;">The prompt includes supplementary information, or “reference,” to aid the LLM in its evaluation.</span>  
- <span style="font-size:18px;">References may include:</span> 
    - <span style="font-size:16px;">reasoning steps.</span>
      
    - <span style="font-size:16px;">expected answers.</span>  
      
    - <span style="font-size:16px;">other relevant details that simplify the LLM’s task.</span>
     
- <span style="font-size:18px;">Leads to more nuanced and informed evaluations.</span>  

#### Example  
<span style="font-size:18px;">**Output to evaluate**: “The new environmental law requires companies to reduce carbon
emissions by 30% by 2030.”  
   
<span style="font-size:18px;">**Reference text**: “The Environmental Protection Act of 2024 mandates a 30% reduction in carbon emissions for companies with over 500 employees by 2030, with annual progress reports required.”  

<span style="font-size:18px;">**Scoring criteria Rubric (1-4)**:</span>  
- <span style="font-size:16px;"> Inaccurate information: 1.</span>  
  
- <span style="font-size:16px;">Partially accurate but missing key details: 2</span>  
  
- <span style="font-size:16px;">Accurate but incomplete: 3</span>  
  
- <span style="font-size:16px;">Complete and accurate match with reference: 4</span>    

<span style="font-size:18px;">The LLM would score this as 3 since it captures the main point but omits the company
size requirement and reporting details.  

<span style="font-size:18px;">**Advantage**: Single Output Scoring (With Reference) can lead to more nuanced and informed evaluations, especially for complex outputs.  

#### 3. Pairwise Comparison  
<span style="font-size:18px;"> Here the LLM is asked to directy compare between two outputs.</span>    

- <span style="font-size:18px;">The judge LLM is presented with two inputs and asked to select the superior one based on specified criteria.</span>  
  
- <span style="font-size:18px;">Since the LLM only needs to make a comparative judgment, this method avoids some challenges associated with absolute scoring,   

#### Example  

- <span style="font-size:18px;">Description A: “Our wireless headphones offer 20-hour battery life and noise  cancellation.”</span>    
  
- <span style="font-size:18px;">Description B: “Experience uninterrupted music with our wireless headphones, featuring 20-hour battery life, advanced noise cancellation, and comfortable memory   foam ear cups.”</span>  

<span style="font-size:18px;">The LLM judges Description B as superior because it provides more specific features and benefits while maintaining clarity and engagement.</span>    

<span style="font-size:18px;">**Advanatge** : Pairwise comparison method is particularly effective for relative assessments, such as determining which of two responses is more relevant or comprehensive.</span>    

### Comparison between Different LLM-as-a-Judge Methods  

<div style='text-align: center;'>
    <img src='../Images/Comparison-llm-as-judge-evaluation-methods.png' height='600'>
    <div style='font-size:16px; color:gray; margin-top:8px;'>
        Figure 9: Comparison of LLM as Judge Evaluation Methods<br>
        Reference: Mastering LLM as Judge, Galileo
    </div>
</div>

### When to use LLM-as-a-Judge:  
- <span style="font-size:18px;">If the output is primarily subjective (such as sentiment, quality of summarization).</span>    
    
- <span style="font-size:18px;">Evaluation requires understanding complex context or nuances.</span>     
   
- <span style="font-size:18px;">Multiple aspects need to be evaluated together (e.g. coherence, relevance and accuracy).</span>   

- <span style="font-size:18px;">Traditional metrics like BLEU or ROUGE miss important qualitative aspects.</span>    
  
- <span style="font-size:18px;">Need to evaluate large number of output samples quickly.</span>    
   
- <span style="font-size:18px;">A consistent evaluation criteria should be applied across all the samples.</span>    
  
- <span style="font-size:18px;">Cost of LLM API called justified compared to human evaluation.</span>    
  
- <span style="font-size:18px;">Evaluation requires cross-referencing with context or background knowledge.</span>    

### Ideal Use Cases:  
- <span style="font-size:18px;">Content generation quality assessment.</span>    
  
- <span style="font-size:18px;">Conversational AI response evaluation.</span>     
  
- <span style="font-size:18px;">Document summarization accuracy.</span>     
  
- <span style="font-size:18px;">Style and tone consistency checking.</span>     
  
- <span style="font-size:18px;">Creativity and innovation measurement.</span>     
  


### Use Cases to Avoid:  
- <span style="font-size:18px;">Ground truth exists and objective metrics suffice.</span>     
  
- <span style="font-size:18px;">Binary judgement decisions are needed.</span>     
  
- <span style="font-size:18px;">Extremely high stake decisions are involved.</span>      
  

### Issues with LLM-as-a-Judge  

#### <span style="font-size:18px;">1. Nepotism Bias </span>
- <span style="font-size:16px;">Tendency to favor text generated from the same family of LLMs.</span>   

#### <span style="font-size:18px;">2. Authority Bias </span> 
- <span style="font-size:16px;">Assigning greater credibility to statements from perceived authorities.</span>

#### <span style="font-size:18px;">3. Beauty Bias </span> 
- <span style="font-size:16px;">Prioritize aesthetically pleasing text over factual accuracy or completeness.</span>

#### <span style="font-size:18px;">4. Verbosity Bias </span> 
- <span style="font-size:16px;">Favouring verbose text over concise text. </span> 

#### 5. Positional Bias or Attention Bias  
- <span style="font-size:16px;">Miss information in the middle, and focus solely on information in the beginning and the end.</span>


### Mitigation Strategies  
- <span style="font-size:18px;">Use Chain-of-Thought reasoning in the prompt.</span>  
  
- <span style="font-size:18px;">Generate multiple independent responses  to the same prompt and aggregate the results (self-consistency).</span>  
  
- <span style="font-size:18px;">Implement multi-model evaluation approach to reduce neoptism bias.</span>  
  
- <span style="font-size:18px;">Standardize input lengths to reduce verbosity bias.</span>  
  
- <span style="font-size:18px;">Anonymize sources to prevent authority bias.</span>  


## Exercise 2
#### Evaluate LLM based Summarization using LLM-as-a-Judge Method

<div style='text-align: center;'>
    <img src='../Images/Reference-Free-Summarization-Evaluation-with-LLMs.png' height='500'>
    <div style='font-size:16px; color:gray; margin-top:8px;'>
        Figure 10: Reference-Free Evaluation of Summarization with LLMs<br>
        Reference: Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems
    </div>
</div>

#### Multidimensional Metrics for Summarization Evaluation  

- <span style="font-size:18px;">Coherence</span>   
- <span style="font-size:18px;">Completeness</span>   
- <span style="font-size:18px;">Conciseness</span>   
- <span style="font-size:18px;">Consistency</span>   
- <span style="font-size:18px;">Readability</span>    
- <span style="font-size:18px;">Syntax</span>     

#### Prompts Generation for LLM-as-a-Judge Method

##### 1. Function to Generate Prompt for Coherence Measurement

In [36]:
def generate_prompt_coherence(original_text, summary_text):
    """
    Generate a prompt for evaluating the coherence of a summary.
    """
    return f"""
    You are an expert language model tasked with evaluating the coherence of a summary. Coherence measures how logically and seamlessly the ideas flow in the summary compared to the original text.

    Original Text:
    {original_text}
    
    Summary Text:
    {summary_text}
    
    - Please provide a score between 0 and 100 for coherence. 
    - A highly coherent summary should have a score close to 100 and a poorly coherent summary should have a score close to 0.
    - Use reasoning and chain-of-thought to explain your evaluation before arriving at the final score.
    - The final output should have a json format like this "{{'score': 'score', 'reason': 'traces_of_reasoning'}}"
    """

##### 2. Function to Generate Prompt for Completeness Measurement

In [37]:
def generate_prompt_completeness(original_text, summary_text):
    """
    Generate a prompt for evaluating the completeness of a summary.
    """
    return f"""
    You are an expert language model tasked with evaluating the completeness of a summary. Completeness measures how well the summary captures all the important points from the original text.

    Original Text:
    {original_text}

    Summary Text:
    {summary_text}

    - Please provide a score between 0 and 100 for completeness. 
    - A highly complete summary should have a score close to 100 and a poorly complete summary should have a score close to 0. 
    - Use reasoning and chain-of-thought to explain your evaluation before arriving at the final score.
    - The final output should have a json format like this "{{'score': 'score', 'reason': 'traces_of_reasoning'}}"
    """

##### 3. Function to Generate Prompt for Conciseness Measurement

In [38]:
def generate_prompt_conciseness(original_text, summary_text):
    """
    Generate a prompt for evaluating the conciseness of a summary.
    """
    return f"""
    You are an expert language model tasked with evaluating the conciseness of a summary. Conciseness measures how effectively the summary conveys the essential information without unnecessary verbosity.

    Original Text:
    {original_text}

    Summary Text:
    {summary_text}

    - Please provide a score between 0 and 100 for conciseness. 
    - A highly concise summary should have a score close to 100 and a poorly concise summary should have a score close to 0.
    - Use reasoning and chain-of-thought to explain your evaluation before arriving at the final score.
    - The final output should have a json format like this "{{'score': 'score', 'reason': 'traces_of_reasoning'}}"
    """

##### 4. Function to Generate Prompt for Consistency Measurement

In [39]:
def generate_prompt_consistency(original_text, summary_text):
    """
    Generate a prompt for evaluating the consistency of a summary.
    """
    return f"""
    You are an expert language model tasked with evaluating the consistency of a summary. Consistency measures whether the summary aligns with the facts and details in the original text without introducing contradictions.

    Original Text:
    {original_text}

    Summary Text:
    {summary_text}

    - Please provide a score between 0 and 100 for consistency. 
    - A highly consistent summary should have a score close to 100 and a poorly consistent summary should have a score close to 0. 
    - Use reasoning and chain-of-thought to explain your evaluation before arriving at the final score.
    - The final output should have a json format like this "{{'score': 'score', 'reason': 'traces_of_reasoning'}}"
    """

##### 5. Function to Generate Prompt for Readability Measurement

In [40]:

def generate_prompt_readability(original_text, summary_text):
    """
    Generate a prompt for evaluating the readability of a summary.
    """
    return f"""
    You are an expert language model tasked with evaluating the readability of a summary. Readability measures how easy it is to read and understand the summary.

    Original Text:
    {original_text}

    Summary Text:
    {summary_text}

    - Please provide a score between 0 and 100 for readability. 
    - A highly readable summary should have a score close to 100 and a poorly readable summary should have a score close to 0.
    - Use reasoning and chain-of-thought to explain your evaluation before arriving at the final score.
    - The final output should have a json format like this "{{'score': 'score', 'reason': 'traces_of_reasoning'}}"
    """

##### 6. Function to Generate Prompt for Syntax Measurement

In [41]:
def generate_prompt_syntax(original_text, summary_text):
    """
    Generate a prompt for evaluating the syntax of a summary.
    """
    return f"""
    You are an expert language model tasked with evaluating the syntax of a summary. Syntax measures the grammatical correctness and sentence structure of the summary.

    Original Text:
    {original_text}

    Summary Text:
    {summary_text}

    - Please provide a score between 0 and 100 for syntax. 
    - A summary with highly correct syntax should have a score close to 100 and a summary with a lower correct syntax should have a score close to 0.
    - Use reasoning and chain-of-thought to explain your evaluation before arriving at the final score.
    - The final output should have a json format like this "{{'score': 'score', 'reason': 'traces_of_reasoning'}}"
    """

##### Generate Prompts

##### 1. Prompt for Coherence Measurement

In [42]:
prompt_coherence_measurement = generate_prompt_coherence(article_text, generated_summary)
print(textwrap.fill(prompt_coherence_measurement, width=100))

     You are an expert language model tasked with evaluating the coherence of a summary. Coherence
measures how logically and seamlessly the ideas flow in the summary compared to the original text.
Original Text:     A woman in the Northwest Highlands of Scotland who'd fallen ill tested negative
for Ebola, the Scottish government said Tuesday. A spokesman for the government said the woman had
been in West Africa recently, though she had no direct contact with anyone with Ebola. "A patient at
Aberdeen Royal Infirmary has tested negative for Ebola," the press release said. "The individual was
transferred to the hospital by the Scottish Ambulance Service yesterday after falling ill while
visiting Torridon in the Scottish Highlands." Meanwhile, a health care worker who was diagnosed with
the Ebola virus after returning to Scotland from Sierra Leone was transferred to the Royal Free
Hospital in London. The patient is Pauline Cafferkey, 39, of Glasgow, Scotland, the hospital said.
She was wo

In [43]:
model = "gpt-4.1-mini"
response_coherence = get_chatgpt_response(prompt_coherence_measurement, model)
print(textwrap.fill(response_coherence, width=100))

{   "score": 90,    "reason": "The summary presents information in a logical and mostly seamless
manner that aligns well with the original text. It begins by addressing the negative test of the
first woman who fell ill and followed with the key case of Pauline Cafferkey, thus maintaining the
sequence of critical events. The inclusion of details about Cafferkey's travel path and hospital
isolation setup enhances continuity and context. The summary also covers contact tracing, a
suspected case elsewhere, outbreak statistics, guidelines for returning humanitarian workers, and
government responses, reflecting the breadth of the original text. Minor coherence gaps stem from a
somewhat abrupt shift to the outbreak statistics and monitoring guidance without explicit
transitional phrases, which slightly lessens flow. Nevertheless, the overall logical progression
from individual cases to public health measures preserves clarity and cohesion, justifying a high
coherence score close to 90." }


##### 2.Prompt for Completeness Measurement

In [44]:
prompt_completeness_measurement = generate_prompt_completeness(article_text, generated_summary)
print(textwrap.fill(prompt_completeness_measurement, width=100))

     You are an expert language model tasked with evaluating the completeness of a summary.
Completeness measures how well the summary captures all the important points from the original text.
Original Text:     A woman in the Northwest Highlands of Scotland who'd fallen ill tested negative
for Ebola, the Scottish government said Tuesday. A spokesman for the government said the woman had
been in West Africa recently, though she had no direct contact with anyone with Ebola. "A patient at
Aberdeen Royal Infirmary has tested negative for Ebola," the press release said. "The individual was
transferred to the hospital by the Scottish Ambulance Service yesterday after falling ill while
visiting Torridon in the Scottish Highlands." Meanwhile, a health care worker who was diagnosed with
the Ebola virus after returning to Scotland from Sierra Leone was transferred to the Royal Free
Hospital in London. The patient is Pauline Cafferkey, 39, of Glasgow, Scotland, the hospital said.
She was working

In [45]:
model = "gpt-4.1-mini"
response_completeness = get_chatgpt_response(prompt_completeness_measurement, model)
print(textwrap.fill(response_completeness, width=100))

{   "score": "85",   "reason": "The summary captures the key points around the two main patients:
the woman in the Northwest Highlands who tested negative and Pauline Cafferkey, the healthcare
worker who tested positive and was transferred to a London hospital with a high-level isolation
unit. It correctly notes Cafferkey's travel route (through Morocco and London) and that UK
authorities are tracing her contacts, plus mentioning another suspected case in southwest England,
the death and infection toll in West Africa, the 21-day health monitoring guideline for returning
humanitarian workers, and government coordination to protect public health. However, the summary
omits several specific details that could be relevant for completeness: it does not mention that
Cafferkey worked with Save the Children at an Ebola treatment center, or that the hospital
facilities include controlled ventilation with a tent over the patient's bed. It also excludes
mention of the specialized military aircraf

##### 3. Prompt for Conciseness Measurement

In [46]:
prompt_concisenss_measurement = generate_prompt_conciseness(article_text, generated_summary)
print(textwrap.fill(prompt_concisenss_measurement, width=100))

     You are an expert language model tasked with evaluating the conciseness of a summary.
Conciseness measures how effectively the summary conveys the essential information without
unnecessary verbosity.      Original Text:     A woman in the Northwest Highlands of Scotland who'd
fallen ill tested negative for Ebola, the Scottish government said Tuesday. A spokesman for the
government said the woman had been in West Africa recently, though she had no direct contact with
anyone with Ebola. "A patient at Aberdeen Royal Infirmary has tested negative for Ebola," the press
release said. "The individual was transferred to the hospital by the Scottish Ambulance Service
yesterday after falling ill while visiting Torridon in the Scottish Highlands." Meanwhile, a health
care worker who was diagnosed with the Ebola virus after returning to Scotland from Sierra Leone was
transferred to the Royal Free Hospital in London. The patient is Pauline Cafferkey, 39, of Glasgow,
Scotland, the hospital said

In [47]:
model = "gpt-4.1-mini"
response_conciseness = get_chatgpt_response(prompt_concisenss_measurement, model)
print(textwrap.fill(response_conciseness, width=100))

{   "score": "85",   "reason": "The summary effectively condenses the essential information from the
original text about the negative Ebola case, the confirmed case of Pauline Cafferkey, the hospital's
capabilities, contact tracing efforts, an additional suspected case, outbreak statistics, and
government responses. It omits many specific details such as flight routes, the specific nature of
the isolation unit beyond the tent, the name and background of the healthcare worker's employer, and
details about other cases and past treatments, which is appropriate for conciseness. However, the
summary could be slightly more concise by removing some redundant phrases (e.g., 'after falling ill'
is implied by testing negative for Ebola). It mostly avoids verbosity and focuses on key facts
relevant to understanding the situation, maintaining clarity while trimming detail. Therefore, it
achieves high conciseness but is not perfectly minimalistic or devoid of minor redundancies." }


##### 4. Prompt for Consistency Measurement

In [48]:
prompt_consistency_measurement = generate_prompt_consistency(article_text, generated_summary)
print(textwrap.fill(prompt_consistency_measurement, width=100))

     You are an expert language model tasked with evaluating the consistency of a summary.
Consistency measures whether the summary aligns with the facts and details in the original text
without introducing contradictions.      Original Text:     A woman in the Northwest Highlands of
Scotland who'd fallen ill tested negative for Ebola, the Scottish government said Tuesday. A
spokesman for the government said the woman had been in West Africa recently, though she had no
direct contact with anyone with Ebola. "A patient at Aberdeen Royal Infirmary has tested negative
for Ebola," the press release said. "The individual was transferred to the hospital by the Scottish
Ambulance Service yesterday after falling ill while visiting Torridon in the Scottish Highlands."
Meanwhile, a health care worker who was diagnosed with the Ebola virus after returning to Scotland
from Sierra Leone was transferred to the Royal Free Hospital in London. The patient is Pauline
Cafferkey, 39, of Glasgow, Scotland,

In [49]:
model = "gpt-4.1-mini"
response_consistency = get_chatgpt_response(prompt_consistency_measurement, model)
print(textwrap.fill(response_consistency, width=100))

{   "score": 98,   "reason": "The summary accurately reflects the key factual details from the
original text without contradictions. It correctly states that a woman in the Northwest Highlands
tested negative for Ebola after recent travel to West Africa, matching the original text. The
summary properly identifies Pauline Cafferkey as the healthcare worker diagnosed with Ebola after
returning from Sierra Leone and her subsequent transfer to a hospital in London. It also notes her
travel route through Morocco and London before arriving in Glasgow, consistent with the original.
The description of the hospital's high-level isolation unit, including the specially designed tent,
aligns with the details provided. The summary mentions UK authorities' efforts to trace contacts and
the existence of another suspected case in southwest England, both present in the original. The
statistics on Ebola deaths and cases as of December 24 are correctly reported, as is the guidance
for humanitarian worker

##### 5. Prompt for Readability Measurement

In [77]:
prompt_readability_measurement = generate_prompt_readability(article_text, generated_summary)
print(textwrap.fill(prompt_readability_measurement, width=100))

     You are an expert language model tasked with evaluating the readability of a summary.
Readability measures how easy it is to read and understand the summary.      Original Text:     A
woman in the Northwest Highlands of Scotland who'd fallen ill tested negative for Ebola, the
Scottish government said Tuesday. A spokesman for the government said the woman had been in West
Africa recently, though she had no direct contact with anyone with Ebola. "A patient at Aberdeen
Royal Infirmary has tested negative for Ebola," the press release said. "The individual was
transferred to the hospital by the Scottish Ambulance Service yesterday after falling ill while
visiting Torridon in the Scottish Highlands." Meanwhile, a health care worker who was diagnosed with
the Ebola virus after returning to Scotland from Sierra Leone was transferred to the Royal Free
Hospital in London. The patient is Pauline Cafferkey, 39, of Glasgow, Scotland, the hospital said.
She was working with Save the Children a

In [78]:
model = "gpt-4.1-mini"
response_readability = get_chatgpt_response(prompt_readability_measurement, model)
print(textwrap.fill(response_readability, width=100))

{   "score": 85,   "reason": "The summary is concise and covers the main points of the original text
clearly, including the negative test result of the woman in Scotland, Pauline Cafferkey's diagnosis
and transfer, contact tracing efforts, the low risk to the public, a suspected case, and the ongoing
outbreak statistics. The language is straightforward, avoiding unnecessary jargon or complex
sentence structures which aids readability. However, the summary compresses multiple key details
into relatively dense sentences, which might require readers to pause and process information, such
as linking Pauline Cafferkey's travel and work with her diagnosis and transfer. Some minor elements
slightly reduce flow, for example, the brief mention of the ongoing outbreak at the end feels
somewhat abrupt and could be better integrated. Overall, the summary is highly readable for a
general audience, maintaining a clear narrative while balancing completeness and brevity." }


##### 6. Prompt for Syntax Measurement

In [79]:
prompt_syntax_measurement = generate_prompt_syntax(article_text, generated_summary)
print(textwrap.fill(prompt_syntax_measurement, width=100))

     You are an expert language model tasked with evaluating the syntax of a summary. Syntax
measures the grammatical correctness and sentence structure of the summary.      Original Text:
A woman in the Northwest Highlands of Scotland who'd fallen ill tested negative for Ebola, the
Scottish government said Tuesday. A spokesman for the government said the woman had been in West
Africa recently, though she had no direct contact with anyone with Ebola. "A patient at Aberdeen
Royal Infirmary has tested negative for Ebola," the press release said. "The individual was
transferred to the hospital by the Scottish Ambulance Service yesterday after falling ill while
visiting Torridon in the Scottish Highlands." Meanwhile, a health care worker who was diagnosed with
the Ebola virus after returning to Scotland from Sierra Leone was transferred to the Royal Free
Hospital in London. The patient is Pauline Cafferkey, 39, of Glasgow, Scotland, the hospital said.
She was working with Save the Children

In [80]:
model = "gpt-4.1-mini"
response_syntax = get_chatgpt_response(prompt_syntax_measurement, model)
print(textwrap.fill(response_syntax, width=100))

{   "score": 95,   "reason": "The summary demonstrates strong syntactic correctness with well-formed
sentences and proper grammatical structure throughout. Sentence structure is varied and clear,
effectively conveying the key points from the original text without ambiguity or awkward phrasing.
The summary correctly uses complex sentences and appropriate conjunctions (e.g., 'Meanwhile,'
'while') to maintain coherence and flow. There is a minor issue with a double period at the end of
the summary, which is a small punctuation error, slightly impacting the overall score. Additionally,
the phrase 'has resulted in thousands of deaths and cases..' uses 'thousands of deaths and cases'
somewhat vaguely without clarifying the distinction between deaths and cases, but this is more about
clarity than syntax. Overall, the summary is grammatically correct, coherent, and syntactically
sound, justifying a high syntax score close to perfect." }


### Alignment with Business Metrics
- <span style="font-size:18px;">What values these metrics should take depends on the Business Objectives.</span>   
  
- <span style="font-size:18px;">Hence any optimization of GenAI Applications using the values of such metrics should be done after alinging with Business Metrics with the help of SMEs.</span>  
  
- <span style="font-size:18px;">Example, for Text Summarization the Business Objectives might be to produce content that is more readable and concise even if the completeness not 100%.</span>  
  


## Action Plans 

### Text Summarization Task 
- <span style="font-size:18px;">Define a Text Summarization Task in your Product Domain or using the CNN-DailyMail Dataset.</span>
- <span style="font-size:18px;">Set up a Text Summarization Pipeline using gpt-3.5-turbo model. </span>  
- <span style="font-size:18px;">Set up a LLM-as-a-Judge Evaluation Pipeline using gpt-4.1-mini model.</span>
- <span style="font-size:18px;">Use the 6 Metrics learned here for Evaluation.</span>  
- <span style="font-size:18px;">Run the pipeline and plot the distribution of each of the metric values.</span> 


## What is Coming Up in Session 2  
- <span style="font-size:18px;">G-Eval Framework.</span>  
  
- <span style="font-size:18px;">RAGAS Framework.</span>  
  
- <span style="font-size:18px;">DeepEval Framework</span>

## Thank You
<span class='muted'><span style="font-size:20px;">Questions?</span>