# Practice using an LLM as an Evaluator/Judge

In [1]:
from langchain_openai import AzureChatOpenAI
from langchain_openai import AzureOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import PromptTemplate
from datasets import load_dataset

from dotenv import load_dotenv

import json
from time import sleep
import pandas as pd
import os

from azure.identity import (
    DefaultAzureCredential,
)

from azure.identity import AzureAuthorityHosts
from azure.keyvault.secrets import SecretClient

### Create your Azure OpenAI Resource and Key Vault

_If you have your resource from the last exercise, you don't need to complete the following steps._

Navigate to the [Azure Portal](https://portal.azure.com/#home) or [US Gov Azure Portal](https://portal.azure.us/#home) and login using your account. Next you're going to create an Azure OpenAI resource, create a new resource group and use any unique name for the resource's name.  
Once the resource is created, you need to open [Azure AI Foundry](https://ai.azure.com/) or [Azure OpenAI Studio](https://ai.azure.us/) to deploy the model. Navigate to deployments, press deploy model and select gpt-4o-mini.  Make sure to increase your rate limit, or tokens per minute (around 700k should be sufficient)

Once that is created, copy the key (Found under Resource Management > Keys and Endpoints) and create a new key vault. Assign the same resource group as your Azure OpenAI resource and again pick a unique name for the key vault name.  

Once the key vault is created, make a new secret with the API key.  

### Connect to Azure OpenAI
First, run `az login` in the terminal and login to your FedAIRS account.  
<br>
If you are using a Gov account:<br>
az cloud set --name AzureUSGovernment <br>
az login <br>
az account set --subscription="your subscription"<br>

If you are using a commercial account:<br>
az login <br>
az account set --subscription="your subscription"<br>
<br>
Two things are needed to connect to your Azure OpenAI resource
- Your API key
- Your Endpoint  
  
For the API Key, we are going to connect to the key vault we just made to insert the key. For this, you'll need to change the URL below to match your key vault's URL.  
Next, we will insert the endpoint URL from our Azure OpenAI resource.

Our `azure_client` is where we are calling the LLM and connecting to the model we deployed. Other parameters can be passed in, like timeout or max_retires.

Note: In the block below we inserted `credential = DefaultAzureCredential(authority=AzureAuthorityHosts.AZURE_GOVERNMENT)`. This is because when we use our FedAIRS account we are connected to the US Government cloud, rather than the regular commercial cloud.

In [2]:
load_dotenv()

credential = DefaultAzureCredential(authority=AzureAuthorityHosts.AZURE_GOVERNMENT)

secret_client = SecretClient(vault_url=os.getenv('KEY_VAULT_URL'), credential=credential)
deployment = os.getenv('DEPLOYMENT')
endpoint_url = os.getenv('AZURE_OPENAI_ENDPOINT')
api_version = os.getenv('API_VERSION')
api_key = secret_client.get_secret(os.getenv('SECRET_NAME')).value


azure_client = AzureChatOpenAI(
                api_key=api_key
                ,api_version=api_version
                ,azure_endpoint=endpoint_url
                ,deployment_name=deployment
                ,temperature=0
                ,max_tokens=4000
                ,model_kwargs={"response_format": {"type": "json_object"}}
)

### Our Dataset

Like the last exercise, we are going to pull data from [Hugging Face](https://huggingface.co/). The [SAMsum dataset](https://huggingface.co/datasets/Samsung/samsum) contains about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English. Linguists were asked to create conversations similar to those they write on a daily basis, reflecting the proportion of topics of their real-life messenger convesations. The style and register are diversified - conversations could be informal, semi-formal or formal, they may contain slang words, emoticons and typos. Then, the conversations were annotated with summaries. It was assumed that summaries should be a concise brief of what people talked about in the conversation in third person.

In [3]:
summary_dataset=load_dataset("Samsung/samsum", split="train")
summaries = summary_dataset.to_pandas()
summaries.head()

Found cached dataset samsum (C:/Users/mislentz/.cache/huggingface/datasets/Samsung___samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e)


Unnamed: 0,id,dialogue,summary
0,13818513,Amanda: I baked cookies. Do you want some?\r\...,Amanda baked cookies and will bring Jerry some...
1,13728867,Olivia: Who are you voting for in this electio...,Olivia and Olivier are voting for liberals in ...
2,13681000,"Tim: Hi, what's up?\r\nKim: Bad mood tbh, I wa...",Kim may try the pomodoro technique recommended...
3,13730747,"Edward: Rachel, I think I'm in ove with Bella....",Edward thinks he is in love with Bella. Rachel...
4,13728094,Sam: hey overheard rick say something\r\nSam:...,"Sam is confused, because he overheard Rick com..."


Like last time, we are going to take a sample of 100 rows of the dataset.

In [4]:
summaries_sample = summaries.sample(100).reset_index()
summaries_sample_array = summaries_sample.to_numpy()

### Evaluating the summaries

From our sample dataset, we are going to use the LLM to evaluate the summaries that the SAMsum dataset provided. We will use a few criteria:  
* **Coherence:** the collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby "the summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic". Scored from 1-5.
* **Consistency:** the factual alignment between the summary and the summarized source. A factually consistent summary contains only statements that are entailed by the source document. Annotators were also asked to penalize summaries that contained hallucinated facts. Scored from 1-5.
* **Fluency:** the quality of the summary in terms of grammar, spelling, punctuation, word choice, and sentence structure. Scored from 1-3.
* **Relevance:** selection of important content from the source. The summary should include only important information from the source document. Annotators were instructed to penalize summaries which contained redundancies and excess information. Scored from 1-5.

We are prompting the LLM to give a score for each of the summaries and provide reasoning as to why that score was chosen. Then we will join all of the results together into one dataframe.


In [5]:
coherence_prompt = PromptTemplate.from_template(

    template = '''
### Instructions
You will be given one summary written for a source document.
Your task is to rate the summary on one metric.
Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed.

### Evaluation Criteria:
Coherence (1-5) - the collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby "the summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic."

### Evaluation Steps:
1. Read the source document carefully and identify the main topic and key points.
2. Read the summary and compare it to the source document. Check if the summary covers the main topic and key points of the source document, and if it presents them in a clear and logical order.
3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.  
4. Return your score in JSON

### Source Document:
{Document}

### Summary:
{Summary}

### Return JSON:
{{
    "coherence": <coherence score from 1-5>,
    "coherence_rationale": "<Explain why you assigned the score>"
}}
'''
)

In [6]:
coherence_chain = (
    coherence_prompt
    | azure_client
    | StrOutputParser()
)

In [7]:
consistency_prompt = PromptTemplate.from_template(

    template = '''
### Instructions:
You will be given a source document. You will then be given one summary written for this source document.
Your task is to rate the summary on one metric.
Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed.

### Evaluation Criteria:
Consistency (1-5) - the factual alignment between the summary and the summarized source. A factually consistent summary contains only statements that are entailed by the source document. Annotators were also asked to penalize summaries that contained hallucinated facts. 

### Evaluation Steps:
1. Read the source document carefully and identify the main facts and details it presents.
2. Read the summary and compare it to the source document. Check if the summary contains any factual errors that are not supported by the source document.
3. Assign a score for consistency based on the Evaluation Criteria.
4. Return your score in JSON

### Source Document: 
{Document}

### Summary: 
{Summary}

### Return JSON:
{{
    "consistency": <consistency score from 1-5>,
    "consistency_rationale": "<Explain why you assigned the score>"
}}
'''
)

In [8]:
consistency_chain = (
    consistency_prompt
    | azure_client
    | StrOutputParser()
)

In [9]:
fluency_prompt = PromptTemplate.from_template(

    template = '''
### Instructions:
You will be given one summary written for a source document.
Your task is to rate the summary on one metric.
Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed.
Return your score in JSON

### Evaluation Criteria:
Fluency (1-3): the quality of the summary in terms of grammar, spelling, punctuation, word choice, and sentence structure.
- 1: Poor. The summary has many errors that make it hard to understand or sound unnatural.
- 2: Fair. The summary has some errors that affect the clarity or smoothness of the text, but the main points are still comprehensible.
- 3: Good. The summary has few or no errors and is easy to read and follow.

### Summary:
{{Summary}}

### Return JSON:
{{
    "fluency": <fluency score from 1-3>,
    "fluency_rationale": "<Explain why you assigned the score>"
}}
'''
)

In [10]:
fluency_chain = (
    fluency_prompt
    | azure_client
    | StrOutputParser()
)

In [None]:
relevance_prompt = PromptTemplate.from_template(

    template = '''
### Instructions:
You will be given one summary written for a source document.
Your task is to rate the summary on one metric.
Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed.


### Evaluation Criteria:
Relevance (1-5) - selection of important content from the source. The summary should include only important information from the source document. Annotators were instructed to penalize summaries which contained redundancies and excess information.

### Evaluation Steps:
1. Read the summary and the source document carefully.
2. Compare the summary to the source document and identify the main points of the source document.
3. Assess how well the summary covers the main points of the source document, and how much irrelevant or redundant information it contains.
4. Assign a relevance score from 1 to 5.

### Source Document:
{Document}

### Summary:
{Summary}


### Return JSON:
{{
    "relevance": <relevance score from 1-5>,
    "relevance_rationale": "<Explain why you assigned the score>"
}}
'''
)

In [12]:
relevance_chain = (
    relevance_prompt
    | azure_client
    | StrOutputParser()
)

In [13]:
async def runChain(summary_array, chain):
    texts = [{"Document": x[2], "Summary": x[3]} for x in summary_array]
    return await chain.abatch(texts)

In [14]:
async def assess_summaries(df, array, chain):
    results = await runChain(array, chain)
    df["results"] = results

    results_df = pd.json_normalize(df.results.apply(json.loads))
    output = pd.concat([df, results_df], axis=1)
    output.drop(columns='results', inplace=True)
    return output

Next, we are going to join all of the responses together to one dataframe for evaluation. Each row is one summary with its scores for coherence, consistency, fluency, and relevance as well as the reasoning for each.

In [15]:
input_df = summaries_sample
input_array = summaries_sample_array
chains = [coherence_chain, consistency_chain, fluency_chain, relevance_chain]
output_df = None

for chain in chains:
    print(chain)
    output_df = await assess_summaries(input_df, input_array, chain)
    input_df = output_df


first=PromptTemplate(input_variables=['Document', 'Summary'], template='\n### Instructions\nYou will be given one summary written for a source document.\nYour task is to rate the summary on one metric.\nPlease make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed.\n\n### Evaluation Criteria:\nCoherence (1-5) - the collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby "the summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic."\n\n### Evaluation Steps:\n1. Read the source document carefully and identify the main topic and key points.\n2. Read the summary and compare it to the source document. Check if the summary covers the main topic and key points of the source document, and if it presents the

In [16]:
output_df.head()

Unnamed: 0,index,id,dialogue,summary,coherence,coherence_rationale,consistency,consistency_rationale,fluency,fluency_rationale,relevance,relevance_rationale
0,5285,13611619,"Kitty: Are you there?\r\nAnne: sure, what's up...",Kitty forgot the keys and can't get inside. An...,4,The summary effectively captures the main even...,5,The summary accurately reflects the main facts...,3,The summary is well-written with correct gramm...,5,The summary effectively captures the main poin...
1,12541,13820714,Henryk: Have you seen the new episode? ;)\r\nP...,Henryk and Piotr want to discuss the latest ep...,5,The summary effectively captures the main topi...,5,The summary accurately reflects the main facts...,3,The summary is well-written with correct gramm...,5,The summary accurately captures the main point...
2,12420,13730466,Robert: yo did u talk to the teacher\r\nBale: ...,Robert and Bale have to submit the project. Ba...,4,The summary captures the main topic of Robert ...,4,The summary accurately captures the main point...,3,The summary is well-written with correct gramm...,4,The summary captures the main points of the co...
3,1131,13729470,"Tom: <file_other> read it!\r\nBridget: gosh, h...",Tom and Brigdet discuss an event involving an ...,4,The summary presents the main topic of the con...,5,The summary accurately reflects the main point...,3,The summary is well-written with correct gramm...,4,The summary captures the main points of the co...
4,8830,13810253,Ross: Hey\r\nRoss: I am at the mall\r\nRoss: W...,Ross is at the mall and Rose prefers pure blac...,4,The summary effectively captures the main topi...,5,The summary accurately reflects the main facts...,3,The summary is well-written with correct gramm...,5,The summary accurately captures the main point...


Here's an example of one of the coherence evaluations. Below is the short dialogue as well as the score and the reasoning for why it was given that score.

In [17]:
print(f'''
Example Record:
Dialogue: 	{output_df["dialogue"][0]}
Summary: {output_df["summary"][0]}
Coherence score: {output_df["coherence"][0]}
Coherence rational: {output_df["coherence_rationale"][0]}
''')


Example Record:
Dialogue: 	Kitty: Are you there?
Anne: sure, what's up?
Kitty: I forgot the keys:/ again...
Anne: Nooo... I am totally across town, can't come now
Kitty: Shit;/ I gotta get ready for a meeting
Anne: How about if I uber them to you?
Kitty: Could you? That would be great! Where are you I'll order it
Anne: My office, just hurry up, I gotta go see my client soon
Kitty: On its way! 5 min
Summary: Kitty forgot the keys and can't get inside. Anne offers to send the keys to Kitty via Uber. Kitty orders the Uber.
Coherence score: 4
Coherence rational: The summary effectively captures the main events of the conversation between Kitty and Anne, presenting them in a clear and logical order. It outlines Kitty's problem of forgetting the keys, Anne's offer to send them via Uber, and Kitty's action of ordering the Uber. However, it could be slightly improved by including more context about the urgency of the situation, which would enhance the overall coherence and depth of the summar

Next, as a way to measure overall error, we are subtracting each row's individual score from the max score it can receieve for that category. Then we will add up all of the error together.

In [18]:
output_df["coherence_error"] = 5 - output_df["coherence"]
output_df["consistency_error"] = 5 - output_df["consistency"]
output_df["fluency_error"] = 3 - output_df["fluency"]
output_df["relevance_error"] = 5 - output_df["relevance"]

print(f'''
Error Scores:
coherence: {output_df["coherence"].sum()}
consistency: {output_df["consistency"].sum()}
fluency: {output_df["fluency"].sum()}
relevance: {output_df["relevance"].sum()}
''')



Error Scores:
coherence: 356
consistency: 445
fluency: 300
relevance: 407



As another measure, let's look at the average score for each category.  

As you can see, fluency got a perfect score since it was pertaining to the grammar of the summaries. However, coherency, which focuses on collective quality, was ranked the lowest.

In [19]:
print(f'''
Average Error Scores:
coherence: {output_df["coherence"].mean()}
consistency: {output_df["consistency"].mean()}
fluency: {output_df["fluency"].mean()}
relevance: {output_df["relevance"].mean()}
''')



Average Error Scores:
coherence: 3.56
consistency: 4.45
fluency: 3.0
relevance: 4.07

