<a href="https://colab.research.google.com/github/mattambrogi/llm-summary-evals/blob/main/llm_summarization_evals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM Summarization Evals
### Comparing the performance of Claude 3 Opus and GPT-4 on a summarization task

## Set up and data exploration

In [None]:
!pip install openai
!pip install datasets
!pip install pandas
import pandas as pd
import os

In [None]:
from datasets import load_dataset
billsum = load_dataset("billsum", split="train")

In [None]:
# Preview data
billsum[0]["summary"]

"Shields a business entity from civil liability relating to any injury or death occurring at a facility of that entity in connection with a use of such facility by a nonprofit organization if: (1) the use occurs outside the scope of business of the business entity; (2) such injury or death occurs during a period that such facility is used by such organization; and (3) the business entity authorized the use of such facility by the organization. \nMakes this Act inapplicable to an injury or death that results from an act or omission of a business entity that constitutes gross negligence or intentional misconduct, including misconduct that: (1) constitutes a hate crime or a crime of violence or act of international terrorism for which the defendant has been convicted in any court; or (2) involves a sexual offense for which the defendant has been convicted in any court or misconduct for which the defendant has been found to have violated a Federal or State civil rights law. \nPreempts Stat

In [None]:
for i in range(25):
    print(f"Index: {i}, Text Length: {len(billsum[i]['text'])}, Summary Length: {len(billsum[i]['summary'])}")

Index: 0, Text Length: 5026, Summary Length: 1561
Index: 1, Text Length: 19094, Summary Length: 1317
Index: 2, Text Length: 11013, Summary Length: 473
Index: 3, Text Length: 10745, Summary Length: 162
Index: 4, Text Length: 18047, Summary Length: 4462
Index: 5, Text Length: 8875, Summary Length: 867
Index: 6, Text Length: 17714, Summary Length: 1022
Index: 7, Text Length: 19915, Summary Length: 716
Index: 8, Text Length: 7689, Summary Length: 1888
Index: 9, Text Length: 10213, Summary Length: 1379
Index: 10, Text Length: 6201, Summary Length: 657
Index: 11, Text Length: 5199, Summary Length: 185
Index: 12, Text Length: 12943, Summary Length: 350
Index: 13, Text Length: 16483, Summary Length: 1749
Index: 14, Text Length: 19874, Summary Length: 998
Index: 15, Text Length: 7959, Summary Length: 600
Index: 16, Text Length: 9056, Summary Length: 1047
Index: 17, Text Length: 11391, Summary Length: 1594
Index: 18, Text Length: 6638, Summary Length: 860
Index: 19, Text Length: 9246, Summary Le

In [None]:
# get first 20 bills with a summary between 200 and 1500 characters
def valid(bill):
  return len(bill['summary']) > 200 and len(bill['summary']) < 1500

test_bills = []
i = 0
while len(test_bills) < 20:
  if valid(billsum[i]):
    test_bills.append(billsum[i])
  i+=1

## Set up eval prompt

In [None]:
# Evaluation prompt template based on G-Eval
EVALUATION_PROMPT_TEMPLATE = """
You will be given one human written, source of truth, summary written for an article. You will also be given an AI generated summary for the same article. Your task is to rate the AI summary on one metric.
Please make sure you read and understand these instructions very carefully.
Please keep this document open while reviewing, and refer to it as needed.

Evaluation Criteria:

{criteria}

Evaluation Steps:

{steps}

Example:

Source of Trust Summary:

{document}

AI Summary:

{summary}

Evaluation Form (return scores ONLY):

- {metric_name}
"""

# Metric 1: Relevance

RELEVANCY_SCORE_CRITERIA = """
Relevance(1-10) - selection of important content from the source. \
The summary should include only important information from the source summary. \
Annotators were instructed to penalize summaries which contained redundancies and excess information.
"""

RELEVANCY_SCORE_STEPS = """
1. Read the AI summary and the human summary carefully.
2. Compare the AI summary to the human summary and identify the main points of the article.
3. Assess how well the AI summary covers the main points of the article, and how much irrelevant or redundant information it contains.
4. Assign a relevance score from 1 to 10. ONLY return the number of the score, no additional text.
"""

# Metric 2: Coherence

COHERENCE_SCORE_CRITERIA = """
Coherence(1-10) - the collective quality of all sentences. \
We align this dimension with the DUC quality question of structure and coherence \
whereby "the summary should be well-structured and well-organized. \
The summary should not just be a heap of related information, but should build from sentence to a\
coherent body of information about a topic."
"""

COHERENCE_SCORE_STEPS = """
1. Read the human summary carefully and identify the main topic and key points.
2. Read the AI summary and compare it to the human summary. Check if the summary covers the main topic and key points of the article,
and if it presents them in a clear and logical order.
3. Assign a score for coherence on a scale of 1 to 10, where 1 is the lowest and 10 is the highest based on the Evaluation Criteria. ONLY return the number of the score, no additional text.
"""

# Metric 3: Consistency

CONSISTENCY_SCORE_CRITERIA = """
Consistency(1-10) - the factual alignment between the AI summary and the human summary. \
A factually consistent AI summary contains only statements that are entailed by the source summary. \
Annotators were also asked to penalize summaries that contained hallucinated facts.
"""

CONSISTENCY_SCORE_STEPS = """
1. Read the human summary carefully and identify the main facts and details it presents.
2. Read the AI summary and compare it to the human sunmmary. Check if the summary contains any factual errors that are not supported by the source of truth summary.
3. Assign a score for consistency based on the Evaluation Criteria. ONLY return the number of the score, no additional text.
"""

# Metric 4: Fluency

FLUENCY_SCORE_CRITERIA = """
Fluency(1-5): the quality of the AI summary in terms of grammar, spelling, punctuation, word choice, and sentence structure.
1: Poor. The summary has many errors that make it hard to understand or sound unnatural.
2: Fair. The summary has some errors that affect the clarity or smoothness of the text, but the main points are still comprehensible.
3: Good. The summary has few or no errors and is easy to read and follow.
"""

FLUENCY_SCORE_STEPS = """
Read the summary and evaluate its fluency based on the given criteria. Assign a fluency score from 1 to 5. ONLY return the number of the score, no additional text.
"""

## Anthropic Evaluation

In [None]:
!pip install anthropic

In [None]:
import anthropic
# add API key to os.environ
anthropic_client = anthropic.Anthropic()

In [None]:
import math
import time

class ClaudeSummarizer:
  """
  A class to summarize text using a specified language model.

  Attributes:
      system (str): A system message to initialize the summarizer.
      model (str): The language model to use for summarization. Defaults to "claude-3-opus-20240229".
  """
  def __init__(self, system="", model="claude-3-opus-20240229"):
    """
    Initializes the Summarizer with a system message and a model choice.
    """
    self.system = system
    self.model = model

  def __call__(self, message):
    result = self.execute(message)
    return result

  def execute(self, message):
    start_time = time.time()
    message = anthropic_client.messages.create(
      model=self.model,
      temperature=0,
      max_tokens=500,
      system=self.system,
      messages=[
          {
              "role": "user",
              "content": [
                  {
                      "type": "text",
                      "text": message
                  }
              ]
          }
      ]
    )
    elapsed_time = time.time() - start_time
    total_tokens = message.usage.input_tokens + message.usage.output_tokens

    return {
        "content": message.content[0].text,
        "duration": elapsed_time,
        "tokens": total_tokens
    }



In [None]:
claude_summarizer = ClaudeSummarizer(
    """
    You are a helpful assistant that summarizes text. You will be provided with a bill from the california legislature. Your job is to create a summary which concisely captures all key information from the bill.

    The user input will just the text of the bill. The output should be the summary and no addtional text.
    """
)

In [None]:
# Test Claude Summarizer
bill = test_bills[0]
response = claude_summarizer(bill["text"])
print(response["content"])

The Human Rights Information Act is a proposed U.S. legislation that aims to declassify and publicly disclose human rights records related to Guatemala and Honduras after 1944. Key points of the bill include:

1. Federal agencies must identify, review, and organize relevant human rights records within 120 days of enactment, and make them public within 30 days after review.

2. Records can only be withheld if there is clear and convincing evidence that disclosure would gravely threaten U.S. military defense, intelligence operations, or foreign relations.

3. Requests for records from official entities in other countries investigating human rights violations must be reviewed and disclosed according to the Act's standards.

4. The Interagency Security Classification Appeals Panel will review agency decisions to withhold records and can direct disclosure.

5. The President has final authority over the Panel's determinations based on the Act's standards.

6. The Information Security Policy 

In [None]:
def get_geval_score_claude(
    criteria: str, steps: str, document: str, summary: str, metric_name: str
):
    claude_eval_client = anthropic.Anthropic()
    prompt = EVALUATION_PROMPT_TEMPLATE.format(
        criteria=criteria,
        steps=steps,
        metric_name=metric_name,
        document=document,
        summary=summary,
    )

    response = claude_eval_client.messages.create(
      model="claude-3-opus-20240229",
      max_tokens=7,
      temperature=0,
      messages=[
          {
              "role": "user",
              "content": [
                  {
                      "type": "text",
                      "text": prompt
                  }
              ]
          }
      ]
    )
    return response.content[0].text


evaluation_metrics = {
    "Relevance": (RELEVANCY_SCORE_CRITERIA, RELEVANCY_SCORE_STEPS),
    "Coherence": (COHERENCE_SCORE_CRITERIA, COHERENCE_SCORE_STEPS),
    "Consistency": (CONSISTENCY_SCORE_CRITERIA, CONSISTENCY_SCORE_STEPS),
    "Fluency": (FLUENCY_SCORE_CRITERIA, FLUENCY_SCORE_STEPS),
}

In [None]:
import re

In [None]:
bill = test_bills[0]
base_summary = bill["summary"]
generated_summary = claude_summarizer(bill["text"])
generated_summary_text = generated_summary["content"]

scores = {}

for eval_type, (criteria, steps) in evaluation_metrics.items():
    result = get_geval_score_claude(
        criteria,
        steps,
        base_summary,
        generated_summary_text,
        eval_type,
    )
    score_num = float(re.findall(r'\d+', result)[0])
    scores[eval_type] = score_num

for criteria, score in scores.items():
    print(f"{criteria}: {score}")

Relevance: 9.0
Coherence: 9.0
Consistency: 9.0
Fluency: 5.0


**Note:** regex required. Having some issues with Claude 3 Opus returning just the score. Not seeing same issue with GPT-4.

In [None]:
# generate and display scores for test set
all_scores = []
i = 0
for bill in test_bills:
    print(f"Generating summary for index: {i}")
    i+=1
    base_summary = bill["summary"]
    summary_result = claude_summarizer(bill["text"])
    generated_summary = summary_result["content"]
    scores = {}

    for eval_type, (criteria, steps) in evaluation_metrics.items():
        result = get_geval_score_claude(criteria, steps, base_summary, generated_summary, eval_type)
        try:
            score_num = float(re.findall(r'\d+', result)[0])
        except ValueError:
          score_num = -1
        scores[eval_type] = score_num

    scores['Duration'] = summary_result['duration']
    scores['Total Tokens'] = summary_result['tokens']
    scores['Content'] = summary_result['content']

    all_scores.append(scores)
    time.sleep(35)  # avoid rate limit

# Create the DataFrame
df = pd.DataFrame(all_scores)

# Label summaries
df.index = [f"Summary {i+1}" for i in range(len(test_bills))]

# Display of df
display(df)

Generating summary for index: 0
Generating summary for index: 1
Generating summary for index: 2
Generating summary for index: 3
Generating summary for index: 4
Generating summary for index: 5
Generating summary for index: 6
Generating summary for index: 7
Generating summary for index: 8
Generating summary for index: 9
Generating summary for index: 10
Generating summary for index: 11
Generating summary for index: 12
Generating summary for index: 13
Generating summary for index: 14
Generating summary for index: 15
Generating summary for index: 16
Generating summary for index: 17
Generating summary for index: 18
Generating summary for index: 19


Unnamed: 0,Relevance,Coherence,Consistency,Fluency,Duration,Total Tokens,Content
Summary 1,8.0,9.0,9.0,5.0,24.423361,4391,The Human Rights Information Act is a proposed...
Summary 2,8.0,8.0,8.0,4.0,16.09074,2893,The Jackie Robinson Commemorative Coin Act aut...
Summary 3,8.0,8.0,8.0,4.0,18.924698,2418,The Holocaust Victims Insurance Relief Act of ...
Summary 4,8.0,8.0,8.0,4.0,21.547284,4274,Summary:\n\nThis bill amends the Elementary an...
Summary 5,8.0,8.0,8.0,5.0,21.186474,4959,The Gallatin Land Consolidation Act of 1998 au...
Summary 6,8.0,9.0,9.0,4.0,16.919641,2471,The Indian Needs Assessment and Program Evalua...
Summary 7,8.0,8.0,9.0,4.0,22.107932,1780,The Kidney Disease Educational Benefits Act of...
Summary 8,6.0,8.0,7.0,4.0,19.970268,3165,Summary of the National Center for Social Work...
Summary 9,8.0,8.0,8.0,4.0,22.834198,5024,The San Juan Mountains Wilderness Act of 2009 ...
Summary 10,8.0,8.0,8.0,5.0,29.827525,2295,The New Millennium Classrooms Act is a bill th...


In [28]:
df.to_csv('summary_eval_results_claude3_v2.csv', index=True)

## OpenAI Evaluation

In [31]:
from openai import OpenAI
# Add API key to os.environ here
openai_client = OpenAI()

In [37]:
import math
import time

class OpenAISummarizer:
  """
  A class to summarize text using a specified language model.

  Attributes:
      system (str): A system message to initialize the summarizer.
      model (str): The language model to use for summarization. Defaults to "gpt-4-1106-preview".
  """
  def __init__(self, system="", model="gpt-4-1106-preview"):
    """
    Initializes the Summarizer with a system message and a model choice.
    """
    self.system = system
    self.messages = []
    self.model = model
    if self.system:
      self.messages.append({"role": "system", "content": system})

  def __call__(self, message):
    result = self.execute(message)
    return result

  def execute(self, message):
    start_time = time.time()
    completion = openai_client.chat.completions.create(
        model=self.model,
        messages = self.messages + [{"role": "user", "content": message}]
    )
    elapsed_time = time.time() - start_time
    total_tokens = completion.usage.total_tokens

    return {
        "content": completion.choices[0].message.content,
        "duration": elapsed_time,
        "tokens": total_tokens
    }

In [38]:
openai_summarizer = OpenAISummarizer(
    """
    You are a helpful assistant that summarizes text. You will be provided with a bill from the california legislature. Your job is to create a summary which concisely captures all key information from the bill.

    The user input will just the text of the bill. The output should be the summary and no addtional text.
    """
)

In [39]:
# Test summarizer on one doc
bill = test_bills[0]
response = openai_summarizer(bill["text"])
print(response["content"])

This Act, known as the Human Rights Information Act, asserts that the protection and promotion of human rights are foundational values of the United States and its constitution. Acknowledging possession of documents pertaining to human rights violations abroad, the Act calls for the strengthening of declassification procedures of such documents to assist in international human rights investigations.

Key Provisions:

1. Definitions: Human rights records, agencies, and gross violations of internationally recognized human rights are defined as they pertain to the Act.

2. Declassification: Agencies including the Department of State, CIA, DOD, and others are directed to identify, review, and declassify human rights records related to activities in Guatemala and Honduras post-1944, with a report on compliance due to Congress within 150 days.

3. Grounds for Postponement: Disclosure may be postponed if it is determined that releasing the information could significantly threaten U.S. militar

In [40]:
def get_geval_score_openai(
    criteria: str, steps: str, document: str, summary: str, metric_name: str
):
    openai_eval_client = OpenAI()
    prompt = EVALUATION_PROMPT_TEMPLATE.format(
        criteria=criteria,
        steps=steps,
        metric_name=metric_name,
        document=document,
        summary=summary,
    )
    response = openai_eval_client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=5,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
    )
    return response.choices[0].message.content


evaluation_metrics = {
    "Relevance": (RELEVANCY_SCORE_CRITERIA, RELEVANCY_SCORE_STEPS),
    "Coherence": (COHERENCE_SCORE_CRITERIA, COHERENCE_SCORE_STEPS),
    "Consistency": (CONSISTENCY_SCORE_CRITERIA, CONSISTENCY_SCORE_STEPS),
    "Fluency": (FLUENCY_SCORE_CRITERIA, FLUENCY_SCORE_STEPS),
}

In [42]:
# Generate evals for one summary
bill = test_bills[0]
base_summary = bill["summary"]
generated_summary = openai_summarizer(bill["text"])
generated_summary_text = generated_summary["content"]

scores = {}

for eval_type, (criteria, steps) in evaluation_metrics.items():
    result = get_geval_score_openai(
        criteria,
        steps,
        base_summary,
        generated_summary_text,
        eval_type,
    )
    score_num = float(result.strip())
    scores[eval_type] = score_num

for criteria, score in scores.items():
    print(f"{criteria}: {score}")

Relevance: 9.0
Coherence: 9.0
Consistency: 8.0
Fluency: 5.0


In [None]:
# generate and display scores for test set
all_scores = []
i = 0
for bill in test_bills:
    print(f"Generating summary for index: {i}")
    i+=1
    base_summary = bill["summary"]
    summary_result = openai_summarizer(bill["text"])
    generated_summary = summary_result["content"]
    scores = {}

    for eval_type, (criteria, steps) in evaluation_metrics.items():
        result = get_geval_score_openai(criteria, steps, base_summary, generated_summary, eval_type)
        try:
          score_num = float(re.findall(r'\d+', result)[0])
        except ValueError:
          score_num = -1
        scores[eval_type] = score_num

    scores['Duration'] = summary_result['duration']
    scores['Total Tokens'] = summary_result['tokens']
    scores['Content'] = summary_result['content']

    all_scores.append(scores)
    time.sleep(65)  # avoid rate limit

# Create the DataFrame
df = pd.DataFrame(all_scores)

# Label summaries
df.index = [f"Summary {i+1}" for i in range(num_summaries)]

# Display of df
display(df)

Generating summary for index: 0
Generating summary for index: 1
Generating summary for index: 2
Generating summary for index: 3
Generating summary for index: 4
Generating summary for index: 5
Generating summary for index: 6
Generating summary for index: 7
Generating summary for index: 8
Generating summary for index: 9
Generating summary for index: 10
Generating summary for index: 11
Generating summary for index: 12
Generating summary for index: 13
Generating summary for index: 14
Generating summary for index: 15
Generating summary for index: 16
Generating summary for index: 17
Generating summary for index: 18
Generating summary for index: 19


In [46]:
display(df)

Unnamed: 0,Relevance,Coherence,Consistency,Fluency,Duration,Total Tokens,Content
0,10.0,10.0,10.0,5.0,16.075001,4110,"The ""Human Rights Information Act"" aims to:\n\..."
1,5.0,8.0,7.0,5.0,14.584744,2797,"The ""Jackie Robinson Commemorative Coin Act"" e..."
2,9.0,9.0,7.0,5.0,15.16286,2257,Title: Holocaust Victims Insurance Relief Act ...
3,10.0,9.0,9.0,5.0,18.227484,3821,The bill amends Subpart 14 of title V of the E...
4,8.0,10.0,9.0,5.0,18.745847,4439,The Gallatin Land Consolidation Act of 1998 ai...
5,10.0,10.0,10.0,5.0,15.370878,2341,"The ""Indian Needs Assessment and Program Evalu..."
6,10.0,10.0,10.0,5.0,12.055676,1592,"This Act is named the ""Kidney Disease Educatio..."
7,10.0,10.0,10.0,5.0,16.931851,2924,"The ""National Center for Social Work Research ..."
8,9.0,9.0,10.0,5.0,17.113862,4564,"This Act, cited as the ""San Juan Mountains Wil..."
9,10.0,10.0,9.0,5.0,15.83017,2080,"The ""New Millennium Classrooms Act"" aims to en..."


In [47]:
df.to_csv('summary_eval_results_gpt4_v2.csv', index=True)

## Cross Evaluations

Have Opus evaluate GPT-4 summaries and vice versa.