# **Demo: ROUGE Benchmark**

This demo is designed to read a PDF file and a summary of that file, and then compute the ROUGE scores for the summary by comparing it with the original document. The ROUGE scores provide a measure of the quality of the summary.





## **Steps to Perform:**


*   Step 1: Import the Necessary Libraries
*   Step 2: Read the PDF File
*   Step 3: Define the Summary Function
*   Step 4: Load the ROUGE Metric

### **Step 1: Import the Necessary Libraries**

In [2]:
!pip install rouge-score

Defaulting to user installation because normal site-packages is not writeable
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25ldone
[?25h  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24936 sha256=3a6391fc7cbe83ab9f9918d8fb096ac65b5f4d70a143ecfb299f238e37c81161
  Stored in directory: /voc/work/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
from rouge_score import rouge_scorer
import PyPDF2
from openai import OpenAI

client = OpenAI()
import pandas as pd

## **Step 2: Read the PDF File**

*   Open the PDF file.
*   Create a **PdfReader** object for the PDF file.
*   Extract the text from each page of the PDF and concatenate it into a single string.


In [5]:
# Read the PDF file
pdf_file = open('arxiv_impact_of_GENAI.pdf', 'rb')
pdf_reader = PyPDF2.PdfReader(pdf_file)
num_pages = len(pdf_reader.pages)
document_text = ""
for page in range(num_pages):
    document_text += pdf_reader.pages[page].extract_text()

## **Step 3: Define the Summary Function**



In [6]:
def summarize_text(text, model="gpt-3.5-turbo"):
    truncated_text = text[:3000]  # Adjust for token limits
    response = client.chat.completions.create(model=model,
    messages=[
        {"role": "system", "content": "You are a helpful assistant that summarizes text."},
        {"role": "user", "content": f"Summarize the following text: {truncated_text}"}
    ])
    return response.choices[0].message.content

summary_text = summarize_text(document_text)

## **Step 4: Load the ROUGE metric**

*   Load the ROUGE metric.
*   Compute the ROUGE scores for the summary.
*   Print the scores.



In [7]:
# Compute ROUGE scores using rouge-score
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(document_text, summary_text)


In [8]:
# Convert scores to a DataFrame
scores_df = pd.DataFrame(scores).T
scores_df.columns = ['Precision', 'Recall', 'F1-Score']

# Print the scores
print("ROUGE Scores DataFrame:\n", scores_df)

ROUGE Scores DataFrame:
         Precision    Recall  F1-Score
rouge1   0.943396  0.014599  0.028752
rouge2   0.600000  0.009198  0.018119
rougeL   0.716981  0.011095  0.021852


This approach uses the rouge-score library to calculate ROUGE scores. The output will be a DataFrame showing precision, recall, and F1 scores for rouge1, rouge2, and rougeL.

### **Conclusion**

The ROUGE score output shows the F-measure for different versions of the ROUGE metric: ROUGE-1, ROUGE-2, and ROUGE-L. These scores provide a measure of how well the summary matches the reference document. The higher the score (closer to 1), the better the match between the summary and the original text.