# EVAL USING ROGUE

prepare imports

In [1]:
from transformers import pipeline
from rouge import Rouge
import torch
import json
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


### prepare functions

In [2]:
summarizer = pipeline(
            "summarization",
            model = "../models/model2",
            tokenizer = "../models/model2",
        )
        
def summarize(text, prompt):
    inputs = f"{prompt}: {text}"
    input_tokens = summarizer.tokenizer.encode(inputs, truncation=False)
    input_len = len(input_tokens)
    max_length = min(input_len * 2, 1024)
    min_length = max(32, input_len // 4)
    summary = summarizer(
        inputs,
        max_length=max_length,
        min_length=min_length,
        length_penalty=2.0,
        num_beams=4,
    )
    return summary[0]['summary_text']
    
def process_data(response, prompt):
    post_content = response[0]['data']['children'][0]['data'].get('selftext', '')
    comments = []
    for comment in response[1]['data']['children']:
        if 'body' in comment['data']:
            comments.append(comment['data']['body'])
    comments_all = ' '.join(comments)

    post_summary = summarize(post_content, prompt)
    comments_summary = summarize(comments_all, prompt)

    return {
        "post_summary": post_summary,
        "comments_summary": comments_summary
    }

Device set to use cuda:0


### load the reddit post and summarize it then save the summary in another json file

In [3]:
with open('../data/response.json') as file:
    reddit_post = json.load(file)

summary = process_data(reddit_post, "Summarize and highlight popular brands")

with open('../data/summary.json', 'w') as file:
    json.dump(summary, file, indent=4)
print("Summary saved to summary.json")

Your max_length is set to 274, but your input_length is only 137. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=68)
Your max_length is set to 380, but your input_length is only 190. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=95)


Summary saved to summary.json


### verify by printing the summaries

In [4]:
print("Post Summary:\n", summary["post_summary"])
print("\nComments Summary:\n", summary["comments_summary"])

Post Summary:
 {"title": "Budget for a New Chair with Adjustable Headrest and Armrest", "selftext": "The user is looking for a new chair with an adjustable headrest, armrest, chair height, and a lumbar pillow with mesh material. They mention a budget of P3,000-P4,000 and highlight popular brands.", "comments": ["Users provide feedback on the chair's specifications and suggest alternative options.", "Some users suggest reaching out to influencers for recommendations on speakerphones or headphones for better long-distance communication.", "The sentiment is positive, with users appreciating the user's effort to find a chair and offering helpful suggestions for its design."], "sentiment": " the sentiment is encouraging and supportive, with Users sharing their own search experiences and offering suggestions for additional features on chairs and speakers."}

Comments Summary:
 {"title": "Budget-friendly office chair suggestions", "selftext": {"overview": "The post provides a budget-friendly 

### fix the json files (the output summary files are not formatted properly (json))

In [5]:
with open('../data/summary.json', 'r') as file:
    generated_summaries = file.read()

with open('../data/reference.json', 'r') as file:
    reference_summaries = file.read()

def fix_json(jsonfile, path):
    improper_json = jsonfile

    fixed_json = json.loads(improper_json)

    fixed_post_summary = json.loads(fixed_json['post_summary'])
    fixed_comments_summary = json.loads(fixed_json['comments_summary'])

    fixed_json['post_summary'] = fixed_post_summary
    fixed_json['comments_summary'] = fixed_comments_summary

    print(json.dumps(fixed_json, indent=4))

    with open(path, 'w') as file:
        json.dump(fixed_json, file, indent=4)

### fix the formatting of the generated summary

In [6]:
fix_json(generated_summaries, '../data/summary.json')

{
    "post_summary": {
        "title": "Budget for a New Chair with Adjustable Headrest and Armrest",
        "selftext": "The user is looking for a new chair with an adjustable headrest, armrest, chair height, and a lumbar pillow with mesh material. They mention a budget of P3,000-P4,000 and highlight popular brands.",
        "comments": [
            "Users provide feedback on the chair's specifications and suggest alternative options.",
            "Some users suggest reaching out to influencers for recommendations on speakerphones or headphones for better long-distance communication.",
            "The sentiment is positive, with users appreciating the user's effort to find a chair and offering helpful suggestions for its design."
        ],
        "sentiment": " the sentiment is encouraging and supportive, with Users sharing their own search experiences and offering suggestions for additional features on chairs and speakers."
    },
    "comments_summary": {
        "title": "

### evaluate the results using ROUGE

In [7]:
rouge = Rouge()

results = rouge.get_scores(generated_summaries, reference_summaries)

print(f"ROUGE Scores: {results}")

ROUGE Scores: [{'rouge-1': {'r': 0.5766423357664233, 'p': 0.4340659340659341, 'f': 0.4952978007421311}, 'rouge-2': {'r': 0.2930232558139535, 'p': 0.20723684210526316, 'f': 0.24277456162102168}, 'rouge-l': {'r': 0.5620437956204379, 'p': 0.4230769230769231, 'f': 0.4827586157891531}}]


### **ROGUE-1 SCORES:**
- **Recall (r)** = 57.66%
- **Precision (p)** = 43.41%
- **F1-Score (f)** = 49.53%

### **ROUGE-2 SCORES:**
- **Recall (r)** = 29.30%
- **Precision (p)** = 20.72%
- **F1-Score (f)** = 24.27%

### **ROUGE-L SCORES:**
- **Recall (r)** = 56.20%
- **Precision (p)** = 42.30%
- **F1-Score (f)** = 48.28%

<br>

**ROUGE-1:** also known as unigram, measures the overlap of unigrams (individual words) between the generated summary and the reference summary. It calculates the proportion of words in the generated summary that are also present in the reference summary. Example: Reference text: “The cat is on the rug” Generated text: “The dog is on the rug” ROUGE-1 = 3/5 = 0.6.

**ROUGE-2:** also known as bigram, measures the overlap of bigrams (pairs of consecutive words) between the generated summary and the reference summary. It calculates the proportion of bigrams in the generated summary that are also present in the reference summary.

**ROUGE-L:** measures the similarity between the word sequence of the generated abstract and the reference abstract using the longest sequence of words in common. Unlike ROUGE-1 and ROUGE-2, which use a simple word count approach, ROUGE-L uses a string matching approach.

**Source:** https://fabianofalcao.medium.com/metrics-for-evaluating-summarization-of-texts-performed-by-transformers-how-to-evaluate-the-b3ce68a309c3

### **CONCLUSIONS**
**ROGUE-1 SCORE:** The ROUGE-1 score indicates that the model is fairly good at capturing individual words (unigrams) from the reference summaries. With a recall of 57.66%, the model captures more than half of the relevant words from the reference summaries, which suggests that it's effectively capturing key content. The precision of 43.41% indicates that a significant portion of the generated summary’s words also appear in the reference, but there may be some additional, irrelevant words included. The F1-score of 49.53% shows that, overall, there’s a fairly good balance between recall and precision, although there's still room to increase both aspects for better results.

**ROGUE-2 SCORE:** The ROUGE-2 score, which focuses on bigram overlap, is considerably lower than ROUGE-1. The recall of 29.30% indicates that the model captures roughly 30% of the bigrams from the reference summaries, which is a moderate result but suggests that the model may not be fully preserving the structural relationships between words. The precision of 20.72% suggests that the generated summaries might include bigrams that are not present in the reference summaries. The F1-score of 24.27% is relatively low, which may indicate that the model needs improvement in capturing bigram patterns in the summaries. This is common in summarization tasks, as producing high-quality bigram overlap is challenging.

**ROGUE-l SCORE:** The ROUGE-L score, which focuses on the longest common subsequence (LCS), shows that the model is able to capture the overall structure of the reference summaries quite well. The recall of 56.20% suggests that a large portion of the key sequences (order-preserving) from the reference summaries appear in the generated summaries, indicating good coherence. The precision of 42.30% shows that the model does well in maintaining relevant sequences in the generated summary but could further reduce redundant or non-informative sequences. The F1-score of 48.28% indicates a solid performance in preserving the flow and structure of the original text.

### **SUMMARY**  
- The ROUGE-1 score is strong, indicating that the model is capturing individual words well, which is important for summarizing the key points of Reddit posts. This suggests the model is effectively identifying the core content from the original Reddit discussions.

- The ROUGE-2 score is relatively low, suggesting that the model struggles with preserving the structure and sequence of words, which is crucial for generating coherent summaries of Reddit posts where sentence flow and the connection between ideas are important.

- The ROUGE-L score shows that the model is effectively capturing meaningful sequences and maintaining coherence in the summaries. This is a positive outcome for summarizing Reddit posts, where keeping the overall message and flow intact is important.

While the model performs well in certain areas (particularly with ROUGE-1 and ROUGE-L), there is room for improvement, especially with the ROUGE-2 score. Improving bigram overlap could enhance the fluency and structure of the summaries, leading to more readable and coherent summaries of Reddit posts.