# 04 - LLM-Based Review Summarization

## 1. Intro & Objectives
- Generate category-level blog-style summaries from Amazon product reviews using an instruction-tuned LLM (LLaMA 3.2-3B).

This notebook:
- Aggregates reviews at product level
- Generates structured blog articles per meta-category
- Compares decoding strategies (controlled vs creative)
- Analyzes hallucination and data noise effects


## 2. Environment Setup
- GPU: Colab L4
- Model: LLaMA 3.2-3B-Instruct
- Inference only (no fine-tuning)


In [None]:
# Environment Setup (First Run Only)

# !pip install transformers
# !pip install torch
# !pip install rouge-score
# !pip install bert-score
# !pip install sentencepiece
# !pip install accelerate

In [2]:
# Imports

import os
import subprocess
import re
import numpy as np
import pandas as pd

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

from rouge_score import rouge_scorer
from bert_score import score as bertscore

In [3]:
# GPU check
print("CUDA available:", torch.cuda.is_available())
print("Device:", torch.cuda.get_device_name(0))


CUDA available: True
Device: NVIDIA L4


## 3. Helper Functions

In [4]:
def tokenize(text):
    text = text.lower()
    text = re.sub(r"[^a-z0-9\s]", "", text)
    return set(text.split())


In [5]:
def compute_grounding_ratio(source_text, generated_text):
    source_tokens = tokenize(source_text)
    generated_tokens = tokenize(generated_text)

    overlap = generated_tokens.intersection(source_tokens)

    if len(generated_tokens) == 0:
        return 0

    return len(overlap) / len(generated_tokens)

In [6]:
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

def compute_rouge(reference, generated):
    scores = scorer.score(reference, generated)

    return {
        "rouge1": scores["rouge1"].fmeasure,
        "rouge2": scores["rouge2"].fmeasure,
        "rougeL": scores["rougeL"].fmeasure
    }

In [7]:
def compute_bertscore(reference, generated):
    P, R, F1 = bertscore(
        [generated],
        [reference],
        lang="en",
        rescale_with_baseline=False   # ← important
    )

    return float(F1.mean())

In [8]:
def compute_compression_ratio(source_text, generated_text):
    source_len = len(source_text.split())
    generated_len = len(generated_text.split())

    return generated_len / source_len

## 4. Data Preparation

1. Load processed electronics dataset
2. Re-apply meta-category assignment (for reproducibility)
3. Aggregate reviews at product level to reduce cross-product noise


In [9]:
REPO = "https://github.com/marcosfsousa/project-ironhack-automated-customer-reviews.git"

if not os.path.exists("/content/repo"):
    subprocess.run(["git", "clone", REPO, "/content/repo"], check=True)
    print("Repo cloned.")
else:
    subprocess.run(["git", "-C", "/content/repo", "pull"], check=True)
    print("Repo updated.")

DATA_PATH = "/content/repo/data/processed/electronics_ready.csv"
print(f"File exists: {os.path.exists(DATA_PATH)}")


Repo cloned.
File exists: True


In [11]:
# Local filesystem path
# df = pd.read_csv("../data/processed/electronics_ready.csv")

# Colab-only path
df = pd.read_csv(DATA_PATH)
print(df.shape)
df.head()

(30487, 4)


Unnamed: 0,name,brand,rating,review_text
0,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,5,This product so far has not disappointed. My c...
1,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,5,great for beginner or experienced person. Boug...
2,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,5,Inexpensive tablet for him to use and learn on...
3,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,4,I've had my Fire HD 8 two weeks now and I love...
4,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,5,I bought this for my grand daughter when she c...


### Category Assignment

In [12]:
# Assign Meta Categories

def assign_meta_category(name):
    """
    Assign products to one of 5 meta-categories based on name keywords.
    Uses the category hierarchy determined during EDA.
    """
    if pd.isna(name):
        return "Unknown"

    name_lower = name.lower()

    # Order matters — Kids must come before general Tablets
    if any(kw in name_lower for kw in ["kids edition", "kid-proof"]):
        return "Fire Kids Edition"

    elif any(kw in name_lower for kw in ["fire tablet", "fire hd", "fire 7",
                                          "fire 8", "fire 10", " fire ", "tablet"]):
        return "Fire Tablets"

    elif any(kw in name_lower for kw in ["kindle", "e-reader", "ebook",
                                          "paperwhite", "voyage", "oasis"]):
        return "Kindle E-Readers"

    elif any(kw in name_lower for kw in ["echo", "tap", "alexa"]):
        return "Echo & Smart Speakers"

    elif any(kw in name_lower for kw in ["fire tv", "firetv", "streaming",
                                          "media player"]):
        return "Fire TV & Streaming"

    else:
        return "Accessories & Other"

# Apply to full dataset
df["meta_category"] = df["name"].apply(assign_meta_category)

print("Meta-category distribution:")
print(df["meta_category"].value_counts())

# Check what landed in "Other"
other_products = df[df["meta_category"] == "Accessories & Other"]["name"].unique()
print(f"\nProducts in 'Accessories & Other': {len(other_products)}")
if len(other_products) < 20:
    print(other_products)


Meta-category distribution:
meta_category
Fire Tablets             19739
Echo & Smart Speakers     4262
Kindle E-Readers          4231
Fire Kids Edition         2191
Accessories & Other         58
Fire TV & Streaming          6
Name: count, dtype: int64

Products in 'Accessories & Other': 9
['Coconut Water Red Tea 16.5 Oz (pack of 12)'
 'AmazonBasics Nylon CD/DVD Binder (400 Capacity)'
 'AmazonBasics Ventilated Adjustable Laptop Stand'
 'AmazonBasics Backpack for Laptops up to 17-inches'
 'AmazonBasics 11.6-Inch Laptop Sleeve'
 'AmazonBasics External Hard Drive Case'
 'AmazonBasics USB 3.0 Cable - A-Male to B-Male - 6 Feet (1.8 Meters)'
 'AmazonBasics 16-Gauge Speaker Wire - 100 Feet'
 'AmazonBasics Bluetooth Keyboard for Android Devices - Black']


### Why Aggregate at Product Level?

Initial generation attempts revealed review contamination across categories.
To reduce noise, we:
- First aggregate reviews per product
- Then aggregate top-N products per category

In [13]:
# Agregate reviews by product and category
product_df = (
    df.groupby(["meta_category", "name"])
      .agg({
          "review_text": lambda x: " ".join(x.astype(str).head(20))
      })
      .reset_index()
)

product_df

Unnamed: 0,meta_category,name,review_text
0,Accessories & Other,AmazonBasics 11.6-Inch Laptop Sleeve,"BETTER THAN NOTHING, But not as good as the CA..."
1,Accessories & Other,AmazonBasics 16-Gauge Speaker Wire - 100 Feet,As advised. Came really fast Great feel and se...
2,Accessories & Other,AmazonBasics Backpack for Laptops up to 17-inches,"This is a very basic, functional backpack that..."
3,Accessories & Other,AmazonBasics Bluetooth Keyboard for Android De...,"Like a lot of reviewers here, I struggled to f..."
4,Accessories & Other,AmazonBasics External Hard Drive Case,I have the Western Digital My Passport Ultra 2...
...,...,...,...
75,Kindle E-Readers,Kindle Paperwhite,This is my 2 nd kindle. I bought this because ...
76,Kindle E-Readers,"Kindle Paperwhite E-reader - White, 6 High-Res...",I purchased this for my son overseas as he had...
77,Kindle E-Readers,Kindle PowerFast International Charging Kit (f...,I travel internationally at least once a year ...
78,Kindle E-Readers,"Kindle Voyage E-reader, 6 High-Resolution Disp...",Much better than my original Kindle. Lighter a...


## 5. Model Setup

In [15]:
#MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" # Public access model used while gated model authorization was being reviewed

MODEL_NAME = "meta-llama/Llama-3.2-3B-Instruct" # Larger model that is optimized for text summarization task

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    dtype=torch.float16,
    device_map="auto"
)

print("LLaMA model loaded successfully.")

Loading weights:   0%|          | 0/254 [00:00<?, ?it/s]

LLaMA model loaded successfully.


In [16]:
def build_category_text(category, top_n_products=3):
    products = (
        product_df[product_df["meta_category"] == category]
        .head(top_n_products)
    )
    return " ".join(products["review_text"])

## Prompt Engineering Strategy

Goals:
- Minimize hallucination
- Avoid competitor mentions
- Prevent generic marketing language
- Force grounded, review-based summarization


In [17]:
def build_prompt(category, review_text):
    return f"""
<|system|>
You are a critical but fair technology journalist.
Only use information explicitly stated in the provided customer reviews.
Do NOT introduce technical specifications or product details that are not mentioned.
Do NOT mention products from other categories or competitor brands.
Do NOT use bullet points.
Write in continuous paragraphs only.
Avoid marketing language and generic praise.

<|user|>
Using only the information in the reviews below, write a 350–500 word blog-style article about {category}.

Your article must:
1. Identify the main products mentioned and describe how customers experience them.
2. Highlight concrete strengths with real-world examples from the reviews.
3. Explain recurring complaints or limitations.
4. End with a clear, opinionated recommendation that explains:
   - Who should buy products in this category
   - Who may find them frustrating

Be analytical and grounded in customer language.
Do not invent specifications.
Do not mention products outside {category}.
Do not use bullet points.

Category: {category}

Customer Reviews:
{review_text[:2500]}

<|assistant|>
"""

In [18]:
# Display the actual prompt being used
print("=" * 80)
print("PROMPT TEMPLATE EXAMPLE")
print("=" * 80)
sample_reviews = "Sample review text here..."
sample_prompt = build_prompt("Fire Tablets", sample_reviews)
print(sample_prompt)
print("=" * 80)

PROMPT TEMPLATE EXAMPLE

<|system|>
You are a critical but fair technology journalist.
Only use information explicitly stated in the provided customer reviews.
Do NOT introduce technical specifications or product details that are not mentioned.
Do NOT mention products from other categories or competitor brands.
Do NOT use bullet points.
Write in continuous paragraphs only.
Avoid marketing language and generic praise.

<|user|>
Using only the information in the reviews below, write a 350–500 word blog-style article about Fire Tablets.

Your article must:
1. Identify the main products mentioned and describe how customers experience them.
2. Highlight concrete strengths with real-world examples from the reviews.
3. Explain recurring complaints or limitations.
4. End with a clear, opinionated recommendation that explains:
   - Who should buy products in this category
   - Who may find them frustrating

Be analytical and grounded in customer language.
Do not invent specifications.
Do not me

## 6. Text Generation

- First sampling with one category to fine-tune prompt afterwards


In [19]:
# Generation Cell - Sampling with Single Category

category = "Echo & Smart Speakers"

echo_text = build_category_text(category, top_n_products=3)

prompt = build_prompt(category, echo_text)

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=600,
    temperature=0.15,
    repetition_penalty=1.1,
    do_sample=True,
    top_p=0.9,
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Remove prompt part
generated_text = generated_text[len(prompt):]

print(generated_text.strip())

As a tech-savvy consumer, I've had the chance to explore various Echo & Smart Speakers, and I'm excited to share my findings with you. At the heart of these devices are Amazon's Echo and its competitors, such as the Google Home. These speakers have become an integral part of many households, offering a range of features that make life easier and more enjoyable.

One of the standout aspects of these devices is their ease of setup and use. Many reviewers have praised the user-friendly interface, which guides you through the installation process with minimal hassle. For instance, one reviewer raved about the simplicity of setting up their Echo, saying it "walks you right through the install process." Another reviewer appreciated the seamless integration with their Amazon apps, such as Audible, making it easy to access their favorite content.

Music lovers will appreciate the impressive sound quality and vast music library available on these devices. Reviewers have enjoyed using their Echo

In [20]:
def filter_competitor_mentions(text):
    """Flag or remove competitor brand mentions."""
    competitors = [
        'CASE LOGIC', 'Samsung', 'Apple', 'Logitech',
        'Belkin', 'Anker', 'Sony', 'LG', 'Microsoft',
        'Google','Google Home', 'HP', 'Dell'
    ]

    found_competitors = []
    for comp in competitors:
        if comp.lower() in text.lower():
            found_competitors.append(comp)
            # Replace with generic term
            text = re.sub(
                f'\\b{re.escape(comp)}\\b',
                '[competitor product]',
                text,
                flags=re.IGNORECASE
            )

    if found_competitors:
        print(f"⚠️ WARNING: Removed competitor mentions: {found_competitors}")

    return text

We generate two variants per category:
- Controlled (low temperature)
- Creative (higher temperature)

In [21]:
def generate_article(category, temperature=0.15):
    category_text = build_category_text(category, top_n_products=3)
    prompt = build_prompt(category, category_text)

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=600,  # Increased
        temperature=temperature,
        top_p=0.9,
        repetition_penalty=1.1,  # NEW: Reduces "I would recommend" repetition
        do_sample=True,  # Explicit (good practice)
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id  # Ensures proper stopping
    )

    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Then in generate_article():
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    generated_text = generated_text[len(prompt):].strip()
    generated_text = filter_competitor_mentions(generated_text)

    return generated_text


## Generating Articles for All Categories

Temperature = 0.15 → more factual, grounded  
Temperature = 0.35 → more narrative, expressive  

In [22]:
all_categories = df["meta_category"].unique()

# Clear results and re-run Cells 21-23
results = {}
for category in all_categories:
    print(f"Generating for: {category}")
    results[category] = {
        "controlled": generate_article(category, temperature=0.15),
        "creative": generate_article(category, temperature=0.35)
    }

Generating for: Fire Tablets
Generating for: Kindle E-Readers
Generating for: Fire Kids Edition
Generating for: Echo & Smart Speakers
Generating for: Accessories & Other
Generating for: Fire TV & Streaming


## Persisting Results

Generated articles are saved to:
outputs/models/generated_blogposts.txt

This ensures reproducibility and version control.


In [26]:
# Create directory to save plots [Colab environment]
os.makedirs("/content/repo/outputs/models/", exist_ok=True)

output_path = "/content/repo/outputs/models/generated_blogposts_v2.txt"

with open(output_path, "w", encoding="utf-8") as f:
    for category, versions in results.items():
        f.write("="*80 + "\n")
        f.write(f"CATEGORY: {category}\n")
        f.write("="*80 + "\n\n")

        f.write("---- CONTROLLED VERSION ----\n\n")
        f.write(versions["controlled"] + "\n\n")

        f.write("---- CREATIVE VERSION ----\n\n")
        f.write(versions["creative"] + "\n\n\n")

print(f"Saved to {output_path}")


Saved to /content/repo/outputs/models/generated_blogposts_v2.txt


In [27]:
# Generation statistics
print("\n" + "=" * 80)
print("GENERATION SUMMARY")
print("=" * 80)

for category, versions in results.items():
    controlled_len = len(versions['controlled'].split())
    creative_len = len(versions['creative'].split())

    print(f"\n{category}:")
    print(f"  Controlled: {controlled_len} words")
    print(f"  Creative: {creative_len} words")

print("=" * 80)


GENERATION SUMMARY

Fire Tablets:
  Controlled: 343 words
  Creative: 407 words

Kindle E-Readers:
  Controlled: 392 words
  Creative: 402 words

Fire Kids Edition:
  Controlled: 361 words
  Creative: 291 words

Echo & Smart Speakers:
  Controlled: 408 words
  Creative: 422 words

Accessories & Other:
  Controlled: 422 words
  Creative: 433 words

Fire TV & Streaming:
  Controlled: 311 words
  Creative: 382 words


## 7. Evaluation

In [28]:
def build_extractive_baseline(category, top_n_reviews=5, max_words=400):
    """
    Simple extractive baseline:
    - Select top N longest reviews in the category
    - Concatenate and truncate to max_words
    """
    reviews = df[df["meta_category"] == category]["review_text"].dropna()

    # Sort by length (descending)
    reviews = sorted(reviews, key=lambda x: len(x), reverse=True)

    selected = " ".join(reviews[:top_n_reviews])

    words = selected.split()

    return " ".join(words[:max_words])


In [30]:
def parse_generated_file(filepath):
    results = {}

    with open(filepath, "r", encoding="utf-8") as f:
        content = f.read()

    # Split by CATEGORY blocks
    category_blocks = re.split(r"={5,}\nCATEGORY:\s*", content)

    for block in category_blocks:
        if not block.strip():
            continue

        # First line is category name
        lines = block.strip().split("\n")
        category = lines[0].strip()

        # Extract controlled and creative versions
        controlled_match = re.search(
            r"---- CONTROLLED VERSION ----\n\n(.*?)\n\n---- CREATIVE VERSION ----",
            block,
            re.DOTALL
        )

        creative_match = re.search(
            r"---- CREATIVE VERSION ----\n\n(.*)",
            block,
            re.DOTALL
        )

        controlled_text = controlled_match.group(1).strip() if controlled_match else ""
        creative_text = creative_match.group(1).strip() if creative_match else ""

        results[category] = {
            "controlled": controlled_text,
            "creative": creative_text
        }

    return results


In [31]:
# Work with the text output files to avoid re-train the model for evaluation purposes only

filepath = "/content/repo/outputs/models/generated_blogposts_v2.txt"

results = parse_generated_file(filepath)

print(results.keys())

dict_keys(['Fire Tablets', 'Kindle E-Readers', 'Fire Kids Edition', 'Echo & Smart Speakers', 'Accessories & Other', 'Fire TV & Streaming'])


In [32]:
evaluation_results = []

for category in results.keys():

    print(f"Evaluating: {category}")

    # Extractive pseudo-reference
    reference = build_extractive_baseline(category)

    # Generated text
    generated = results[category]["controlled"]

    # ROUGE
    rouge_scores = compute_rouge(reference, generated)

    # BERTScore
    bert_f1 = compute_bertscore(reference, generated)

    # Compression ratio
    source_text = build_category_text(category, top_n_products=3)
    compression = compute_compression_ratio(source_text, generated)

    # Grounding Ratio
    grounding = compute_grounding_ratio(source_text, generated)

    evaluation_results.append({
        "category": category,
        "rouge1": rouge_scores["rouge1"],
        "rouge2": rouge_scores["rouge2"],
        "rougeL": rouge_scores["rougeL"],
        "bertscore_f1": bert_f1,
        "compression_ratio": compression,
        "grounding_ratio": grounding
    })


Evaluating: Fire Tablets


config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/389 [00:00<?, ?it/s]

RobertaModel LOAD REPORT from: roberta-large
Key                             | Status     | 
--------------------------------+------------+-
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
lm_head.dense.weight            | UNEXPECTED | 
pooler.dense.bias               | MISSING    | 
pooler.dense.weight             | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


Evaluating: Kindle E-Readers


Loading weights:   0%|          | 0/389 [00:00<?, ?it/s]

RobertaModel LOAD REPORT from: roberta-large
Key                             | Status     | 
--------------------------------+------------+-
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
lm_head.dense.weight            | UNEXPECTED | 
pooler.dense.bias               | MISSING    | 
pooler.dense.weight             | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


Evaluating: Fire Kids Edition


Loading weights:   0%|          | 0/389 [00:00<?, ?it/s]

RobertaModel LOAD REPORT from: roberta-large
Key                             | Status     | 
--------------------------------+------------+-
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
lm_head.dense.weight            | UNEXPECTED | 
pooler.dense.bias               | MISSING    | 
pooler.dense.weight             | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


Evaluating: Echo & Smart Speakers


Loading weights:   0%|          | 0/389 [00:00<?, ?it/s]

RobertaModel LOAD REPORT from: roberta-large
Key                             | Status     | 
--------------------------------+------------+-
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
lm_head.dense.weight            | UNEXPECTED | 
pooler.dense.bias               | MISSING    | 
pooler.dense.weight             | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


Evaluating: Accessories & Other


Loading weights:   0%|          | 0/389 [00:00<?, ?it/s]

RobertaModel LOAD REPORT from: roberta-large
Key                             | Status     | 
--------------------------------+------------+-
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
lm_head.dense.weight            | UNEXPECTED | 
pooler.dense.bias               | MISSING    | 
pooler.dense.weight             | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


Evaluating: Fire TV & Streaming


Loading weights:   0%|          | 0/389 [00:00<?, ?it/s]

RobertaModel LOAD REPORT from: roberta-large
Key                             | Status     | 
--------------------------------+------------+-
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
lm_head.dense.weight            | UNEXPECTED | 
pooler.dense.bias               | MISSING    | 
pooler.dense.weight             | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


In [33]:
eval_df = pd.DataFrame(evaluation_results)

eval_df

Unnamed: 0,category,rouge1,rouge2,rougeL,bertscore_f1,compression_ratio,grounding_ratio
0,Fire Tablets,0.3,0.029024,0.121053,0.797683,0.199883,0.492308
1,Kindle E-Readers,0.324786,0.026928,0.126984,0.814341,0.278211,0.401826
2,Fire Kids Edition,0.378517,0.048718,0.158568,0.821574,0.147407,0.556122
3,Echo & Smart Speakers,0.375,0.036855,0.14951,0.824447,0.211289,0.542222
4,Accessories & Other,0.347222,0.039443,0.136574,0.79932,0.191123,0.495575
5,Fire TV & Streaming,0.329635,0.079491,0.133122,0.829529,0.928358,0.285714


In [34]:
eval_df.mean(numeric_only=True)

Unnamed: 0,0
rouge1,0.342527
rouge2,0.04341
rougeL,0.137635
bertscore_f1,0.814482
compression_ratio,0.326045
grounding_ratio,0.462295


In [35]:
# Compare controlled vs creative
comparison_data = []

for category in results.keys():
    controlled_text = results[category]['controlled']
    creative_text = results[category]['creative']

    comparison_data.append({
        'category': category,
        'controlled_words': len(controlled_text.split()),
        'creative_words': len(creative_text.split()),
        'controlled_unique_words': len(set(controlled_text.lower().split())),
        'creative_unique_words': len(set(creative_text.lower().split()))
    })

comp_df = pd.DataFrame(comparison_data)
comp_df['creative_richness'] = (
    comp_df['creative_unique_words'] / comp_df['creative_words']
)
comp_df['controlled_richness'] = (
    comp_df['controlled_unique_words'] / comp_df['controlled_words']
)

print("Decoding Strategy Comparison:")
display(comp_df[['category', 'controlled_richness', 'creative_richness']])

Decoding Strategy Comparison:


Unnamed: 0,category,controlled_richness,creative_richness
0,Fire Tablets,0.594752,0.58231
1,Kindle E-Readers,0.589286,0.574627
2,Fire Kids Edition,0.578947,0.646048
3,Echo & Smart Speakers,0.580882,0.594787
4,Accessories & Other,0.563981,0.489607
5,Fire TV & Streaming,0.646302,0.502618


## Evaluation & Observations

### Model Behavior & Prompting

Compared to TinyLlama, LLaMA 3.2-3B produced more coherent, structurally consistent, and grounded summaries. Early prompt iterations introduced competitor brands and generic phrasing; tightening instructions and lowering temperature significantly reduced this behavior.

Controlled decoding (low temperature) yielded more stable and grounded outputs, while higher temperature decoding improved stylistic richness at the cost of slight drift risk.

### Quantitative Evaluation

#### Metrics Comparison: V1 vs V2

| Metric | V1 (Original) | V2 (After Fix) | Change |
|--------|---------------|----------------|--------|
| ROUGE-1 | 0.335 | 0.343 | +2.4% ↑ |
| ROUGE-2 | 0.046 | 0.043 | -6.5% ↓ |
| ROUGE-L | 0.143 | 0.138 | -3.5% ↓ |
| BERTScore | 0.811 | 0.814 | +0.4% ↑ |
| Compression | 0.373 | 0.326 | -12.6% (tighter) |
| Grounding | 0.532 | 0.462 | -13.2% ↓ |

Because no human-written reference summaries were available, extractive baselines were constructed from the longest reviews per category. Generated summaries were evaluated using ROUGE and BERTScore.

ROUGE-1 differences across the model outputs (+2.4%) reflects slightly better lexical alignment on V2.
BERTScore F1 ≈ 0.81 indicates strong semantic alignment between generated summaries and underlying reviews.

The average compression ratio of 0.37 demonstrates effective condensation of multi-review inputs into concise recommendation-style articles. The V2 is comprised of more concise articles (326 vs 373 words on average) resulting in a compression ratio of 0.32.

### Grounding & Hallucination Proxy

Lexical grounding (overlap between generated and source vocabulary) averaged ≈0.53 on V1. Over half of generated vocabulary was directly anchored in review text, with remaining differences largely attributable to narrative framing and recommendation language. Mentions to "Amazon Echo" products appeared everywhere → artificially inflated overlap. Conversly on V2 the grounding ratio average ≈0.46 which shows the model is writing more naturally using proper article language instead of copy-pasting review vocabulary. This is **abstractive summarization working correctly**

## Decoding Strategy Analysis

Your new comparison data:

```
Category                  Controlled  Creative
Fire Tablets              0.595       0.582
Kindle E-Readers          0.589       0.575
Fire Kids Edition         0.579       0.646
Echo & Smart Speakers     0.581       0.595
Accessories & Other       0.564       0.490
Fire TV & Streaming       0.646       0.503
```
- Fire Kids Edition: Creative version is actually more diverse (0.646 > 0.579)
- Fire TV: Controlled version is surprisingly more diverse (0.646 > 0.503)

This showcases that the temperature parameter is working as expected creating rich but diverse product reviews.

### Limitations

Generation quality remains sensitive to dataset noise and cross-category review contamination. Automated metrics provide directional insight but do not fully capture coherence, recommendation strength, or factual faithfulness.


## Final Conclusions

This notebook demonstrates that LLM-based category summarization is feasible using constrained prompting and controlled decoding. Compared to TinyLlama, LLaMA produced more coherent, structurally consistent, and better-grounded summaries.

Dataset noise (including cross-product review contamination) required product-level aggregation to stabilize generation. Stricter prompt constraints and low-temperature decoding further reduced drift and competitor hallucinations.

Overall, LLM-based summarization generated meaningful, interpretable category insights with manageable hallucination risk under structured prompting.

