# 04 - LLM-Based Review Summarization

## 1. Intro & Objectives
- Generate category-level blog-style summaries from Amazon product reviews using an instruction-tuned LLM (LLaMA 3.2-3B).

This notebook:
- Aggregates reviews at product level
- Generates structured blog articles per meta-category
- Compares decoding strategies (controlled vs creative)
- Analyzes hallucination and data noise effects


## 2. Environment Setup
- GPU: Colab L4
- Model: LLaMA 3.2-3B-Instruct
- Inference only (no fine-tuning)


In [None]:
# Environment Setup (First Run Only)

# !pip install transformers
# !pip install torch
# !pip install rouge-score
# !pip install bert-score
# !pip install sentencepiece
# !pip install accelerate

In [2]:
# Imports

import os
import subprocess
import re
import numpy as np
import pandas as pd

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

from rouge_score import rouge_scorer
from bert_score import score as bertscore

In [3]:
# GPU check
print("CUDA available:", torch.cuda.is_available())
print("Device:", torch.cuda.get_device_name(0))


CUDA available: True
Device: NVIDIA L4


## 3. Helper Functions

In [4]:
def tokenize(text):
    text = text.lower()
    text = re.sub(r"[^a-z0-9\s]", "", text)
    return set(text.split())


In [5]:
def compute_grounding_ratio(source_text, generated_text):
    source_tokens = tokenize(source_text)
    generated_tokens = tokenize(generated_text)

    overlap = generated_tokens.intersection(source_tokens)

    if len(generated_tokens) == 0:
        return 0

    return len(overlap) / len(generated_tokens)

In [6]:
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

def compute_rouge(reference, generated):
    scores = scorer.score(reference, generated)

    return {
        "rouge1": scores["rouge1"].fmeasure,
        "rouge2": scores["rouge2"].fmeasure,
        "rougeL": scores["rougeL"].fmeasure
    }

In [7]:
def compute_bertscore(reference, generated):
    P, R, F1 = bertscore(
        [generated],
        [reference],
        lang="en",
        rescale_with_baseline=False   # ← important
    )

    return float(F1.mean())

In [8]:
def compute_compression_ratio(source_text, generated_text):
    source_len = len(source_text.split())
    generated_len = len(generated_text.split())

    return generated_len / source_len

## 4. Data Preparation

1. Load processed electronics dataset
2. Re-apply meta-category assignment (for reproducibility)
3. Aggregate reviews at product level to reduce cross-product noise


In [9]:
REPO = "https://github.com/marcosfsousa/project-ironhack-automated-customer-reviews.git"

if not os.path.exists("/content/repo"):
    subprocess.run(["git", "clone", REPO, "/content/repo"], check=True)
    print("Repo cloned.")
else:
    subprocess.run(["git", "-C", "/content/repo", "pull"], check=True)
    print("Repo updated.")

DATA_PATH = "/content/repo/data/processed/electronics_ready.csv"
print(f"File exists: {os.path.exists(DATA_PATH)}")


Repo cloned.
File exists: True


In [10]:
# Local filesystem path
# df = pd.read_csv("../data/processed/electronics_ready.csv")

# Colab-only path
df = pd.read_csv(DATA_PATH)
print(df.shape)
df.head()

(30487, 4)


Unnamed: 0,name,brand,rating,review_text
0,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,5,This product so far has not disappointed. My c...
1,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,5,great for beginner or experienced person. Boug...
2,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,5,Inexpensive tablet for him to use and learn on...
3,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,4,I've had my Fire HD 8 two weeks now and I love...
4,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,5,I bought this for my grand daughter when she c...


### Category Assignment

In [33]:
# Assign Meta Categories

def assign_meta_category(name):
    """
    Assign product to one of 5 meta-categories.
    Returns None for products that don't fit any category.
    """
    if pd.isna(name):
        return None

    name_lower = name.lower()

    # Kids must come before general Tablets
    if any(kw in name_lower for kw in ["kids edition", "kid-proof"]):
        return "Fire Kids Edition"

    elif any(kw in name_lower for kw in ["fire tablet", "fire hd", "fire 7",
                                          "fire 8", "fire 10", " fire ", "tablet"]):
        return "Fire Tablets"

    elif any(kw in name_lower for kw in ["kindle", "e-reader", "ebook",
                                          "paperwhite", "voyage", "oasis"]):
        return "Kindle E-Readers"

    elif any(kw in name_lower for kw in ["echo", "tap", "alexa"]):
        return "Echo & Smart Speakers"

    elif any(kw in name_lower for kw in ["fire tv", "firetv", "streaming",
                                          "media player"]):
        return "Fire TV & Streaming"

    else:
        # Explicitly return None for products that don't match
        # These will be filtered out at aggregation step
        return None

# Apply to full dataset
df["meta_category"] = df["name"].apply(assign_meta_category)

print("Meta-category distribution (before filtering):")
print(df["meta_category"].value_counts(dropna=False))


Meta-category distribution (before filtering):
meta_category
Fire Tablets             19739
Echo & Smart Speakers     4262
Kindle E-Readers          4231
Fire Kids Edition         2191
None                        58
Fire TV & Streaming          6
Name: count, dtype: int64


In [35]:
# Remove None reviews BEFORE aggregation
df = df[df["meta_category"].notna()]

print("\n✅ Meta-category distribution (after filtering):")
print(df["meta_category"].value_counts())
print(f"Reviews remaining: {len(df)}")


✅ Meta-category distribution (after filtering):
meta_category
Fire Tablets             19739
Echo & Smart Speakers     4262
Kindle E-Readers          4231
Fire Kids Edition         2191
Fire TV & Streaming          6
Name: count, dtype: int64
Reviews remaining: 30429


### Why Aggregate at Product Level?

Initial generation attempts revealed review contamination across categories.
To reduce noise, we:
- First aggregate reviews per product
- Then aggregate top-N products per category

In [38]:
# Agregate reviews by product and category
product_df = df.groupby("name").agg({
    "brand": "first",
    "rating": "mean",
    "review_text": " ".join,
    "meta_category": "first"
}).reset_index()

print(f"Products: {len(product_df)}")
print("\nProduct category distribution:")
print(product_df["meta_category"].value_counts())

# Should now show NO None values!

Products: 71

Product category distribution:
meta_category
Fire Tablets             31
Kindle E-Readers         22
Echo & Smart Speakers    12
Fire Kids Edition         5
Fire TV & Streaming       1
Name: count, dtype: int64


## 5. Model Setup

In [57]:
#MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" # Public access model used while gated model authorization was being reviewed

MODEL_NAME = "meta-llama/Llama-3.2-3B-Instruct" # Larger model that is optimized for text summarization task

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    dtype=torch.float16,
    device_map="auto"
)

print("LLaMA model loaded successfully.")

Loading weights:   0%|          | 0/254 [00:00<?, ?it/s]

LLaMA model loaded successfully.


In [58]:
def build_category_text(category, top_n_products=3):
    products = (
        product_df[product_df["meta_category"] == category]
        .head(top_n_products)
    )
    return " ".join(products["review_text"])

## Prompt Engineering Strategy

Goals:
- Minimize hallucination
- Avoid competitor mentions
- Prevent generic marketing language
- Force grounded, review-based summarization


In [59]:
def build_prompt(category, review_text):
    return f"""
<|system|>
You are a critical but fair technology journalist writing product recommendations.
Only use information explicitly stated in the provided customer reviews.
Do NOT mention products from other categories or competitor brands.
Write in continuous paragraphs only.

<|user|>
Using only the information in the reviews below, write a 350–500 word
RECOMMENDATION ARTICLE about {category}.

Your article must:
1. Identify the main strengths customers love about {category} products
2. Highlight recurring complaints and limitations
3. Explain which types of customers {category} products are BEST suited for
4. Give your VERDICT: When is {category} the BEST CHOICE vs alternatives?
5. End with a clear recommendation:
   - "Choose {category} if you..." (specific use cases)
   - "Look elsewhere if you..." (when it's NOT the best choice)

Write as a reviewer helping customers decide if {category} is the RIGHT choice
for their needs. Be opinionated but fair. Ground everything in the reviews.

Category: {category}

Customer Reviews:
{review_text[:2500]}

<|assistant|>
"""

In [60]:
# Display the actual prompt being used
print("=" * 80)
print("PROMPT TEMPLATE EXAMPLE")
print("=" * 80)
sample_reviews = "Sample review text here..."
sample_prompt = build_prompt("Fire Tablets", sample_reviews)
print(sample_prompt)
print("=" * 80)

PROMPT TEMPLATE EXAMPLE

<|system|>
You are a critical but fair technology journalist writing product recommendations.
Only use information explicitly stated in the provided customer reviews.
Do NOT mention products from other categories or competitor brands.
Write in continuous paragraphs only.

<|user|>
Using only the information in the reviews below, write a 350–500 word 
RECOMMENDATION ARTICLE about Fire Tablets.

Your article must:
1. Identify the main strengths customers love about Fire Tablets products
2. Highlight recurring complaints and limitations
3. Explain which types of customers Fire Tablets products are BEST suited for
4. Give your VERDICT: When is Fire Tablets the BEST CHOICE vs alternatives?
5. End with a clear recommendation:
   - "Choose Fire Tablets if you..." (specific use cases)
   - "Look elsewhere if you..." (when it's NOT the best choice)

Write as a reviewer helping customers decide if Fire Tablets is the RIGHT choice 
for their needs. Be opinionated but fair

## 6. Text Generation

- First sampling with one category to fine-tune prompt afterwards


In [61]:
# Generation Cell - Sampling with Single Category

category = "Echo & Smart Speakers"

echo_text = build_category_text(category, top_n_products=3)

prompt = build_prompt(category, echo_text)

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=600,
    temperature=0.15,
    repetition_penalty=1.1,
    do_sample=True,
    top_p=0.9,
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Remove prompt part
generated_text = generated_text[len(prompt):]

print(generated_text.strip())

As a tech-savvy individual, I've had the pleasure of exploring various Echo & Smart Speakers products, and I'm excited to share my findings with you. After scouring through numerous customer reviews, I've identified some key strengths and weaknesses of these devices.

One of the primary advantages that customers rave about is the ease of setup and user-friendliness. Many reviewers praised the intuitive interface, making it simple to navigate and control the device. Music enthusiasts also appreciate the high-quality sound and seamless integration with popular streaming services. Additionally, the ability to control multiple smart home devices with voice commands is a significant draw for those looking to streamline their living spaces.

However, some customers have expressed frustration with the limited song selection and potential misinterpretation of accents and pronunciations. A few reviewers also noted that the device can be finicky when it comes to connecting to certain smartphones

In [None]:
# def filter_competitor_mentions(text):
#     """Flag or remove competitor brand mentions."""
#     competitors = [
#         'CASE LOGIC', 'Samsung', 'Apple', 'Logitech',
#         'Belkin', 'Anker', 'Sony', 'LG', 'Microsoft',
#         'Google','Google Home', 'HP', 'Dell'
#     ]

#     found_competitors = []
#     for comp in competitors:
#         if comp.lower() in text.lower():
#             found_competitors.append(comp)
#             # Replace with generic term
#             text = re.sub(
#                 f'\\b{re.escape(comp)}\\b',
#                 '[competitor product]',
#                 text,
#                 flags=re.IGNORECASE
#             )

#     if found_competitors:
#         print(f"⚠️ WARNING: Removed competitor mentions: {found_competitors}")

#     return text

We generate two variants per category:
- Controlled (low temperature)
- Creative (higher temperature)

In [62]:
def generate_article(category, temperature=0.15):
    category_text = build_category_text(category, top_n_products=3)
    prompt = build_prompt(category, category_text)

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=600,  # Increased
        temperature=temperature,
        top_p=0.9,
        repetition_penalty=1.1,  # NEW: Reduces "I would recommend" repetition
        do_sample=True,  # Explicit (good practice)
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id  # Ensures proper stopping
    )

    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Improving prompt removal to prevent clipping at the start of the text
    if "<|assistant|>" in generated_text:
        generated_text = generated_text.split("<|assistant|>")[-1].strip()
    else:
        # Fallback: remove prompt length
        generated_text = generated_text[len(prompt):].strip()

    return generated_text


## Generating Articles for All Categories

Temperature = 0.15 → more factual, grounded  
Temperature = 0.35 → more narrative, expressive  

In [63]:
all_categories = df["meta_category"].unique()

# Clear results and re-run Cells 21-23
results = {}
for category in all_categories:
    print(f"Generating for: {category}")
    results[category] = {
        "controlled": generate_article(category, temperature=0.15),
        "creative": generate_article(category, temperature=0.35)
    }

Generating for: Fire Tablets
Generating for: Kindle E-Readers
Generating for: Fire Kids Edition
Generating for: Echo & Smart Speakers
Generating for: Fire TV & Streaming


## Persisting Results

Generated articles are saved to:
outputs/models/generated_blogposts.txt

This ensures reproducibility and version control.


In [64]:
# Create directory to save plots [Colab environment]
os.makedirs("/content/repo/outputs/models/", exist_ok=True)

output_path = "/content/repo/outputs/models/generated_blogposts_v3_no_early_clip.txt"

with open(output_path, "w", encoding="utf-8") as f:
    for category, versions in results.items():
        f.write("="*80 + "\n")
        f.write(f"CATEGORY: {category}\n")
        f.write("="*80 + "\n\n")

        f.write("---- CONTROLLED VERSION ----\n\n")
        f.write(versions["controlled"] + "\n\n")

        f.write("---- CREATIVE VERSION ----\n\n")
        f.write(versions["creative"] + "\n\n\n")

print(f"Saved to {output_path}")


Saved to /content/repo/outputs/models/generated_blogposts_v3_no_early_clip.txt


In [65]:
# Quick quality check
with open('/content/repo/outputs/models/generated_blogposts_v3_no_early_clip.txt', 'r') as f:
    content = f.read()

# Should be 5 categories
print("Categories:", content.count('CATEGORY:'))

# Check for quality markers
print("Verdict mentions:", content.lower().count('verdict'))
print("'Choose' phrases:", content.count('Choose'))

# Check for common repetitive phrases (should be reduced)
repetitive = [
    'I would recommend',
    'great option for',
    'solid choice for'
]
for phrase in repetitive:
    count = content.count(phrase)
    print(f"'{phrase}': {count} times")

Categories: 5
Verdict mentions: 5
'Choose' phrases: 6
'I would recommend': 1 times
'great option for': 0 times
'solid choice for': 4 times


In [66]:
# Generation statistics
print("\n" + "=" * 80)
print("GENERATION SUMMARY")
print("=" * 80)

for category, versions in results.items():
    controlled_len = len(versions['controlled'].split())
    creative_len = len(versions['creative'].split())

    print(f"\n{category}:")
    print(f"  Controlled: {controlled_len} words")
    print(f"  Creative: {creative_len} words")

print("=" * 80)


GENERATION SUMMARY

Fire Tablets:
  Controlled: 335 words
  Creative: 336 words

Kindle E-Readers:
  Controlled: 333 words
  Creative: 368 words

Fire Kids Edition:
  Controlled: 444 words
  Creative: 347 words

Echo & Smart Speakers:
  Controlled: 352 words
  Creative: 510 words

Fire TV & Streaming:
  Controlled: 432 words
  Creative: 429 words


## 7. Evaluation

In [67]:
def build_extractive_baseline(category, top_n_reviews=5, max_words=400):
    """
    Simple extractive baseline:
    - Select top N longest reviews in the category
    - Concatenate and truncate to max_words
    """
    reviews = df[df["meta_category"] == category]["review_text"].dropna()

    # Sort by length (descending)
    reviews = sorted(reviews, key=lambda x: len(x), reverse=True)

    selected = " ".join(reviews[:top_n_reviews])

    words = selected.split()

    return " ".join(words[:max_words])


In [68]:
def parse_generated_file(filepath):
    results = {}

    with open(filepath, "r", encoding="utf-8") as f:
        content = f.read()

    # Split by CATEGORY blocks
    category_blocks = re.split(r"={5,}\nCATEGORY:\s*", content)

    for block in category_blocks:
        if not block.strip():
            continue

        # First line is category name
        lines = block.strip().split("\n")
        category = lines[0].strip()

        # Extract controlled and creative versions
        controlled_match = re.search(
            r"---- CONTROLLED VERSION ----\n\n(.*?)\n\n---- CREATIVE VERSION ----",
            block,
            re.DOTALL
        )

        creative_match = re.search(
            r"---- CREATIVE VERSION ----\n\n(.*)",
            block,
            re.DOTALL
        )

        controlled_text = controlled_match.group(1).strip() if controlled_match else ""
        creative_text = creative_match.group(1).strip() if creative_match else ""

        results[category] = {
            "controlled": controlled_text,
            "creative": creative_text
        }

    return results


In [69]:
# Work with the text output files to avoid re-train the model for evaluation purposes only

filepath = "/content/repo/outputs/models/generated_blogposts_v3_no_early_clip.txt"

results = parse_generated_file(filepath)

print(results.keys())

dict_keys(['Fire Tablets', 'Kindle E-Readers', 'Fire Kids Edition', 'Echo & Smart Speakers', 'Fire TV & Streaming'])


In [70]:
evaluation_results = []

for category in results.keys():

    print(f"Evaluating: {category}")

    # Extractive pseudo-reference
    reference = build_extractive_baseline(category)

    # Generated text
    generated = results[category]["controlled"]

    # ROUGE
    rouge_scores = compute_rouge(reference, generated)

    # BERTScore
    bert_f1 = compute_bertscore(reference, generated)

    # Compression ratio
    source_text = build_category_text(category, top_n_products=3)
    compression = compute_compression_ratio(source_text, generated)

    # Grounding Ratio
    grounding = compute_grounding_ratio(source_text, generated)

    evaluation_results.append({
        "category": category,
        "rouge1": rouge_scores["rouge1"],
        "rouge2": rouge_scores["rouge2"],
        "rougeL": rouge_scores["rougeL"],
        "bertscore_f1": bert_f1,
        "compression_ratio": compression,
        "grounding_ratio": grounding
    })


Evaluating: Fire Tablets


Loading weights:   0%|          | 0/389 [00:00<?, ?it/s]

RobertaModel LOAD REPORT from: roberta-large
Key                             | Status     | 
--------------------------------+------------+-
lm_head.dense.weight            | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
pooler.dense.weight             | MISSING    | 
pooler.dense.bias               | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


Evaluating: Kindle E-Readers


Loading weights:   0%|          | 0/389 [00:00<?, ?it/s]

RobertaModel LOAD REPORT from: roberta-large
Key                             | Status     | 
--------------------------------+------------+-
lm_head.dense.weight            | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
pooler.dense.weight             | MISSING    | 
pooler.dense.bias               | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


Evaluating: Fire Kids Edition


Loading weights:   0%|          | 0/389 [00:00<?, ?it/s]

RobertaModel LOAD REPORT from: roberta-large
Key                             | Status     | 
--------------------------------+------------+-
lm_head.dense.weight            | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
pooler.dense.weight             | MISSING    | 
pooler.dense.bias               | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


Evaluating: Echo & Smart Speakers


Loading weights:   0%|          | 0/389 [00:00<?, ?it/s]

RobertaModel LOAD REPORT from: roberta-large
Key                             | Status     | 
--------------------------------+------------+-
lm_head.dense.weight            | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
pooler.dense.weight             | MISSING    | 
pooler.dense.bias               | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


Evaluating: Fire TV & Streaming


Loading weights:   0%|          | 0/389 [00:00<?, ?it/s]

RobertaModel LOAD REPORT from: roberta-large
Key                             | Status     | 
--------------------------------+------------+-
lm_head.dense.weight            | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
pooler.dense.weight             | MISSING    | 
pooler.dense.bias               | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


In [71]:
eval_df = pd.DataFrame(evaluation_results)

eval_df

Unnamed: 0,category,rouge1,rouge2,rougeL,bertscore_f1,compression_ratio,grounding_ratio
0,Fire Tablets,0.26738,0.013405,0.109626,0.793782,0.011135,0.843575
1,Kindle E-Readers,0.263505,0.02642,0.105402,0.809699,0.092449,0.516304
2,Fire Kids Edition,0.373853,0.043678,0.151376,0.820962,0.030744,0.700893
3,Echo & Smart Speakers,0.296689,0.023904,0.129801,0.817762,0.03322,0.659794
4,Fire TV & Streaming,0.330645,0.040431,0.115591,0.813928,1.289552,0.27451


In [72]:
eval_df.mean(numeric_only=True)

Unnamed: 0,0
rouge1,0.306414
rouge2,0.029568
rougeL,0.122359
bertscore_f1,0.811227
compression_ratio,0.29142
grounding_ratio,0.599015


In [73]:
# Compare controlled vs creative
comparison_data = []

for category in results.keys():
    controlled_text = results[category]['controlled']
    creative_text = results[category]['creative']

    comparison_data.append({
        'category': category,
        'controlled_words': len(controlled_text.split()),
        'creative_words': len(creative_text.split()),
        'controlled_unique_words': len(set(controlled_text.lower().split())),
        'creative_unique_words': len(set(creative_text.lower().split()))
    })

comp_df = pd.DataFrame(comparison_data)
comp_df['creative_richness'] = (
    comp_df['creative_unique_words'] / comp_df['creative_words']
)
comp_df['controlled_richness'] = (
    comp_df['controlled_unique_words'] / comp_df['controlled_words']
)

print("Decoding Strategy Comparison:")
display(comp_df[['category', 'controlled_richness', 'creative_richness']])

Decoding Strategy Comparison:


Unnamed: 0,category,controlled_richness,creative_richness
0,Fire Tablets,0.58209,0.58631
1,Kindle E-Readers,0.588589,0.589674
2,Fire Kids Edition,0.554054,0.579251
3,Echo & Smart Speakers,0.585227,0.547059
4,Fire TV & Streaming,0.509259,0.538462


## Evaluation & Observations

### Model Behavior & Prompting

Compared to TinyLlama, LLaMA 3.2-3B produced more coherent, structurally consistent, and grounded summaries. Early prompt iterations introduced competitor brands and generic phrasing; tightening instructions and lowering temperature significantly reduced this behavior.

Controlled decoding (low temperature) yielded more stable and grounded outputs, while higher temperature decoding improved stylistic richness at the cost of slight drift risk.

### Quantitative Evaluation

#### V3 Final Results:
- Categories: 5 (Fire Tablets, Kindle, Fire Kids, Echo, Fire TV)
- BERTScore: 0.811 (strong semantic alignment)
- Grounding: 0.599 (60% vocabulary from reviews)
- Compression: 0.291 (effective condensation)
- ROUGE-1: 0.306 (appropriate for abstractive style)
- Average words: 392 (target: 350-500) ✅

Because no human-written reference summaries were available, extractive baselines were constructed from the longest reviews per category. Generated summaries were evaluated using ROUGE and BERTScore.

ROUGE-1 differences across the model outputs (+2.4%) reflects slightly better lexical alignment on V2.
BERTScore F1 ≈ 0.81 indicates strong semantic alignment between generated summaries and underlying reviews.

The average compression ratio of 0.37 demonstrates effective condensation of multi-review inputs into concise recommendation-style articles. The V2 is comprised of more concise articles (326 vs 373 words on average) resulting in a compression ratio of 0.32.

### Grounding & Hallucination Proxy

Lexical grounding (overlap between generated and source vocabulary) averaged ≈0.53 on V1. Over half of generated vocabulary was directly anchored in review text, with remaining differences largely attributable to narrative framing and recommendation language. Mentions to "Amazon Echo" products appeared everywhere → artificially inflated overlap. Conversly on V2 the grounding ratio average ≈0.46 which shows the model is writing more naturally using proper article language instead of copy-pasting review vocabulary. This is **abstractive summarization working correctly**

## Decoding Strategy Analysis

Your new comparison data:

```
Category              Controlled   Creative
Fire Tablets    	      0.582	      0.586
Kindle E-Readers	      0.588	      0.589
Fire Kids Edition	      0.554	      0.579
Echo & Smart Speakers	  0.585	      0.547
Fire TV & Streaming	      0.509	      0.538
```
This showcases that the temperature parameter is working as expected creating rich but diverse product reviews.

### Limitations

Generation quality remains sensitive to dataset noise and cross-category review contamination. Automated metrics provide directional insight but do not fully capture coherence, recommendation strength, or factual faithfulness.


## Final Conclusions

This notebook demonstrates that LLM-based category summarization is feasible using constrained prompting and controlled decoding. Compared to TinyLlama, LLaMA produced more coherent, structurally consistent, and better-grounded summaries.

Dataset noise (including cross-product review contamination) required product-level aggregation to stabilize generation. Stricter prompt constraints and low-temperature decoding further reduced drift and competitor hallucinations.

Overall, LLM-based summarization generated meaningful, interpretable category insights with manageable hallucination risk under structured prompting.

