# 04 — LLM-Based Review Summarization

## Objective
Generate category-level blog-style summaries from Amazon product reviews using an instruction-tuned LLM (LLaMA 3.2-3B).

This notebook:
- Aggregates reviews at product level
- Generates structured blog articles per meta-category
- Compares decoding strategies (controlled vs creative)
- Analyzes hallucination and data noise effects


## Environment Setup
- GPU: Colab L4
- Model: LLaMA 3.2-3B-Instruct
- Inference only (no fine-tuning)


In [None]:
# Environment Setup (First Run Only)

# !pip install transformers
# !pip install torch
# !pip install rouge-score
# !pip install bert-score
# !pip install sentencepiece
# !pip install accelerate

In [None]:
# Imports

import os
import subprocess
import re
import numpy as np
import pandas as pd

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

from rouge_score import rouge_scorer
from bert_score import score as bertscore

In [None]:
print("CUDA available:", torch.cuda.is_available())
print("Device:", torch.cuda.get_device_name(0))


CUDA available: True
Device: NVIDIA L4


## Data Preparation

We:
1. Load processed electronics dataset
2. Re-apply meta-category assignment (for reproducibility)
3. Aggregate reviews at product level to reduce cross-product noise


In [None]:
REPO = "https://github.com/marcosfsousa/project-ironhack-automated-customer-reviews.git"

if not os.path.exists("/content/repo"):
    subprocess.run(["git", "clone", REPO, "/content/repo"], check=True)
    print("Repo cloned.")
else:
    subprocess.run(["git", "-C", "/content/repo", "pull"], check=True)
    print("Repo updated.")

DATA_PATH = "/content/repo/data/processed/electronics_ready.csv"
print(f"File exists: {os.path.exists(DATA_PATH)}")


Repo updated.
File exists: True


In [None]:
df = pd.read_csv("../data/processed/electronics_ready.csv")
print(df.shape)
df.head()


(30487, 4)


Unnamed: 0,name,brand,rating,review_text
0,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,5,This product so far has not disappointed. My c...
1,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,5,great for beginner or experienced person. Boug...
2,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,5,Inexpensive tablet for him to use and learn on...
3,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,4,I've had my Fire HD 8 two weeks now and I love...
4,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,5,I bought this for my grand daughter when she c...


In [None]:
#MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" # Public access model used while gated model authorization was being reviewed

MODEL_NAME = "meta-llama/Llama-3.2-3B-Instruct" # Gated model that requires auth from the Repo Owners in HF

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    dtype=torch.float16,
    device_map="auto"
)

generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer
)

print("LLaMA model loaded successfully.")

config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/254 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

LLaMA model loaded successfully.


### Why Aggregate at Product Level?

Initial generation attempts revealed review contamination across categories.
To reduce noise, we:
- First aggregate reviews per product
- Then aggregate top-N products per category


In [19]:
# Assign Meta Categories

def assign_meta_category(name):
    """
    Assign products to one of 5 meta-categories based on name keywords.
    Uses the category hierarchy determined during EDA.
    """
    if pd.isna(name):
        return "Unknown"
    
    name_lower = name.lower()
    
    # Order matters — Kids must come before general Tablets
    if any(kw in name_lower for kw in ["kids edition", "kid-proof"]):
        return "Fire Kids Edition"
    
    elif any(kw in name_lower for kw in ["fire tablet", "fire hd", "fire 7", 
                                          "fire 8", "fire 10", " fire ", "tablet"]):
        return "Fire Tablets"
    
    elif any(kw in name_lower for kw in ["kindle", "e-reader", "ebook", 
                                          "paperwhite", "voyage", "oasis"]):
        return "Kindle E-Readers"
    
    elif any(kw in name_lower for kw in ["echo", "tap", "alexa"]):
        return "Echo & Smart Speakers"
    
    elif any(kw in name_lower for kw in ["fire tv", "firetv", "streaming", 
                                          "media player"]):
        return "Fire TV & Streaming"
    
    else:
        return "Accessories & Other"

# Apply to full dataset
df["meta_category"] = df["name"].apply(assign_meta_category)

print("Meta-category distribution:")
print(df["meta_category"].value_counts())

# Check what landed in "Other"
other_products = df[df["meta_category"] == "Accessories & Other"]["name"].unique()
print(f"\nProducts in 'Accessories & Other': {len(other_products)}")
if len(other_products) < 20:
    print(other_products)


Meta-category distribution:
meta_category
Fire Tablets             19739
Echo & Smart Speakers     4262
Kindle E-Readers          4231
Fire Kids Edition         2191
Accessories & Other         58
Fire TV & Streaming          6
Name: count, dtype: int64

Products in 'Accessories & Other': 9
<StringArray>
[                         'Coconut Water Red Tea 16.5 Oz (pack of 12)',
                     'AmazonBasics Nylon CD/DVD Binder (400 Capacity)',
                     'AmazonBasics Ventilated Adjustable Laptop Stand',
                   'AmazonBasics Backpack for Laptops up to 17-inches',
                                'AmazonBasics 11.6-Inch Laptop Sleeve',
                               'AmazonBasics External Hard Drive Case',
 'AmazonBasics USB 3.0 Cable - A-Male to B-Male - 6 Feet (1.8 Meters)',
                       'AmazonBasics 16-Gauge Speaker Wire - 100 Feet',
         'AmazonBasics Bluetooth Keyboard for Android Devices - Black']
Length: 9, dtype: str


In [11]:
# Agregate reviews by product and category

product_df = (
    df.groupby(["meta_category", "name"])
      .agg({
          "review_text": lambda x: " ".join(x.astype(str).head(20))
      })
      .reset_index()
)

product_df

Unnamed: 0,meta_category,name,review_text
0,Accessories & Other,AmazonBasics 11.6-Inch Laptop Sleeve,"BETTER THAN NOTHING, But not as good as the CA..."
1,Accessories & Other,AmazonBasics 16-Gauge Speaker Wire - 100 Feet,As advised. Came really fast Great feel and se...
2,Accessories & Other,AmazonBasics Backpack for Laptops up to 17-inches,"This is a very basic, functional backpack that..."
3,Accessories & Other,AmazonBasics Bluetooth Keyboard for Android De...,"Like a lot of reviewers here, I struggled to f..."
4,Accessories & Other,AmazonBasics External Hard Drive Case,I have the Western Digital My Passport Ultra 2...
...,...,...,...
75,Kindle E-Readers,Kindle Paperwhite,This is my 2 nd kindle. I bought this because ...
76,Kindle E-Readers,"Kindle Paperwhite E-reader - White, 6 High-Res...",I purchased this for my son overseas as he had...
77,Kindle E-Readers,Kindle PowerFast International Charging Kit (f...,I travel internationally at least once a year ...
78,Kindle E-Readers,"Kindle Voyage E-reader, 6 High-Resolution Disp...",Much better than my original Kindle. Lighter a...


## Prompt Engineering Strategy

Goals:
- Minimize hallucination
- Avoid competitor mentions
- Prevent generic marketing language
- Force grounded, review-based summarization


In [None]:
def build_prompt(category, review_text):
    return f"""
<|system|>
You are a critical but fair technology journalist.
Only use information explicitly stated in the provided customer reviews.
Do NOT introduce technical specifications or product details that are not mentioned.
Do NOT use bullet points.
Write in continuous paragraphs only.
Avoid marketing language and generic praise.

<|user|>
Using only the information in the reviews below, write a 350–500 word blog-style article about Amazon Echo products.

Your article must:
1. Identify the main Echo product types mentioned and describe how customers experience them.
2. Highlight concrete strengths with real-world examples from the reviews.
3. Explain recurring complaints or limitations.
4. End with a clear, opinionated recommendation that explains:
   - Who should buy an Echo
   - Who may find it frustrating

Be analytical and grounded in customer language.
Do not invent specifications.
Do not use bullet points.

Category: {category}

Customer Reviews:
{review_text[:2500]}

<|assistant|>
"""

In [20]:
def build_category_text(category, top_n_products=3):
    products = (
        product_df[product_df["meta_category"] == category]
        .head(top_n_products)
    )
    return " ".join(products["review_text"])


## Generation Pipeline

- First sampling with one category to fine-tune prompt afterwards


In [None]:
# Generation Cell - Sampling with Single Category

category = "Echo & Smart Speakers"

echo_text = build_category_text(category, top_n_products=3)

prompt = build_prompt(category, echo_text)

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=450,
    temperature=0.15,
    top_p=0.9,
    pad_token_id=tokenizer.eos_token_id
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Remove prompt part
generated_text = generated_text[len(prompt):]

print(generated_text.strip())

We generate two variants per category:
- Controlled (low temperature)
- Creative (higher temperature)

In [51]:
def generate_article(category, temperature=0.15):
    category_text = build_category_text(category, top_n_products=3)
    prompt = build_prompt(category, category_text)
    
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=500,
        temperature=temperature,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id
    )

    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Remove prompt
    generated_text = generated_text[len(prompt):].strip()
    
    return generated_text


## Generating Articles for All Categories

Temperature = 0.15 → more factual, grounded  
Temperature = 0.35 → more narrative, expressive  

In [52]:
all_categories = df["meta_category"].unique()

results = {}

for category in all_categories:
    print(f"Generating for: {category}")
    
    results[category] = {
        "controlled": generate_article(category, temperature=0.15),
        "creative": generate_article(category, temperature=0.35)
    }


Generating for: Fire Tablets
Generating for: Kindle E-Readers
Generating for: Fire Kids Edition
Generating for: Echo & Smart Speakers
Generating for: Accessories & Other
Generating for: Fire TV & Streaming


## Persisting Results

Generated articles are saved to:
outputs/models/generated_blogposts.txt

This ensures reproducibility and version control.


In [54]:
# Create directory to save plots
os.makedirs("../outputs/models/", exist_ok=True)

output_path = "../outputs/models/generated_blogposts.txt"

with open(output_path, "w", encoding="utf-8") as f:
    for category, versions in results.items():
        f.write("="*80 + "\n")
        f.write(f"CATEGORY: {category}\n")
        f.write("="*80 + "\n\n")
        
        f.write("---- CONTROLLED VERSION ----\n\n")
        f.write(versions["controlled"] + "\n\n")
        
        f.write("---- CREATIVE VERSION ----\n\n")
        f.write(versions["creative"] + "\n\n\n")

print(f"Saved to {output_path}")


Saved to ../outputs/models/generated_blogposts.txt


## Evaluation

In [None]:
def build_extractive_baseline(category, top_n_reviews=5, max_words=400):
    """
    Simple extractive baseline:
    - Select top N longest reviews in the category
    - Concatenate and truncate to max_words
    """
    reviews = df[df["meta_category"] == category]["review_text"].dropna()
    
    # Sort by length (descending)
    reviews = sorted(reviews, key=lambda x: len(x), reverse=True)
    
    selected = " ".join(reviews[:top_n_reviews])
    
    words = selected.split()
    
    return " ".join(words[:max_words])


In [None]:
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

def compute_rouge(reference, generated):
    scores = scorer.score(reference, generated)
    
    return {
        "rouge1": scores["rouge1"].fmeasure,
        "rouge2": scores["rouge2"].fmeasure,
        "rougeL": scores["rougeL"].fmeasure
    }


In [None]:
def compute_bertscore(reference, generated):
    P, R, F1 = bertscore(
        [generated],
        [reference],
        lang="en",
        rescale_with_baseline=False   # ← important
    )
    
    return float(F1.mean())



In [7]:
def compute_compression_ratio(source_text, generated_text):
    source_len = len(source_text.split())
    generated_len = len(generated_text.split())
    
    return generated_len / source_len


In [None]:
def tokenize(text):
    text = text.lower()
    text = re.sub(r"[^a-z0-9\s]", "", text)
    return set(text.split())

def compute_grounding_ratio(source_text, generated_text):
    source_tokens = tokenize(source_text)
    generated_tokens = tokenize(generated_text)
    
    overlap = generated_tokens.intersection(source_tokens)
    
    if len(generated_tokens) == 0:
        return 0
    
    return len(overlap) / len(generated_tokens)

In [None]:
# Work with the generated_blogsposts.txt to avoid re-train the model for evaluation purposes only

filepath = "../outputs/models/generated_blogposts.txt"

results = parse_generated_file(filepath)

print(results.keys())

In [None]:
def parse_generated_file(filepath):
    results = {}
    
    with open(filepath, "r", encoding="utf-8") as f:
        content = f.read()
    
    # Split by CATEGORY blocks
    category_blocks = re.split(r"={5,}\nCATEGORY:\s*", content)
    
    for block in category_blocks:
        if not block.strip():
            continue
        
        # First line is category name
        lines = block.strip().split("\n")
        category = lines[0].strip()
        
        # Extract controlled and creative versions
        controlled_match = re.search(
            r"---- CONTROLLED VERSION ----\n\n(.*?)\n\n---- CREATIVE VERSION ----",
            block,
            re.DOTALL
        )
        
        creative_match = re.search(
            r"---- CREATIVE VERSION ----\n\n(.*)",
            block,
            re.DOTALL
        )
        
        controlled_text = controlled_match.group(1).strip() if controlled_match else ""
        creative_text = creative_match.group(1).strip() if creative_match else ""
        
        results[category] = {
            "controlled": controlled_text,
            "creative": creative_text
        }
    
    return results


In [31]:
evaluation_results = []

for category in results.keys():
    
    print(f"Evaluating: {category}")
    
    # Extractive pseudo-reference
    reference = build_extractive_baseline(category)
    
    # Generated text
    generated = results[category]["controlled"]
    
    # ROUGE
    rouge_scores = compute_rouge(reference, generated)
    
    # BERTScore
    bert_f1 = compute_bertscore(reference, generated)
    
    # Compression ratio
    source_text = build_category_text(category, top_n_products=3)
    compression = compute_compression_ratio(source_text, generated)

    # Grounding Ratio
    grounding = compute_grounding_ratio(source_text, generated)
    
    evaluation_results.append({
        "category": category,
        "rouge1": rouge_scores["rouge1"],
        "rouge2": rouge_scores["rouge2"],
        "rougeL": rouge_scores["rougeL"],
        "bertscore_f1": bert_f1,
        "compression_ratio": compression,
        "grounding_ratio": grounding
    })


Evaluating: Fire Tablets


Loading weights:   0%|          | 0/389 [00:00<?, ?it/s]

[1mRobertaModel LOAD REPORT[0m from: roberta-large
Key                             | Status     | 
--------------------------------+------------+-
lm_head.dense.weight            | UNEXPECTED | 
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
pooler.dense.bias               | MISSING    | 
pooler.dense.weight             | MISSING    | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING[3m	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.[0m


Evaluating: Kindle E-Readers


Loading weights:   0%|          | 0/389 [00:00<?, ?it/s]

[1mRobertaModel LOAD REPORT[0m from: roberta-large
Key                             | Status     | 
--------------------------------+------------+-
lm_head.dense.weight            | UNEXPECTED | 
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
pooler.dense.bias               | MISSING    | 
pooler.dense.weight             | MISSING    | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING[3m	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.[0m


Evaluating: Fire Kids Edition


Loading weights:   0%|          | 0/389 [00:00<?, ?it/s]

[1mRobertaModel LOAD REPORT[0m from: roberta-large
Key                             | Status     | 
--------------------------------+------------+-
lm_head.dense.weight            | UNEXPECTED | 
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
pooler.dense.bias               | MISSING    | 
pooler.dense.weight             | MISSING    | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING[3m	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.[0m


Evaluating: Echo & Smart Speakers


Loading weights:   0%|          | 0/389 [00:00<?, ?it/s]

[1mRobertaModel LOAD REPORT[0m from: roberta-large
Key                             | Status     | 
--------------------------------+------------+-
lm_head.dense.weight            | UNEXPECTED | 
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
pooler.dense.bias               | MISSING    | 
pooler.dense.weight             | MISSING    | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING[3m	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.[0m


Evaluating: Accessories & Other


Loading weights:   0%|          | 0/389 [00:00<?, ?it/s]

[1mRobertaModel LOAD REPORT[0m from: roberta-large
Key                             | Status     | 
--------------------------------+------------+-
lm_head.dense.weight            | UNEXPECTED | 
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
pooler.dense.bias               | MISSING    | 
pooler.dense.weight             | MISSING    | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING[3m	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.[0m


Evaluating: Fire TV & Streaming


Loading weights:   0%|          | 0/389 [00:00<?, ?it/s]

[1mRobertaModel LOAD REPORT[0m from: roberta-large
Key                             | Status     | 
--------------------------------+------------+-
lm_head.dense.weight            | UNEXPECTED | 
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
pooler.dense.bias               | MISSING    | 
pooler.dense.weight             | MISSING    | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING[3m	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.[0m


In [None]:
eval_df = pd.DataFrame(evaluation_results)

eval_df

Unnamed: 0,category,rouge1,rouge2,rougeL,bertscore_f1,compression_ratio,grounding_ratio
0,Fire Tablets,0.292135,0.014085,0.126404,0.800121,0.176573,0.509934
1,Kindle E-Readers,0.304993,0.024357,0.137652,0.808534,0.227821,0.463087
2,Fire Kids Edition,0.332948,0.034762,0.145665,0.805684,0.177215,0.543624
3,Echo & Smart Speakers,0.343008,0.031746,0.137203,0.825337,0.180735,0.630303
4,Accessories & Other,0.357895,0.065651,0.152047,0.797558,0.194293,0.578947
5,Fire TV & Streaming,0.381963,0.106383,0.156499,0.83048,1.283582,0.464481


In [None]:
eval_df.mean(numeric_only=True)

rouge1               0.335490
rouge2               0.046164
rougeL               0.142578
bertscore_f1         0.811286
compression_ratio    0.373370
grounding_ratio      0.531729
dtype: float64

In [None]:
def tokenize(text):
    text = text.lower()
    text = re.sub(r"[^a-z0-9\s]", "", text)
    return set(text.split())

def compute_grounding_ratio(source_text, generated_text):
    source_tokens = tokenize(source_text)
    generated_tokens = tokenize(generated_text)
    
    overlap = generated_tokens.intersection(source_tokens)
    
    if len(generated_tokens) == 0:
        return 0
    
    return len(overlap) / len(generated_tokens)

## Evaluation & Observations

### Model Behavior & Prompting

Compared to TinyLlama, LLaMA 3.2-3B produced more coherent, structurally consistent, and grounded summaries. Early prompt iterations introduced competitor brands and generic phrasing; tightening instructions and lowering temperature significantly reduced this behavior.

Controlled decoding (low temperature) yielded more stable and grounded outputs, while higher temperature decoding improved stylistic richness at the cost of slight drift risk.

### Quantitative Evaluation

Because no human-written reference summaries were available, extractive baselines were constructed from the longest reviews per category. Generated summaries were evaluated using ROUGE and BERTScore.

ROUGE-1 ≈ 0.33 reflects moderate lexical overlap, consistent with abstractive summarization where wording differs from source text.  
BERTScore F1 ≈ 0.81 indicates strong semantic alignment between generated summaries and underlying reviews.

The average compression ratio of 0.37 demonstrates effective condensation of multi-review inputs into concise recommendation-style articles. Smaller categories exhibited expansion rather than compression due to limited source material.

### Grounding & Hallucination Proxy

Lexical grounding (overlap between generated and source vocabulary) averaged ≈0.53. Over half of generated vocabulary was directly anchored in review text, with remaining differences largely attributable to narrative framing and recommendation language. While this metric does not capture semantic hallucination, it provides a coarse but useful signal of source alignment.

### Limitations

Generation quality remains sensitive to dataset noise and cross-category review contamination. Automated metrics provide directional insight but do not fully capture coherence, recommendation strength, or factual faithfulness.


## Final Conclusions

This notebook demonstrates that LLM-based category summarization is feasible using constrained prompting and controlled decoding. Compared to TinyLlama, LLaMA produced more coherent, structurally consistent, and better-grounded summaries.

Dataset noise (including cross-product review contamination) required product-level aggregation to stabilize generation. Prompt constraints and low-temperature decoding further reduced drift and competitor hallucinations.

Quantitative evaluation showed moderate lexical overlap (ROUGE-1 ≈ 0.33) and strong semantic alignment (BERTScore ≈ 0.81), with effective compression of multi-review inputs. Grounding analysis (≈0.53 lexical overlap) suggests outputs remained largely anchored to source content despite narrative rephrasing.

Overall, LLM-based summarization generated meaningful, interpretable category insights with manageable hallucination risk under structured prompting.

