## Text Generation Model

- Load clustered products

- Aggregate reviews per meta-category

- Load a lightweight LLM

- Design a strong prompt

- Generate summaries per category

In [None]:
#%pip install -q transformers accelerate bitsandbytes sentencepiece rouge-score nltk

In [3]:
import torch
print("CUDA available:", torch.cuda.is_available())
print("Device:", torch.cuda.get_device_name(0))


CUDA available: True
Device: NVIDIA L4


## Load the Data

In [17]:
import os
import subprocess

REPO = "https://github.com/marcosfsousa/project-ironhack-automated-customer-reviews.git"

if not os.path.exists("/content/repo"):
    subprocess.run(["git", "clone", REPO, "/content/repo"], check=True)
    print("Repo cloned.")
else:
    subprocess.run(["git", "-C", "/content/repo", "pull"], check=True)
    print("Repo updated.")

DATA_PATH = "/content/repo/data/processed/electronics_ready.csv"
print(f"File exists: {os.path.exists(DATA_PATH)}")


Repo updated.
File exists: True


In [19]:
import pandas as pd

df = pd.read_csv("data/processed/electronics_ready.csv")
print(df.shape)
df.head()


(30487, 4)


Unnamed: 0,name,brand,rating,review_text
0,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,5,This product so far has not disappointed. My c...
1,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,5,great for beginner or experienced person. Boug...
2,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,5,Inexpensive tablet for him to use and learn on...
3,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,4,I've had my Fire HD 8 two weeks now and I love...
4,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,5,I bought this for my grand daughter when she c...


In [35]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

#MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" # Public access model used while gated model authorization was being reviewed

MODEL_NAME = "meta-llama/Llama-3.2-3B-Instruct" # Gated model that requires auth from the Repo Owners in HF

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    dtype=torch.float16,
    device_map="auto"
)

generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer
)

print("LLaMA model loaded successfully.")

config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/254 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

LLaMA model loaded successfully.


## Building the Prompt

In [49]:
def build_prompt(category, review_text):
    return f"""
<|system|>
You are a critical but fair technology journalist.
Only use information explicitly stated in the provided customer reviews.
Do NOT introduce technical specifications or product details that are not mentioned.
Do NOT use bullet points.
Write in continuous paragraphs only.
Avoid marketing language and generic praise.

<|user|>
Using only the information in the reviews below, write a 350–500 word blog-style article about Amazon Echo products.

Your article must:
1. Identify the main Echo product types mentioned and describe how customers experience them.
2. Highlight concrete strengths with real-world examples from the reviews.
3. Explain recurring complaints or limitations.
4. End with a clear, opinionated recommendation that explains:
   - Who should buy an Echo
   - Who may find it frustrating

Be analytical and grounded in customer language.
Do not invent specifications.
Do not use bullet points.

Category: {category}

Customer Reviews:
{review_text[:2500]}

<|assistant|>
"""

In [None]:
# Assign Meta Categories

def assign_meta_category(name):
    """
    Assign products to one of 5 meta-categories based on name keywords.
    Uses the category hierarchy determined during EDA.
    """
    if pd.isna(name):
        return "Unknown"
    
    name_lower = name.lower()
    
    # Order matters — Kids must come before general Tablets
    if any(kw in name_lower for kw in ["kids edition", "kid-proof"]):
        return "Fire Kids Edition"
    
    elif any(kw in name_lower for kw in ["fire tablet", "fire hd", "fire 7", 
                                          "fire 8", "fire 10", " fire ", "tablet"]):
        return "Fire Tablets"
    
    elif any(kw in name_lower for kw in ["kindle", "e-reader", "ebook", 
                                          "paperwhite", "voyage", "oasis"]):
        return "Kindle E-Readers"
    
    elif any(kw in name_lower for kw in ["echo", "tap", "alexa"]):
        return "Echo & Smart Speakers"
    
    elif any(kw in name_lower for kw in ["fire tv", "firetv", "streaming", 
                                          "media player"]):
        return "Fire TV & Streaming"
    
    else:
        return "Accessories & Other"

# Apply to full dataset
df["meta_category"] = df["name"].apply(assign_meta_category)

print("Meta-category distribution:")
print(df["meta_category"].value_counts())

# Check what landed in "Other"
other_products = df[df["meta_category"] == "Accessories & Other"]["name"].unique()
print(f"\nProducts in 'Accessories & Other': {len(other_products)}")
if len(other_products) < 20:
    print(other_products)


In [None]:
# Agregate reviews by product and category

product_df = (
    df.groupby(["meta_category", "name"])
      .agg({
          "review_text": lambda x: " ".join(x.astype(str).head(20))
      })
      .reset_index()
)

product_df


Unnamed: 0,meta_category,name,review_text
0,Accessories & Other,AmazonBasics 11.6-Inch Laptop Sleeve,"BETTER THAN NOTHING, But not as good as the CA..."
1,Accessories & Other,AmazonBasics 16-Gauge Speaker Wire - 100 Feet,As advised. Came really fast Great feel and se...
2,Accessories & Other,AmazonBasics Backpack for Laptops up to 17-inches,"This is a very basic, functional backpack that..."
3,Accessories & Other,AmazonBasics Bluetooth Keyboard for Android De...,"Like a lot of reviewers here, I struggled to f..."
4,Accessories & Other,AmazonBasics External Hard Drive Case,I have the Western Digital My Passport Ultra 2...
...,...,...,...
75,Kindle E-Readers,Kindle Paperwhite,This is my 2 nd kindle. I bought this because ...
76,Kindle E-Readers,"Kindle Paperwhite E-reader - White, 6 High-Res...",I purchased this for my son overseas as he had...
77,Kindle E-Readers,Kindle PowerFast International Charging Kit (f...,I travel internationally at least once a year ...
78,Kindle E-Readers,"Kindle Voyage E-reader, 6 High-Resolution Disp...",Much better than my original Kindle. Lighter a...


In [33]:
def build_category_text(category, top_n_products=3):
    products = (
        product_df[product_df["meta_category"] == category]
        .head(top_n_products)
    )
    return " ".join(products["review_text"])


In [None]:
# Generation Cell - Sampling with Single Category

category = "Echo & Smart Speakers"

echo_text = build_category_text(category, top_n_products=3)

prompt = build_prompt(category, echo_text)

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=450,
    temperature=0.15,
    top_p=0.9,
    pad_token_id=tokenizer.eos_token_id
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Remove prompt part
generated_text = generated_text[len(prompt):]

print(generated_text.strip())

In [51]:
def generate_article(category, temperature=0.15):
    category_text = build_category_text(category, top_n_products=3)
    prompt = build_prompt(category, category_text)
    
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=500,
        temperature=temperature,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id
    )

    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Remove prompt
    generated_text = generated_text[len(prompt):].strip()
    
    return generated_text


In [52]:
all_categories = df["meta_category"].unique()

results = {}

for category in all_categories:
    print(f"Generating for: {category}")
    
    results[category] = {
        "controlled": generate_article(category, temperature=0.15),
        "creative": generate_article(category, temperature=0.35)
    }


Generating for: Fire Tablets
Generating for: Kindle E-Readers
Generating for: Fire Kids Edition
Generating for: Echo & Smart Speakers
Generating for: Accessories & Other
Generating for: Fire TV & Streaming


In [54]:
# Create directory to save plots
os.makedirs("../outputs/models/", exist_ok=True)

output_path = "../outputs/models/generated_blogposts.txt"

with open(output_path, "w", encoding="utf-8") as f:
    for category, versions in results.items():
        f.write("="*80 + "\n")
        f.write(f"CATEGORY: {category}\n")
        f.write("="*80 + "\n\n")
        
        f.write("---- CONTROLLED VERSION ----\n\n")
        f.write(versions["controlled"] + "\n\n")
        
        f.write("---- CREATIVE VERSION ----\n\n")
        f.write(versions["creative"] + "\n\n\n")

print(f"Saved to {output_path}")


Saved to ../outputs/models/generated_blogposts.txt


### Conclusion

During generation testing, we observed cross-product review contamination in the dataset. Some review texts did not match the product name, likely due to scraping inconsistencies in the original dataset. To mitigate this, we aggregated reviews at product-level and limited the number of reviews per product.