## Generate blogposts for all categories

## Model 3 - Blog entry generation for each category

This notebook generates blog-like summaries that help users choose among products.
It takes clustered product data and creates human-readable reviews using a language model.

#### 1. 🔐 Authenticate with Hugging Face

Authenticates the session with Hugging Face to allow downloading gated models using our account credentials.

In [None]:
from huggingface_hub import login

load_dotenv()
hf_key = os.getenv("HF_KEY") # We need our Hugging Face API key to access gated models

login(new_session=False)

#### 2. Install and Import Libraries

In [None]:
#!pip install git+https://github.com/huggingface/accelerate.git
#!pip install git+https://github.com/huggingface/transformers.git
#!pip install BitsAndBytes

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig
from dotenv import load_dotenv
import os

#### 3. Load Mistral 7B Model with 8-bit Quantization

The below logic loads the Mistral model in 8-bit precision for reduced memory usage (without renouncing to model precision) using BitsAndBytesConfig.

In [None]:
model_path = "mistralai/Mistral-7B-Instruct-v0.2"

bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_8bit_quant_type="nf8",
    bnb_8bit_compute_dtype=torch.float16,
    bnb_8bit_use_double_quant=False,
)

# Load model and tokenizer with memory-efficient 8-bit settings
blog_model = AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=bnb_config,
    device_map="auto",
    use_auth_token=hf_key
)
blog_tokenizer = AutoTokenizer.from_pretrained(model_path, use_auth_token=hf_key)

#### 4. Save Model and Tokenizer Locally

Saves the loaded model and tokenizer to disk so that we don’t need to re-download in the future, during our deployment.

In [None]:
blog_model.save_pretrained('blog_generator')
blog_tokenizer.save_pretrained('blog_generator')

#### 5. Define Blogpost Generation Function

This function loops over product categories and generates short blog posts for each. It:

- Ranks top-rated products

- Extracts most common complaints from negative reviews

- Builds prompts with examples

- Uses the text generation model to generate a blog-style post

In [None]:
def generate_blogposts_for_all_categories(
    df,
    generator,
    n_shot_prompt=None,
    n_top_products=3,
    n_complaints=2,
    max_new_tokens=800,
    temperature=0.4,
    top_p=0.9,
    repetition_penalty=1.1
):
    from nltk.corpus import stopwords
    from collections import Counter

    # Map sentiment labels to numeric values
    sentiment_map = {'positive': 2, 'neutral': 1, 'negative': 0}
    df['sentiment_points'] = df['sentiment'].map(sentiment_map)

    stop_words = set(stopwords.words('english'))
    
    n_shot_prompt = """
# Context:
You are a helpful assistant that writes concise, well-structured blogposts comparing consumer tech products. The user provides product review summaries, and you respond with a styled blogpost including top picks, key complaints, and a warning about the worst-rated product.

## Example 1

<|system|>
You are a product review blogger assistant.

<|user|>
Write a blogpost based on review data for e-readers:
1. Kindle Paperwhite (Avg. Rating: 4.8, 1200 reviews)
   Complaints: screen glare, slow refresh
2. Kobo Clara HD (Avg. Rating: 4.7, 950 reviews)
   Complaints: limited store, battery life
3. Kindle Oasis (Avg. Rating: 4.6, 800 reviews)
   Complaints: price, weight
Worst: Nook GlowLight 3 (Avg. Rating: 3.2), Complaints: slow performance, software glitches

<|assistant|>
--- Blogpost for E-readers ---
Looking for the perfect e-reader? Here are our top picks!

1. **Kindle Paperwhite** (Avg. Rating: 4.8, 1200 reviews)  
   *Top complaints:* screen glare, slow refresh

2. **Kobo Clara HD** (Avg. Rating: 4.7, 950 reviews)  
   *Top complaints:* limited store, battery life

3. **Kindle Oasis** (Avg. Rating: 4.6, 800 reviews)  
   *Top complaints:* price, weight

The **Kindle Paperwhite** stands out for its crisp display and waterproof design, making it ideal for reading anywhere. **Kobo Clara HD** is a great alternative with support for multiple formats, though some users wish for a larger store. The **Kindle Oasis** offers premium features, but its higher price and weight are noted by some reviewers.

**Worst product:** Nook GlowLight 3 (Avg. Rating: 3.2)  
Many customers report slow performance and frequent software glitches. Unless you’re a die-hard Nook fan, we recommend considering other options for a smoother reading experience.

---

## Example 2

<|user|>
Write a blogpost using this tablet review data:
1. iPad Air (4.9, 2100) - complaints: price, limited ports
2. Galaxy Tab S7 (4.8, 1800) - complaints: bloatware, charger speed
3. Fire HD 10 (4.5, 1600) - complaints: app selection, ads
Worst: Lenovo Tab M8 (3.4) - complaints: sluggish, weak display

<|assistant|>
--- Blogpost for Smart Tablets ---
Shopping for a smart tablet? Here are the top choices this year:

1. **Apple iPad Air** (Avg. Rating: 4.9, 2100 reviews)  
   *Top complaints:* price, limited ports

2. **Samsung Galaxy Tab S7** (Avg. Rating: 4.8, 1800 reviews)  
   *Top complaints:* bloatware, charger speed

3. **Amazon Fire HD 10** (Avg. Rating: 4.5, 1600 reviews)  
   *Top complaints:* app selection, ads

The **iPad Air** is praised for its speed and display quality, though it comes at a premium. The **Galaxy Tab S7** offers excellent multitasking, but some users dislike the pre-installed apps. The **Fire HD 10** is a budget-friendly option, but its app ecosystem is more limited.

**Worst product:** Lenovo Tab M8 (Avg. Rating: 3.4)  
Users mention sluggish performance and a lackluster screen. For a better experience, consider one of the top-rated tablets above.

---
"""
    
    blogposts = []

    def extract_top_complaints(product_reviews, product_name, n=n_complaints):
        neg_reviews = product_reviews[
            (product_reviews['name'] == product_name) & 
            (product_reviews['sentiment'] == 'negative')
        ]['combined_reviews']
        words = ' '.join(neg_reviews).split()
        filtered = [w.lower() for w in words if w.lower() not in stop_words and len(w) > 2]
        return [word for word, _ in Counter(filtered).most_common(n)]

    for category in df['clustered_category'].unique():
        cat_df = df[df['clustered_category'] == category]
        if cat_df['name'].nunique() == 0:
            continue

        # Rank products by sentiment score and review count
        product_stats = (
            cat_df.groupby('name')['sentiment_points']
            .agg(['mean', 'count'])
            .sort_values(by=['mean', 'count'], ascending=[False, False])
            .reset_index()
        )
        avg_ratings = cat_df.groupby('name')['reviews.rating'].mean()

        num_products = len(product_stats)
        if num_products == 1:
            top_n, include_worst = 1, False
        elif num_products == 2:
            top_n, include_worst = 1, True
        elif num_products == 3:
            top_n, include_worst = 2, True
        else:
            top_n, include_worst = n_top_products, True

        top_products = product_stats.head(top_n)
        worst_product = product_stats.tail(1) if include_worst else None

        # Compile product details for prompt
        product_lines = []
        for idx, row in top_products.iterrows():
            name = row['name']
            rating = avg_ratings[name]
            count = int(row['count'])
            complaints = extract_top_complaints(cat_df, name)
            complaint_text = ', '.join(complaints) if complaints else 'Few complaints!'
            product_lines.append(f"{idx + 1}. {name} (Avg. Rating: {rating:.2f}, {count} reviews)\n   Top complaints: {complaint_text}")

        # Compile worst product info
        worst_line = ""
        if include_worst and worst_product is not None:
            worst_name = worst_product.iloc[0]['name']
            worst_rating = avg_ratings[worst_name]
            worst_line = f"\nThe worst product is {worst_name} (Avg. Rating: {worst_rating:.2f}).\nExplain why customers should avoid the worst product from the category, based on reviews.\n"

        # Construct final prompt
        category_prompt = (
            f"\n\n### Blogpost:\n\n"
            f"You are a product reviewer. Write a short, helpful blogpost for customers shopping for {category}.\n"
            f"- The top {top_n} product{'s are' if top_n > 1 else ' is'}:\n"
            + '\n'.join(product_lines) +
            worst_line +
            "Write the blog entry in a friendly, informative tone.\n"
        )

        full_prompt = (n_shot_prompt or "") + category_prompt
        generated = generator(
            full_prompt,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=top_p,
            repetition_penalty=repetition_penalty
        )[0]['generated_text']

        blog_entry = generated.split("Write the blog entry in a friendly, informative tone.")[-1].strip()
        blogposts.append(f"--- Blogpost for {category} ---\n\n{blog_entry}")

    return blogposts