# ReviewLens AI: Model Development and Validation

**Author:** Kevin Della Piazza
**Date:** October 2025

## 1. Introduction
This notebook documents the development and validation process for the core AI models of the ReviewLens platform. The objective is to test a suite of four distinct NLP models on a sample of the dataset to validate their effectiveness and to prototype the foundational Python code for the production AWS Lambda functions. This notebook serves as a professional artifact detailing the data science workflow, from data preparation to model evaluation.

### AI Models Tested:
1.  **Sentiment Analysis:** For overall positive/negative classification.
2.  **Zero-Shot Classification:** For dynamic topic tagging.
3.  **Aspect-Based Sentiment Analysis (ABSA):** For fine-grained sentiment on specific features.
4.  **Topic Modeling:** For discovering latent themes in the text corpus.

In [None]:
# --- Core Libraries ---
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
import warnings
from IPython.display import display
import re
import stopit

# --- AI Libraries ---
from transformers import pipeline
from pyabsa import ATEPCCheckpointManager
from bertopic import BERTopic

# --- Validation ---
from sklearn.metrics import accuracy_score, classification_report

# --- Configuration ---
# Suppress ignorable warnings for a cleaner output
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
tqdm.pandas()

print("Libraries imported successfully.")

## 2. Data Loading, Cleaning & Sanitization
We load the raw dataset and apply the cleaning steps that will be replicated in our AWS pipeline. A text sanitization step is included to prevent errors from special characters, making the code more robust for production.

In [None]:
# Load the raw dataset
try:
    # Use a relative path to ensure portability
    file_path = '../data/reviews.csv'
    df = pd.read_csv(file_path)
    print(f"Dataset loaded successfully with {len(df)} rows.")

    # --- Data Cleaning ---
    print("Starting data cleaning...")
    df_cleaned = df.drop('Unnamed: 0', axis=1, errors='ignore')
    df_cleaned.dropna(subset=['Review Text'], inplace=True)
    df_cleaned['Title'] = df_cleaned['Title'].fillna('')
    df_cleaned['full_review_text'] = df_cleaned['Title'] + ' ' + df_cleaned['Review Text']
    df_cleaned.dropna(subset=['Division Name', 'Department Name', 'Class Name'], inplace=True)
    
    # --- Text Sanitization Function ---
    def sanitize_text(text):
        if not isinstance(text, str): return ""
        # Replace ampersand, which was identified as a "poison pill" for ABSA
        text = text.replace('&', 'and')
        # Remove non-printable control characters
        text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\x9f]', '', text)
        return text

    # Apply sanitization to the main text column
    df_cleaned['full_review_text'] = df_cleaned['full_review_text'].apply(sanitize_text)
    print(f"Data cleaned and sanitized. {len(df_cleaned)} rows remaining.")

except FileNotFoundError:
    print("Error: reviews.csv not found. Please ensure the file is in the 'data/' directory.")

## 3. Sample Creation
To ensure rapid development and testing, we create two distinct samples:
* A **small sample (100 reviews)** for fast, iterative tests on row-level models.
* A **larger sample (2000 reviews)** for Topic Modeling, which requires a larger corpus to generate meaningful results.

In [None]:
if 'df_cleaned' in locals() and not df_cleaned.empty:
    # Small sample for fast, iterative tests
    df_sample_fast = df_cleaned.sample(100, random_state=42).copy()
    print(f"Created 'df_sample_fast' with {len(df_sample_fast)} reviews for rapid testing.")

    # Larger sample required for meaningful Topic Modeling
    df_sample_topic = df_cleaned.sample(2000, random_state=42).copy()
    print(f"Created 'df_sample_topic' with {len(df_sample_topic)} reviews for Topic Modeling.")

## 4. AI Model Development & Validation

### A Note on Model Constraints (Token Limits)

All models based on the BERT or Transformer architecture (including DistilBERT and `pyabsa`'s underlying models) have a **maximum input size limit**, typically 512 tokens (approx. 350-400 words).

In a production environment, passing text longer than this limit will cause the model to fail. To build a robust pipeline, we must **truncate** the input text to this 512-token limit.

**Is this a problem?** No. For review analysis, this is an acceptable trade-off. The core sentiment and topics of a review are almost always contained in the first few paragraphs. By truncating the text, we gain massive performance and stability at a negligible cost to accuracy. The `bertopic` model handles this truncation automatically.



### 4.1. Sentiment Analysis
* **Business Question:** Are our customers generally happy or unhappy?
* **Model Chosen:** `distilbert-base-uncased-finetuned-sst-2-english`
* **Justification:** Chosen for its excellent balance of speed and accuracy, making it ideal for scalable, serverless environments.

In [None]:
print("Loading Sentiment Analysis model...")
sentiment_pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

def safe_get_sentiment(text):
    """Applies sentiment analysis with error handling."""
    try:
        return sentiment_pipeline(text[:512])[0]['label']
    except Exception as e:
        print(f"--> SENTIMENT FAILED on review: '{text[:50]}...' | Error: {e}")
        return "ERROR"

print(f"Applying Sentiment Analysis to the fast sample ({len(df_sample_fast)} reviews)...")
df_sample_fast['sentiment_prediction'] = df_sample_fast['full_review_text'].progress_apply(safe_get_sentiment)
print("Sentiment Analysis complete.")

# --- Quantitative Validation ---
print("\n--- Model Validation ---")
# Define "ground truth" based on user ratings (e.g., > 3 stars is positive)
df_sample_fast['true_sentiment'] = np.where(df_sample_fast['Rating'] > 3, 'POSITIVE', 'NEGATIVE')

# Calculate accuracy
accuracy = accuracy_score(df_sample_fast['true_sentiment'], df_sample_fast['sentiment_prediction'])
print(f"Sentiment Model Accuracy on Sample: {accuracy:.2%}")

# Display a detailed classification report
print("\nClassification Report:")
print(classification_report(df_sample_fast['true_sentiment'], df_sample_fast['sentiment_prediction']))

### 4.2. Zero-Shot Classification
* **Business Question:** What specific topics (e.g., 'price', 'shipping') are customers talking about?
* **Model Chosen:** `typeform/distilbert-base-uncased-mnli`
* **Justification:** This model provides a fast and highly accurate zero-shot classification capability. Its efficient "distilled" architecture is ideal for a serverless environment, ensuring high performance and scalability. This allows us to dynamically categorize reviews against key business topics (like 'price' or 'shipping') without needing to retrain the model for new categories.

In [None]:
print("Loading Zero-Shot Classification model (typeform/distilbert-base-uncased-mnli)...")
zero_shot_classifier = pipeline("zero-shot-classification", model="typeform/distilbert-base-uncased-mnli")
print("Zero-Shot model loaded successfully.")

candidate_labels = ['price', 'quality', 'shipping', 'customer service', 'fit', 'fabric']

def safe_get_top_topic(review_text):
    """Applies zero-shot classification with error handling."""
    try:
        return zero_shot_classifier(review_text[:512], candidate_labels)['labels'][0]
    except Exception as e:
        print(f"--> ZERO-SHOT FAILED on review: '{review_text[:50]}...' | Error: {e}")
        return "ERROR"

print(f"Applying Zero-Shot Classification to the fast sample ({len(df_sample_fast)} reviews)...")
df_sample_fast['zero_shot_topic'] = df_sample_fast['full_review_text'].progress_apply(safe_get_top_topic)
print("Zero-Shot Classification complete.")

# --- Qualitative Validation ---
print("\n--- Qualitative Validation Examples ---")
display(df_sample_fast[['full_review_text', 'zero_shot_topic']].head())

### 4.3. Aspect-Based Sentiment Analysis (ABSA using Zero-Shot)
* **Business Question:** When customers discuss a topic, what specific *features* do they like or dislike, and with what sentiment?
* **Model Chosen:** `MoritzLaurer/mDeBERTa-v3-base-mnli-xnli` (used via Zero-Shot Classification pipeline)
* **Justification:** We employ a reliable Zero-Shot Classification technique using the powerful, multilingual mDeBERTa model to perform ABSA. By providing a curated list of "Aspect-Sentiment Pairs" (e.g., 'slow delivery') as candidate labels and enabling multi_label=True, this method accurately identifies all relevant aspects within the review text. It offers high accuracy and stability for extracting fine-grained sentiment in a deployable serverless function

In [None]:
print("Loading Zero-Shot Classification model for ABSA (MoritzLaurer/mDeBERTa-v3-base-mnli-xnli)...")
# Note: This is a different, more powerful model than the one used for Topic classification
absa_zero_shot_classifier = pipeline("zero-shot-classification", model="MoritzLaurer/mDeBERTa-v3-base-mnli-xnli")
print("Zero-Shot ABSA model loaded successfully.")

# --- Define Aspect-Sentiment Pair (ASP) Labels ---
# These labels are crucial. They define the specific aspect-sentiment combinations the model will look for.
# Customize this list based on the expected domain (e.g., clothing, electronics).
aspect_sentiment_labels = [
    'slow delivery', 'fast delivery', 'damaged box', 'good packaging',
    'good quality', 'poor quality', 'defective item',
    'good fit', 'tight fit', 'loose fit', 'wrong size',
    'good price', 'expensive', 'value for money',
    'soft fabric', 'rough fabric', 'nice color', 'wrong color',
    'good customer service', 'poor customer service'
]
# Define the confidence score threshold
score_threshold = 0.6

def safe_get_aspects_zeroshot(review_text):
    """
    Applies zero-shot classification with multi_label=True to find all relevant
    Aspect-Sentiment Pairs above a certain threshold.
    """
    if not isinstance(review_text, str) or not review_text.strip():
        return "N/A"

    try:
        # Truncate text
        truncated_text = " ".join(review_text.split()[:400]) # Safe estimate for 512 tokens

        # Run inference with multi_label=True
        results = absa_zero_shot_classifier(
            truncated_text,
            aspect_sentiment_labels,
            multi_label=True
        )

        # Filter results based on the score threshold
        matching_aspects = []
        for label, score in zip(results['labels'], results['scores']):
            if score >= score_threshold:
                # Store as "label (score)" for validation, or just "label" for production
                matching_aspects.append(f"{label} ({score:.2f})")

        if not matching_aspects:
            return "N/A"

        # Return comma-separated string
        return ", ".join(matching_aspects)

    except Exception as e:
        print(f"--> ZERO-SHOT ABSA FAILED on review: '{review_text[:50]}...' | Error: {e}")
        return "PREDICTION_ERROR" # Use a distinct error code

print(f"Applying Zero-Shot ABSA to the fast sample ({len(df_sample_fast)} reviews)...")
# Apply the new function
df_sample_fast['aspects'] = df_sample_fast['full_review_text'].progress_apply(safe_get_aspects_zeroshot)
print("Zero-Shot ABSA complete.")

# --- Qualitative Validation & Error Checking ---
print("\n--- Qualitative Validation Examples (Aspects) ---")
# Display reviews where aspects were successfully found
display(df_sample_fast[~df_sample_fast['aspects'].isin(["N/A", "PREDICTION_ERROR"])][['full_review_text', 'aspects']].head())

failed_reviews = df_sample_fast[df_sample_fast['aspects'] == "PREDICTION_ERROR"]
if not failed_reviews.empty:
    print(f"\nWARNING: {len(failed_reviews)} reviews failed during Zero-Shot ABSA prediction.")

### 4.4. Topic Modeling
* **Business Question:** What are the hidden, high-level themes of conversation across all reviews?
* **Model Chosen:** `bertopic`
* **Justification:** BERTopic leverages transformer embeddings to find semantically coherent topics, which are more interpretable than traditional methods.

In [None]:
print(f"Starting Topic Modeling on the larger sample ({len(df_sample_topic)} reviews)...")
docs = df_sample_topic['full_review_text'].tolist()

from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS

# 1. Create the list of stop words to ignore
stop_words = list(ENGLISH_STOP_WORDS)

# 2. Create the "employee" (CountVectorizer) and give it the stop word list
vectorizer = CountVectorizer(stop_words=stop_words)

# 3. Initialize BERTopic and tell it to use our specific "employee"
topic_model = BERTopic(language="english", vectorizer_model=vectorizer, verbose=False)

topics, probs = topic_model.fit_transform(docs)

print("Topic Modeling complete!")

# --- Qualitative Validation ---
print("\n--- Discovered Topics Summary ---")

# Display a summary of the most prominent topics. Topic -1 contains outliers and can be ignored.
# We also show only the top 10 most frequent topics (plus Topic -1)
display(topic_model.get_topic_info().head(5))

#### **How to Read This Table:**

* **`Topic`**: The ID number for the theme.
    * **Topic `-1` (The Outliers):** This is the **outlier group**. It contains all unique, one-off reviews that don't fit into a larger theme. **You should ignore Topic -1 when analyzing trends.**
* **`Count`**: The number of reviews in that theme. This shows you how popular a topic is.
* **`Representation`**: The **new, clean keywords** that best describe the theme. After remove the stop-words, we can see the true essence of the conversation (e.g., `[jeans, fit, pants, love]`).
* **`Representative_Docs`**: A full, real review from that group, which provides the ultimate context. **Read this column to give the topic a human-readable name.**

**Example Interpretation:**
* **Topic 0:** Keywords might be `[dress, love, fit, size]`. This is the **"Dress Reviews"** cluster.
* **Topic 1:** Keywords might be `[jeans, fit, pants, comfortable]`. This is the **"Pants & Jeans"** cluster.

## 5. Final Result & Conclusion
The sample DataFrame is now fully enriched with insights from all AI models. This validated logic is ready to be integrated into the production Lambda functions. The final data structure provides a multi-dimensional view of each customer review, enabling a rich, interactive analysis on the final dashboard.

In [None]:
# Select and reorder columns for a clean final view
# Note: 'bertopic_id' is not in this fast sample; it's a corpus-level insight.
final_columns = [
    'full_review_text', 
    'sentiment_prediction', 
    'zero_shot_topic',
    'aspects',
    'Rating',
    'true_sentiment'
]

print("--- Final Enriched Sample DataFrame (from fast sample) ---")
display(df_sample_fast[final_columns].head(10))

The DataFrame above represents the **final enriched product** for a *single sample* of reviews. 
Each new column provides a different layer of AI-driven insight:

* **`full_review_text` (Input):** The raw, sanitized text (Title + Review Text) that was fed into the pipeline.
* **`sentiment_prediction` (AI Layer 1):** The **overall sentiment** of the entire review (e.g., `POSITIVE`/`NEGATIVE`), as determined by our *Sentiment Lambda*.
* **`zero_shot_topic` (AI Layer 2):** The **primary topic** of the review, dynamically classified into one of our predefined business categories (e.g., `price`, `quality`), as determined by our *Zero-Shot Lambda*.
* **`aspects` (AI Layer 3):** The **most granular and actionable analysis**. This extracts the specific *features* (aspects) mentioned and the sentiment attached *to each one* (e.g., `fabric (NEGATIVE)`), as determined by our *ABSA Lambda*.
* **`Rating` (Input):** The original 1-5 star rating provided by the user.
* **`true_sentiment` (Validation Only):** A "ground truth" column created *only* in this notebook to validate our model. This column is not part of the final production pipeline.

## 6. Conclusive Analysis: How to Interpret and Use These Results

The final, enriched table is a powerful business tool. Its value is unlocked by asking specific questions and combining the AI-generated columns to get answers.

Here is a practical guide on how a business stakeholder would use this data.

### Business Question 1: "How are we doing?"

**Analysis:** Use the `sentiment_prediction` column.

By aggregating this column, a manager can get an instant, high-level KPI of overall customer happiness. This dashboard metric can be tracked weekly to spot immediate changes in brand perception.

**Example Insight:**
* "This week, our Positive Sentiment Score dropped by 8%."

---
### Business Question 2: "Our sentiment dropped... *Why*?"

**Analysis:** Combine `sentiment_prediction` and `zero_shot_topic`.

This is the first level of diagnosis. A manager can filter for all reviews where `sentiment_prediction == 'NEGATIVE'` and then create a bar chart of the `zero_shot_topic` column.

**Example Insight:**
* "Our sentiment dropped because of a 60% spike in complaints. By filtering for those complaints, we see that **72% of them are about 'shipping'**."
* **Action:** The business now knows exactly where the problem is. They don't need to waste time investigating 'price' or 'quality'; they have a clear priority.

---
### Business Question 3: "Okay, shipping is the problem. But *what about* shipping?"

**Analysis:** Combine `zero_shot_topic` and `aspects`.

This is the most powerful, granular insight. The manager can now filter for all reviews where `zero_shot_topic == 'shipping'` and analyze the `aspects` column.

**Example Insight:**
* By creating a word cloud from the `aspects` in this segment, they see that the most common phrases are **`delivery time (NEGATIVE)`** and **`box (NEGATIVE)`**.
* **Action:** The business has its final answer. The problem isn't the cost of shipping; it's that the courier is slow and the packaging is getting damaged. They can now take surgical action, like renegotiating with their courier or improving their packaging materials.

---
### Business Question 4: "What problems are we not even aware of?"

**Analysis:** Use the `bertopic_id` from the full, corpus-level analysis (as seen in section 4.4).

This analysis is performed by the `stitcher-lambda` on the entire dataset. It automatically clusters reviews by "hidden themes" that we didn't define in advance.

**Example Insight:**
* By examining the `topic_model.get_topic_info()` output, the manager spots a new cluster (e.g., `[jeans, fit, pants, comfortable]`).
* **Action:** The manager realizes that a huge number of customers (Count: 219) are discussing jeans not just in terms of price or quality (our predefined topics), but specifically in terms of fit and comfort. This hidden theme was previously invisible.

---
### Final Conclusion

This notebook has validated the logic for each of these four analytical layers. The deployed AWS pipeline is built to perform this exact multi-layer analysis at scale, transforming raw text feedback from a "cost center" (something to be stored) into a **strategic asset** (something to be queried).