From: Alex Rivera, Head of Product Management
Challenge: Our product categorization system currently relies on manual tagging by our content
team. This process is time-consuming and inconsistent. We believe that customer reviews contain rich information about product characteristics that could be used to automatically classify
products into the correct categories.
Request: Develop a classification system that can automatically categorize products into their
appropriate departments (electronics, home goods, fashion, beauty, etc.) based solely on the
language used in customer reviews. This would help us with:
• Automatically categorizing new products
• Identifying miscategorized existing products
• Understanding cross-category product attributes
Success metrics: Classification accuracy of at least 85% across major categories and a clear
explanation of which review elements are most predictive of product categories.

In [None]:
from de3_preprocessing import load_preprocessed, compute_tfidf, normalize_features
import pandas as pd
import numpy as np
from scipy.sparse import hstack, csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df, embeddings = load_preprocessed("/content/drive/MyDrive/DEAssignment3/review")



TF-IDF Vectorizer is used below to show the top terms used in reviews for each category. This technique weighs how important each word is relative to all reviews. Important words like "album", "fit", or "battery" get higher scores whereas more vague terms get low scores.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer


### 1. TF-IDF Term Visualization
tfidf = TfidfVectorizer(max_features=300) #Converts text into a matrix of word importance scores. 300 most important words are kept
tfidf_matrix = tfidf.fit_transform(df["clean_text"])
tfidf_features = np.array(tfidf.get_feature_names_out())
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_features)
tfidf_df["category"] = df["category"].values

top_n = 10 #grabs top 10 terms per category
top_terms_all = []
for category in tfidf_df["category"].unique():
    cat_df = tfidf_df[tfidf_df["category"] == category].drop(columns="category")
    avg_scores = cat_df.mean().sort_values(ascending=False).head(top_n) #avgs score
    for term, score in avg_scores.items():
        top_terms_all.append({"category": category, "term": term, "score": score})

top_terms_df = pd.DataFrame(top_terms_all)
plt.figure(figsize=(12, 6))
sns.barplot(data=top_terms_df, x="term", y="score", hue="category")
plt.title("Top TF-IDF Terms by Category")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Review level classification
-Review length and helpfulness votes are normalized (scaled)
-Text features (TF-IDF) and semantic features, and numeric features were horizontally stacked. This forms a single feature matrix that can then be used in modeling
-Models Used


*   Logistic Regression
*   Random Forest
*   Support Vector Machine



In [None]:
# Normalize numeric features
scaled_numeric, numeric_sparse = normalize_features(df, ["review_length", "helpful_vote"])
embedding_sparse = csr_matrix(embeddings)
X_combined = hstack([tfidf_matrix, embedding_sparse, numeric_sparse])
y = df["category"]

X_train, X_test, y_train, y_test = train_test_split(X_combined, y, test_size=0.2, stratify=y, random_state=42)

# Logistic Regression
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)
y_pred_log = logreg.predict(X_test)
print("Logistic Regression:\n", classification_report(y_test, y_pred_log))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_log))
print("")

# Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Random Forest:\n", classification_report(y_test, y_pred_rf))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))
print("")

# Support Vector Machine
svm = LinearSVC(max_iter=1000)
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)
print("Support Vector Machine:\n", classification_report(y_test, y_pred_svm))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_svm))

Product Level Classification
- This next approach aggregates information across all reviews for a given product, grouped by parent_asin
- For each product, all embeddings are aggregated (averaged) to create a single embedding vector- captures overall sentiment/themes
-Each product assigned a category label based on first review found
-Models used:


*   Logistic Regression
*   Random Forest Classifier



In [None]:
grouped = df.groupby("parent_asin") #products group by parent_asin
product_embeddings = {}
product_labels = {}
for product_id, group in grouped:
    product_embeddings[product_id] = np.mean(embeddings[group.index], axis=0)
    product_labels[product_id] = group["category"].iloc[0]

X_product = np.array(list(product_embeddings.values()))
y_product = np.array(list(product_labels.values()))

X_train_p, X_test_p, y_train_p, y_test_p = train_test_split(X_product, y_product, test_size=0.2, stratify=y_product, random_state=42)

# Product-Level Logistic Regression
clf_p = LogisticRegression(max_iter=1000)
clf_p.fit(X_train_p, y_train_p)
y_pred_p = clf_p.predict(X_test_p)
print("Product-Level Logistic Regression:\n", classification_report(y_test_p, y_pred_p))
print("Confusion Matrix:\n", confusion_matrix(y_test_p, y_pred_p))

# Product-Level Random Forest
rf_p = RandomForestClassifier(n_estimators=100, random_state=42)
rf_p.fit(X_train_p, y_train_p)
y_pred_rf_p = rf_p.predict(X_test_p)
print("Product-Level Random Forest:\n", classification_report(y_test_p, y_pred_rf_p))
print("Confusion Matrix:\n", confusion_matrix(y_test_p, y_pred_rf_p))