# Smart Gift Planner - Holiday Data Jam 2025

**Author:** Viktor  
**Date:** December 2025  
**Project:** Smart Gift Planner Recommender System

## Introduction

This project implements a Smart Gift Planner recommender system designed to help users find personalized gift suggestions based on their specified interests and budget constraints. The system leverages the Amazon Products Dataset, which contains comprehensive product information including titles, descriptions, categories, prices, ratings, and review counts. Our approach begins with thorough data preprocessing to clean and normalize the dataset, followed by the implementation of both baseline and enhanced recommendation models. The baseline model utilizes TF-IDF (Term Frequency-Inverse Document Frequency) vectorization combined with cosine similarity to match user interests with product descriptions. The enhanced model upgrades this approach by incorporating sentence-transformer embeddings for semantic understanding, along with additional ranking signals such as average ratings, review counts, and popularity scores. This multi-faceted approach ensures that recommendations are not only relevant to the user's stated interests but also reflect product quality and community validation. 

The final deliverable includes cleaned datasets, recommender functions, visualizations for SE integration, and JSON outputs ready for mobile app implementation.

## 1. Setup and Imports

In [3]:
import pandas as pd
import numpy as np
import json
import re
import warnings
warnings.filterwarnings('ignore')

# Text processing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sentence transformers for enhanced model
from sentence_transformers import SentenceTransformer

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import plotly.express as px
import plotly.graph_objects as go

## 2. Data Loading and Preprocessing

In [5]:
df = pd.read_csv('amazon_products.csv')

print(f"Dataset Shape: {df.shape}")
print(f"Total Products: {len(df):,}")
print(f"\nColumns: {df.columns.tolist()}")

# Data types
print(f"\nData Types:\n{df.dtypes}")

print("\nFirst 10 Rows:")
df.head()

Dataset Shape: (1426337, 11)
Total Products: 1,426,337

Columns: ['asin', 'title', 'imgUrl', 'productURL', 'stars', 'reviews', 'price', 'listPrice', 'category_id', 'isBestSeller', 'boughtInLastMonth']

Data Types:
asin                  object
title                 object
imgUrl                object
productURL            object
stars                float64
reviews                int64
price                float64
listPrice            float64
category_id            int64
isBestSeller            bool
boughtInLastMonth      int64
dtype: object

First 10 Rows:


Unnamed: 0,asin,title,imgUrl,productURL,stars,reviews,price,listPrice,category_id,isBestSeller,boughtInLastMonth
0,B014TMV5YE,"Sion Softside Expandable Roller Luggage, Black...",https://m.media-amazon.com/images/I/815dLQKYIY...,https://www.amazon.com/dp/B014TMV5YE,4.5,0,139.99,0.0,104,False,2000
1,B07GDLCQXV,Luggage Sets Expandable PC+ABS Durable Suitcas...,https://m.media-amazon.com/images/I/81bQlm7vf6...,https://www.amazon.com/dp/B07GDLCQXV,4.5,0,169.99,209.99,104,False,1000
2,B07XSCCZYG,Platinum Elite Softside Expandable Checked Lug...,https://m.media-amazon.com/images/I/71EA35zvJB...,https://www.amazon.com/dp/B07XSCCZYG,4.6,0,365.49,429.99,104,False,300
3,B08MVFKGJM,Freeform Hardside Expandable with Double Spinn...,https://m.media-amazon.com/images/I/91k6NYLQyI...,https://www.amazon.com/dp/B08MVFKGJM,4.6,0,291.59,354.37,104,False,400
4,B01DJLKZBA,Winfield 2 Hardside Expandable Luggage with Sp...,https://m.media-amazon.com/images/I/61NJoaZcP9...,https://www.amazon.com/dp/B01DJLKZBA,4.5,0,174.99,309.99,104,False,400


In [6]:
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_df = pd.DataFrame({'Missing Count': missing, 'Missing %': missing_pct})
print(missing_df)

                   Missing Count  Missing %
asin                           0    0.00000
title                          1    0.00007
imgUrl                         0    0.00000
productURL                     0    0.00000
stars                          0    0.00000
reviews                        0    0.00000
price                          0    0.00000
listPrice                      0    0.00000
category_id                    0    0.00000
isBestSeller                   0    0.00000
boughtInLastMonth              0    0.00000


In [8]:
# Cleaning functions
def clean_text(text):
    """Clean and normalize text fields."""
    if pd.isna(text):
        return ''
    text = str(text).lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

In [12]:
# Applied preprocessing
print("Applying data cleaning:")

df['title_clean'] = df['title'].apply(clean_text)

# Rename columns to match our model expectations
df['rating'] = df['stars']
df['review_count'] = df['reviews']

# Filter out products with zero/invalid prices
print(f"Products before filtering: {len(df):,}")
df_clean = df[df['price'] > 0].copy()
print(f"Products after removing $0 prices: {len(df_clean):,}")

df_clean.head(5)

Applying data cleaning:
Products before filtering: 1,426,337
Products after removing $0 prices: 1,393,565


Unnamed: 0,asin,title,imgUrl,productURL,stars,reviews,price,listPrice,category_id,isBestSeller,boughtInLastMonth,title_clean,rating,review_count
0,B014TMV5YE,"Sion Softside Expandable Roller Luggage, Black...",https://m.media-amazon.com/images/I/815dLQKYIY...,https://www.amazon.com/dp/B014TMV5YE,4.5,0,139.99,0.0,104,False,2000,sion softside expandable roller luggage black ...,4.5,0
1,B07GDLCQXV,Luggage Sets Expandable PC+ABS Durable Suitcas...,https://m.media-amazon.com/images/I/81bQlm7vf6...,https://www.amazon.com/dp/B07GDLCQXV,4.5,0,169.99,209.99,104,False,1000,luggage sets expandable pc abs durable suitcas...,4.5,0
2,B07XSCCZYG,Platinum Elite Softside Expandable Checked Lug...,https://m.media-amazon.com/images/I/71EA35zvJB...,https://www.amazon.com/dp/B07XSCCZYG,4.6,0,365.49,429.99,104,False,300,platinum elite softside expandable checked lug...,4.6,0
3,B08MVFKGJM,Freeform Hardside Expandable with Double Spinn...,https://m.media-amazon.com/images/I/91k6NYLQyI...,https://www.amazon.com/dp/B08MVFKGJM,4.6,0,291.59,354.37,104,False,400,freeform hardside expandable with double spinn...,4.6,0
4,B01DJLKZBA,Winfield 2 Hardside Expandable Luggage with Sp...,https://m.media-amazon.com/images/I/61NJoaZcP9...,https://www.amazon.com/dp/B01DJLKZBA,4.5,0,174.99,309.99,104,False,400,winfield 2 hardside expandable luggage with sp...,4.5,0


In [13]:
# Handle missing values
print("Handling missing values:")

# Fill the 1 missing title
df_clean['title_clean'] = df_clean['title_clean'].fillna('')

# Fill any missing ratings with median
df_clean['rating'] = df_clean['rating'].fillna(df_clean['rating'].median())

print(f"Total products ready: {len(df_clean):,}")

Handling missing values:
Total products ready: 1,393,565


In [17]:
# Create combined text feature for recommender
df_clean['combined_text'] = df_clean['title_clean']

print(f"\nSample combined text:")
print(df_clean['combined_text'].iloc[0])


Sample combined text:
sion softside expandable roller luggage black checked large 29 inch


In [18]:
# Identify and handle anomalies
print("Anomaly Detection:")

# Price anomalies
print(df_clean['price'].describe())

# Flag extreme prices (potential anomalies)
Q1 = df_clean['price'].quantile(0.25)
Q3 = df_clean['price'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

price_anomalies = df_clean[(df_clean['price'] < lower_bound) | (df_clean['price'] > upper_bound)]
print(f"\nPrice anomalies detected: {len(price_anomalies):,} products")
print(f"(Products priced below ${lower_bound:.2f} or above ${upper_bound:.2f})")

# Rating anomalies
rating_anomalies = df_clean[(df_clean['rating'] < 0) | (df_clean['rating'] > 5)]
print(f"Rating anomalies detected: {len(rating_anomalies):,} products")

Anomaly Detection:
count    1.393565e+06
mean     4.439545e+01
std      1.316405e+02
min      1.000000e-02
25%      1.199000e+01
50%      1.999000e+01
75%      3.696000e+01
max      1.973181e+04
Name: price, dtype: float64

Price anomalies detected: 156,524 products
(Products priced below $-25.46 or above $74.41)
Rating anomalies detected: 0 products


In [19]:
# Calculate popularity score for enhanced model
print("Calculating popularity scores:")

# Normalize review count (0-1 scale)
max_reviews = df_clean['review_count'].max()
df_clean['review_score'] = df_clean['review_count'] / max_reviews if max_reviews > 0 else 0

# Normalize rating (0-5 to 0-1)
df_clean['rating_score'] = df_clean['rating'] / 5

# Normalize boughtInLastMonth (0-1 scale)
max_bought = df_clean['boughtInLastMonth'].max()
df_clean['bought_score'] = df_clean['boughtInLastMonth'] / max_bought if max_bought > 0 else 0

# Combined popularity score (weighted average)
df_clean['popularity_score'] = (
    0.4 * df_clean['rating_score'] + 
    0.3 * df_clean['review_score'] + 
    0.3 * df_clean['bought_score'])

print(f"Popularity score range: {df_clean['popularity_score'].min():.3f} - {df_clean['popularity_score'].max():.3f}")

Calculating popularity scores:
Popularity score range: 0.000 - 0.944


In [23]:
# Final cleaned dataset summary
print("Preprocessing complete - Final dataset summary:")
print(f"\nTotal products: {len(df_clean):,}")
print(f"Columns: {len(df_clean.columns)}")
print(f"\nPrice range: ${df_clean['price'].min():.2f} - ${df_clean['price'].max():.2f}")
print(f"Rating range: {df_clean['rating'].min():.1f} - {df_clean['rating'].max():.1f}")
print(f"Unique categories: {df_clean['category_id'].nunique()}")
print(f"\nBestSellers: {df_clean['isBestSeller'].sum():,}")
print(f"Products with reviews: {(df_clean['review_count'] > 0).sum():,}")

Preprocessing complete - Final dataset summary:

Total products: 1,393,565
Columns: 19

Price range: $0.01 - $19731.81
Rating range: 0.0 - 5.0
Unique categories: 248

BestSellers: 8,469
Products with reviews: 290,127
