## Final Project Part 2: Product Recommender System Using NLP   

Laine Close  
Marcos Fernandez  
Owen Randolph

Recommender systems are a common feature in many digital platforms where user choice plays a central role. They provide significant benefits for both users and creators. For instance, Amazon‚Äôs recommendation engine has greatly increased user engagement and overall spending on its e-commerce site. For customers, these systems personalize the shopping experience by suggesting products that best match their preferences‚Äîwhether inferred from individual browsing history or derived from broader popularity trends. Recommender systems have fundamentally transformed how users discover and interact with products online.

This notebook demonstrates the preliminary development and testing of a product recommender system using machine learning. It focuses on establishing the preprocessing, feature extraction, and model functionality needed to generate personalized product recommendations.

## Part 0: Setup Python Environment

The pymongo library is used as the Python driver for MongoDB, enabling interaction with the database directly from Python. It allows queries, inserts, updates, and data retrieval through an API similar to working with dictionaries and lists in Python.

In [1]:
!pip install pymongo



In [None]:
from pymongo import MongoClient
uri = "mongodb+srv://admin:6hkaexkcZ1roPi1Q@firstcluster.il42qkf.mongodb.net/"

# Create the client
client = MongoClient(uri)

# List all databases in the cluster
print("Databases available:")
print(client.list_database_names())

Databases available:
['admin', 'amazon_reviews', 'config', 'local']


In [3]:
# If using google collab must save both datasets to google drive due to size restrictions with uploading locally to 'contents' directory
# from google.colab import drive
# drive.mount('/content/drive')

##### NLTK: NLP Library

The NLTK library is a toolkit that allows us to do preprocessing tasks with the text data including and allowing:

- Tokenize text

- Remove stop words

- Tag parts of speech

- Lemmatize words

- Access lexical database

- Support multilingual data

- Enable sentence splitting

In [4]:
import json
import gzip
import pandas as pd
import nltk

from nltk.corpus import stopwords
import re
import string
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')
nltk.download('punkt_tab')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\orand\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\orand\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\orand\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\orand\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\orand\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\orand\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


##### BERT
Before using advanced neural network models like BERT, the required transformer libraries are installed to enable text embedding and model loading.

In [5]:
# Used for text embedding with BERT
!pip install sentence-transformers
!pip install transformers



In [6]:
# Used for collaborative filtering
!pip install scikit-surprise

Collecting scikit-surprise
  Using cached scikit_surprise-1.1.4.tar.gz (154 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml): started
  Building wheel for scikit-surprise (pyproject.toml): finished with status 'done'
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp311-cp311-win_amd64.whl size=1323948 sha256=1a1486f93c2b236f9913d0f0a47b6e9d20d38dcc76ca996cab1f358458694aee
  Stored in directory: c:\users\orand\appdata\local\pip\cache\wheels\2a\8f\6e\7e2899163e2d85d8266daab4aa1cdabec7a6c56f83c015b5af
Successfully built scikit-surprise
Installing collected packages: scikit-surpris

## Part 1. Import Electronics Data

We connect to our MongoDB cluster in order to use the data in our recommender system.  This is a good database choice because we are using json documents.  A NoSQL database is the best option for storing and drawing our data because some of the features of our data are long text strings, so it's considered nonrelational data.  We use MongoDB Atlas to set up a 3 node cluster using MongoDB Atlas, and connect to the active cluster to.  Cursor object is used to query data within the notebook environment, in our case just to check the data.

In [8]:
from pymongo import MongoClient
import pandas as pd

# connect to MongoDB
client = MongoClient("mongodb+srv://admin:6hkaexkcZ1roPi1Q@firstcluster.il42qkf.mongodb.net/")
db = client["amazon_reviews"]
collection = db["Electronics"]

# example: read first 10,000 docs
cursor = collection.find({}, {"_id": 0}).limit(10000)
data = list(cursor)

# convert to pandas DataFrame
df = pd.DataFrame(data)
print(df.head())

   rating                                  title_x  \
0     3.0        Smells like gasoline! Going back!   
1     1.0  Didn‚Äôt work at all lenses loose/broken.   
2     5.0                               Excellent!   
3     5.0                   Great laptop backpack!   
4     5.0                solid sound for the price   

                                                text  \
0  First & most offensive: they reek of gasoline ...   
1  These didn‚Äôt work. Idk if they were damaged in...   
2  I love these. They even come with a carry case...   
3  I was searching for a sturdy backpack for scho...   
4  Update 2-they sent a new warranty replacement....   

                                            images_x        asin  \
0  [{'attachment_type': 'IMAGE', 'large_image_url...  B083NRGZMM   
1                                                 []  B07N69T6TM   
2                                                 []  B01G8JO5F2   
3                                                 []  B001OC5J

In [9]:
print(df.info())
print(df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   rating             10000 non-null  float64
 1   title_x            10000 non-null  object 
 2   text               10000 non-null  object 
 3   images_x           10000 non-null  object 
 4   asin               10000 non-null  object 
 5   parent_asin_x      10000 non-null  object 
 6   user_id            10000 non-null  object 
 7   timestamp          10000 non-null  int64  
 8   helpful_vote       10000 non-null  int64  
 9   verified_purchase  10000 non-null  bool   
 10  main_category      9794 non-null   object 
 11  title_y            10000 non-null  object 
 12  average_rating     10000 non-null  float64
 13  rating_number      10000 non-null  int64  
 14  features           10000 non-null  object 
 15  description        10000 non-null  object 
 16  price              1000

Import the Electronics review file and store it as the df_review DataFrame. Only records with non-missing parent_asin values are included. A potential future feature of this project could involve querying a targeted subset of 10,000 documents to improve recommendation quality after model training ‚Äî for instance, filtering by users (user_id) with more than five reviews or by reviews with ratings greater than or equal to four.

The source file is in .json format and compressed in a zip folder. For testing, only 10,000 records are temporarily loaded due to file size; this limit will be removed prior to the final submission.

In [10]:
# Import 'Electronics' review data
## NOTE: UPDATE PATH TO YOUR SOURCE DATA
#file = '/content/Electronics.jsonl.gz'
#file = '/content/drive/MyDrive/Graduate School/Natural Language Processing/Electronics.jsonl.gz'

'''data = []
with gzip.open(file, 'rt', encoding='utf-8') as fp:
   for i, line in enumerate(fp):
     if i >= 5000:
            break
        record = json.loads(line.strip())
        if record.get('parent_asin'):
            data.append(record)

# Convert to dataframe
df_review = pd.DataFrame(data)
df_review.head()'''

"data = []\nwith gzip.open(file, 'rt', encoding='utf-8') as fp:\n   for i, line in enumerate(fp):\n     if i >= 5000:\n            break\n        record = json.loads(line.strip())\n        if record.get('parent_asin'):\n            data.append(record)\n\n# Convert to dataframe\ndf_review = pd.DataFrame(data)\ndf_review.head()"

Import Electronics metadata file and store as df_meta dataframe. Only pull in records with non-missing parent_asin values.
The file is .json and is is in a zip folder. Only temporarily pull the parent_asin that are in the df_review dataframe. This will be removed prior to the final submission.

In [11]:
# Import 'Electronics' metadata
## NOTE: UPDATE PATH TO YOUR SOURCE DATA
#file = '/content/meta_Electronics.jsonl.gz'
#file = '/content/drive/MyDrive/Graduate School/Natural Language Processing/meta_Electronics.jsonl.gz'
file = "meta_Electronics.jsonl.gz"
data = []

# Create a set of valid ASINs from df_review
valid_asins = set(df['asin'])

with gzip.open(file, 'rt', encoding='utf-8') as fp:
    for line in fp:
        record = json.loads(line.strip())
        if record.get('parent_asin') in valid_asins and record.get('parent_asin'):
            data.append(record)

# Convert to dataframe
df_meta = pd.DataFrame(data)
df_meta.head()

Unnamed: 0,main_category,title,average_rating,rating_number,features,description,price,images,videos,store,categories,details,parent_asin,bought_together,subtitle,author
0,Computers,Plugable USB 3.0 Sharing Switch for One-Button...,4.2,1325,[SIMPLE DEVICE SHARING - Compact design provid...,[],,[{'thumb': 'https://m.media-amazon.com/images/...,[{'title': 'How to Switch Between Two Computer...,Plugable,"[Electronics, Computers & Accessories, Compute...","{'Operation Mode': 'Toggle', 'Operating Voltag...",B00JX3Q28Y,,,
1,Amazon Home,"Aproca Hard Storage Travel Case, for AKASO EK7...",4.6,489,[Eco-friendly Material: Made of High-density E...,[],14.99,[{'thumb': 'https://m.media-amazon.com/images/...,[{'title': 'LTGEM EVA Hard Case for AKASO EK70...,Aproca,"[Electronics, Camera & Photo, Bags & Cases, Ca...",{'Package Dimensions': '9.1 x 5.8 x 3.6 inches...,B07ZZ595TG,,,
2,Camera & Photo,"Teleprompter,Desview T3S Promoting for Device ...",4.0,210,[üëç„ÄêLatest Teleprompter from Desview Official„ÄëC...,[],129.95,[{'thumb': 'https://m.media-amazon.com/images/...,[{'title': 'The Teleprompter You've Been Waiti...,Desview,"[Electronics, Camera & Photo, Lighting & Studi...",{'Package Dimensions': '9.09 x 9.02 x 5.39 inc...,B0952X8NBY,,,
3,Camera & Photo,Fujifilm Instax Mini 8 Instant Film Camera (Po...,4.5,120,"[New slimmer and lighter body, Works with Fuji...",[The Fujifilm Instax Mini 8 Instant Film Camer...,,[{'thumb': 'https://m.media-amazon.com/images/...,[{'title': 'Don‚Äôt buy it light switch won‚Äôt ch...,Fujifilm,"[Electronics, Camera & Photo, Film Photography...","{'Product Dimensions': '3 x 5 x 5.5 inches', '...",B075S8H5LY,,,
4,Office Products,Allstate 3-Year Office Protection Plan ($75-99...,4.0,749,[],"[From the Manufacturer, Let's face it‚Äîwarranti...",,[{'thumb': 'https://m.media-amazon.com/images/...,[],SquareTrade,"[Electronics, Electronics Warranties]","{'Manufacturer': 'SquareTrade', 'Brand': 'Squa...",B008I63HDA,,,


In [12]:
df_meta['main_category'].value_counts()

main_category
All Electronics                 2741
Computers                       2166
Camera & Photo                   894
Cell Phones & Accessories        872
Home Audio & Theater             695
Amazon Devices                   267
Industrial & Scientific          218
Tools & Home Improvement         150
Office Products                  140
Amazon Home                      114
Car Electronics                   71
AMAZON FASHION                    60
Sports & Outdoors                 58
Health & Personal Care            42
Automotive                        39
Apple Products                    38
Musical Instruments               37
GPS & Navigation                  20
Toys & Games                      18
Portable Audio & Accessories      12
Video Games                       11
Books                              9
All Beauty                         7
Arts, Crafts & Sewing              7
Baby                               6
Pet Supplies                       4
Software                

Majority of data is Computer hardware.  

Some questions we wonder are: 
- What is the overlap in product recommendations by product type? For instance, would I be recommended a camera if I bought a computer?  
- Would it recommend a computer camera instead of a digital photography camera? 
- How does the data distribution influence the model‚Äôs behavior?

In [13]:
df_meta.count()

main_category      8708
title              8893
average_rating     8893
rating_number      8893
features           8893
description        8893
price              3251
images             8893
videos             8893
store              8871
categories         8893
details            8893
parent_asin        8893
bought_together       0
subtitle              8
author                8
dtype: int64

In [14]:
# Merge both the review and metadata together by parent_asin from meta and asin from non-meta
comb_df = pd.merge(df,
                   df_meta,
                   how='inner', # Inner join to only keep records that are in both reviews and meta
                   left_on='asin', # Merge on parent_asin
                   right_on='parent_asin'
)

pd.set_option('display.max_columns', None)
display(comb_df.head())
print(len(comb_df))

Unnamed: 0,rating,title_x,text,images_x,asin,parent_asin_x,user_id,timestamp,helpful_vote,verified_purchase,main_category_x,title_y,average_rating_x,rating_number_x,features_x,description_x,price_x,images_y,videos_x,store_x,categories_x,details_x,parent_asin_y,bought_together_x,subtitle_x,author_x,main_category_y,title,average_rating_y,rating_number_y,features_y,description_y,price_y,images,videos_y,store_y,categories_y,details_y,parent_asin,bought_together_y,subtitle_y,author_y
0,3.0,Smells like gasoline! Going back!,First & most offensive: they reek of gasoline ...,"[{'attachment_type': 'IMAGE', 'large_image_url...",B083NRGZMM,B083NRGZMM,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,1658185117948,0,True,Camera & Photo,"Binoculars, 12x42 Binoculars for Adults, Binoc...",4.3,134,[],[],,{'hi_res': ['https://m.media-amazon.com/images...,{'title': ['Really Good Binoculars with Great ...,Hikkogo,"[Electronics, Camera & Photo, Binoculars & Sco...","{""Package Dimensions"": ""7.28 x 6.77 x 3.07 inc...",B083NRGZMM,,,,Camera & Photo,"Binoculars, 12x42 Binoculars for Adults, Binoc...",4.3,134,[],[],,[{'thumb': 'https://m.media-amazon.com/images/...,[{'title': 'Really Good Binoculars with Great ...,Hikkogo,"[Electronics, Camera & Photo, Binoculars & Sco...",{'Package Dimensions': '7.28 x 6.77 x 3.07 inc...,B083NRGZMM,,,
1,1.0,Didn‚Äôt work at all lenses loose/broken.,These didn‚Äôt work. Idk if they were damaged in...,[],B07N69T6TM,B07N69T6TM,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,1592678549731,0,True,Camera & Photo,"Toys for 4-5 Year Old Boys, Mom&myaboys 8 X 21...",4.1,115,[‚úîSUPERIOR SAFETY -Soft Rubber Surrounded Eyep...,[],15.99,{'hi_res': ['https://m.media-amazon.com/images...,"{'title': [' Binocular for Kids', 'Perfect for...",mom&myaboys,"[Electronics, Camera & Photo, Binoculars & Sco...","{""Package Dimensions"": ""4.8 x 3.6 x 2.3 inches...",B07N69T6TM,,,,Camera & Photo,"Toys for 4-5 Year Old Boys, Mom&myaboys 8 X 21...",4.1,115,[‚úîSUPERIOR SAFETY -Soft Rubber Surrounded Eyep...,[],15.99,[{'thumb': 'https://m.media-amazon.com/images/...,"[{'title': ' Binocular for Kids', 'url': 'http...",mom&myaboys,"[Electronics, Camera & Photo, Binoculars & Sco...",{'Package Dimensions': '4.8 x 3.6 x 2.3 inches...,B07N69T6TM,,,
2,5.0,Excellent!,I love these. They even come with a carry case...,[],B01G8JO5F2,B01G8JO5F2,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,1523093017534,0,True,All Electronics,"Senso Bluetooth Headphones, Best Wireless Spor...",4.1,42824,[True HD high Fidelity sound featuring latest ...,[],24.96,{'hi_res': ['https://m.media-amazon.com/images...,"{'title': [], 'url': [], 'user_id': []}",Senso,"[Electronics, Headphones, Earbuds & Accessorie...","{""Product Dimensions"": ""4.9 x 4.7 x 1.3 inches...",B01G8JO5F2,,,,All Electronics,"Senso Bluetooth Headphones, Best Wireless Spor...",4.1,42824,[True HD high Fidelity sound featuring latest ...,[],24.96,[{'thumb': 'https://m.media-amazon.com/images/...,[],Senso,"[Electronics, Headphones, Earbuds & Accessorie...",{'Product Dimensions': '4.9 x 4.7 x 1.3 inches...,B01G8JO5F2,,,
3,5.0,Great laptop backpack!,I was searching for a sturdy backpack for scho...,[],B001OC5JKY,B001OC5JKY,AGGZ357AO26RQZVRLGU4D4N52DZQ,1290278495000,18,True,,"Targus Air Traveler Laptop Backpack, Professio...",4.2,265,[The Targus Zip-Thru Air Traveler Backpack is ...,"[Product Description, The Targus Checkpoint-Fr...",,{'hi_res': ['https://m.media-amazon.com/images...,"{'title': ['Durable And Spacious Backpack!', '...",Targus,"[Electronics, Computers & Accessories, Laptop ...","{""Product Dimensions"": ""17.8 x 14.8 x 3.8 inch...",B001OC5JKY,,,,,"Targus Air Traveler Laptop Backpack, Professio...",4.2,265,[The Targus Zip-Thru Air Traveler Backpack is ...,"[Product Description, The Targus Checkpoint-Fr...",,[{'thumb': 'https://m.media-amazon.com/images/...,"[{'title': 'Durable And Spacious Backpack!', '...",Targus,"[Electronics, Computers & Accessories, Laptop ...",{'Product Dimensions': '17.8 x 14.8 x 3.8 inch...,B001OC5JKY,,,
4,5.0,solid sound for the price,Update 2-they sent a new warranty replacement....,[],B07BHHB5RH,B07BHHB5RH,AGCI7FAH4GL5FI65HYLKWTMFZ2CQ,1565130879386,0,True,Cell Phones & Accessories,"Bluetooth Headphones, Soundcore Spirit Sports ...",4.1,2126,[IPX7 Sweat Guard TechnologyÔºöTruly sweat proof...,[],,{'hi_res': ['https://m.media-amazon.com/images...,{'title': ['HEALTH HAZARD; Random LOUD Buzzing...,Anker,"[Electronics, Headphones, Earbuds & Accessorie...","{""Product Dimensions"": ""23.6 x 1.3 x 0.47 inch...",B07BHHB5RH,,,,Cell Phones & Accessories,"Bluetooth Headphones, Soundcore Spirit Sports ...",4.1,2126,[IPX7 Sweat Guard TechnologyÔºöTruly sweat proof...,[],,[{'thumb': 'https://m.media-amazon.com/images/...,[{'title': 'HEALTH HAZARD; Random LOUD Buzzing...,Anker,"[Electronics, Headphones, Earbuds & Accessorie...",{'Product Dimensions': '23.6 x 1.3 x 0.47 inch...,B07BHHB5RH,,,


10000


Our combined dataset with 10,000 instances in a DataFrame is a good start to begin with preprocessing steps and working through training the model.

## Part 2. Preprocessing  



In [15]:
print(comb_df.columns.tolist())

['rating', 'title_x', 'text', 'images_x', 'asin', 'parent_asin_x', 'user_id', 'timestamp', 'helpful_vote', 'verified_purchase', 'main_category_x', 'title_y', 'average_rating_x', 'rating_number_x', 'features_x', 'description_x', 'price_x', 'images_y', 'videos_x', 'store_x', 'categories_x', 'details_x', 'parent_asin_y', 'bought_together_x', 'subtitle_x', 'author_x', 'main_category_y', 'title', 'average_rating_y', 'rating_number_y', 'features_y', 'description_y', 'price_y', 'images', 'videos_y', 'store_y', 'categories_y', 'details_y', 'parent_asin', 'bought_together_y', 'subtitle_y', 'author_y']


In [16]:
# Only keep columns required for recommender system
keep_columns = ['rating','text','user_id','title_y','categories_y','parent_asin_y','description_x']

df_preprocess = comb_df[keep_columns]

df_preprocess.head()

Unnamed: 0,rating,text,user_id,title_y,categories_y,parent_asin_y,description_x
0,3.0,First & most offensive: they reek of gasoline ...,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,"Binoculars, 12x42 Binoculars for Adults, Binoc...","[Electronics, Camera & Photo, Binoculars & Sco...",B083NRGZMM,[]
1,1.0,These didn‚Äôt work. Idk if they were damaged in...,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,"Toys for 4-5 Year Old Boys, Mom&myaboys 8 X 21...","[Electronics, Camera & Photo, Binoculars & Sco...",B07N69T6TM,[]
2,5.0,I love these. They even come with a carry case...,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,"Senso Bluetooth Headphones, Best Wireless Spor...","[Electronics, Headphones, Earbuds & Accessorie...",B01G8JO5F2,[]
3,5.0,I was searching for a sturdy backpack for scho...,AGGZ357AO26RQZVRLGU4D4N52DZQ,"Targus Air Traveler Laptop Backpack, Professio...","[Electronics, Computers & Accessories, Laptop ...",B001OC5JKY,"[Product Description, The Targus Checkpoint-Fr..."
4,5.0,Update 2-they sent a new warranty replacement....,AGCI7FAH4GL5FI65HYLKWTMFZ2CQ,"Bluetooth Headphones, Soundcore Spirit Sports ...","[Electronics, Headphones, Earbuds & Accessorie...",B07BHHB5RH,[]


##### Text Cleaning

- Removing English stop words and HTML or special character artifacts (e.g., ‚Äúnbsp‚Äù).

- Tokenizing the text into manageable word units for later feature extraction and embedding.

In [None]:
# Clean text. This was modified from Week 8 Recommender Systems Demo
# Load BERT tokenizer
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def clean_bert(text):
    text = re.sub("'", "", text) # Remove "'"
    text = text.replace("nbsp", "") # Remove html

    tokens = tokenizer.tokenize(text) # Tokenize using Bert

    filtered_tokens = [
        token for token in tokens
        if token not in stop_words and token not in string.punctuation and len(token) > 2 # Remove stop words and short tokens
    ]

    return " ".join(filtered_tokens)

In [18]:
# Combine text columns into a single column for data preprocessing
df_preprocess = df_preprocess.copy()

df_preprocess.loc[:, "prod_desc"] = (
    df_preprocess["title_y"].astype(str)
    + " "
    + df_preprocess["categories_y"].astype(str)
    + " "
    + df_preprocess["description_x"].astype(str)
)

df_preprocess["categories_y"] = df_preprocess["categories_y"].astype(str)
df_preprocess.head()

Unnamed: 0,rating,text,user_id,title_y,categories_y,parent_asin_y,description_x,prod_desc
0,3.0,First & most offensive: they reek of gasoline ...,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,"Binoculars, 12x42 Binoculars for Adults, Binoc...","['Electronics', 'Camera & Photo', 'Binoculars ...",B083NRGZMM,[],"Binoculars, 12x42 Binoculars for Adults, Binoc..."
1,1.0,These didn‚Äôt work. Idk if they were damaged in...,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,"Toys for 4-5 Year Old Boys, Mom&myaboys 8 X 21...","['Electronics', 'Camera & Photo', 'Binoculars ...",B07N69T6TM,[],"Toys for 4-5 Year Old Boys, Mom&myaboys 8 X 21..."
2,5.0,I love these. They even come with a carry case...,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,"Senso Bluetooth Headphones, Best Wireless Spor...","['Electronics', 'Headphones, Earbuds & Accesso...",B01G8JO5F2,[],"Senso Bluetooth Headphones, Best Wireless Spor..."
3,5.0,I was searching for a sturdy backpack for scho...,AGGZ357AO26RQZVRLGU4D4N52DZQ,"Targus Air Traveler Laptop Backpack, Professio...","['Electronics', 'Computers & Accessories', 'La...",B001OC5JKY,"[Product Description, The Targus Checkpoint-Fr...","Targus Air Traveler Laptop Backpack, Professio..."
4,5.0,Update 2-they sent a new warranty replacement....,AGCI7FAH4GL5FI65HYLKWTMFZ2CQ,"Bluetooth Headphones, Soundcore Spirit Sports ...","['Electronics', 'Headphones, Earbuds & Accesso...",B07BHHB5RH,[],"Bluetooth Headphones, Soundcore Spirit Sports ..."


In [19]:
# Apply clean_bert function to relevant text columns
df_preprocess['prod_desc'] = df_preprocess['prod_desc'].apply(clean_bert)
df_preprocess['text'] = df_preprocess['text'].apply(clean_bert)
df_preprocess['title_y'] = df_preprocess['title_y'].apply(clean_bert)
df_preprocess['categories_y'] = df_preprocess['categories_y'].apply(clean_bert)

df_preprocess.head()

Unnamed: 0,rating,text,user_id,title_y,categories_y,parent_asin_y,description_x,prod_desc
0,3.0,first offensive ##ek gasoline sensitive allerg...,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,binoculars ##x ##42 binoculars adults binocula...,electronics camera photo binoculars scope ##s ...,B083NRGZMM,[],binoculars ##x ##42 binoculars adults binocula...
1,1.0,work ##k damaged shipping lenses loose somethi...,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,toys year old boys mom ##ab ##oys kids binocul...,electronics camera photo binoculars scope ##s ...,B07N69T6TM,[],toys year old boys mom ##ab ##oys kids binocul...
2,5.0,love even come carry case several sizes ear bu...,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,sen ##so blue ##tooth head ##phones best wirel...,electronics head ##phones ear ##bu ##ds access...,B01G8JO5F2,[],sen ##so blue ##tooth head ##phones best wirel...
3,5.0,searching sturdy backpack school would allow c...,AGGZ357AO26RQZVRLGU4D4N52DZQ,tar ##gus air traveler laptop backpack profess...,electronics computers accessories laptop acces...,B001OC5JKY,"[Product Description, The Targus Checkpoint-Fr...",tar ##gus air traveler laptop backpack profess...
4,5.0,update sent new warrant ##y replacement good c...,AGCI7FAH4GL5FI65HYLKWTMFZ2CQ,blue ##tooth head ##phones sound ##core spirit...,electronics head ##phones ear ##bu ##ds access...,B07BHHB5RH,[],blue ##tooth head ##phones sound ##core spirit...


## Feature Extraction  
In this stage, we use BERT embeddings to transform product descriptions and review text into numerical vector representations. These embeddings capture the semantic meaning of the text, allowing the recommender system to compare products based on contextual similarity rather than simple keyword matching.

In [20]:
import os
os.environ["USE_TF"] = "0"
os.environ["USE_TORCH"] = "1"


In [21]:
#!pip install tf-keras

In [22]:
#!pip install --user tf-keras

In [23]:
from sentence_transformers import SentenceTransformer

bert_model = SentenceTransformer('all-MiniLM-L6-v2')
item_embeddings = bert_model.encode(
    df_preprocess['prod_desc'].tolist(),
    show_progress_bar=True
)

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

In [None]:
#from sentence_transformers import SentenceTransformer
#bert_model = SentenceTransformer('all-MiniLM-L6-v2')

#item_embeddings = bert_model.encode(df_preprocess['prod_desc'].tolist(), show_progress_bar=True)

In [24]:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split

# Encode the user_id and item_id
df_preprocess['user_idx'] = df_preprocess['user_id'].astype('category').cat.codes

# Store the categorical series of prod_desc before getting codes
item_categorical = df_preprocess['prod_desc'].astype('category')
df_preprocess['item_idx'] = item_categorical.cat.codes

# Get unique uuser_id and item_id
num_users = df_preprocess['user_idx'].nunique()
num_items = df_preprocess['item_idx'].nunique()

# Prepare training data
X = df_preprocess[['user_idx', 'item_idx']].values
y = df_preprocess['rating'].values

# Generate train and test sets and split the data at 0.2 for test size
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a pytorch class representing a dataset. This is PyTorch documentation found in reference [6]. See 'class torch.utils.data.Dataset' section
class CF_Dataset(torch.utils.data.Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels

    def __getitem__(self, index):
        user, item = self.data[index]
        return torch.tensor(user, dtype=torch.long), torch.tensor(item, dtype=torch.long), torch.tensor(self.labels[index], dtype=torch.float32)

    def __len__(self):
        return len(self.data)

train_dataset = CF_Dataset(X_train, y_train)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=256, shuffle=True)

# Collaborative Filtering Model
## Note: Collaborative filtering code template was pulled from reference [5] and modified for our needs
class CF_Model(nn.Module):
    def __init__(self, num_users, num_items, latent_dim=32):
        super(CF_Model, self).__init__()
        self.user_embedding = nn.Embedding(num_users, latent_dim)
        self.item_embedding = nn.Embedding(num_items, latent_dim)
        self.fc = nn.Sequential(
            nn.Linear(latent_dim * 2, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )

    def forward(self, user, item):
        user_vec = self.user_embedding(user)
        item_vec = self.item_embedding(item)
        x = torch.cat([user_vec, item_vec], dim=-1)
        return self.fc(x).squeeze()

# Initialize model
model = CF_Model(num_users, num_items)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop - set epoch to 5o for now
for epoch in range(50):
    model.train()
    total_loss = 0
    for user, item, rating in train_loader:
        optimizer.zero_grad()
        output = model(user, item)
        loss = criterion(output, rating)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")

Epoch 1, Loss: 635.8890
Epoch 2, Loss: 425.9041
Epoch 3, Loss: 251.3650
Epoch 4, Loss: 133.1683
Epoch 5, Loss: 83.0218
Epoch 6, Loss: 64.0955
Epoch 7, Loss: 54.6042
Epoch 8, Loss: 50.0155
Epoch 9, Loss: 47.2680
Epoch 10, Loss: 44.8997
Epoch 11, Loss: 43.2620
Epoch 12, Loss: 41.6674
Epoch 13, Loss: 39.9885
Epoch 14, Loss: 39.1764
Epoch 15, Loss: 37.5106
Epoch 16, Loss: 36.3945
Epoch 17, Loss: 35.0589
Epoch 18, Loss: 34.0244
Epoch 19, Loss: 33.3378
Epoch 20, Loss: 32.0483
Epoch 21, Loss: 31.0779
Epoch 22, Loss: 30.3072
Epoch 23, Loss: 29.2592
Epoch 24, Loss: 28.2736
Epoch 25, Loss: 27.4145
Epoch 26, Loss: 26.3189
Epoch 27, Loss: 25.7282
Epoch 28, Loss: 25.1463
Epoch 29, Loss: 24.1017
Epoch 30, Loss: 23.5999
Epoch 31, Loss: 22.4263
Epoch 32, Loss: 21.7253
Epoch 33, Loss: 21.0327
Epoch 34, Loss: 20.2164
Epoch 35, Loss: 19.4260
Epoch 36, Loss: 18.6727
Epoch 37, Loss: 18.1713
Epoch 38, Loss: 17.4508
Epoch 39, Loss: 16.6234
Epoch 40, Loss: 15.8059
Epoch 41, Loss: 15.2099
Epoch 42, Loss: 14.43

In [25]:
#Measure how well the model performs
from sklearn.metrics import mean_squared_error
import numpy as np

# Get predictions on the test set
model.eval() # Set the model to evaluation mode
with torch.no_grad(): # Disable gradient calculation
    user_test = torch.tensor(X_test[:, 0], dtype=torch.long)
    item_test = torch.tensor(X_test[:, 1], dtype=torch.long)
    model_predictions = model(user_test, item_test).numpy()

# Calculate RMSE manually
rmse = np.sqrt(mean_squared_error(y_test, model_predictions))
print(f"RMSE on test set: {rmse:.4f}")

RMSE on test set: 1.4071


We aim to do more experimenting and model tuning to improve metrics.  This is our preliminary result.

## Main Functionality  
This section implements the key recommendation logic of the system. After feature extraction, the model uses user-item embeddings to generate personalized recommendations.

The system maps users and items to unique indices, builds a history of rated products, and uses this to generate personalized recommendations. The recommend_top_n_cf() function runs in evaluation mode with PyTorch to calculate similarity scores from embeddings and returns a ranked list of top product titles for each user.

In [26]:
# Map user_id to user_idx
user_id_to_idx = dict(zip(df_preprocess['user_id'], df_preprocess['user_idx']))

# Map item_idx to title_y for display
item_titles = df_preprocess.drop_duplicates('item_idx')[['item_idx', 'title_y']]
idx_to_title = dict(zip(item_titles['item_idx'], item_titles['title_y']))

# Build user-item history to avoid recommending already rated items
user_rated_items = df_preprocess.groupby('user_idx')['item_idx'].apply(set).to_dict()

In [27]:
# Build function to recommend the N= number of times for a specified user
def recommend_top_n_cf(user_id, model, N=10):
    model.eval()
    user_idx = user_id_to_idx.get(user_id)
    if user_idx is None: # Catch issues here is user_id entered is not found, then print 'Uers ID not found.'
        print(f"User ID {user_id} not found.")
        return

    rated_items = user_rated_items.get(user_idx, set())
    scores = []

    for item_idx in range(num_items):
        if item_idx in rated_items:
            continue  # Skip items already rated

        with torch.no_grad():
            score = model(torch.tensor(user_idx), torch.tensor(item_idx)).item()
        scores.append((item_idx, score))

    top_items = sorted(scores, key=lambda x: x[1], reverse=True)[:N]

    # Print the top N= items and add line break after each item for better visibility.
    print(f"\nTop {N} recommended products for user id: {user_id}:\n")
    for item_idx, score in top_items:
        print(f"{idx_to_title.get(item_idx, 'Unknown Item')}\n")  # Line break after each title

In [28]:
# Create a map from item index to its BERT embedding
recommend_top_n_cf('AFKZENTNBQ7A7V7UXW5JJI6UGRYQ', model, N=5)


Top 5 recommended products for user id: AFKZENTNBQ7A7V7UXW5JJI6UGRYQ:

videos ##ec ##u studio speaker wall ceiling mount bracket holder ##5 white pack

net ##ge ##ar night ##hawk ##4 ##s smart ##fi route ##r ##7 ##80 ##0 ##26 ##00 wireless speed 260 ##0 ##ps 2500 coverage devices ##g ethernet usb esa ##ta ports

flex ##imo ##unt ##s ##6 quad lcd arm monitor stand desk mounts samsung dell ##us ace ##r ##c lcd computer monitor quad lcd mount

##m ##x plus ipad mini 6th gen 2021 inch rugged case apple pencil holder magnetic closure stand sleep wake cover mil spec drop tested midnight blue ##m 222 341 ##g ##x

armor ##suit military ##shi ##eld apple ipad ##tina display ipad ipad screen protector shield lifetime replacements



In [29]:
# Enter a user_id to generate product recommendations
user_id = 'AFKZENTNBQ7A7V7UXW5JJI6UGRYQ'

# Filter rows for this user
user_purchases = df_preprocess[df_preprocess['user_id'] == user_id]

# Print each purchased item's title_y with a line break
print(f"\nItems purchased by user {user_id}:\n")
for title in user_purchases['title_y']:
    print(f"{title}\n")


Items purchased by user AFKZENTNBQ7A7V7UXW5JJI6UGRYQ:

binoculars ##x ##42 binoculars adults binoculars hunting compact binoculars trip ##od smartphone adapt ##er hunting bird watching hiking traveling sports

toys year old boys mom ##ab ##oys kids binoculars children compact telescope boys gifts years old bird watching scenery yellow

sen ##so blue ##tooth head ##phones best wireless sports ear ##bu ##ds mic ##x ##7 water ##proof stereo sweat ##proof ear ##phones gym running workout noise cancel ##ling ear ##phones ear ##bu ##ds noise cancel ##ling heads ##ets



As with the metrics, we aim to refine our model to produce a more coherent output.

## Personal Contribution Statement  
Summary of tasks and team members' contributions  
Proofreading

## References

McAuley, J. (2023). Amazon Reviews 2023 Dataset. Hugging Face.
Retrieved from https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023

Codegenes. (n.d.). Collaborative Filtering Using PyTorch.
Retrieved from https://www.codegenes.net/blog/collaborative-filtering-pytorch/

GeeksforGeeks. (n.d.). Build a Recommendation Engine with Collaborative Filtering.
Retrieved from https://www.geeksforgeeks.org/machine-learning/build-a-recommendation-engine-with-collaborative-filtering/

PyTorch. (n.d.). Introducing TorchRec.
Retrieved from https://pytorch.org/blog/introducing-torchrec/

Sling Academy. (n.d.). Combining Content-Based and Collaborative Approaches in PyTorch Recommenders.
Retrieved from https://www.slingacademy.com/article/combining-content-based-and-collaborative-approaches-in-pytorch-recommenders/

PyTorch Documentation. (n.d.). Data API Reference.
Retrieved from https://docs.pytorch.org/docs/stable/data.html
