## Final Project Part 2: Product Recommender System Using NLP   

Laine Close  
Marcos Fernandez  
Owen Randolph

Recommender systems are a common feature in many digital platforms where user choice plays a central role. They provide significant benefits for both users and creators. For instance, Amazon‚Äôs recommendation engine has greatly increased user engagement and overall spending on its e-commerce site. For customers, these systems personalize the shopping experience by suggesting products that best match their preferences‚Äîwhether inferred from individual browsing history or derived from broader popularity trends. Recommender systems have fundamentally transformed how users discover and interact with products online.

This notebook demonstrates the preliminary development and testing of a product recommender system using machine learning. It focuses on establishing the preprocessing, feature extraction, and model functionality needed to generate personalized product recommendations.

## Part 0: Setup Python Environment

The pymongo library is used as the Python driver for MongoDB, enabling interaction with the database directly from Python. It allows queries, inserts, updates, and data retrieval through an API similar to working with dictionaries and lists in Python.

In [None]:
!pip install pymongo

Collecting pymongo
  Downloading pymongo-4.15.5-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.8.0-py3-none-any.whl.metadata (5.7 kB)
Downloading pymongo-4.15.5-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (1.7 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.7/1.7 MB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.8.0-py3-none-any.whl (331 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m331.1/331.1 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dnspython, pymongo
Successfully installed dnspython-2.8.0 pymongo-4.15.5


In [None]:
from pymongo import MongoClient
uri = "mongodb+srv://admin:6hkaexkcZ1roPi1Q@firstcluster.il42qkf.mongodb.net/"

# Create the client
client = MongoClient(uri)

# List all databases in the cluster
print("Databases available:")
print(client.list_database_names())

Databases available:
['admin', 'amazon_reviews', 'config', 'local']


In [None]:
# If using google collab must save both datasets to google drive due to size restrictions with uploading locally to 'contents' directory
# from google.colab import drive
# drive.mount('/content/drive')

##### NLTK: NLP Library

The NLTK library is a toolkit that allows us to do preprocessing tasks with the text data including and allowing:

- Tokenize text

- Remove stop words

- Tag parts of speech

- Lemmatize words

- Access lexical database

- Support multilingual data

- Enable sentence splitting

In [None]:
import json
import gzip
import pandas as pd
import nltk

from nltk.corpus import stopwords
import re
import string
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')
nltk.download('punkt_tab')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


##### BERT
Before using advanced neural network models like BERT, the required transformer libraries are installed to enable text embedding and model loading.

In [None]:
# Used for text embedding with BERT
!pip install sentence-transformers
!pip install transformers



In [None]:
# Used for collaborative filtering
!pip install scikit-surprise



## Part 1. Import Electronics Data

We connect to our MongoDB cluster in order to use the data in our recommender system.  This is a good database choice because we are using json documents.  A NoSQL database is the best option for storing and drawing our data because some of the features of our data are long text strings, so it's considered nonrelational data.  We use MongoDB Atlas to set up a 3 node cluster using MongoDB Atlas, and connect to the active cluster to.  Cursor object is used to query data within the notebook environment, in our case just to check the data.

In [None]:
from pymongo import MongoClient
import pandas as pd

# connect to MongoDB
client = MongoClient("mongodb+srv://admin:6hkaexkcZ1roPi1Q@firstcluster.il42qkf.mongodb.net/")
db = client["amazon_reviews"]
collection = db["Electronics"]

# example: read first 10,000 docs
cursor = collection.find({}, {"_id": 0}).limit(10000)
data = list(cursor)

# convert to pandas DataFrame
df = pd.DataFrame(data)
print(df.head())

   rating                                            title_x  \
0     5.0                             Great laptop backpack!   
1     1.0  Not sure why this has good reviews. My devices...   
2     5.0                                         Great Case   
3     5.0                                      Great Product   
4     4.0                                   Worthy Purchase!   

                                                text images_x        asin  \
0  I was searching for a sturdy backpack for scho...       []  B001OC5JKY   
1  This devices brought my wifi network to its kn...       []  B00L0YLRUW   
2  Great case. Fits both the iPad Air & Air 2. Co...       []  B00PQBDQPO   
3  AWESOME sound!  Easy to install. The wireless ...       []  B00CRF9V2O   
4  Arrived sooner than the original really long e...       []  B00852JEPW   

  parent_asin_x                       user_id      timestamp  helpful_vote  \
0    B001OC5JKY  AGGZ357AO26RQZVRLGU4D4N52DZQ  1290278495000            18

In [None]:
print(df.info())
print(df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   rating             10000 non-null  float64
 1   title_x            10000 non-null  object 
 2   text               10000 non-null  object 
 3   images_x           10000 non-null  object 
 4   asin               10000 non-null  object 
 5   parent_asin_x      10000 non-null  object 
 6   user_id            10000 non-null  object 
 7   timestamp          10000 non-null  int64  
 8   helpful_vote       10000 non-null  int64  
 9   verified_purchase  10000 non-null  bool   
 10  main_category      9795 non-null   object 
 11  title_y            10000 non-null  object 
 12  average_rating     10000 non-null  float64
 13  rating_number      10000 non-null  int64  
 14  features           10000 non-null  object 
 15  description        10000 non-null  object 
 16  price              1000

Import the Electronics review file and store it as the df_review DataFrame. Only records with non-missing parent_asin values are included. A potential future feature of this project could involve querying a targeted subset of 10,000 documents to improve recommendation quality after model training ‚Äî for instance, filtering by users (user_id) with more than five reviews or by reviews with ratings greater than or equal to four.

The source file is in .json format and compressed in a zip folder. For testing, only 10,000 records are temporarily loaded due to file size; this limit will be removed prior to the final submission.

In [None]:
# Import 'Electronics' review data
## NOTE: UPDATE PATH TO YOUR SOURCE DATA
#file = '/content/Electronics.jsonl.gz'
#file = '/content/drive/MyDrive/Graduate School/Natural Language Processing/Electronics.jsonl.gz'

'''data = []
with gzip.open(file, 'rt', encoding='utf-8') as fp:
   for i, line in enumerate(fp):
     if i >= 5000:
            break
        record = json.loads(line.strip())
        if record.get('parent_asin'):
            data.append(record)

# Convert to dataframe
df_review = pd.DataFrame(data)
df_review.head()'''

"data = []\nwith gzip.open(file, 'rt', encoding='utf-8') as fp:\n   for i, line in enumerate(fp):\n     if i >= 5000:\n            break\n        record = json.loads(line.strip())\n        if record.get('parent_asin'):\n            data.append(record)\n\n# Convert to dataframe\ndf_review = pd.DataFrame(data)\ndf_review.head()"

Import Electronics metadata file and store as df_meta dataframe. Only pull in records with non-missing parent_asin values.
The file is .json and is is in a zip folder. Only temporarily pull the parent_asin that are in the df_review dataframe. This will be removed prior to the final submission.

In [None]:
# Import 'Electronics' metadata
## NOTE: UPDATE PATH TO YOUR SOURCE DATA
#file = '/content/meta_Electronics.jsonl.gz'
#file = '/content/drive/MyDrive/Graduate School/Natural Language Processing/meta_Electronics.jsonl.gz'
file = "meta_Electronics.jsonl.gz"
data = []

# Create a set of valid ASINs from df_review
valid_asins = set(df['asin'])

with gzip.open(file, 'rt', encoding='utf-8') as fp:
    for line in fp:
        record = json.loads(line.strip())
        if record.get('parent_asin') in valid_asins and record.get('parent_asin'):
            data.append(record)

# Convert to dataframe
df_meta = pd.DataFrame(data)
df_meta.head()

Unnamed: 0,main_category,title,average_rating,rating_number,features,description,price,images,videos,store,categories,details,parent_asin,bought_together,subtitle,author
0,Computers,Plugable USB 3.0 Sharing Switch for One-Button...,4.2,1325,[SIMPLE DEVICE SHARING - Compact design provid...,[],,[{'thumb': 'https://m.media-amazon.com/images/...,[{'title': 'How to Switch Between Two Computer...,Plugable,"[Electronics, Computers & Accessories, Compute...","{'Operation Mode': 'Toggle', 'Operating Voltag...",B00JX3Q28Y,,,
1,Amazon Home,"Aproca Hard Storage Travel Case, for AKASO EK7...",4.6,489,[Eco-friendly Material: Made of High-density E...,[],14.99,[{'thumb': 'https://m.media-amazon.com/images/...,[{'title': 'LTGEM EVA Hard Case for AKASO EK70...,Aproca,"[Electronics, Camera & Photo, Bags & Cases, Ca...",{'Package Dimensions': '9.1 x 5.8 x 3.6 inches...,B07ZZ595TG,,,
2,Camera & Photo,"Teleprompter,Desview T3S Promoting for Device ...",4.0,210,[üëç„ÄêLatest Teleprompter from Desview Official„ÄëC...,[],129.95,[{'thumb': 'https://m.media-amazon.com/images/...,[{'title': 'The Teleprompter You've Been Waiti...,Desview,"[Electronics, Camera & Photo, Lighting & Studi...",{'Package Dimensions': '9.09 x 9.02 x 5.39 inc...,B0952X8NBY,,,
3,Camera & Photo,Fujifilm Instax Mini 8 Instant Film Camera (Po...,4.5,120,"[New slimmer and lighter body, Works with Fuji...",[The Fujifilm Instax Mini 8 Instant Film Camer...,,[{'thumb': 'https://m.media-amazon.com/images/...,[{'title': 'Don‚Äôt buy it light switch won‚Äôt ch...,Fujifilm,"[Electronics, Camera & Photo, Film Photography...","{'Product Dimensions': '3 x 5 x 5.5 inches', '...",B075S8H5LY,,,
4,Office Products,Allstate 3-Year Office Protection Plan ($75-99...,4.0,749,[],"[From the Manufacturer, Let's face it‚Äîwarranti...",,[{'thumb': 'https://m.media-amazon.com/images/...,[],SquareTrade,"[Electronics, Electronics Warranties]","{'Manufacturer': 'SquareTrade', 'Brand': 'Squa...",B008I63HDA,,,


In [None]:
df_meta['main_category'].value_counts()

Unnamed: 0_level_0,count
main_category,Unnamed: 1_level_1
All Electronics,2740
Computers,2169
Camera & Photo,896
Cell Phones & Accessories,871
Home Audio & Theater,694
Amazon Devices,267
Industrial & Scientific,218
Tools & Home Improvement,150
Office Products,141
Amazon Home,114


Majority of data is Computer hardware.  

Some questions we wonder are:
- What is the overlap in product recommendations by product type? For instance, would I be recommended a camera if I bought a computer?  
- Would it recommend a computer camera instead of a digital photography camera?
- How does the data distribution influence the model‚Äôs behavior?

In [None]:
df_meta.count()

Unnamed: 0,0
main_category,8712
title,8896
average_rating,8896
rating_number,8896
features,8896
description,8896
price,3254
images,8896
videos,8896
store,8874


In [None]:
# Merge both the review and metadata together by parent_asin from meta and asin from non-meta
comb_df = pd.merge(df,
                   df_meta,
                   how='inner', # Inner join to only keep records that are in both reviews and meta
                   left_on='asin', # Merge on parent_asin
                   right_on='parent_asin'
)

pd.set_option('display.max_columns', None)
display(comb_df.head())
print(len(comb_df))

Unnamed: 0,rating,title_x,text,images_x,asin,parent_asin_x,user_id,timestamp,helpful_vote,verified_purchase,main_category_x,title_y,average_rating_x,rating_number_x,features_x,description_x,price_x,images_y,videos_x,store_x,categories_x,details_x,parent_asin_y,bought_together_x,subtitle_x,author_x,main_category_y,title,average_rating_y,rating_number_y,features_y,description_y,price_y,images,videos_y,store_y,categories_y,details_y,parent_asin,bought_together_y,subtitle_y,author_y
0,5.0,Great laptop backpack!,I was searching for a sturdy backpack for scho...,[],B001OC5JKY,B001OC5JKY,AGGZ357AO26RQZVRLGU4D4N52DZQ,1290278495000,18,True,,"Targus Air Traveler Laptop Backpack, Professio...",4.2,265,[The Targus Zip-Thru Air Traveler Backpack is ...,"[Product Description, The Targus Checkpoint-Fr...",,{'hi_res': ['https://m.media-amazon.com/images...,"{'title': ['Durable And Spacious Backpack!', '...",Targus,"[Electronics, Computers & Accessories, Laptop ...","{""Product Dimensions"": ""17.8 x 14.8 x 3.8 inch...",B001OC5JKY,,,,,"Targus Air Traveler Laptop Backpack, Professio...",4.2,265,[The Targus Zip-Thru Air Traveler Backpack is ...,"[Product Description, The Targus Checkpoint-Fr...",,[{'thumb': 'https://m.media-amazon.com/images/...,"[{'title': 'Durable And Spacious Backpack!', '...",Targus,"[Electronics, Computers & Accessories, Laptop ...",{'Product Dimensions': '17.8 x 14.8 x 3.8 inch...,B001OC5JKY,,,
1,1.0,Not sure why this has good reviews. My devices...,This devices brought my wifi network to its kn...,[],B00L0YLRUW,B00L0YLRUW,AGCI7FAH4GL5FI65HYLKWTMFZ2CQ,1439226089000,0,True,Computers,NETGEAR Wi-Fi Range Extender EX2700 - Coverage...,3.9,110468,[N300 WI-FI speed: Provides up to 300 Mbps per...,"[Say goodbye to Wi-Fi dead zones. Convenient, ...",59.99,{'hi_res': ['https://m.media-amazon.com/images...,"{'title': ['Clear set up instructions, great r...",NETGEAR,"[Electronics, Computers & Accessories, Network...","{""Wireless Type"": ""802.11ac"", ""Brand"": ""NETGEA...",B00L0YLRUW,,,,Computers,NETGEAR Wi-Fi Range Extender EX2700 - Coverage...,3.9,110468,[N300 WI-FI speed: Provides up to 300 Mbps per...,"[Say goodbye to Wi-Fi dead zones. Convenient, ...",59.99,[{'thumb': 'https://m.media-amazon.com/images/...,"[{'title': 'Clear set up instructions, great r...",NETGEAR,"[Electronics, Computers & Accessories, Network...","{'Wireless Type': '802.11ac', 'Brand': 'NETGEA...",B00L0YLRUW,,,
2,5.0,Great Case,Great case. Fits both the iPad Air & Air 2. Co...,[],B00PQBDQPO,B00PQBDQPO,AHITBJSS7KYUBVZPX7M2WJCOIVKQ,1432344054000,0,True,Computers,"iPad Air 2 Keyboard Case, BoriYuan Ultra Thin ...",3.4,55,[Automatic connecting(without entering passwor...,[Product Description Color: Aluminium Silver- ...,,"{'hi_res': [None, 'https://m.media-amazon.com/...","{'title': [], 'url': [], 'user_id': []}",BORIYUAN,"[Electronics, Computers & Accessories, Tablet ...","{""Package Dimensions"": ""11.7 x 7.7 x 2.2 inche...",B00PQBDQPO,,,,Computers,"iPad Air 2 Keyboard Case, BoriYuan Ultra Thin ...",3.4,55,[Automatic connecting(without entering passwor...,[Product Description Color: Aluminium Silver- ...,,[{'thumb': 'https://m.media-amazon.com/images/...,[],BORIYUAN,"[Electronics, Computers & Accessories, Tablet ...",{'Package Dimensions': '11.7 x 7.7 x 2.2 inche...,B00PQBDQPO,,,
3,5.0,Great Product,AWESOME sound! Easy to install. The wireless ...,[],B00CRF9V2O,B00CRF9V2O,AHITBJSS7KYUBVZPX7M2WJCOIVKQ,1432342639000,0,True,Home Audio & Theater,VIZIO S4221w-C4 42-inch 2.1 Home Theater Sound...,3.9,177,[Best in class audio performance: 101dB of roo...,[Upgrade to premium audio with VIZIO's 42 inch...,,{'hi_res': ['https://m.media-amazon.com/images...,{'title': ['VIZIO V-Series 5.1 REAL Review fro...,VIZIO,"[Electronics, Home Audio, Speakers, Sound Bars]","{""Product Dimensions"": ""42.4 x 3.2 x 3.8 inche...",B00CRF9V2O,,,,Home Audio & Theater,VIZIO S4221w-C4 42-inch 2.1 Home Theater Sound...,3.9,177,[Best in class audio performance: 101dB of roo...,[Upgrade to premium audio with VIZIO's 42 inch...,,[{'thumb': 'https://m.media-amazon.com/images/...,[{'title': 'VIZIO V-Series 5.1 REAL Review fro...,VIZIO,"[Electronics, Home Audio, Speakers, Sound Bars]",{'Product Dimensions': '42.4 x 3.2 x 3.8 inche...,B00CRF9V2O,,,
4,4.0,Worthy Purchase!,Arrived sooner than the original really long e...,[],B00852JEPW,B00852JEPW,AHITBJSS7KYUBVZPX7M2WJCOIVKQ,1419885573000,0,True,Home Audio & Theater,Best Choice Products 100 Diagonal 16:9 Electri...,3.8,80,[],[],,{'hi_res': ['https://m.media-amazon.com/images...,{'title': ['AWESOME! Installed yesterday and i...,Best Choice Products,"[Electronics, Television & Video, Accessories,...","{""Package Dimensions"": ""103.5 x 6 x 4.5 inches...",B00852JEPW,,,,Home Audio & Theater,Best Choice Products 100 Diagonal 16:9 Electri...,3.8,80,[],[],,[{'thumb': 'https://m.media-amazon.com/images/...,[{'title': 'AWESOME! Installed yesterday and i...,Best Choice Products,"[Electronics, Television & Video, Accessories,...",{'Package Dimensions': '103.5 x 6 x 4.5 inches...,B00852JEPW,,,


10000


Our combined dataset with 10,000 instances in a DataFrame is a good start to begin with preprocessing steps and working through training the model.

## Part 2. Preprocessing  



In [None]:
print(comb_df.columns.tolist())

['rating', 'title_x', 'text', 'images_x', 'asin', 'parent_asin_x', 'user_id', 'timestamp', 'helpful_vote', 'verified_purchase', 'main_category_x', 'title_y', 'average_rating_x', 'rating_number_x', 'features_x', 'description_x', 'price_x', 'images_y', 'videos_x', 'store_x', 'categories_x', 'details_x', 'parent_asin_y', 'bought_together_x', 'subtitle_x', 'author_x', 'main_category_y', 'title', 'average_rating_y', 'rating_number_y', 'features_y', 'description_y', 'price_y', 'images', 'videos_y', 'store_y', 'categories_y', 'details_y', 'parent_asin', 'bought_together_y', 'subtitle_y', 'author_y']


In [None]:
# Only keep columns required for recommender system
keep_columns = ['rating','text','user_id','title_y','categories_y','parent_asin_y','description_x']

df_preprocess = comb_df[keep_columns]
df_preprocess['org_title_y'] = df_preprocess['title_y']

df_preprocess.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_preprocess['org_title_y'] = df_preprocess['title_y']


Unnamed: 0,rating,text,user_id,title_y,categories_y,parent_asin_y,description_x,org_title_y
0,5.0,I was searching for a sturdy backpack for scho...,AGGZ357AO26RQZVRLGU4D4N52DZQ,"Targus Air Traveler Laptop Backpack, Professio...","[Electronics, Computers & Accessories, Laptop ...",B001OC5JKY,"[Product Description, The Targus Checkpoint-Fr...","Targus Air Traveler Laptop Backpack, Professio..."
1,1.0,This devices brought my wifi network to its kn...,AGCI7FAH4GL5FI65HYLKWTMFZ2CQ,NETGEAR Wi-Fi Range Extender EX2700 - Coverage...,"[Electronics, Computers & Accessories, Network...",B00L0YLRUW,"[Say goodbye to Wi-Fi dead zones. Convenient, ...",NETGEAR Wi-Fi Range Extender EX2700 - Coverage...
2,5.0,Great case. Fits both the iPad Air & Air 2. Co...,AHITBJSS7KYUBVZPX7M2WJCOIVKQ,"iPad Air 2 Keyboard Case, BoriYuan Ultra Thin ...","[Electronics, Computers & Accessories, Tablet ...",B00PQBDQPO,[Product Description Color: Aluminium Silver- ...,"iPad Air 2 Keyboard Case, BoriYuan Ultra Thin ..."
3,5.0,AWESOME sound! Easy to install. The wireless ...,AHITBJSS7KYUBVZPX7M2WJCOIVKQ,VIZIO S4221w-C4 42-inch 2.1 Home Theater Sound...,"[Electronics, Home Audio, Speakers, Sound Bars]",B00CRF9V2O,[Upgrade to premium audio with VIZIO's 42 inch...,VIZIO S4221w-C4 42-inch 2.1 Home Theater Sound...
4,4.0,Arrived sooner than the original really long e...,AHITBJSS7KYUBVZPX7M2WJCOIVKQ,Best Choice Products 100 Diagonal 16:9 Electri...,"[Electronics, Television & Video, Accessories,...",B00852JEPW,[],Best Choice Products 100 Diagonal 16:9 Electri...


##### Text Cleaning

- Removing English stop words and HTML or special character artifacts (e.g., ‚Äúnbsp‚Äù).

- Tokenizing the text into manageable word units for later feature extraction and embedding.

In [None]:
# Clean text. This was modified from Week 8 Recommender Systems Demo
# Load BERT tokenizer
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def clean_bert(text):
    text = re.sub("'", "", text) # Remove "'"
    text = text.replace("nbsp", "") # Remove html

    tokens = tokenizer.tokenize(text) # Tokenize using Bert

    filtered_tokens = [
        token for token in tokens
        if token not in stop_words and token not in string.punctuation and len(token) > 2 # Remove stop words and short tokens
    ]

    return " ".join(filtered_tokens)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
# Combine text columns into a single column for data preprocessing
df_preprocess = df_preprocess.copy()

df_preprocess.loc[:, "prod_desc"] = (
    df_preprocess["title_y"].astype(str)
    + " "
    + df_preprocess["categories_y"].astype(str)
    + " "
    + df_preprocess["description_x"].astype(str)
)

df_preprocess["categories_y"] = df_preprocess["categories_y"].astype(str)
df_preprocess.head()

Unnamed: 0,rating,text,user_id,title_y,categories_y,parent_asin_y,description_x,org_title_y,prod_desc
0,5.0,I was searching for a sturdy backpack for scho...,AGGZ357AO26RQZVRLGU4D4N52DZQ,"Targus Air Traveler Laptop Backpack, Professio...","['Electronics', 'Computers & Accessories', 'La...",B001OC5JKY,"[Product Description, The Targus Checkpoint-Fr...","Targus Air Traveler Laptop Backpack, Professio...","Targus Air Traveler Laptop Backpack, Professio..."
1,1.0,This devices brought my wifi network to its kn...,AGCI7FAH4GL5FI65HYLKWTMFZ2CQ,NETGEAR Wi-Fi Range Extender EX2700 - Coverage...,"['Electronics', 'Computers & Accessories', 'Ne...",B00L0YLRUW,"[Say goodbye to Wi-Fi dead zones. Convenient, ...",NETGEAR Wi-Fi Range Extender EX2700 - Coverage...,NETGEAR Wi-Fi Range Extender EX2700 - Coverage...
2,5.0,Great case. Fits both the iPad Air & Air 2. Co...,AHITBJSS7KYUBVZPX7M2WJCOIVKQ,"iPad Air 2 Keyboard Case, BoriYuan Ultra Thin ...","['Electronics', 'Computers & Accessories', 'Ta...",B00PQBDQPO,[Product Description Color: Aluminium Silver- ...,"iPad Air 2 Keyboard Case, BoriYuan Ultra Thin ...","iPad Air 2 Keyboard Case, BoriYuan Ultra Thin ..."
3,5.0,AWESOME sound! Easy to install. The wireless ...,AHITBJSS7KYUBVZPX7M2WJCOIVKQ,VIZIO S4221w-C4 42-inch 2.1 Home Theater Sound...,"['Electronics', 'Home Audio', 'Speakers', 'Sou...",B00CRF9V2O,[Upgrade to premium audio with VIZIO's 42 inch...,VIZIO S4221w-C4 42-inch 2.1 Home Theater Sound...,VIZIO S4221w-C4 42-inch 2.1 Home Theater Sound...
4,4.0,Arrived sooner than the original really long e...,AHITBJSS7KYUBVZPX7M2WJCOIVKQ,Best Choice Products 100 Diagonal 16:9 Electri...,"['Electronics', 'Television & Video', 'Accesso...",B00852JEPW,[],Best Choice Products 100 Diagonal 16:9 Electri...,Best Choice Products 100 Diagonal 16:9 Electri...


In [None]:
# Apply clean_bert function to relevant text columns
df_preprocess['prod_desc'] = df_preprocess['prod_desc'].apply(clean_bert)
df_preprocess['text'] = df_preprocess['text'].apply(clean_bert)
df_preprocess['title_y'] = df_preprocess['title_y'].apply(clean_bert)
df_preprocess['categories_y'] = df_preprocess['categories_y'].apply(clean_bert)

df_preprocess.head()

Unnamed: 0,rating,text,user_id,title_y,categories_y,parent_asin_y,description_x,org_title_y,prod_desc
0,5.0,searching sturdy backpack school would allow c...,AGGZ357AO26RQZVRLGU4D4N52DZQ,tar ##gus air traveler laptop backpack profess...,electronics computers accessories laptop acces...,B001OC5JKY,"[Product Description, The Targus Checkpoint-Fr...","Targus Air Traveler Laptop Backpack, Professio...",tar ##gus air traveler laptop backpack profess...
1,1.0,devices brought ##fi network knees sure good r...,AGCI7FAH4GL5FI65HYLKWTMFZ2CQ,net ##ge ##ar range extend ##er ##27 ##00 cove...,electronics computers accessories networking p...,B00L0YLRUW,"[Say goodbye to Wi-Fi dead zones. Convenient, ...",NETGEAR Wi-Fi Range Extender EX2700 - Coverage...,net ##ge ##ar range extend ##er ##27 ##00 cove...
2,5.0,great case fits ipad air air comes ##movable i...,AHITBJSS7KYUBVZPX7M2WJCOIVKQ,ipad air keyboard case ##ri ##yuan ultra thin ...,electronics computers accessories tablet acces...,B00PQBDQPO,[Product Description Color: Aluminium Silver- ...,"iPad Air 2 Keyboard Case, BoriYuan Ultra Thin ...",ipad air keyboard case ##ri ##yuan ultra thin ...
3,5.0,awesome sound easy install wireless sub ##wo #...,AHITBJSS7KYUBVZPX7M2WJCOIVKQ,viz ##io ##42 ##21 ##w ##4 inch home theater s...,electronics home audio speakers sound bars,B00CRF9V2O,[Upgrade to premium audio with VIZIO's 42 inch...,VIZIO S4221w-C4 42-inch 2.1 Home Theater Sound...,viz ##io ##42 ##21 ##w ##4 inch home theater s...
4,4.0,arrived sooner original really long estimated ...,AHITBJSS7KYUBVZPX7M2WJCOIVKQ,best choice products 100 diagonal electric pro...,electronics television video accessories proje...,B00852JEPW,[],Best Choice Products 100 Diagonal 16:9 Electri...,best choice products 100 diagonal electric pro...


## Feature Extraction  
In this stage, we use BERT embeddings to transform product descriptions and review text into numerical vector representations. These embeddings capture the semantic meaning of the text, allowing the recommender system to compare products based on contextual similarity rather than simple keyword matching.

In [None]:
import os
os.environ["USE_TF"] = "0"
os.environ["USE_TORCH"] = "1"


In [None]:
#!pip install tf-keras

In [None]:
#!pip install --user tf-keras

In [None]:
from sentence_transformers import SentenceTransformer

bert_model = SentenceTransformer('all-MiniLM-L6-v2')
item_embeddings = bert_model.encode(
    df_preprocess['prod_desc'].tolist(),
    show_progress_bar=True
)

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

In [None]:
#from sentence_transformers import SentenceTransformer
#bert_model = SentenceTransformer('all-MiniLM-L6-v2')

#item_embeddings = bert_model.encode(df_preprocess['prod_desc'].tolist(), show_progress_bar=True)

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split

# Encode the user_id and item_id
df_preprocess['user_idx'] = df_preprocess['user_id'].astype('category').cat.codes

# Store the categorical series of prod_desc before getting codes
item_categorical = df_preprocess['prod_desc'].astype('category')
df_preprocess['item_idx'] = item_categorical.cat.codes

# Get unique uuser_id and item_id
num_users = df_preprocess['user_idx'].nunique()
num_items = df_preprocess['item_idx'].nunique()

# Prepare training data
X = df_preprocess[['user_idx', 'item_idx']].values
y = df_preprocess['rating'].values

# Generate train and test sets and split the data at 0.2 for test size
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a pytorch class representing a dataset. This is PyTorch documentation found in reference [6]. See 'class torch.utils.data.Dataset' section
class CF_Dataset(torch.utils.data.Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels

    def __getitem__(self, index):
        user, item = self.data[index]
        return torch.tensor(user, dtype=torch.long), torch.tensor(item, dtype=torch.long), torch.tensor(self.labels[index], dtype=torch.float32)

    def __len__(self):
        return len(self.data)

train_dataset = CF_Dataset(X_train, y_train)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=256, shuffle=True)

# Collaborative Filtering Model
## Note: Collaborative filtering code template was pulled from reference [5] and modified for our needs
class CF_Model(nn.Module):
    def __init__(self, num_users, num_items, latent_dim=32):
        super(CF_Model, self).__init__()
        self.user_embedding = nn.Embedding(num_users, latent_dim)
        self.item_embedding = nn.Embedding(num_items, latent_dim)
        self.fc = nn.Sequential(
            nn.Linear(latent_dim * 2, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )

    def forward(self, user, item):
        user_vec = self.user_embedding(user)
        item_vec = self.item_embedding(item)
        x = torch.cat([user_vec, item_vec], dim=-1)
        return self.fc(x).squeeze()

# Initialize model
model = CF_Model(num_users, num_items)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop - set epoch to 5o for now
for epoch in range(50):
    model.train()
    total_loss = 0
    for user, item, rating in train_loader:
        optimizer.zero_grad()
        output = model(user, item)
        loss = criterion(output, rating)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")

Epoch 1, Loss: 514.9764
Epoch 2, Loss: 308.4510
Epoch 3, Loss: 155.9459
Epoch 4, Loss: 84.6810
Epoch 5, Loss: 60.4614
Epoch 6, Loss: 51.9712
Epoch 7, Loss: 47.9183
Epoch 8, Loss: 46.1429
Epoch 9, Loss: 44.5077
Epoch 10, Loss: 42.8551
Epoch 11, Loss: 41.3522
Epoch 12, Loss: 40.0995
Epoch 13, Loss: 38.4258
Epoch 14, Loss: 37.2046
Epoch 15, Loss: 35.9994
Epoch 16, Loss: 34.9043
Epoch 17, Loss: 34.0490
Epoch 18, Loss: 32.6130
Epoch 19, Loss: 31.7676
Epoch 20, Loss: 30.9996
Epoch 21, Loss: 29.6768
Epoch 22, Loss: 28.9083
Epoch 23, Loss: 27.7534
Epoch 24, Loss: 27.0783
Epoch 25, Loss: 25.8426
Epoch 26, Loss: 25.4034
Epoch 27, Loss: 24.6044
Epoch 28, Loss: 23.4462
Epoch 29, Loss: 22.9892
Epoch 30, Loss: 21.9246
Epoch 31, Loss: 21.2476
Epoch 32, Loss: 20.4539
Epoch 33, Loss: 19.7372
Epoch 34, Loss: 19.0683
Epoch 35, Loss: 18.0740
Epoch 36, Loss: 17.3590
Epoch 37, Loss: 16.8072
Epoch 38, Loss: 15.8432
Epoch 39, Loss: 15.1987
Epoch 40, Loss: 14.6645
Epoch 41, Loss: 13.7939
Epoch 42, Loss: 13.088

In [None]:
#Measure how well the model performs
from sklearn.metrics import mean_squared_error
import numpy as np

# Get predictions on the test set
model.eval() # Set the model to evaluation mode
with torch.no_grad(): # Disable gradient calculation
    user_test = torch.tensor(X_test[:, 0], dtype=torch.long)
    item_test = torch.tensor(X_test[:, 1], dtype=torch.long)
    model_predictions = model(user_test, item_test).numpy()

# Calculate RMSE manually
rmse = np.sqrt(mean_squared_error(y_test, model_predictions))
print(f"RMSE on test set: {rmse:.4f}")

RMSE on test set: 1.4030


We aim to do more experimenting and model tuning to improve metrics.  This is our preliminary result.

## Main Functionality  
This section implements the key recommendation logic of the system. After feature extraction, the model uses user-item embeddings to generate personalized recommendations.

The system maps users and items to unique indices, builds a history of rated products, and uses this to generate personalized recommendations. The recommend_top_n_cf() function runs in evaluation mode with PyTorch to calculate similarity scores from embeddings and returns a ranked list of top product titles for each user.

In [None]:
# Map user_id to user_idx
user_id_to_idx = dict(zip(df_preprocess['user_id'], df_preprocess['user_idx']))

# Map item_idx to org_title_y for display
item_titles = df_preprocess.drop_duplicates('item_idx')[['item_idx', 'org_title_y']]
idx_to_title = dict(zip(item_titles['item_idx'], item_titles['org_title_y']))

# Build user-item history to avoid recommending already rated items
user_rated_items = df_preprocess.groupby('user_idx')['item_idx'].apply(set).to_dict()

In [None]:
temp_user_rated_items = user_rated_items.copy()
for key, value in temp_user_rated_items.items():
    temp_user_rated_items[key] = list(temp_user_rated_items[key])

In [None]:
with open('user_id_to_idx.json', 'w') as json_file:
    json.dump(user_id_to_idx, json_file, indent=4)

with open('idx_to_title.json', 'w') as json_file:
    json.dump(idx_to_title, json_file, indent=4)

with open('user_rated_items.json', 'w') as json_file:
    json.dump(temp_user_rated_items, json_file, indent=4)

In [None]:
# Build function to recommend the N= number of times for a specified user
def recommend_top_n_cf(user_id, model, N=10):
    model.eval()
    user_idx = user_id_to_idx.get(user_id)
    if user_idx is None: # Catch issues here is user_id entered is not found, then print 'Uers ID not found.'
        print(f"User ID {user_id} not found.")
        return

    rated_items = user_rated_items.get(user_idx, set())
    scores = []

    for item_idx in range(num_items):
        if item_idx in rated_items:
            continue  # Skip items already rated

        with torch.no_grad():
            score = model(torch.tensor(user_idx), torch.tensor(item_idx)).item()
        scores.append((item_idx, score))

    top_items = sorted(scores, key=lambda x: x[1], reverse=True)[:N]

    # Print the top N= items and add line break after each item for better visibility.
    print(f"\nTop {N} recommended products for user id: {user_id}:\n")
    for item_idx, score in top_items:
        print(f"{idx_to_title.get(item_idx, 'Unknown Item')}\n")  # Line break after each title

In [None]:
# Create a map from item index to its BERT embedding
recommend_top_n_cf('AFKZENTNBQ7A7V7UXW5JJI6UGRYQ', model, N=5)


Top 5 recommended products for user id: AFKZENTNBQ7A7V7UXW5JJI6UGRYQ:

Belkin F4U047-RS USB 2.0 Ethernet Adapter 10/100MBPS

Probrother Anti Slip Silicone Shock Proof Cover Case for Amazon 5.9'' Fire TV with 4K Alexa Voice Remote 2nd Generation Amazon Fire TV Stick Alexa Voice Remote (Green)

Cobra ACXT390 Walkie Talkies for Adults - Rechargeable, Lightweight, 22 Channels, 23-Mile Range Two-Way Radios with VOX (2-Pack)

GoPro Fusion Case, Keten Deluxe Travel Carrying Case with Buffer Sponge, Hard Shell Case, Large Storage for Adapter and Other GoPro Accessories, Best Protective Case for GoPro Camera

New Samsung AA59-00637A Replacement HDTV, LCD, LED, 3D, Smart TV Remote Control



In [None]:
torch.save(model.state_dict(), './model_weights.pth')

In [None]:
# Enter a user_id to generate product recommendations
user_id = 'AFKZENTNBQ7A7V7UXW5JJI6UGRYQ'

# Filter rows for this user
user_purchases = df_preprocess[df_preprocess['user_id'] == user_id]

# Print each purchased item's title_y with a line break
print(f"\nItems purchased by user {user_id}:\n")
for title in user_purchases['org_title_y']:
    print(f"{title}\n")


Items purchased by user AFKZENTNBQ7A7V7UXW5JJI6UGRYQ:

Senso Bluetooth Headphones, Best Wireless Sports Earbuds w/Mic IPX7 Waterproof HD Stereo Sweatproof Earphones for Gym Running Workout Noise Cancelling Earphones Earbuds Noise Cancelling Headsets

Binoculars, 12x42 Binoculars for Adults, Binoculars for Hunting, Compact Binoculars with Tripod, Smartphone Adapter for Hunting, Bird Watching, Hiking, Traveling and Sports

Toys for 4-5 Year Old Boys, Mom&myaboys 8 X 21 Kids Binoculars for Children,Compact Telescope Boys Gifts 4-8 Years Old to Bird Watching &Scenery(Yellow)



As with the metrics, we aim to refine our model to produce a more coherent output.

## Personal Contribution Statement  

#### Laine Close  
- Imported libraries and downloaded required tools such as stopwords.
- Created connection to the remote DB to allow for CRUD interactions.
- Loaded 10,000 records out of the 24 Million for testing so we can run experiments on a subset of the data.
- Transformed tokens into numerical embeddings using BERT to capture text semantics.
- Ran experiments using the recommender function to test the functionality accuracy of the model.

#### Marcos Fernandez
- Created MongoDB intance and imported the data so it can be retrived remotely.
- Filtered data so that only records with a parent_asin mapping are used for the project.
- Filtered data to only the required columns to create our recommender system.
- Performed text cleaning, removing stop words, special characters, tokenization, etc...

#### Owen Randolph
- Create connection to MongoDB Cluster in order to retrieve the data dynamically without having to download to local enviroments.
- Checked the remote data to make sure it can be queried dynamically.
- Preliminary data analysis, extract questions from the data that will help us keep our models in track with expectations.
- Set up the data for the recomender system by mapping the users and items to the specified ids and building a user-item history.
- Created the function that return the top N recommended items for the user.

## References

McAuley, J. (2023). Amazon Reviews 2023 Dataset. Hugging Face.
Retrieved from https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023

Codegenes. (n.d.). Collaborative Filtering Using PyTorch.
Retrieved from https://www.codegenes.net/blog/collaborative-filtering-pytorch/

GeeksforGeeks. (n.d.). Build a Recommendation Engine with Collaborative Filtering.
Retrieved from https://www.geeksforgeeks.org/machine-learning/build-a-recommendation-engine-with-collaborative-filtering/

PyTorch. (n.d.). Introducing TorchRec.
Retrieved from https://pytorch.org/blog/introducing-torchrec/

Sling Academy. (n.d.). Combining Content-Based and Collaborative Approaches in PyTorch Recommenders.
Retrieved from https://www.slingacademy.com/article/combining-content-based-and-collaborative-approaches-in-pytorch-recommenders/

PyTorch Documentation. (n.d.). Data API Reference.
Retrieved from https://docs.pytorch.org/docs/stable/data.html
