# Project-3: Recommendation System Using Amazon Data

In this project, Recommendation System was performed on meta data that contains `title` and `details` features from the **All Beauty**, **Digital Music**, and **Health and Personal Care** categories in the Amazon marketplace for the year 2023.

In this study, the SentenceTransformer was used for vectorization, employing the BERT-based `all-MiniLM-L6-v2` model. For the recommendation system, `FAISS`, developed by Facebook, and `ChromaDB`, developed by Chroma AI, were used for similarity search. A recommendation system was developed that takes a product title as input and lists similar products.


# Data Specifications

| Category                  | #User   | #Item   | #Rating  | #R_Token | #M_Token |
|---------------------------|---------|---------|----------|----------|----------|
| All_Beauty               | 632.0K  | 112.6K  | 701.5K   | 31.6M    | 74.1M    |
| Digital_Music            | 101.0K  | 70.5K   | 130.4K   | 11.4M    | 22.3M    |
| Health_and_Personal_Care | 461.7K  | 60.3K   | 494.1K   | 23.9M    | 40.3M    |

[Data Source](https://amazon-reviews-2023.github.io/)

# Table of Contents

- [1. Data Loading and Validation](#1-data-loading-and-validation)
- [2. Text Preprocessing](#2-text-preprocessing)
  - [2.1. Concatenating of Title and Details Features](#21-concatenating-of-title-and-details-features)
  - [2.2. Text Cleaning](#22-text-cleaning)
- [3. Vectorization](#3-vectorization)
- [4. Recommendation System](#4-recommendation-system)
  - [4.1. Converting Vectors to NumPy Array](#41-converting-vectors-to-numpy-array)
  - [4.2. FAISS is Used for Similarity Search](#42-faiss-is-used-for-similarity-search)
  - [4.3. Listing Similar Product with FAISS](#43-listing-similar-product-with-faiss)
  - [4.4. CHROMADB is Used for Cosine Similarity Search](#44-chromadb-is-used-for-cosine-similarity-search)
  - [4.5. Listing Similar Products with CHROMADB](#45-listing-similar-products-with-chromadb)
- [5. Conclusion](#5-conclusion)


In [None]:
from warnings import filterwarnings
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from tqdm import tqdm
from typing import Union, Dict, Tuple, List
from sklearn.base import BaseEstimator
import faiss
import chromadb
import re
from sentence_transformers import SentenceTransformer

filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.float_format', lambda x: '%.2f' % x)
pd.set_option('display.max_colwidth', None)

plt.style.use("ggplot")
sns.set_palette(sns.diverging_palette(220, 20))
tqdm.pandas()

  from .autonotebook import tqdm as notebook_tqdm


# 1. Data loading and Validation

The datasets containing Amazon images from three different categories have been downloaded from the source link, and the row counts have been verified. The `All Beauty` category contains `112590` records, the `Digital Music` category contains `70537` records, and the `Health and Personal Care` category contains `60293` records.

A missing value check was performed, and no missing data was found in `title` and `details` features.  

In [4]:
all_beauty_meta_file_path = "./data/all_beauty/meta_All_Beauty.jsonl.gz"
digital_music_meta_file_path = "./data/digital_music/meta_Digital_Music.jsonl.gz"
health_and_personal_care_meta_file_path = "./data/health_and_personal_care/meta_Health_and_Personal_Care.jsonl.gz"

all_beauty_meta_df = pd.read_json(all_beauty_meta_file_path, lines=True, compression="gzip")
digital_music_meta_df = pd.read_json(digital_music_meta_file_path, lines=True, compression="gzip")
health_and_personal_care_meta_df = pd.read_json(health_and_personal_care_meta_file_path, lines=True,compression="gzip")

all_beauty_meta_df["category"] = "all_beauty"
digital_music_meta_df["category"] = "digital_music"
health_and_personal_care_meta_df["category"] = "health_and_personal_care"

print("# Missing value check of each dataset:\n")
meta_names = ["all_beauty_meta_df", "digital_music_meta_df", "health_and_personal_care_meta_df"]
meta_dataframes = [all_beauty_meta_df, digital_music_meta_df, health_and_personal_care_meta_df]

for name, df in zip(meta_names, meta_dataframes):
    print(f"Missing values in the {name} dataset: {df.isnull().sum().sum()}")
    print(50 * "-")

print("# Shape check of each dataset:\n")
for name, df in zip(meta_names, meta_dataframes):
    print(f"Shape of the {name} dataset: {df.shape}")
    print(50 * "-")

print(50 * "*")
meta_data = pd.concat([all_beauty_meta_df, digital_music_meta_df, health_and_personal_care_meta_df], axis= 0)
meta_df = meta_data.copy()

print("The datasets from three different categories have been successfully concatenated into a single dataframe.")
print(f"Shape of the concatenated dataframe is {meta_df.shape}")
print(50 * "*")

# Missing value check of each dataset:

Missing values in the all_beauty_meta_df dataset: 218807
--------------------------------------------------
Missing values in the digital_music_meta_df dataset: 105187
--------------------------------------------------
Missing values in the health_and_personal_care_meta_df dataset: 112396
--------------------------------------------------
# Shape check of each dataset:

Shape of the all_beauty_meta_df dataset: (112590, 15)
--------------------------------------------------
Shape of the digital_music_meta_df dataset: (70537, 15)
--------------------------------------------------
Shape of the health_and_personal_care_meta_df dataset: (60293, 15)
--------------------------------------------------
**************************************************
The datasets from three different categories have been successfully concatenated into a single dataframe.
Shape of the concatenated dataframe is (243420, 15)
************************************************

In [5]:
meta_df = meta_data.copy()
meta_df = meta_df[["title","details"]]
meta_df.head()


Unnamed: 0,title,details
0,"Howard LC0008 Leather Conditioner, 8-Ounce (4-Pack)","{'Package Dimensions': '7.1 x 5.5 x 3 inches; 2.38 Pounds', 'UPC': '617390882781'}"
1,"Yes to Tomatoes Detoxifying Charcoal Cleanser (Pack of 2) with Charcoal Powder, Tomato Fruit Extract, and Gingko Biloba Leaf Extract, 5 fl. oz.","{'Item Form': 'Powder', 'Skin Type': 'Acne Prone', 'Brand': 'Yes To', 'Age Range (Description)': 'Adult', 'Unit Count': '10 Fl Oz', 'Is Discontinued By Manufacturer': 'No', 'Item model number': 'SG_B076WQZGPM_US', 'UPC': '653801351125', 'Manufacturer': 'Yes to Tomatoes'}"
2,Eye Patch Black Adult with Tie Band (6 Per Pack),{'Manufacturer': 'Levine Health Products'}
3,"Tattoo Eyebrow Stickers, Waterproof Eyebrow, 4D Imitation Eyebrow Tattoos, 4D Hair-like Authentic Eyebrows Waterproof Long Lasting for Woman & Man Makeup Tool","{'Brand': 'Cherioll', 'Item Form': 'Powder', 'Finish Type': 'Natural', 'Product Benefits': 'Long Lasting', 'Skin Type': 'All', 'Package Dimensions': '8.43 x 5.91 x 0.87 inches; 8.78 Ounces', 'Item model number': 'eyebrow sticker001'}"
4,Precision Plunger Bars for Cartridge Grips – 93mm – Bag of 10 Plungers,{'UPC': '644287689178'}


**There is no any missing values in the main_category and images**

In [6]:
meta_df.isnull().sum()

title      0
details    0
dtype: int64

# 2. Text Preprocessing

## 2.1. Concatenating of Title and Details Features

In this section, the `title` and `details` features have been merged. The purpose of this is to generate richer text data for each product and use this data to identify similar products more accurately.

In [7]:
meta_df["combined_text"] = meta_df.progress_apply(
    lambda x: f"{x['title']} {str(x['details']) if isinstance(x['details'], dict) else x['details']}",
    axis=1)
meta_df.head(2)

100%|██████████| 243420/243420 [00:02<00:00, 95818.13it/s] 


Unnamed: 0,title,details,combined_text
0,"Howard LC0008 Leather Conditioner, 8-Ounce (4-Pack)","{'Package Dimensions': '7.1 x 5.5 x 3 inches; 2.38 Pounds', 'UPC': '617390882781'}","Howard LC0008 Leather Conditioner, 8-Ounce (4-Pack) {'Package Dimensions': '7.1 x 5.5 x 3 inches; 2.38 Pounds', 'UPC': '617390882781'}"
1,"Yes to Tomatoes Detoxifying Charcoal Cleanser (Pack of 2) with Charcoal Powder, Tomato Fruit Extract, and Gingko Biloba Leaf Extract, 5 fl. oz.","{'Item Form': 'Powder', 'Skin Type': 'Acne Prone', 'Brand': 'Yes To', 'Age Range (Description)': 'Adult', 'Unit Count': '10 Fl Oz', 'Is Discontinued By Manufacturer': 'No', 'Item model number': 'SG_B076WQZGPM_US', 'UPC': '653801351125', 'Manufacturer': 'Yes to Tomatoes'}","Yes to Tomatoes Detoxifying Charcoal Cleanser (Pack of 2) with Charcoal Powder, Tomato Fruit Extract, and Gingko Biloba Leaf Extract, 5 fl. oz. {'Item Form': 'Powder', 'Skin Type': 'Acne Prone', 'Brand': 'Yes To', 'Age Range (Description)': 'Adult', 'Unit Count': '10 Fl Oz', 'Is Discontinued By Manufacturer': 'No', 'Item model number': 'SG_B076WQZGPM_US', 'UPC': '653801351125', 'Manufacturer': 'Yes to Tomatoes'}"


## 2.2. Text Cleaning

In the vectorization process, `SentenceTransformer` will be used. When working with models like Sentence-BERT, the following steps are typically sufficient:

- **Converting text to lowercase**: This ensures consistency by standardizing the text.
  
- **Cleaning punctuation**: Only punctuation marks and extra spaces are removed during the cleaning process, while numbers are preserved. In recommendation systems, meaningful insights can be derived from numbers.
  
- **Removing stopwords**: This is not necessary, as the model automatically detects contextual meaning.
  
- **Lemmatization and stemming**: Not required, as Sentence-BERT handles these processes internally.
  
- **Removing rare words**: Not necessary, as the model evaluates rare words within their context.



In [8]:
def clean_text(text: str) -> str:
    """
    Clean and normalize text by converting it to lowercase and removing special characters 
    (punctuation), extra spaces while keeping numbers.

    Parameters
    ----------
    text : str
        The input text string to be cleaned.

    Returns
    -------
    str
        The cleaned and normalized text string.
    """
    text = text.lower()
    # Remove special characters (punctuation) but keep numbers and spaces
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
    
    # Collapse multiple spaces into a single space and strips
    text = re.sub(r"\s+", " ", text).strip()
    
    return text

In [9]:
meta_df["combined_text"] = meta_df["combined_text"].progress_apply(clean_text)
meta_df.head(2)

100%|██████████| 243420/243420 [00:05<00:00, 45463.79it/s]


Unnamed: 0,title,details,combined_text
0,"Howard LC0008 Leather Conditioner, 8-Ounce (4-Pack)","{'Package Dimensions': '7.1 x 5.5 x 3 inches; 2.38 Pounds', 'UPC': '617390882781'}",howard lc0008 leather conditioner 8ounce 4pack package dimensions 71 x 55 x 3 inches 238 pounds upc 617390882781
1,"Yes to Tomatoes Detoxifying Charcoal Cleanser (Pack of 2) with Charcoal Powder, Tomato Fruit Extract, and Gingko Biloba Leaf Extract, 5 fl. oz.","{'Item Form': 'Powder', 'Skin Type': 'Acne Prone', 'Brand': 'Yes To', 'Age Range (Description)': 'Adult', 'Unit Count': '10 Fl Oz', 'Is Discontinued By Manufacturer': 'No', 'Item model number': 'SG_B076WQZGPM_US', 'UPC': '653801351125', 'Manufacturer': 'Yes to Tomatoes'}",yes to tomatoes detoxifying charcoal cleanser pack of 2 with charcoal powder tomato fruit extract and gingko biloba leaf extract 5 fl oz item form powder skin type acne prone brand yes to age range description adult unit count 10 fl oz is discontinued by manufacturer no item model number sgb076wqzgpmus upc 653801351125 manufacturer yes to tomatoes



# 3. Vectorization

In the vectorization stage, the Sentence Transformers library was used to convert text data into numerical format. The `all-MiniLM-L6-v2` model was employed by this library to transform each text into meaningful vectors. The all-MiniLM-L6-v2 model, which is a small yet efficient `BERT-based` model, was optimized to accurately capture similarities between texts. This model converts sentences and texts into numerical vectors while preserving their meaning, making it suitable for measuring similarities between texts and for applications such as recommendation systems. This vectorization process is essential for calculating similarities between texts in subsequent steps.

In [10]:
model = SentenceTransformer("all-MiniLM-L6-v2")
meta_df["vector"] = meta_df["combined_text"].progress_apply(lambda x: model.encode(x))
meta_df["vector"].head()

100%|██████████| 243420/243420 [1:18:44<00:00, 51.53it/s]


0     [-0.0918413, 0.06003494, 0.004900872, -0.0073167305, -0.0465304, 0.028341604, 0.0024836974, 0.07730959, -0.013703487, -0.032785356, -0.018591268, -0.08558699, 0.10394323, -0.051345274, -0.05043488, 0.0047271713, 0.029379848, -0.02028668, -0.06343625, 0.0006807128, -0.06766593, -0.015997058, -0.026054906, 0.021870762, -0.11063664, 0.0291975, 0.06551853, -0.0002637189, -0.04165343, -0.0038086483, 0.057719193, -0.020869533, 0.028144227, 0.065324865, -0.0019984734, 0.050205432, -0.041322928, -0.036801253, 0.064071886, 0.036421824, -0.0022212465, -0.04859935, 0.004728827, 0.03915579, -0.045157462, -0.03938245, -0.034714114, 0.06453106, 0.045751784, -0.023349473, 0.046039216, 0.034843195, -0.062064715, -0.014443702, 0.013906781, 0.0036212008, -0.022480816, -0.027430534, -0.038602434, -0.014063306, 0.062229816, -0.028353779, -0.0043628886, -0.029414691, 0.0053261686, 0.0883889, -0.034931876, -0.059152935, -0.06094775, -0.021268213, -0.07645427, 0.028868686, 0.046392813, 0.013261544, 0.0

In [11]:
meta_df.to_csv("vectorized_df.csv") 

In [2]:
# meta_df = pd.read_csv("vectorized_df.csv")

# 4. Recommendation System

## 4.1. Converting Vectors to NumPy Array

First, the vector column in meta_df was converted to a `numpy` array. This was done because numpy arrays offer performance benefits, especially when handling large datasets. Mathematical operations and memory management are optimized when numpy arrays are used, which is essential when working with high-dimensional data such as text embeddings. The embeddings, which were initially in the form of lists or pandas series objects, were converted to numpy arrays to enable fast computations, particularly for tasks like cosine similarity or distance-based searches. This conversion was also performed to ensure compatibility with libraries like `FAISS`.

In [12]:
text_embeddings = np.array(meta_df["vector"].progress_apply(lambda x: x).tolist()).astype('float32')
print(f"Vectors Shape: {text_embeddings.shape}")

100%|██████████| 243420/243420 [00:00<00:00, 1060078.47it/s]


Vectors Shape: (243420, 384)


## 4.2. FAISS is Used for Similarity Search

Instead of using traditional methods like sklearn.metrics.pairwise.cosine_similarity, `FAISS (Facebook AI Similarity Search)` was used for similarity search. This choice was made because FAISS is highly optimized for nearest neighbor searches in large datasets, whereas methods like cosine_similarity can become computationally expensive and memory-intensive as the dataset grows. For instance, using sklearn for cosine similarity calculation across thousands or millions of vectors may require a substantial amount of memory (e.g., 400 GB).

FAISS was designed to perform high-speed similarity searches efficiently. While cosine similarity is a standard method for calculating the similarity between vectors, it is not directly calculated by FAISS. Instead, FAISS uses the `L2 distance (Euclidean distance)` between vectors to perform similarity searches. However, this is equivalent to calculating cosine similarity when vectors are normalized, which was the case in this project.

Rather than keeping the entire similarity matrix in memory, FAISS computes only the necessary parts on demand, which significantly reduces memory usage.

An index was created in FAISS using the IndexFlatL2 algorithm, which computes the L2 distance between vectors. The index was then populated with the vectors.

In [41]:
index = faiss.IndexFlatL2(text_embeddings.shape[1])
index.add(text_embeddings)
print(f"Number of vectors in the index: {index.ntotal}")

Number of vectors in the index: 243420


## 4.3. Listing Similar Product with FAISS

After the FAISS index was set up, a function get_similar_products_faiss was implemented to retrieve the most similar products based on a given product’s title. The steps involved are as follows:

- The index of the product title in the DataFrame (meta_df) was located.

- The FAISS index was used to perform a similarity search for the product’s embedding, returning the top n most similar products.
  
- The input product was excluded from the results to prevent it from appearing in its own similarity list.
  
- The similarity scores were added to the products and sorted in descending order to retrieve the top n most similar products.

Upon examining the results, it is observed that similar products are successfully listed.

In [50]:
def get_similar_products_faiss(product_title: str, meta_df: pd.DataFrame, index, top_n: int = 10) -> pd.Series:
    """
    Use Faiss to find products similar to a given product based on its title.

    Parameters
    ----------
    product_title : str
        The title of the product for which similar products are to be found.
    meta_df : pd.DataFrame
        A DataFrame containing product information (titles and vectors).
    index : faiss.Index
        The Faiss index object used for similarity search.
    top_n : int, optional
        The number of similar products to return, by default 10.

    Returns
    -------
    pd.Series
        A Series containing the titles of the most similar products and their similarity scores, sorted in descending order.
    """

    product_index = meta_df[meta_df["title"] == product_title].index[0]
    distances, indices = index.search(text_embeddings[product_index:product_index+1], top_n)
    similar_products = meta_df.iloc[indices[0]]
    similar_products = similar_products[similar_products.index != product_index]
    num_similar = len(similar_products)
    
    if num_similar == 0:
        return pd.Series([], dtype="object")
    
    similar_products["similarity"] = distances[0][:num_similar]
    similar_products = similar_products.sort_values(by="similarity", ascending=False)
    
    return similar_products[["title", "similarity"]]


In [58]:
meta_df[["title"]].head(10)

Unnamed: 0,title
0,"Howard LC0008 Leather Conditioner, 8-Ounce (4-Pack)"
1,"Yes to Tomatoes Detoxifying Charcoal Cleanser (Pack of 2) with Charcoal Powder, Tomato Fruit Extract, and Gingko Biloba Leaf Extract, 5 fl. oz."
2,Eye Patch Black Adult with Tie Band (6 Per Pack)
3,"Tattoo Eyebrow Stickers, Waterproof Eyebrow, 4D Imitation Eyebrow Tattoos, 4D Hair-like Authentic Eyebrows Waterproof Long Lasting for Woman & Man Makeup Tool"
4,Precision Plunger Bars for Cartridge Grips – 93mm – Bag of 10 Plungers
5,Lurrose 100Pcs Full Cover Fake Toenails Artificial Transparent Nail Tips Nail Art for DIY
6,Stain Bonnet For Baby Bonnet Silk Sleep Cap For Toddler Child Shower Cap Teens Kids
7,50 Pieces False Eyelash Packaging Box Empty Eyelash Box Plastic Eyelash Storage Case with Glitter Paper and Clear Tray for Women Girls Eyelash (Holographic)
8,Gold extatic Musk EDT 90ml
9,"4 Pieces Satin Bonnet Adjustable Sleep Cap Double Layer Printed Hair Bonnet Large Reversible Sleeping Silk Night Hair Cap for Women Natural Curly Hair (Floral, Flamingo, Flower)"


In [62]:
product_title = meta_df["title"].iloc[0]
similar_products = get_similar_products_faiss(product_title, meta_df, index, top_n=10)
print(f"Similar products for '{product_title}':")
similar_products

Similar products for 'Howard LC0008 Leather Conditioner, 8-Ounce (4-Pack)':


Unnamed: 0,title,similarity
24869,Black Leather Jewelry Box w/ Travel Case,0.65
26120,English Leather Deodorant Stick - 3 Oz (85g) (2 Pack),0.63
42363,LEATHERMAN 831961 Juice(R) C2 Multi-Tool (Sunrise Yellow),0.63
84289,"Nine WEST Women's Wearable Leather Oxford Flat, Black, 5.5",0.63
69124,Quantum Moisturizing Conditioner 15oz/443ml,0.62
26969,Genuine Leather Travel Grooming Cosmetic Kit Color: Malibu Blue,0.61
41850,Gunnar Authentic Leather Case,0.44
89502,"Howard LC0008 Leather Conditioner, 8-Ounce (2-pack)",0.38
52885,Workman’s Friend All-Natural Leather Conditioner,0.0


## 4.4. CHROMADB is Used for Cosine Similarity Search

`Chroma` is another powerful tool used for vector-based similarity searches, similar to FAISS. However, Chroma allows for more efficient and scalable vector searches, including cosine similarity, without needing a custom FAISS index.

In this section, we will use Chroma to perform cosine similarity searches. The procedure is similar to FAISS but with easier setup and integration. Chroma stores embeddings and allows for direct querying of vectors for finding similar items. The process includes setting up the Chroma client, inserting embeddings, and performing searches on those embeddings.

In [13]:
client = chromadb.Client()
collection_name = "product_collection"
collection = client.create_collection(collection_name) if collection_name not in client.list_collections() else client.get_collection(collection_name)

In [None]:
for idx, embedding in enumerate(text_embeddings):
    collection.add(ids=[str(idx)], documents=[meta_df["title"].iloc[idx]], embeddings=[embedding])

## 4.5. Listing Similar Products with CHROMADB

After setting up the Chroma collection, you can now query for the most similar products based on the `cosine similarity` of their embeddings. The get_similar_products_chroma function will help retrieve products that are similar to the given product title.

The function will perform the following steps:

- Retrieve the embedding for the given product title.
  
- Use Chroma to search for the most similar products by querying the collection.
  
- Return the top N most similar products.

When comparing Chroma and FAISS, it was observed that Chroma produced slightly better results for the top recommended product. 

In [21]:
def get_similar_products_chroma(product_title: str, meta_df: pd.DataFrame, collection, top_n: int = 10) -> pd.DataFrame:
    """
    Use Chroma to find products similar to a given product based on its title.

    Parameters
    ----------
    product_title : str
        The title of the product for which similar products are to be found.
    meta_df : pd.DataFrame
        A DataFrame containing product information (titles and vectors).
    collection : chromadb.Collection
        The Chroma collection object used for similarity search.
    top_n : int, optional
        The number of similar products to return, by default 10.

    Returns
    -------
    pd.DataFrame
        A DataFrame containing the titles of the most similar products and their similarity scores, sorted in descending order.
    """

    product_index = meta_df[meta_df["title"] == product_title].index[0]
    product_vector = text_embeddings[product_index]

    results = collection.query(query_embeddings=[product_vector], n_results=top_n)

    if len(results["documents"][0]) != len(results["distances"][0]):
        raise ValueError("Mismatch between documents and distances lengths in query results.")

    similar_products = pd.DataFrame({
        'title': results['documents'][0],
        'similarity': results['distances'][0]
    })

    similar_products = similar_products.sort_values(by="similarity", ascending=False)

    return similar_products


In [24]:
meta_df[["title"]].head(10)

Unnamed: 0,title
0,"Howard LC0008 Leather Conditioner, 8-Ounce (4-Pack)"
1,"Yes to Tomatoes Detoxifying Charcoal Cleanser (Pack of 2) with Charcoal Powder, Tomato Fruit Extract, and Gingko Biloba Leaf Extract, 5 fl. oz."
2,Eye Patch Black Adult with Tie Band (6 Per Pack)
3,"Tattoo Eyebrow Stickers, Waterproof Eyebrow, 4D Imitation Eyebrow Tattoos, 4D Hair-like Authentic Eyebrows Waterproof Long Lasting for Woman & Man Makeup Tool"
4,Precision Plunger Bars for Cartridge Grips – 93mm – Bag of 10 Plungers
5,Lurrose 100Pcs Full Cover Fake Toenails Artificial Transparent Nail Tips Nail Art for DIY
6,Stain Bonnet For Baby Bonnet Silk Sleep Cap For Toddler Child Shower Cap Teens Kids
7,50 Pieces False Eyelash Packaging Box Empty Eyelash Box Plastic Eyelash Storage Case with Glitter Paper and Clear Tray for Women Girls Eyelash (Holographic)
8,Gold extatic Musk EDT 90ml
9,"4 Pieces Satin Bonnet Adjustable Sleep Cap Double Layer Printed Hair Bonnet Large Reversible Sleeping Silk Night Hair Cap for Women Natural Curly Hair (Floral, Flamingo, Flower)"


In [26]:
product_title = meta_df["title"].iloc[0]
similar_products = get_similar_products_chroma(product_title, meta_df, collection, top_n=10)
print(f"Similar products for '{product_title}':")
similar_products

Similar products for 'Howard LC0008 Leather Conditioner, 8-Ounce (4-Pack)':


Unnamed: 0,title,similarity
9,COBBLER'S CHOICE CO. FINEST QUALITY All-Natural Leather Restorer Conditioner,0.66
8,Black Leather Jewelry Box w/ Travel Case,0.66
7,LEATHERMAN 831961 Juice(R) C2 Multi-Tool (Sunrise Yellow),0.63
6,"Nine WEST Women's Wearable Leather Oxford Flat, Black, 5.5",0.63
5,Quantum Moisturizing Conditioner 15oz/443ml,0.63
4,Genuine Leather Travel Grooming Cosmetic Kit Color: Malibu Blue,0.62
3,Gunnar Authentic Leather Case,0.61
2,"Howard LC0008 Leather Conditioner, 8-Ounce (2-pack)",0.44
1,Workman’s Friend All-Natural Leather Conditioner,0.38
0,"Howard LC0008 Leather Conditioner, 8-Ounce (4-Pack)",0.0


# 5. Conclusion

In this project, a recommendation system was developed using the SentenceTransformer model for text vectorization, converting product titles into vector representations. Faiss and Chroma were used for similarity search, with Euclidean distance applied by Faiss and Cosine Similarity by Chroma. It was found that Chroma provided slightly better results, likely due to its optimization for textual data. Both tools enabled efficient similarity search, with Chroma being more effective for high-dimensional text embeddings. This system offers product recommendations based on semantic similarity, enhancing the user experience.