<a href="https://www.kaggle.com/code/mohsinal/example-context-based-recommender-engine?scriptVersionId=122364634" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

### Sentence transformers 
A popular choice for building recommender systems that rely on text data. These models can take in raw text data and transform it into fixed-length vectors that capture the semantic meaning of the text. These vectors can then be used to find similarities and differences between pieces of text, which is useful for tasks like recommendation.

To build your naïve contextual based recommender system using sentence transformers, you will need to follow these general steps:

### Prepare your data: 
Collect and clean the text data that you will be using for your recommender system. This could be a collection of product reviews, news articles, or any other type of text data that you want to use for recommendations.

### Train your sentence transformer:
You can either train your own sentence transformer from scratch or use a pre-trained model. Pre-trained models like BERT, RoBERTa, and DistilBERT are already available and can be fine-tuned on your specific dataset to generate embeddings for your text data.

### Generate embeddings:
Use the trained sentence transformer to generate embeddings for each piece of text in your dataset. These embeddings should capture the semantic meaning of the text and be of a fixed length.

### Find similarities:
Use a similarity measure, such as cosine similarity, to find the similarity between embeddings for different pieces of text. This will allow you to identify similar pieces of text that could be recommended to users.

### Create a recommendation engine:
Based on the similarity measure, create a recommendation engine that recommends similar pieces of text to users based on their input.

Keep in mind that this is a very basic approach and there are many ways to improve the performance of your recommender system, such as using more advanced algorithms or incorporating additional features like user preferences or ratings.

lets install sentence transformers

In [1]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l- done
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l- \ | done
[?25h  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125938 sha256=d8b8a54cd79cdd293b60735c6cc7d415d97c237ba3210daba556270dcd5ee070
  Stored in directory: /root/.cache/pip/wheels/bf/06/fb/d59c1e5bd1dac7f6cf61ec0036cc3a10ab8fecaa6b2c3d3ee9
Successfully built sentence-transformers
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.2.2
[0m

In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/fashion-clothing-products-catalog/myntra_products_catalog.csv


In [3]:
data=pd.read_csv(os.path.join(dirname, filename))

In [4]:
df=data.copy()

In [5]:
df.head()

Unnamed: 0,ProductID,ProductName,ProductBrand,Gender,Price (INR),NumImages,Description,PrimaryColor
0,10017413,DKNY Unisex Black & Grey Printed Medium Trolle...,DKNY,Unisex,11745,7,"Black and grey printed medium trolley bag, sec...",Black
1,10016283,EthnoVogue Women Beige & Grey Made to Measure ...,EthnoVogue,Women,5810,7,Beige & Grey made to measure kurta with churid...,Beige
2,10009781,SPYKAR Women Pink Alexa Super Skinny Fit High-...,SPYKAR,Women,899,7,Pink coloured wash 5-pocket high-rise cropped ...,Pink
3,10015921,Raymond Men Blue Self-Design Single-Breasted B...,Raymond,Men,5599,5,Blue self-design bandhgala suitBlue self-desig...,Blue
4,10017833,Parx Men Brown & Off-White Slim Fit Printed Ca...,Parx,Men,759,5,"Brown and off-white printed casual shirt, has ...",White


# Prepare your data:
Lets clean the data remove any stoping words and punctuations, we will use NLTK and spacy for this

These lines first import the necessary libraries for text processing, including spaCy and NLTK. The 'spacy.load' method loads a pre-trained English language model for text processing, and the 'stop_words' attribute of that model is used to obtain a set of common English stopwords. Additional stopwords are then added to this set.

In [6]:
import spacy
from nltk.tokenize import word_tokenize
sp = spacy.load('en_core_web_sm')
all_stopwords = sp.Defaults.stop_words
all_stopwords.add('&')
all_stopwords.add(',')
all_stopwords.add('.')
all_stopwords.add('@')
all_stopwords.add('/')
all_stopwords.add(':')
all_stopwords.add('?')

The 'remove_Stopingwords_Punctuation' function is defined to remove stopwords and punctuation from text. The function first tokenizes the text using NLTK's 'word_tokenize' method, and then removes stopwords using a list comprehension. The resulting list of words is then joined back into a string using the 'join' method.

In [7]:
def remove_Stopingwords_Punctuation(text):
    text_tokens = word_tokenize(text)
    tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]
    return " ".join(tokens_without_sw)


The next three lines apply the 'remove_Stopingwords_Punctuation' function to the 'ProductName' and 'Description' columns of the DataFrame, and concatenate the resulting strings with other relevant columns to form a 'Feature_Set' column.

Appling a function remove_Stopingwords_Punctuation to the ProductName column of the df DataFrame to remove stopwords and punctuations from the text.

In [8]:
df["ProductName"]=df["ProductName"].apply(lambda x : remove_Stopingwords_Punctuation(x))

Appling the same function to the Description column.

In [9]:
df["Description"]=df["Description"].apply(lambda x : remove_Stopingwords_Punctuation(x))

Create a new column Feature_Set in the df DataFrame by concatenating multiple columns including ProductBrand, ProductName, Gender, Description, and PrimaryColor.

In [10]:
df["Feature_Set"]=df["ProductBrand"]+df["ProductName"]+df["Gender"]+df["Description"]+df["PrimaryColor"]

Creates a new DataFrame subset with only two columns, ProductID and Feature_Set, by selecting these columns from the df DataFrame.

In [11]:
subset=df[["ProductID","Feature_Set"]]

In [12]:
subset

Unnamed: 0,ProductID,Feature_Set
0,10017413,DKNYDKNY Unisex Black Grey Printed Medium Trol...
1,10016283,EthnoVogueEthnoVogue Women Beige Grey Made Mea...
2,10009781,SPYKARSPYKAR Women Pink Alexa Super Skinny Fit...
3,10015921,RaymondRaymond Men Blue Self-Design Single-Bre...
4,10017833,ParxParx Men Brown Off-White Slim Fit Printed ...
...,...,...
12486,10262843,Pepe JeansPepe Jeans Men Black Hammock Slim Fi...
12487,10261721,MochiMochi Women Gold-Toned Solid HeelsWomenA ...
12488,10261607,612 league612 league Girls Navy Blue White Pri...
12489,10266621,


Drops any row with missing values from the subset DataFrame.

In [13]:
subset.dropna(axis=0, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


Resets the index of the subset DataFrame after dropping the rows.

In [14]:
subset.reset_index(drop=True,inplace=True)

Loading a pre-trained Sentence Transformer model all-MiniLM-L6-v2.

In [15]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Encoding the Feature_Set column of the subset DataFrame using the pre-trained model and stores the embeddings in sentence_embeddings.

In [16]:
sentence_embeddings = model.encode(subset["Feature_Set"])

Batches:   0%|          | 0/363 [00:00<?, ?it/s]

In [17]:
sentence_embeddings.shape

(11597, 384)

In [18]:
subset["Feature_Set"].values

array(['DKNYDKNY Unisex Black Grey Printed Medium Trolley BagUnisexBlack grey printed medium trolley bag secured TSA lockOne handle trolley retractable handle corner mounted inline skate wheelsOne main zip compartment zip lining compression straps click clasps zip compartment flap zip pocketsWarranty 5 yearsWarranty provided Brand Owner Manufacturer Black',
       'EthnoVogueEthnoVogue Women Beige Grey Made Measure Custom Made Kurta Set JacketWomenBeige Grey measure kurta churidar dupattaBeige measure calf length kurta V-neck three-quarter sleeves lightly padded bust flared hem concealed zip closureGrey solid measure churidar drawstring closureGrey net sequined dupatta printed tapingWhat Made Measure Customised Kurta Set according Bust Length So refer Size Chart pick perfect size.How measure bust Measure arms chest find bust size inchesHow measure Kurta length Measure shoulder till barefoot find kurta length Beige',
       'SPYKARSPYKAR Women Pink Alexa Super Skinny Fit High-Rise Clean

computing the cosine similarity between the embeddings of the specified product_id and all the embeddings in sentence_embeddings.

In [19]:
product_id=10015921
cosine_scores = util.cos_sim(model.encode(subset[subset["ProductID"]==product_id]["Feature_Set"].values[0]), sentence_embeddings)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Iterating over score five times and selects the product ID with the highest cosine similarity score, adds it to recomendations_cos, and marks that score as already selected by setting it to -1.

In [20]:
score=cosine_scores[0].tolist()
recomendations_cos=[]
for i in range(0,5):
    maxx=score.index(max(score))
    recomendations_cos.append(subset['ProductID'][maxx])
    score[maxx]=-1

Printing the list of recommended product IDs.

In [21]:
recomendations_cos

[10015921, 10025539, 10025555, 10025541, 10217093]

Selecting the rows from the original df DataFrame that have the recommended product IDs and prints them.

In [22]:
df[df["ProductID"].isin(recomendations_cos)]

Unnamed: 0,ProductID,ProductName,ProductBrand,Gender,Price (INR),NumImages,Description,PrimaryColor,Feature_Set
3,10015921,Raymond Men Blue Self-Design Single-Breasted B...,Raymond,Men,5599,5,Blue self-design bandhgala suitBlue self-desig...,Blue,RaymondRaymond Men Blue Self-Design Single-Bre...
1434,10025539,Raymond Men Blue Solid Regular-Fit Single-Brea...,Raymond,Men,6299,5,Blue solid party suitBlue solid regular-fit bl...,Blue,RaymondRaymond Men Blue Solid Regular-Fit Sing...
1822,10025555,Raymond Men Blue Solid Regular-Fit Single-Brea...,Raymond,Men,5849,5,Blue solid party suitBlue solid regular-fit bl...,Blue,RaymondRaymond Men Blue Solid Regular-Fit Sing...
1951,10025541,Raymond Men Blue Solid Regular-Fit Single-Brea...,Raymond,Men,4274,5,Blue solid formal suitBlue solid regular-fit b...,Blue,RaymondRaymond Men Blue Solid Regular-Fit Sing...
9674,10217093,Raymond Men Blue Solid Contemporary Fit Formal...,Raymond,Men,4049,5,Blue solid single-breasted formal suitBlue sol...,Blue,RaymondRaymond Men Blue Solid Contemporary Fit...
