In this notebook, I will attempt at implementing **Content Based Recommendation System**.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import string
import re
import os
%matplotlib inline

In [None]:
pre_df=pd.read_csv("/Users/neirinzaralwin/Developer/personal/product_recommendation_model_api/data/raw/flipkart_sample.csv", na_values=["No rating available"])

In [None]:
pre_df.head()

In [None]:
pre_df.info()

## ***Data Preprocessing***

In [None]:
pre_df['product_category_tree']=pre_df['product_category_tree'].map(lambda x:x.strip('[]'))
pre_df['product_category_tree']=pre_df['product_category_tree'].map(lambda x:x.strip('"'))
pre_df['product_category_tree']=pre_df['product_category_tree'].map(lambda x:x.split('>>'))

axis=0 → Drop rows.
axis=1 → Drop columns.

In [None]:
#delete unwanted columns
del_list=['crawl_timestamp','product_url',"retail_price","discounted_price","is_FK_Advantage_product","product_rating","overall_rating","product_specifications"]
pre_df=pre_df.drop(del_list,axis=1)

In [None]:
import nltk
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("wordnet")
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer 
lem = WordNetLemmatizer()
stop_words = set(stopwords.words('english')) 
exclude = set(string.punctuation)
import string

In [None]:
pre_df.head()

In [None]:
pre_df.shape

In [None]:
smd=pre_df.copy()
# drop duplicate produts
smd.drop_duplicates(subset ="product_name", 
                     keep = "first", inplace = True)
smd.shape

### Data Cleaning

Let’s say doc = "The cats are running fast!".

- Lowercased: "the cats are running fast!"
- Removed stopwords: "cats running fast!"
- Removed punctuation: "cats running fast"
- Tokenized: ["cats", "running", "fast"]
- Lemmatized: ["cat", "run", "fast"]

In [None]:
def filter_keywords(doc):
    doc=doc.lower()
    stop_free = " ".join([i for i in doc.split() if i not in stop_words])
    punc_free = "".join(ch for ch in stop_free if ch not in exclude)
    word_tokens = word_tokenize(punc_free)
    filtered_sentence = [(lem.lemmatize(w, "v")) for w in word_tokens]
    return filtered_sentence

In [None]:

smd['product'] = smd['product_name'].apply(filter_keywords)
smd['description'] = smd['description'].astype("str").apply(filter_keywords)
smd['brand'] = smd['brand'].astype("str").apply(filter_keywords)

In [None]:
smd["all_meta"]=smd['product']+smd['brand']+ pre_df['product_category_tree']+smd['description']
smd["all_meta"] = smd["all_meta"].apply(lambda x: ' '.join(x))

In [None]:
smd["all_meta"].head()

<strong>TfidfVectorizer</strong> will:

- Tokenize the text into words and bigrams.
- Remove stopwords.
- Compute the Term Frequency-Inverse Document Frequency (TF-IDF) score for each token.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=1, stop_words='english')
tfidf_matrix = tf.fit_transform(smd['all_meta'])

### Cosine Similarity
I will be using the Cosine Similarity to calculate a numeric quantity that denotes the similarity between two products.
Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity Score.

<strong>Example Use Case:</strong>
Suppose you're building a document recommendation system.

- Compute the cosine similarity between all documents in your dataset.
- Use the similarity scores to recommend the most similar documents to a given document

In [None]:
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

We now have a pairwise cosine similarity matrix for all the products in our dataset. The next step is to write a function that returns the most similar products based on the cosine similarity score.

In [None]:
def get_recommendations(title):
    # Step 1: Find the index of the given product/document by its title
    idx = indices[title]
    
    # Step 2: Get the cosine similarity scores of the given product/document with all other products/documents
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Step 3: Sort the similarity scores in descending order (most similar to least similar)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Step 4: Select the top 30 most similar products/documents (excluding the first one, which is the document itself)
    sim_scores = sim_scores[1:31]
    
    # Step 5: Extract the indices of the recommended products/documents
    product_indices = [i[0] for i in sim_scores]
    
    # Step 6: Return the titles (or names) of the recommended products/documents using the extracted indices
    return titles.iloc[product_indices]


In [None]:
smd = smd.reset_index()
titles = smd['product_name']
indices = pd.Series(smd.index, index=smd['product_name'])

Let us now try and get the top recommendations for a few products.

In [None]:
get_recommendations("FabHomeDecor Fabric Double Sofa Bed").head(5)

In [None]:
get_recommendations("Alisha Solid Women's Cycling Shorts").head(5)

In [None]:
# get_recommendations("Alisha Solid Women's Cycling Shorts").head(5).to_csv("Alisha Solid Women's Cycling Shorts recommendations",index=False,header=True)