In [1]:
import pandas as pd
import csv
from IPython.display import Image, HTML
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [2]:
listings = pd.read_csv("bigbasket.csv", usecols = ['product', 'category', 'description'])
listings.head(10)

Unnamed: 0,product,category,description
0,Garlic Oil - Vegetarian Capsule 500 mg,Beauty & Hygiene,This Product contains Garlic Oil that is known...
1,Water Bottle - Orange,"Kitchen, Garden & Pets","Each product is microwave safe (without lid), ..."
2,"Brass Angle Deep - Plain, No.2",Cleaning & Household,"A perfect gift for all occasions, be it your m..."
3,Cereal Flip Lid Container/Storage Jar - Assort...,Cleaning & Household,Multipurpose container with an attractive desi...
4,Creme Soft Soap - For Hands & Body,Beauty & Hygiene,Nivea Creme Soft Soap gives your skin the best...
5,Germ - Removal Multipurpose Wipes,Cleaning & Household,Stay protected from contamination with Multipu...
6,Multani Mati,Beauty & Hygiene,Satinance multani matti is an excellent skin t...
7,Hand Sanitizer - 70% Alcohol Base,Beauty & Hygiene,70%Alcohol based is gentle of hand leaves skin...
8,Biotin & Collagen Volumizing Hair Shampoo + Bi...,Beauty & Hygiene,"An exclusive blend with Vitamin B7 Biotin, Hyd..."
9,"Scrub Pad - Anti- Bacterial, Regular",Cleaning & Household,Scotch Brite Anti- Bacterial Scrub Pad thoroug...


In [3]:
listings['product']=listings['product'].astype('str')
listings['description']=listings['description'].astype('str')
listings['category']=listings['category'].astype('str')

.astype() is used to change the datatype to str (in case some columns have integer values


In [4]:
listings['content']=listings[['product','category','description']].astype(str).apply(lambda x:'//'.join(x),axis=1)

need to convert the values to string in order to use .join()
#need to read up about .apply() function

axis =0 applies to each column *and*
axis=1 applies to each row

In [5]:
listings['content']

0        Garlic Oil - Vegetarian Capsule 500 mg//Beauty...
1        Water Bottle - Orange//Kitchen, Garden & Pets/...
2        Brass Angle Deep - Plain, No.2//Cleaning & Hou...
3        Cereal Flip Lid Container/Storage Jar - Assort...
4        Creme Soft Soap - For Hands & Body//Beauty & H...
                               ...                        
27550    Wottagirl! Perfume Spray - Heaven, Classic//Be...
27551    Rosemary//Gourmet & World Food//Puramate rosem...
27552    Peri-Peri Sweet Potato Chips//Gourmet & World ...
27553    Green Tea - Pure Original//Beverages//Tetley G...
27554    United Dreams Go Far Deodorant//Beauty & Hygie...
Name: content, Length: 27555, dtype: object

In [6]:
listings['content'].fillna('Null',inplace=True)

Replaces all null values with "NULL"


---



  Training the recommender: TfidfVectorizer() converts a collection of raw documents into a matrix of TF-IDF features
  ----------

In [7]:
tf = TfidfVectorizer(analyzer = 'word', ngram_range = (1, 2), min_df = 0, stop_words = 'english')
tfidf_matrix = tf.fit_transform(listings['content'])

analyzer: extract the sequence of words out of the raw, unprocessed input.

ngram_range: defines the range of different n-grams to be extracted.(1,2) means unigrams and bigrams are to be extracted.

min_df=is used for removing terms that appear too infrequently.


---



In [8]:
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

Cosine similarity: contains the pairwise cosine similarity score for every pair of entries(or their vectors). If there are 'n' entries, a (n*n) matrix is formed, where the value corresponding to the ith row and jth column(i.e, index=[i][j]) denotes the similarity score for the ith and jth vector.

both linear_kernel() and cosine_similarity() produce the same result but linear_kernel() executes faster and is more efficient for a larger dataset

In [9]:
results = {}
for idx, row in listings.iterrows():
    similar_indices = cosine_similarities[idx].argsort()[:-100:-1] #change it to reverse=True
    similar_items = [(cosine_similarities[idx][i], listings['product'][i]) for i in similar_indices]
    results[row['product']] = similar_items[1:]



---

Iterate through each item's similar items and store the 100 most-similar

“idx” means index, Dataframe.iterrows() specifically knows that idx is index.

DataFrame.iterrows() allows us to iterate each index and row in the DataFrame. Each iteration produces an index object and a row object 

.argsort() returns an array of indices that sort the given array. It doesn't return the sorted array, but the array of indexes which indicate the sorted positions. 

"similar_indices" is a 2D array/matrix having the cosine similarities of each row, with every other row in the dataset.  Here, .argsort() sorts each row (the cosine similarities) and represents the sorted list in terms of their index positions

list slicing[:-100:-1] is used to get the last 100 cosine similarities( which will invariably be the values closest to 1). Thus, these are the values that are most similar to the product given by user

In [10]:
similar_indices

array([27554, 23332,   275, 25677, 17564, 25315, 18348,  2783, 18457,
        9639, 26297, 25419, 20719, 15524, 19701,  2255,  6208,  2012,
        6198, 13247,  4945, 13126, 10800, 24365, 13795,  8192, 23940,
       26006,   601, 22195, 27515,  4113, 15921, 20494, 16291, 18631,
        9957,   343,  9061, 20269, 11283, 13561,  9979,  8191,  1181,
        7026,  5036, 19806,  6166,  5641, 16540, 18559,  5996,  1482,
       20396,   432,  7991, 10192,  5065, 20105,  8600, 11818,  3420,
       19347, 24639, 17005,  6922,  8709, 23849, 21415, 20589, 18275,
       10661,  7692, 15951, 20477, 21714,  1004, 17157, 17197, 19536,
        3578, 15395, 13886, 14698, 14762, 12343,  6960,  8887,  1215,
        7348, 18174,  2713, 25554,  4050,  2776,  9325, 15111, 18479])

similar_indices is giving the sorted order of the lisgt, in terms of their respective index positions, not the values.It stores the top 100 closest values for each row/product in the dataset

In [11]:
similar_items

[(1.0, 'United Dreams Go Far Deodorant'),
 (1.0, 'United Dreams Go Far Deodorant'),
 (0.33636029391641487, 'United Dreams Go Far Eau De Toilette'),
 (0.2612607779944558, 'United Dreams Be Strong Eau De Toilette'),
 (0.260232331276576, 'United Dreams Aim High Eau De Toilette'),
 (0.19991846413770484, 'United Dreams Love Yourself Eau De Toilette'),
 (0.19647723974763515, 'United Dreams Live Free Eau De Toilette'),
 (0.09883079615395249, 'United Dreams For Men Big Eau De Toilette'),
 (0.09413701647116703, 'United Dreams Big For Her Eau De Toilette'),
 (0.09083663010357802, 'Marmalade - Three Fruit (Orange, Grapefruit & lemon)'),
 (0.07196334009296147, 'Body Wash - Bitter Orange & Mandarin'),
 (0.0638826281657799, 'Classic Black Eau De Toilette'),
 (0.06265819833139988, 'Patchouli & Vetiver Soap'),
 (0.056902376263072145, 'After Shave Spray - Musk For Men'),
 (0.054636241138693316, 'On Pocket Perfume - Man, Citrus Fresh'),
 (0.05345268467036848, 'Code Iridium - Body Perfume'),
 (0.05191704

similar_items is a nested list, where each nested list item has 2 values: the cosine similarity and its corresponding product name

For each entry(cosine similarity) in similar_indices, the nested list "similar_items" stores the same cosine similarity, along with its corresponding product name. It stores 100 pairs of values, for each row in the dataset. Each pair consists of the cosine similarity and the corresponding product name

In [12]:
results

{'Garlic Oil - Vegetarian Capsule 500 mg': [(1.0000000000000002,
   'Garlic Oil - Vegetarian Capsule 500 mg'),
  (0.31465929636187984, 'Evening Primrose Oil - Vegetarian Capsule (500 mg)'),
  (0.23634118445284674, 'Almond Oil B P'),
  (0.21511752716888652, 'Natural Spray'),
  (0.19673866526367226, 'Hair Oil - Amla'),
  (0.19516222266407784, 'Cotton Buds'),
  (0.18652178780422246, 'Vaporisateur Natural Spray'),
  (0.18652178780422246, 'Vaporisateur Natural Spray'),
  (0.18389659911729267, 'Classic Deodorant Spray for Men'),
  (0.18389659911729267, 'Classic Deodorant Spray for Men'),
  (0.18373847824784906, 'Imunohills - Capsule'),
  (0.1835772917869305, 'Cotton Balls'),
  (0.16821589714197216, 'Hair Conditioner - Damage Control'),
  (0.16815377362690226, 'Energy Vaporisateur Natural Spray'),
  (0.16552722285627078, 'Coconut Oil - 100 % Pure'),
  (0.16327230428747821, 'Shanti - Badam Amla Hair Oil'),
  (0.16258602407903677, 'Neem Oil'),
  (0.16123176095640598, 'Bathing Soap (Lavender & M

"results" maps each product in the dataset, to its corresponding top 100 closest values, i.e (100 pairs: each pair consists of cosine similarity and corresponding product name). It maps each product in the dataset, to the corresponding entry in "similar_items"

# Prediction

In [13]:
def item(id):
    name   = listings.loc[listings['product'] == id]['content'].tolist()[0].split('//')[0]
    desc   = ' \nDescription: ' + listings.loc[listings['product'] == id]['content'].tolist()[0].split('//')[2][0:165] + '...'
    prediction = name  + desc
    return prediction

.loc() is the function used for filtering "listings" DataFrame to get only those rows where the "product" column matches the id parameter. It then extracts the "content" column from that filtered DataFrame and converts it to list. Conversion to lists is done only because lists are way easier to work with

The [0] index is used to access the first item of the list, which is the value of the "content" column. Thus we get "product//category//description" format

The value from ".tolist()[0]" (i.e, the value of the "content" column) is split around "//". The function .split() returns a list of the substrings

The [0] index of the list returned by .split() consists of the product name.


The [1] index of the list returned by .split() consists of the category.


The [2] index of the list returned by .split() consists of the description.

The [0:165] is to extract the first 165 characters of the description. This is for presentation purpose only

In [14]:
def recommend(product, num):
    print('Recommending ' + str(num) + ' products similar to ' + item(product))
    print('---')
    recs = results[product][:num]
    for rec in recs:
        print('\nRecommended: ' + item(rec[1]) + '\n(score:' + str(rec[0]) + ')')

recs is storing the first "num" values/recommendations for the product. "product" is the argument taken from user input. Assume num=5 for further explanations

"for rec in recs" iterates through each of the 5 recommendations extracted, for the product(which is given by user). Each of these recommendations is a list having 2 values: cosine similarity and product name

rec[1] is the product name of the recommended product


rec[0] is the cosine similarity value of the recommended product

# OVERALL
item() is used to extract name and description column values of a particular product

recommend() calls the function item() for each entry of the selected set of recommended products. These recommended products are extracted by using cosine simlarity and filtering out the closest matches

In [28]:
recommend(product = "Pure Aloe Vera Gel", num = 5)

Recommending 5 products similar to Pure Aloe Vera Gel 
Description: Bring the lost glow back to your face with this Jeva Pure Aloe Vera Gel. It is your one-stop solution to prevent aging and add a natural glow. Enriched with Aloe Ver...
---

Recommended: Aloevera Skin Moistrurising Gel 
Description: This non-greasy formulation works well to hydrate and nourishes your skin making it feel revitalized and fresh. Made with real natural Aloe Vera, Jeva Aloe Vera Gel ...
(score:0.42032321668008416)

Recommended: Aloevera Moisturising Beauty Gel 
Description: A unique gel to moisturize skin, heal burns and cuts, treat acne and also maintains skin texture for sensitive skin.Tip: Following a regular skin care regime can hel...
(score:0.367425585675334)

Recommended: Hydro Replenish Light Aloe Vera Gel 
Description: Hydration is a must for skin! With time, our skin cells lose the capacity to retain hydration, leading to dull, tired and dry skin. Kaya Youth Hydro Replenish, enric...
(score:0.3288