# Product matching

***Using ML/DL techniques, match similar products from the Flipkart dataset with the Amazon dataset. Once
similar products are matched, display the retail price from FK and AMZ side by side.***

***Data extraction***

In [1]:
# Importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import mpl_toolkits
%matplotlib inline

In [2]:
# Reading data in csv format to dataframe
amazon_products = pd.read_csv("DS - Assignment Part 2 data set/amz_com-ecommerce_sample.csv", encoding= 'unicode_escape')
flipkart_products = pd.read_csv("DS - Assignment Part 2 data set/flipkart_com-ecommerce_sample.csv", encoding= 'unicode_escape')
# Creating new column to provide its website
amazon_products['website']='amazon'
flipkart_products['website']='flipkart'

***Data Info***

In [3]:
# Shows various statistics of features (columns)
amazon_products.describe()

Unnamed: 0,retail_price,discounted_price
count,20000.0,20000.0
mean,2957.09515,2364.59705
std,8993.993257,8994.62368
min,-20.0,0.0
25%,647.0,424.0
50%,999.0,663.0
75%,1986.0,1235.0
max,571223.0,726879.0


In [4]:
# Shows various statistics of features (columns)
flipkart_products.describe()

Unnamed: 0,retail_price,discounted_price
count,19922.0,19922.0
mean,2979.206104,1973.401767
std,9009.639341,7333.58604
min,35.0,35.0
25%,666.0,350.0
50%,1040.0,550.0
75%,1999.0,999.0
max,571230.0,571230.0


In [5]:
# Cross checking all columns, for findind NAN/null values to data cleaning 
amazon_products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   uniq_id                  20000 non-null  object
 1   crawl_timestamp          20000 non-null  object
 2   product_url              20000 non-null  object
 3   product_name             20000 non-null  object
 4   product_category_tree    20000 non-null  object
 5   pid                      20000 non-null  object
 6   retail_price             20000 non-null  int64 
 7   discounted_price         20000 non-null  int64 
 8   image                    19997 non-null  object
 9   is_FK_Advantage_product  20000 non-null  bool  
 10  description              19998 non-null  object
 11  product_rating           20000 non-null  object
 12  overall_rating           20000 non-null  object
 13  brand                    14136 non-null  object
 14  product_specifications   19986 non-nul

In [6]:
# Cross checking all columns, for findind NAN/null values to data cleaning 
flipkart_products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   uniq_id                  20000 non-null  object 
 1   crawl_timestamp          20000 non-null  object 
 2   product_url              20000 non-null  object 
 3   product_name             20000 non-null  object 
 4   product_category_tree    20000 non-null  object 
 5   pid                      20000 non-null  object 
 6   retail_price             19922 non-null  float64
 7   discounted_price         19922 non-null  float64
 8   image                    19997 non-null  object 
 9   is_FK_Advantage_product  20000 non-null  bool   
 10  description              19998 non-null  object 
 11  product_rating           20000 non-null  object 
 12  overall_rating           20000 non-null  object 
 13  brand                    14136 non-null  object 
 14  product_specifications

Below we are using product name as feature, which is non empty for all rows hence data cleaning is not required

***Creating variables***

In [7]:
amazon_products_names_list=[]

In [8]:
flipkart_products_names_list=[]

In [9]:
count=0
for i in amazon_products['product_name']:
    temp=[]
    temp.append(i)
    temp.append(count)
    count+=1
    amazon_products_names_list.append(temp)

In [10]:
count=0
for i in flipkart_products['product_name']:
    temp=[]
    temp.append(i)
    temp.append(count)
    count+=1
    flipkart_products_names_list.append(temp)

In [11]:
flipkart_products['product_name']

0            Alisha Solid Women's Cycling Shorts
1            FabHomeDecor Fabric Double Sofa Bed
2                                     AW Bellies
3            Alisha Solid Women's Cycling Shorts
4          Sicons All Purpose Arnica Dog Shampoo
                          ...                   
19995             WallDesign Small Vinyl Sticker
19996    Wallmantra Large Vinyl Stickers Sticker
19997    Elite Collection Medium Acrylic Sticker
19998    Elite Collection Medium Acrylic Sticker
19999    Elite Collection Medium Acrylic Sticker
Name: product_name, Length: 20000, dtype: object

In [12]:
products=amazon_products_names_list+flipkart_products_names_list

In [13]:
sentences=[]
for i in products:
    sentences.append(i[0])

***Using Transformers with hugging face library***

***Finding similairty with product name as feature***

In [14]:
# Importing model
from sentence_transformers import SentenceTransformer

In [15]:
# Loading model
model = SentenceTransformer('bert-base-nli-mean-tokens')

In [16]:
# Encoding the sentences
sentence_embeddings = model.encode(sentences)
sentence_embeddings.shape

(40000, 768)

***Using Cosine similarity***

In [24]:
# Finding similairy using cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# Taking item name as input
input_item=input("Enter product name: ")

In [162]:
input_item="FDT Women's Leggings"

In [163]:
# Collecting all the instances of item in amazon dataset
amazon_dump=[]
for i in amazon_products_names_list:
    if input_item in i:
        amazon_dump.append(i[1])  

In [164]:
# Collecting all the instances of item in flipkart dataset
flipkart_dump=[]
for i in flipkart_products_names_list:
    if input_item in i:
        flipkart_dump.append(i[1])  

In [165]:
# Comparing prices
if(len(amazon_dump) == len(flipkart_dump) == 0):
    print("Item is not present in both the datsets")
else:
    if(len(amazon_dump)==0 and len(flipkart_dump)!=0):
        similaity_value_list=(cosine_similarity([sentence_embeddings[flipkart_dump[0]]],sentence_embeddings)).tolist()
        similaity_value_list=similaity_value_list[0]
        all_similar_amazon=[]
        count=0
        for i in similaity_value_list:
            # Only taking products which have similairty value greater than 0.9
            if i>0.9:
                if count>20000:
                    all_similar_amazon.append(count)
            count+=1
        #print(all_similar_amazon)
        print("Prices comparison between amazon and flipkart")
        for i in flipkart_dump:
            print("-----------------------------------------------")
            print(f"Product name in flipkart: {flipkart_products['product_name'][i]}")
            print(f"Retail price in flipkart: {flipkart_products['retail_price'][i]}")
            print(f"Discounted price in flipkart: {flipkart_products['discounted_price'][i]}")
        for i in all_similar_amazon:
            print("-----------------------------------------------")
            print(f"Product name in amazon: {amazon_products['product_name'][i-20000]}")
            print(f"Retail price in amazon: {amazon_products['retail_price'][i-20000]}")
            print(f"Discounted price in amzon: {amazon_products['discounted_price'][i-20000]}")
    else:
        similaity_value_list=(cosine_similarity([sentence_embeddings[amazon_dump[0]]],sentence_embeddings)).tolist()
        similaity_value_list=similaity_value_list[0]
        all_similar_flipkart=[]
        count=0
        for i in similaity_value_list:
            # Only taking products which have similairty value greater than 0.9
            if i>0.9:
                if count>20000:
                    all_similar_flipkart.append(count)
            count+=1
        #print(all_similar_flipkart)
        print("Prices comparison between amazon and flipkart")
        #print(amazon_dump)
        #print(all_similar_flipkart)
        for i in amazon_dump:
            print("-----------------------------------------------")
            print(f"Product name in amazon: {amazon_products['product_name'][i]}")
            print(f"Retail price in amazon: {amazon_products['retail_price'][i]}")
            print(f"Discounted price in amazon: {amazon_products['discounted_price'][i]}")
        for i in all_similar_flipkart:
            print("-----------------------------------------------")
            print(f"Product name in flipkart: {flipkart_products['product_name'][i-20000]}")
            print(f"Retail price in flipkart: {flipkart_products['retail_price'][i-20000]}")
            print(f"Discounted price in flipkart: {flipkart_products['discounted_price'][i-20000]}")
        
        

Prices comparison between amazon and flipkart
-----------------------------------------------
Product name in flipkart: FDT Women's Leggings
Retail price in flipkart: 699.0
Discounted price in flipkart: 309.0
-----------------------------------------------
Product name in amazon: FDT WOMEN'S Leggings Pants
Retail price in amazon: 698
Discounted price in amzon: 362
-----------------------------------------------
Product name in amazon: Addline Women's Leggings
Retail price in amazon: 1784
Discounted price in amzon: 740
-----------------------------------------------
Product name in amazon: Addline Women's Leggings
Retail price in amazon: 991
Discounted price in amzon: 474
-----------------------------------------------
Product name in amazon: Addline Women's Leggings
Retail price in amazon: 994
Discounted price in amzon: 415
-----------------------------------------------
Product name in amazon: Leebonee Women's Leggings
Retail price in amazon: 486
Discounted price in amzon: 423
-------

Here we can see that using the product name as a feature compares the product with the only item, not the brand.
This can be used when we compare similar product prices across all brands.

Note: I have done the comparison vertically, it can be done sideways too

***Using product description as feature***

Here we are using product category tree as feature, which no NAN rows, so we data cleaning is not required

In [40]:
amazon_products_desc_list=[]
flipkart_products_desc_list=[]

In [41]:
count=0
for i in amazon_products['product_category_tree']:
    temp=[]
    temp.append(i)
    temp.append(count)
    count+=1
    amazon_products_desc_list.append(temp)

In [42]:
count=0
for i in flipkart_products['product_category_tree']:
    temp=[]
    temp.append(i)
    temp.append(count)
    count+=1
    flipkart_products_desc_list.append(temp)

In [43]:
desc=amazon_products_desc_list+flipkart_products_desc_list

In [51]:
catg_treees=[]
for i in desc:
    catg_treees.append(i[0])

In [103]:
count=0
flip_desc_ind=[]
for i in flipkart_products['product_category_tree']:
    temp=[]
    temp.append(i)
    temp.append(count)
    temp.append(flipkart_products['product_name'][count])
    flip_desc_ind.append(temp)
    count+=1

In [104]:
count=0
amzn_desc_ind=[]
for i in amazon_products['product_category_tree']:
    temp=[]
    temp.append(i)
    temp.append(count)
    temp.append(amazon_products['product_name'][count])
    amzn_desc_ind.append(temp)
    count+=1

***Using the previous used model on product_category_tree feature***

In [46]:
model = SentenceTransformer('bert-base-nli-mean-tokens')

In [53]:
# Encoding the sentences
catg_treees_embeddings = model.encode(catg_treees)
catg_treees_embeddings.shape

(40000, 768)

In [151]:
input_item_desc=''

In [152]:
for i in flip_desc_ind:
    if input_item in i:
        input_item_desc=i[0]

In [153]:
for i in  amzn_desc_ind:
    if input_item in i:
        input_item_desc=i[0]

In [154]:
# Collecting all the instances of item in amazon dataset
amazon_dump_2=[]
for i in amazon_products_desc_list:
    if input_item_desc in i:
        amazon_dump_2.append(i[1])  

In [155]:
# Collecting all the instances of item in flipkart dataset
flipkart_dump_2=[]
for i in flipkart_products_desc_list:
    if input_item_desc in i:
        amazon_dump_2.append(i[1])  

In [156]:
# Comparing prices
if(len(amazon_dump_2) == len(flipkart_dump_2) == 0):
    print("Item is not present in both the datsets")
else:
    if(len(amazon_dump_2)==0 and len(flipkart_dump_2_2)!=0):
        similaity_value_list=(cosine_similarity([catg_treees_embeddings[flipkart_dump_2[0]]],catg_treees_embeddings)).tolist()
        similaity_value_list=similaity_value_list[0]
        all_similar_amazon=[]
        count=0
        for i in similaity_value_list:
            # Only taking products which have similairty value greater than 0.9
            if i>0.9:
                if count>20000:
                    all_similar_amazon.append(count)
            count+=1
        #print(all_similar_amazon)
        print("Prices comparison between amazon and flipkart")
        for i in flipkart_dump_2:
            print("-----------------------------------------------")
            print(f"Product name in flipkart: {flipkart_products['product_name'][i]}")
            print(f"Retail price in flipkart: {flipkart_products['retail_price'][i]}")
            print(f"Discounted price in flipkart: {flipkart_products['discounted_price'][i]}")
        for i in all_similar_amazon:
            print("-----------------------------------------------")
            print(f"Product name in amazon: {amazon_products['product_name'][i-20000]}")
            print(f"Retail price in amazon: {amazon_products['retail_price'][i-20000]}")
            print(f"Discounted price in amzon: {amazon_products['discounted_price'][i-20000]}")
    else:
        similaity_value_list=(cosine_similarity([catg_treees_embeddings[amazon_dump_2[0]]],catg_treees_embeddings)).tolist()
        similaity_value_list=similaity_value_list[0]
        all_similar_flipkart=[]
        count=0
        for i in similaity_value_list:
            # Only taking products which have similairty value greater than 0.9
            if i>0.97:
                if count>20000:
                    all_similar_flipkart.append(count)
            count+=1
        #print(all_similar_flipkart)
        print("Prices comparison between amazon and flipkart")
        for i in amazon_dump_2:
            print("-----------------------------------------------")
            print(f"Product name in amazon: {amazon_products['product_name'][i]}")
            print(f"Retail price in amazon: {amazon_products['retail_price'][i]}")
            print(f"Discounted price in amazon: {amazon_products['discounted_price'][i]}")
        for i in all_similar_flipkart:
            print("-----------------------------------------------")
            print(f"Product name in flipkart: {flipkart_products['product_name'][i-20000]}")
            print(f"Retail price in flipkart: {flipkart_products['retail_price'][i-20000]}")
            print(f"Discounted price in flipkart: {flipkart_products['discounted_price'][i-20000]}")

Prices comparison between amazon and flipkart
-----------------------------------------------
Product name in amazon: FDT WOMEN'S Leggings Pants
Retail price in amazon: 698
Discounted price in amazon: 362
-----------------------------------------------
Product name in flipkart: FDT Women's Leggings
Retail price in flipkart: 699.0
Discounted price in flipkart: 309.0


***Comparison of models***

Instead of using transformer library from hugging face we can use Transformers amd pytorch, in which we do the background functions in above method such as creating word embeddingds, masking etc; . But despite doing all that the difference in similarity will be very less like 0.0001 (not brings huge change in results). 
So i have sticked with directly using library from hugging face

Even using hugging face we can load several modles.
The model i loaded above is 'bert-base-nli-mean-tokens', it maps sentences to 768 dimensional dense vector space
It is also one of the recently updated models
I have used other models which maps sentences to 384 dimensional dense vector space  such as 'multi-qa-MiniLM-L6-cos-v1', 'paraphrase-MiniLM-L12-v2'
Although these displays almost same result, but the similairty value between two sentences will be reduced may be due to reduce in vector space

I haven't included them as the body remains the same only loading of the model changes, and each loading takes around 30min.