# <center>Similarity-Based Approach for Product Name Matching<center>

## Introduction:

The key idea behind this approach is to establish a **mathematical similarity measure between strings** and match product names with the highest scores.

<img src="notebook_image/similarity_based.png" alt="similarity_based" style="width: 650px;"/>

We started by developing a Naïve model to test the effectiveness of using statistical string similarity as a fuzzy-matching approach. We then analyzed performance of that initial model to build a more robust multi-step version.

## 0. Importing Libraries:

In [2]:
import numpy as np
import pandas as pd
from difflib import SequenceMatcher
import re
import string

from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix
from scipy.sparse import lil_matrix
import sparse_dot_topn.sparse_dot_topn as ct
import time

import warnings
warnings.filterwarnings('ignore')

## 1. Importing Datasets:

In [3]:
# Unlabeled Products

instacart_products = pd.read_csv('../raw_data/third_party/instacart/products.csv')
grocery_products = pd.read_excel('../raw_data/third_party/grocery.com/Grocery_UPC_Database.xlsx')

In [4]:
# USDA Food Dataset

food = pd.read_csv('../raw_data/third_party/food.csv')

In [5]:
# NCR Dataset

ncr =  pd.read_csv('../raw_data/ncr/items_descriptions.csv')

In [6]:
# Master Catalog generated with the "Self-Verifying Clustering Method"

master_catalog =  pd.read_csv('../raw_data/model_outputs/master_catalog.csv')

## 2. Data Pre-processing

In [7]:
# Adding fdc_id columns to unlabeled datasets:

instacart_products['fdc_id'] = 0
grocery_products['fdc_id'] = 0

# Lower-casing all product names and removing non-alphanumeric characters besides periods:

instacart_products["product_name"] = instacart_products["product_name"].str.lower().replace(r'[^A-Za-z0-9. ]+', '', regex=True)
grocery_products["name"] = grocery_products["name"].str.lower().str.replace('/',' ').str.replace('-',' ').replace(r'[^A-Za-z0-9. ]+', '', regex=True)
food["description"] = food["description"].str.lower().str.replace('/',' ').str.replace('-',' ').replace(r'[^A-Za-z0-9. ]+', '', regex=True)
ncr["description"] = ncr["description"].str.lower().str.replace('/',' ').str.replace('-',' ').replace(r'[^A-Za-z0-9. ]+', '', regex=True)
master_catalog["item_name"] = master_catalog["item_name"].str.lower().str.replace('/',' ').str.replace('-',' ').replace(r'[^A-Za-z0-9. ]+', '', regex=True)

In [8]:
display(instacart_products.head())
display(grocery_products.head())
display(food.head())
display(ncr.head())
display(master_catalog.head())

Unnamed: 0,product_id,product_name,aisle_id,department_id,fdc_id
0,1,chocolate sandwich cookies,61,19,0
1,2,allseasons salt,104,13,0
2,3,robust golden unsweetened oolong tea,94,7,0
3,4,smart ones classic favorites mini rigatoni wit...,38,1,0
4,5,green chile anytime sauce,5,13,0


Unnamed: 0,grp_id,upc14,upc12,brand,name,fdc_id
0,1,35200264013,35200264013,Riceland,riceland american jazmine rice,0
1,2,11111065925,11111065925,Caress,caress velvet bliss ultra silkening beauty bar...,0
2,3,23923330139,23923330139,Earth's Best,earths best organic fruit yogurt smoothie mixe...,0
3,4,208528800007,208528800007,Boar's Head,boars head sliced white american cheese 120 ct,0
4,5,759283100036,759283100036,Back To Nature,back to nature gluten free white cheddar rice ...,0


Unnamed: 0,fdc_id,data_type,description,food_category_id,publication_date
0,344604,branded_food,tutturosso green 14.5oz. nsa italian diced tom...,,2019-04-01
1,344605,branded_food,tutturosso green 14.5oz. italian diced tomatoes,,2019-04-01
2,344606,branded_food,honeysuckle white fresh 97 ground white turkey,,2019-04-01
3,344607,branded_food,honeysuckle white 97 ground white turkey,,2019-04-01
4,344608,branded_food,honeysuckle whtie 85 ground turkey,,2019-04-01


Unnamed: 0,item_id,description,ecomm_description,category,item_type,upc
0,1,pan dulce sencillo,"Mexican Sweet Bread/Pan Dulce Mexicano, 1 Count",20101020,0,10
1,2,bolillo french rolls,"Bolillo, French Rolls, 1 Count",20101210,0,20
2,3,bolillo queso chile jalap,"Jalapeño and Cheese Bolillo, 1 Count",20101210,0,30
3,4,empanada,,20101020,0,40
4,5,mini bolillo,"BOLILLO SMALL, 2 OZ",20101210,0,50


Unnamed: 0,item_name
0,kelloggs poptarts wild cherry 14.1oz
1,kellogg poptarts frosted strawberry 22oz
2,kellogg poptarts frosted strawberry slurpee 3....
3,sunshine cheezit crackers original 3oz
4,sunshine cheezit crackers hot spicy 7oz


## 3. Naïve Model:

For the Naïve model, each new product name is compared to all product names in the Master Catalog based on the number of substring matches. The new product is then matched to the product in the Master Catalog that generates the highest similarity score.

In our experiments, we used the Grocery.com and NCR datasets to represent the new product names and the USDA dataset to represent the Master Catalog. We also performed some string pre-processing techniques
to improve the quality of the match results.

<img src="notebook_image/naive.png" alt="naive" style="width: 450px;"/>

### 31. Similarity Method:

$\textbf{Implementation:}$ *SequenceMatcher* class from the *difflib* Python library.

$
\textbf{Formula:} 
\begin{align}
score = \frac{M}{T},
\end{align}
$
where M = # of substring matches, T = total # of elements

In [8]:
def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

In [18]:
'''for i, product in grocery_products.iterrows():
    similarities = []
    for j, row in food.iterrows():
        similarities.append(similar(product['name'], str(row['description'])))
    max_index = np.argmax(similarities, axis=0)
    #grocery_products.iloc[i]['fdc_id'] = max_index
    print()
    print("Grocery.com Product Name: " + product['name'])
    print("USDA Product Name: " + food.iloc[max_index]['description'])
    print("Similarity Score: " + str(similarities[max_index]))
    print()'''

for i, product in ncr.iterrows():
    similarities = []
    for j, row in food.iterrows():
        similarities.append(similar(product['description'], str(row['description'])))
    max_index = np.argmax(similarities, axis=0)
    #grocery_products.iloc[i]['fdc_id'] = max_index
    print()
    print("NCR Product Name: " + product['description'])
    print("USDA Product Name: " + food.iloc[max_index]['description'])
    print("Similarity Score: " + str(similarities[max_index]))
    print()


NCR Product Name: pan dulce sencillo
USDA Product Name: pan dulce no topping
Similarity Score: 0.631578947368421


NCR Product Name: bolillo french rolls
USDA Product Name: bolillo rolls
Similarity Score: 0.7878787878787878


NCR Product Name: bolillo queso chile jalap
USDA Product Name: bouillon cubes chicken
Similarity Score: 0.6382978723404256


NCR Product Name: empanada
USDA Product Name: empanada
Similarity Score: 1.0


NCR Product Name: mini bolillo
USDA Product Name: mini rolls
Similarity Score: 0.7272727272727273


NCR Product Name: bolillo large
USDA Product Name: bolillo rolls
Similarity Score: 0.6923076923076923


NCR Product Name: storebrand  tortillas single
USDA Product Name: restaurant style tortilla triangles
Similarity Score: 0.6666666666666666



Based on the preliminary results, we empirically established some match quality thresholds. Scores over 0.75 were considered high-probability matches while scores under 0.6 were considered non-matches. We named he range of scores between ***0.6*** and ***0.75*** as the ***\"Danger Zone\"***, because, while the confidence level for that range was lower, correct matches were still observed fairly often.

<img src="notebook_image/thresholds.png" alt="naive" style="width: 450px;"/>

This model produced satisfactory preliminary results and had the advantage of being simple and highly interpretable. Additionally, matches outputted by the model could potentially be used as labeled training data for more advanced models. 

However, the Naïve Model also possessed some significant limitations: it was **extremely slow**, since each new product is compared to every item in the master catalog, and the **similarity measure was overly simplistic.**

## 4. Multi-Step Model:

This model improves on the Naïve approach by using a more sophisticated string similarity measure while reducing the overall computation time. It applies the cosine similarity on vectorized versions of each product name and improves performance by taking advantage of fast matrix operations performed by numerical libraries like NumPy and SciPy. These are the descriptions of each of the three steps:

<img src="notebook_image/multi-step.png" alt="multi-step" style="width: 900px;"/>

### 4.1 Combining the Master Catalog with the NCR Dataset:

In [9]:
catalog = master_catalog["item_name"].dropna()
catalog = catalog.values.astype('U')
ncr_products = ncr["description"]
ncr_products.dropna()
ncr_products = ncr_products.values.astype('U')

all_products = np.concatenate((ncr_products, catalog), axis=None)

# To run the experiments only with the NCR data set (whithout including the master catalog), uncomment the next line
#all_products = ncr_products.copy()

### 4.2 Splitting Product Names into N-grams: 

In [10]:
# Function to generate ngrams. Value of n should be set as a default value

def ngrams(string, n=3):

    string = re.sub(r'[,-./]|\sBD',r'', string)
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]

### 4.3 TF-IDF Vectorization: 

In [11]:
# Creating the TF-IDF Vectorizer object. If the analyzer paramenter is not set, it will use words instead of n-grams

vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)

# Generating the TF-IDF Matrix

t1 = time.time()
tf_idf_matrix = vectorizer.fit_transform(all_products)
t = time.time()-t1

print("TF-IDF Matrix Computation Time: " + str(round(t, 3)) + " seconds")
print("TF-IDF Matrix Shape:", tf_idf_matrix.shape)

TF-IDF Matrix Computation Time: 6.2 seconds
TF-IDF Matrix Shape: (275206, 20114)


### 4.4 Optimized Cosine Similarity: 

In [12]:
def awesome_cossim_top(A, B, ntop, lower_bound=0):
    A = A.tocsr()
    B = B.tocsr()
    M, _ = A.shape
    _, N = B.shape
 
    idx_dtype = np.int32
 
    nnz_max = M*ntop
 
    indptr = np.zeros(M+1, dtype=idx_dtype)
    indices = np.zeros(nnz_max, dtype=idx_dtype)
    data = np.zeros(nnz_max, dtype=A.dtype)

    ct.sparse_dot_topn(
        M, N, np.asarray(A.indptr, dtype=idx_dtype),
        np.asarray(A.indices, dtype=idx_dtype),
        A.data,
        np.asarray(B.indptr, dtype=idx_dtype),
        np.asarray(B.indices, dtype=idx_dtype),
        B.data,
        ntop,
        lower_bound,
        indptr, indices, data)

    return csr_matrix((data,indices,indptr),shape=(M,N))

In [13]:
t1 = time.time()
similarity_score_matrix = awesome_cossim_top(tf_idf_matrix, tf_idf_matrix.transpose(), 10, 0.78)
t = time.time()-t1

print("Similarity-Score Matrix Computation Time: " + str(round(t, 3)) + " seconds")
print("Similarity-Score Matrix Shape:", similarity_score_matrix.shape)

Similarity-Score Matrix Computation Time: 622.114 seconds
Similarity-Score Matrix Shape: (275206, 275206)


### 4.5 Generating the Product Name Match Dictionary

In [15]:
def get_matches_dict(sparse_matrix, name_vector, top):
    non_zeros = sparse_matrix.nonzero()

    sparserows = non_zeros[0]
    sparsecols = non_zeros[1]
    
    if top:
        nr_matches = top
    else:
        nr_matches = sparse_matrix.size
        

    left_side = np.empty([nr_matches], dtype=object)
    right_side = np.empty([nr_matches,10], dtype=object)
    similairity = np.zeros(nr_matches)

    match_dict = {}
    
    # Iterating through the non-zero values of the Sparse Matrix
    for index in range(0, nr_matches):
        
        # Inserting the keys into the dictionary if they are new
        if name_vector[sparserows[index]] not in match_dict:
            match_dict[name_vector[sparserows[index]]] = [[], []]
        if name_vector[sparsecols[index]] not in match_dict:
            match_dict[name_vector[sparsecols[index]]] = [[], []]
            
        # Appending product name matches and similarity scores to the dictionary
        if sparse_matrix.data[index] != -1.0:
            if name_vector[sparsecols[index]] not in match_dict[name_vector[sparserows[index]]][0]:
                match_dict[name_vector[sparserows[index]]][0].append(name_vector[sparsecols[index]])
                match_dict[name_vector[sparserows[index]]][1].append(sparse_matrix.data[index])
                
            if name_vector[sparserows[index]] not in match_dict[name_vector[sparsecols[index]]][0]:
                match_dict[name_vector[sparsecols[index]]][0].append(name_vector[sparserows[index]])
                match_dict[name_vector[sparsecols[index]]][1].append(sparse_matrix.data[index])

    return match_dict

### <center>Visual representation of the matrix operations in the Multi-Step Model:</center>

<img src="notebook_image/matrices.png" alt="matrices" style="width: 800px;"/>

In [18]:
# Setting the diagonal elements to -1:
similarity_score_matrix.setdiag(-1)

# Generating the Product Match Dictionary:
matches_dict = get_matches_dict(similarity_score_matrix, all_products, None)

i = 0
for key in matches_dict:
    print("\"" + key + "\" matching product names: ")
    display(matches_dict[key])
    print()
    if i > 5:
        break
    i += 1

"pan dulce sencillo" matching product names: 


[[], []]


"bolillo french rolls" matching product names: 


[[], []]


"bolillo queso chile jalap" matching product names: 


[[], []]


"empanada" matching product names: 


[['empanadas',
  'empanada de pia empanada',
  'empanada de crema empanada',
  'empanada de manzana empanada',
  'empanada de queso empanada'],
 [0.8948590055129169,
  0.8516954277053159,
  0.8426104563823023,
  0.8174387749878556,
  0.8122090782207405]]


"empanadas" matching product names: 


[['empanada', 'empanadas box', 'cheese empanadas', 'baked empanadas'],
 [0.8948590055129169,
  0.8372629369851776,
  0.7959864337490576,
  0.7924383338105727]]


"empanada de pia empanada" matching product names: 


[['empanada',
  'empanada de pina',
  'empanada de crema empanada',
  'empanada de manzana empanada',
  'empanada de queso empanada'],
 [0.8516954277053159,
  0.8314628629978196,
  0.8900006309051466,
  0.8535773796216122,
  0.8198666682060988]]


"empanada de crema empanada" matching product names: 


[['empanada',
  'empanada de crema',
  'empanada de pia empanada',
  'empanada de manzana empanada',
  'empanada de queso empanada'],
 [0.8426104563823023,
  0.8907084020056475,
  0.8900006309051466,
  0.844472333658493,
  0.8111212118762282]]




### 4.6 Matching New Product Names

After combining the product names provided by NCR with the new Master Catalog generated form the Self-verifying Clustering Method, the data set used to train the Multi-Step Model contained a total of 275,206 names. This caused the program to run considerably slower than when dealing exclusively with the NCR data.

We concluded that, while the time it takes to compute the TF-IDF matrix is not significantly affected by the size of the dataset, the optimized cosine similarity calculation is the main bottleneck. The running time to run that function went from ~5 seconds to ~15 minutes after combining both data sets.

To solve this problem, we devised an approach for quickly obtaining matches for new product names once the Product Name Dictionary has been generated. If the new name is one of the keys in the dictionary, it returns the existing matches for that name. If not, it quickly re-computes the TF-IDF Matrix and calculates the cosine similarity exclusively for the new product name. It then returns the newly found matches and updates the Product Name Dictionary.

### <center>Visual representation of the approach for obtaining matches for a new product name:</center>

<img src="notebook_image/dict.png" alt="dict" style="width: 650px;"/>

In [64]:
def get_single_match_dict(sparse_matrix, name_vector, top, new_product):
    non_zeros = sparse_matrix.nonzero()

    sparserows = non_zeros[0]
    sparsecols = non_zeros[1]
    
    if top:
        nr_matches = top
    else:
        nr_matches = sparse_matrix.size
        

    left_side = np.empty([nr_matches], dtype=object)
    right_side = np.empty([nr_matches,10], dtype=object)
    similairity = np.zeros(nr_matches)

    match_dict = {}
    
    match_dict[new_product] = [[], []]
    
    # Iterating through the non-zero values of the Sparse Matrix
    for index in range(0, nr_matches):
        if name_vector[sparsecols[index]] != new_product:
            match_dict[new_product][0].append(name_vector[sparsecols[index]])
            match_dict[new_product][1].append(sparse_matrix.data[index])

    return match_dict

The **insertNewProduct** function gives you the option of performing the quick version outlined before or of re-computing the entire similarity matrix by setting the *recompute_sim_matrix* to True or False.

In [65]:
def insertNewProduct(new_product, 
                    vectorizer, 
                    product_names, 
                    matches_dict, 
                    top, 
                    threshold,
                    recompute_sim_matrix=False):
    new_product_original = new_product
    new_product = new_product_original.lower().replace('/',' ').replace('-',' ').replace(r'[^A-Za-z0-9. ]+', '')
    
    if new_product in matches_dict:
            print("\"" + new_product_original + "\" matching product names: ")
            display(matches_dict[new_product])
            return matches_dict, product_names
        
    if recompute_sim_matrix:
        
        new_product_list = np.append(product_names,new_product)
        tf_idf_matrix = vectorizer.fit_transform(new_product_list)
        matches = awesome_cossim_top(tf_idf_matrix, tf_idf_matrix.transpose(), top, threshold)
        matches.setdiag(-1)
        matches_dict = get_matches_dict(matches, new_product_list, None)
        
        print("\"" + new_product_original + "\" matching product names: ")
        display(matches_dict[new_product])
        
        return matches_dict, new_product_list
    
    else:
        new_product_list = np.append(product_names,new_product)
        new_tf_idf_matrix = vectorizer.fit_transform(new_product_list)

        new_matches = awesome_cossim_top(new_tf_idf_matrix[-1], new_tf_idf_matrix.transpose(), top, threshold)

        new_matches_dict = get_single_match_dict(new_matches, new_product_list, None, new_product)
        print("\"" + new_product_original + "\" matching product names: ")
        display(new_matches_dict[new_product])

        matches_dict[new_product] = new_matches_dict[new_product]
        for i in range(len(new_matches_dict[new_product][0])):
            matches_dict[new_matches_dict[new_product][0][i]].append(new_product)
            matches_dict[new_matches_dict[new_product][0][i]].append(new_matches_dict[new_product][1][i])
        
        return matches_dict, new_product_list

In [66]:
# Testing a new product name without recomputing the similarity score matrix:

print("Inserting new product name without recomputing the similarity score matrix")
print()
print()

t1 = time.time()
matches_dict, all_products = insertNewProduct('Empanada de Carne', vectorizer, all_products, matches_dict, 10, 0.78)
t = time.time()-t1

print("Elapsed Time: " + str(round(t, 3)) + " seconds")

print()
print()
print("Checking that same product name again:")
print()
print()

t1 = time.time()
matches_dict, all_products = insertNewProduct('Empanada de Carne', vectorizer, all_products, matches_dict, 10, 0.78)
t = time.time()-t1

print("Elapsed Time: " + str(round(t, 3)) + " seconds")

Inserting new product name without recomputing the similarity score matrix


"Empanada de Carne" matching product names: 


[['empanada argentina de carne'], [0.781471383192331]]

Elapsed Time: 6.656 seconds


Checking that same product name again:


"Empanada de Carne" matching product names: 


[['empanada argentina de carne'], [0.781471383192331]]

Elapsed Time: 0.002 seconds


In [67]:
# Testing a new product name recomputing the similarity score matrix:
print("Inserting new product name recomputing the similarity score matrix")
print()
print()

t1 = time.time()
matches_dict, all_products = insertNewProduct('Frenchs Yellow Mustard', 
                                               vectorizer, 
                                               all_products, 
                                               matches_dict, 
                                               10, 0.78,
                                               recompute_sim_matrix=True)
t = time.time()-t1

print("Elapsed Time: "+ str(round(t, 3)) + " seconds")

print()
print()
print("Checking that same product name again: ")
print()
print()

t1 = time.time()
matches_dict, all_products = insertNewProduct('Frenchs Yellow Mustard', 
                                               vectorizer, 
                                               all_products, 
                                               matches_dict, 
                                               10, 0.78,
                                               recompute_sim_matrix=True)
t = time.time()-t1

print("Elapsed Time: " + str(round(t, 3)) + " seconds")

Inserting new product name recomputing the similarity score matrix


"Frenchs Yellow Mustard" matching product names: 


[['frenchs classic yellow mustard'], [0.7899182426290459]]

Elapsed Time: 680.05 seconds


Checking that same product name again: 


"Frenchs Yellow Mustard" matching product names: 


[['frenchs classic yellow mustard'], [0.7899182426290459]]

Elapsed Time: 0.001 seconds
