# Surprisingly Effective Way To Name Matching In Python



![Image](headerroomtype.png)

These are the same room types but were taken as different forms, i.e., deal with different versions of same name.


**In this workbook, we will deal with matching the different versions of one name for room types so that we can create a master record for further analysis.**


**Get the Dataset** : [Room Types Dataset](https://github.com/maladeep/Name-Matching-In-Python/blob/master/room_type.csv)

**Detail Explanation** : [Surprisingly Effective Way To Name Matching In Python](https://medium.com/@maladeep.upadhaya/surprisingly-effective-way-to-name-matching-in-python-1a67328e670e)

In [14]:
#  Importing libraries and module and some setting for notebook

import pandas as pd 
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.sparse import csr_matrix
import sparse_dot_topn.sparse_dot_topn as ct  #Cosine Similarity
import time
pd.set_option('display.max_colwidth', -1)


# reading dataset as df

df =  pd.read_csv('room_type.csv')

# printing first five rows

df.head(5)

Unnamed: 0,RoomTypes
0,"Deluxe Room, 1 King Bed"
1,"Standard Room, 1 King Bed, Accessible"
2,"Grand Corner King Room, 1 King Bed"
3,"Suite, 1 King Bed (Parlor)"
4,"High-Floor Premium Room, 1 King Bed"


## ngrams 

In [15]:
#  ngrams(here we are taking n = 3 thus 3-gram (trigrams ) as  most room types only contain two or three words
#  used for cleaning and removing some punctuation (dots, comma’s etc) i.e.((,-./)) from a string 
#  and generate and collect all n-grams of the string.  

 
def ngrams(string, n=3):

    string = re.sub(r'[,-./]|\sBD',r'', string)
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]





# Testing ngrams work for verification 

print('All 3-grams in "Deluxroom":')
ngrams('Deluxroom')

All 3-grams in "Deluxroom":


['Del', 'elu', 'lux', 'uxr', 'xro', 'roo', 'oom']

## TF-IDF and Vectorization

In [16]:
# After having each words split (token or  lemmas (n-gram generated items) ) into a vector and
# Scikit-learn’s  Tfidfvectorizer aim to do the same thing, which is to convert a collection of raw documents to a matrix of TF-IDF features. 
# Generate the matrix of TF-IDF (Term Frequency-Inverse Document frequency)values for each

room_types = df['RoomTypes']
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(room_types)


In [17]:
# View sparse CSR matrix.
print(tf_idf_matrix[0])


  (0, 113)	0.1605304445953455
  (0, 29)	0.16141635519747793
  (0, 318)	0.22675894223265677
  (0, 452)	0.18208940066082585
  (0, 371)	0.17370536296468006
  (0, 140)	0.17886291412226296
  (0, 46)	0.18887063512800203
  (0, 99)	0.2517854902990057
  (0, 24)	0.23022653423352535
  (0, 412)	0.2609327210548879
  (0, 473)	0.15046703253147248
  (0, 480)	0.13094460588705156
  (0, 168)	0.1363445719787321
  (0, 62)	0.16141635519747793
  (0, 279)	0.28518161479728127
  (0, 612)	0.2882757113194961
  (0, 596)	0.2882757113194961
  (0, 409)	0.2882757113194961
  (0, 297)	0.2882757113194961
  (0, 120)	0.2882757113194961


## Optimized Cosine Similarity  by  sparse_dot_topn 
####  Created by [ING Wholesale Banking Advanced Analytics team](https://medium.com/wbaa/what-does-ing-wb-advanced-analytics-do-707a09175530)

In [28]:
# calculate the similarity between two vectors of TF-IDF values the Cosine Similarity is usually used.
# result matrix in a very sparse terms and Scikit-learn deals with this nicely by returning a sparse CSR matrix.

def awesome_cossim_top(A, B, ntop, lower_bound=0):
    # force A and B as a CSR matrix.
    # If they have already been CSR, there is no overhead
    A = A.tocsr()
    B = B.tocsr()
    M, _ = A.shape
    _, N = B.shape
 
    idx_dtype = np.int32
 
    nnz_max = M*ntop
 
    indptr = np.zeros(M+1, dtype=idx_dtype)
    indices = np.zeros(nnz_max, dtype=idx_dtype)
    data = np.zeros(nnz_max, dtype=A.dtype)

    ct.sparse_dot_topn(
        M, N, np.asarray(A.indptr, dtype=idx_dtype),
        np.asarray(A.indices, dtype=idx_dtype),
        A.data,
        np.asarray(B.indptr, dtype=idx_dtype),
        np.asarray(B.indices, dtype=idx_dtype),
        B.data,
        ntop,
        lower_bound,
        indptr, indices, data)

    return csr_matrix((data,indices,indptr),shape=(M,N))


In [29]:
#  Run the optimized cosine similarity function. 
#  Only stores the top 10 most similar items with a similarity above 0.8

t1 = time.time()
matches = awesome_cossim_top(tf_idf_matrix, tf_idf_matrix.transpose(), 10, 0.8)
t = time.time()-t1
print("SELFTIMED:", t)

SELFTIMED: 0.0019731521606445312


In [25]:
# unpacks the resulting sparse matrix

def get_matches_df(sparse_matrix, name_vector, top=100):
    non_zeros = sparse_matrix.nonzero()
    
    sparserows = non_zeros[0]
    sparsecols = non_zeros[1]
    
    if top:
        nr_matches = top
    else:
        nr_matches = sparsecols.size
    
    left_side = np.empty([nr_matches], dtype=object)
    right_side = np.empty([nr_matches], dtype=object)
    similairity = np.zeros(nr_matches)
    
    for index in range(0, nr_matches):
        left_side[index] = name_vector[sparserows[index]]
        right_side[index] = name_vector[sparsecols[index]]
        similairity[index] = sparse_matrix.data[index]
    
    return pd.DataFrame({'left_side': left_side,
                          'right_side': right_side,
                           'similairity': similairity})

In [26]:
# store the  matches into new dataframe called matched_df and printing 10 samples

matches_df = get_matches_df(matches, room_types, top=200)
matches_df = matches_df[matches_df['similairity'] < 0.99999] # For removing all exact matches
matches_df.sample(10)

Unnamed: 0,left_side,right_side,similairity
121,"Room, Ocean View","Room, Ocean View",0.946453
6,"Grand Corner King Room, 1 King Bed",Grand Corner King Room,0.933038
105,"Premium Room, 1 Queen Bed","Premium Room, 2 Queen Beds",0.816652
151,"Junior Suite, 1 King Bed, Accessible (Roll-in Shower)",Junior Suite - Accessible Roll-in Shower,0.807198
57,"Luxury Room, 1 Queen Bed, Non Smoking","Luxury Room, 1 King Bed, Non Smoking",0.864465
36,"Suite, 1 Bedroom","Deluxe Suite, 1 Bedroom",0.813644
184,Premium Two Queen,Club Premium Two Queen,0.84057
40,"Club Room, City View (Club Lounge Access for 2 guests)","Club Room, Lake View (Club Lounge Access for 2 guests)",0.875506
59,"Luxury Room, 1 King Bed, Non Smoking","Luxury Room, 1 Queen Bed, Non Smoking",0.864465
25,"Signature Room, 1 King Bed",Signature King,0.829876


The matches look pretty satisfying!

The cossine similarity gives a good indication of the similarity between the two room types.

**Deluxe Suite, 1 Bedroom**	and **Suite, 1 Bedroom** are probably not the same room type and we got the  **similarity measure of 0.81.**

In [22]:
# printing the matched in sorted order 

matches_df.sort_values(['similairity'], ascending=False).head(10)

Unnamed: 0,left_side,right_side,similairity
121,"Room, Ocean View","Room, Ocean View",0.946453
138,"Room, Ocean View","Room, Ocean View",0.946453
6,"Grand Corner King Room, 1 King Bed",Grand Corner King Room,0.933038
158,Grand Corner King Room,"Grand Corner King Room, 1 King Bed",0.933038
167,King Room - Disability Access,Queen Room - Disability Access,0.919632
168,King Room - Disability Access,Queen Room - Disability Access,0.919632
135,"Standard Room, Ocean View (Waikiki Tower) - No Resort Fee","Standard Room, Partial Ocean View (Waikiki Tower) - No Resort Fee",0.90516
133,"Standard Room, Partial Ocean View (Waikiki Tower) - No Resort Fee","Standard Room, Ocean View (Waikiki Tower) - No Resort Fee",0.90516
175,Two Double Beds - Location Room (19th to 25th Floors),King Bed - Location Room (19th to 25th Floors),0.875607
177,King Bed - Location Room (19th to 25th Floors),Two Double Beds - Location Room (19th to 25th Floors),0.875607


 ### So, exact visual assessment and the matches made with this strategy are very fulfilling.
 
 Using **ngram** with **TF-IDF** and **cosine similarity( [sparse_dot_topn library](https://github.com/ing-bank/sparse_dot_topn) )** we can **speed up string matching process even for large dataset** (for our case: 572000*12 )
 
**For Detail understanding**: [Surprisingly Effective Way To Name Matching In Python](https://medium.com/@maladeep.upadhaya/surprisingly-effective-way-to-name-matching-in-python-1a67328e670e)




Do not forget to clap for the [medium article](https://medium.com/@maladeep.upadhaya/surprisingly-effective-way-to-name-matching-in-python-1a67328e670e) and if there are any inquiries ping me on [Linkedin](https://www.linkedin.com/in/maladeep/)

# Thank you for reading!
