The problem with Fuzzy Matching on large data

There are many algorithms which can provide fuzzy matching but they quickly fall down when used on even modest data sets of greater than a few thousand records.
The reason for this is that they compare each record to all the other records in the data set. In computer science, this is known as quadratic time and can quickly form a barrier when dealing with larger data sets.
A relative small data set of 10k records would require 100m operations.

How a well known NLP algorithm can help solve the issue.  
The solution to this problem comes from a well known NLP algorithm.  
Term Frequency, Inverse Document Frequency (or tf-idf) has been used in language problems since 1972.  
It is a simple algorithm which splits text into ‘chunks’ (or ngrams), counts the occurrence of each chunk for a given sample and then applies a weighting to this based on how rare the chunk is across all the samples of a data set. This means that useful words are filtered from the ‘noise’ of more common words which occur within text.
Whilst these chunks are normally applied to whole words, there is no reason why the same technique cannot be applied to sets of characters within words. For example, we could split each word into 3 character ngrams, for the word ‘Department’, this would output: ' De', 'Dep', 'epa', 'par', 'art', 'rtm', 'tme', 'men', 'ent', 'nt '


https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536

https://colab.research.google.com/drive/1qhBwDRitrgapNhyaHGxCW8uKK5SWJblW
    

https://bergvca.github.io/2017/10/14/super-fast-string-matching.html

Data for this case obtained from:

https://www.gov.uk/contracts-finder

In [1]:
!pip install sparse_dot_topn 



In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
import sparse_dot_topn.sparse_dot_topn as ct
import numpy as np
from scipy.sparse import csr_matrix
import pandas as pd
import re


In [3]:
df=pd.read_csv("./test_data/notices.csv")
df.shape

(933, 43)

In [4]:
df.head()

Unnamed: 0,Notice Identifier,Notice Type,Organisation Name,Status,Published Date,Title,Description,Nationwide,Postcode,Region,...,Value High,Awarded Date,Awarded Value,Supplier [Name|Address|Ref type|Ref Number|Is SME|Is VCSE],Supplier's contact name,Contract start date,Contract end date,OJEU Procedure Type,Accelerated Justification,Closing Time
0,FSCS SS 031,Contract,FINANCIAL SERVICES COMPENSATION SCHEME LIMITED,Awarded,2020-09-11T19:37:46Z,Real-time GBR address verification,Data capture solution that offers real-time GB...,,,United Kingdom,...,,28/08/2020,81933.0,[Experian Limited|Sir John Peace Building Expe...,,01/09/2020,31/08/2021,SingleTenderActionNonOJEU,,00:00
1,tender_248561/886898,Contract,Public Health England,Awarded,2020-09-11T17:17:55Z,Purchase of three centrifuges for Pillar 3 Cov...,Contract has been awarded to ThermoFisher Scie...,,,Any region,...,,24/08/2020,28285.0,[ThermoFisher Scientific|Bishop Meadow Road |L...,,25/08/2020,24/08/2021,SingleTenderActionNonOJEU,,12:00
2,CCLL20A14.,Contract,Department for Transport : Department for Tran...,Awarded,2020-09-11T15:26:30Z,Provision of Legal Advisers for South Eastern ...,The Department for Transport invites proposals...,,,"United Kingdom,Isle of Man,Channel Islands",...,,13/08/2020,350000.0,[Eversheds Sutherland (International) LLP|One ...,,24/08/2020,23/08/2022,CallOffFromFrameworkAgreement,,11:00
3,GLOSCC001-DN489914-49558633,Contract,Gloucestershire County Council,Awarded,2020-09-11T15:08:53Z,C336AB - Churchdown School to Sandhurst,Home to school transport.\r\nPassenger transpo...,,,South West,...,63745.0,18/08/2020,63745.0,[FIRST ASSOCIATED TAXIS|GL1 2EZ|NONE||Yes|No],Mr Rashid Khan,04/09/2020,31/07/2023,Other,,23:59
4,20-36,Contract,UNIVERSITY OF WOLVERHAMPTON ENTERPRISE LIMITED,Awarded,2020-09-11T14:27:41Z,20-36 E-Textbook access,The University of Wolverhampton has awarded a ...,,,West Midlands,...,,11/08/2020,300000.0,"[KORTEXT LIMITED|26-32 Oxford Road,Suite B, 6t...",,11/08/2020,10/08/2021,CallOffFromFrameworkAgreement,,09:00


In [None]:
#!pip install ftfy # amazing text cleaning for decode issues.. TO INVESTIGATE
#from ftfy import fix_text

The ngram function  
The below function is used as both a cleaning function of the text data as well as a way of splitting text into ngrams. 

In [5]:
def ngrams(string, n=3):
    #string = fix_text(string) # fix text encoding issues
    string = string.encode("ascii", errors="ignore").decode() #remove non ascii chars
    string = string.lower() #make lower case
    chars_to_remove = [")","(",".","|","[","]","{","}","'"]
    rx = '[' + re.escape(''.join(chars_to_remove)) + ']'
    string = re.sub(rx, '', string) #remove the list of chars defined above
    string = string.replace('&', 'and')
    string = string.replace(',', ' ')
    string = string.replace('-', ' ')
    string = string.title() # normalise case - capital at start of each word
    string = re.sub(' +',' ',string).strip() # get rid of multiple spaces and replace with a single space
    string = ' '+ string +' ' # pad names for ngrams...
    string = re.sub(r'[,-./]|\sBD',r'', string)
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]

The great thing about the tf-idf implementation in Scikit is that it allows for a custom function to be added to it. We can therefore add-in the function we have created above and build the matrix in just a few lines of code:

In [7]:
target_field = df['Organisation Name'].unique()
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(target_field)


Now we are going to find similarities using the cosine function  
While you could use the cosine similarity function from Scikit here, it is not the most efficient way of finding close matches as it returns a closeness score for every item in the dataset for each sample. Instead, we are going to use a faster implementation of this which can be found here:
https://bergvca.github.io/2017/10/14/super-fast-string-matching.html

In [8]:
def awesome_cossim_top(A, B, ntop, lower_bound=0):
    # force A and B as a CSR matrix.
    # If they have already been CSR, there is no overhead
    A = A.tocsr()
    B = B.tocsr()
    M, _ = A.shape
    _, N = B.shape
 
    idx_dtype = np.int32
 
    nnz_max = M*ntop
 
    indptr = np.zeros(M+1, dtype=idx_dtype)
    indices = np.zeros(nnz_max, dtype=idx_dtype)
    data = np.zeros(nnz_max, dtype=A.dtype)
    ct.sparse_dot_topn(
        M, N, np.asarray(A.indptr, dtype=idx_dtype),
        np.asarray(A.indices, dtype=idx_dtype),
        A.data,
        np.asarray(B.indptr, dtype=idx_dtype),
        np.asarray(B.indices, dtype=idx_dtype),
        B.data,
        ntop,
        lower_bound,
        indptr, indices, data)
    return csr_matrix((data,indices,indptr),shape=(M,N))

In [17]:
def get_matches_df(sparse_matrix, name_vector, top=100):
    non_zeros = sparse_matrix.nonzero()
    
    sparserows = non_zeros[0]
    sparsecols = non_zeros[1]
    
    if top:
        nr_matches = top
    else:
        nr_matches = sparsecols.size
    
    left_side = np.empty([nr_matches], dtype=object)
    right_side = np.empty([nr_matches], dtype=object)
    similarity = np.zeros(nr_matches)
    
    for index in range(0, nr_matches):
        left_side[index] = name_vector[sparserows[index]]
        right_side[index] = name_vector[sparsecols[index]]
        similarity[index] = sparse_matrix.data[index]
    
    return pd.DataFrame({'left_side': left_side,
                          'right_side': right_side,
                           'similarity': similarity})

In [18]:
import time
t1 = time.time()
matches = awesome_cossim_top(tf_idf_matrix, tf_idf_matrix.transpose(), 10, 0.85)
t = time.time()-t1
print("SELFTIMED:", t)

SELFTIMED: 0.0022437572479248047


In [24]:
matches_df = get_matches_df(matches, target_field, top=100)
matches_df = matches_df[matches_df['similarity'] < 0.99999] # Remove all exact matches
matches_df.shape

(7, 3)

In [23]:
matches_df.head()

Unnamed: 0,left_side,right_side,similarity
3,Department for Transport : Department for Tran...,Department For Transport,0.97424
4,Department for Transport : Department for Tran...,THE DEPARTMENT FOR TRANSPORT,0.898558
9,"MINISTRY OF HOUSING, COMMUNITIES & LOCAL GOVER...","Ministry of Housing, Communities & Local Gover...",0.989177
30,Business Energy and Industrial Strategy,Department for Business Energy & Industrial St...,0.890011
34,THE DEPARTMENT FOR TRANSPORT,Department For Transport,0.922317


## Record linkage and a different approach

If we want to use this technique to match against another data source then we can recycle the majority of our code. In the below section we will see how this is achieved and also use the K Nearest Neighbour algorithm as an alternative closeness measure.
The dataset we would like to join on is a set of ‘clean’ organisation names created by the Office for National Statistics (ONS):

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import re
clean_org_names = pd.read_excel('Gov Orgs ONS.xlsx')
clean_org_names = clean_org_names.iloc[:, 0:6]
org_name_clean = clean_org_names['Institutions'].unique()
print('Vectorizing the data - this could take a few minutes for large datasets...')
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams, lowercase=False)
tfidf = vectorizer.fit_transform(org_name_clean)
print('Vectorizing completed...')
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=1, n_jobs=-1).fit(tfidf)
org_column = 'buyer' #column to match against in the messy data
unique_org = set(names[org_column].values) # set used for increased performance
###matching query:
def getNearestN(query):
    queryTFIDF_ = vectorizer.transform(query)
    distances, indices = nbrs.kneighbors(queryTFIDF_)
    return distances, indices

import time
t1 = time.time()
print('getting nearest n...')
distances, indices = getNearestN(unique_org)
t = time.time()-t1
print("COMPLETED IN:", t)
unique_org = list(unique_org) #need to convert back to a list
print('finding matches...')
matches = []
for i,j in enumerate(indices):
  temp = [round(distances[i][0],2), clean_org_names.values[j][0][0],unique_org[i]]
  matches.append(temp)
print('Building data frame...')  
matches = pd.DataFrame(matches, columns=['Match confidence (lower is better)','Matched name','Origional name'])
print('Done')