![deduplication](https://www.teimouri.net/wp-content/uploads/2019/10/deduplication.png)

## Fast fuzzy matching for large datasets

Deduplication is a common and necessary task for a lot of datasets, and often requires fuzzy matching between strings to help identify the duplicate records. 
<br><br>
A well known package to do so is the [fuzzywuzzy](https://pypi.org/project/fuzzywuzzy/) python package, which uses [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) to calculate the difference between two strings. Unfortunately, this method requires comparing each string to every other string in the dataset (on the order of quadratic time, O(n<sup>2</sup>)). For a large dataset, this falls apart as it quickly builds to very long runtimes.
<br><br>
This notebook explores an alternative approach, using methods from Nautral Language Processing and matrix multiplication to speed up the problem by over **6000x**.

### Importing packages

In [1]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
import numpy as np
import re
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)

### Getting the data 

Company data is notorious for containing duplicates. With organisations creating multiple official registrations, or even just users creating new records, CRMs and other databses can soon get flooded with many versions of the same company.
<br><br>
To use as example data, a list of companies registered with Companies House can be downloaded from [their website](http://download.companieshouse.gov.uk/en_output.html).

In [2]:
df = pd.read_csv('./BasicCompanyDataAsOneFile-2021-01-01.csv')

In [3]:
df.sample(n=5)

Unnamed: 0,CompanyName,CompanyNumber,RegAddress.CareOf,RegAddress.POBox,RegAddress.AddressLine1,RegAddress.AddressLine2,RegAddress.PostTown,RegAddress.County,RegAddress.Country,RegAddress.PostCode,CompanyCategory,CompanyStatus,CountryOfOrigin,DissolutionDate,IncorporationDate,Accounts.AccountRefDay,Accounts.AccountRefMonth,Accounts.NextDueDate,Accounts.LastMadeUpDate,Accounts.AccountCategory,Returns.NextDueDate,Returns.LastMadeUpDate,Mortgages.NumMortCharges,Mortgages.NumMortOutstanding,Mortgages.NumMortPartSatisfied,Mortgages.NumMortSatisfied,SICCode.SicText_1,SICCode.SicText_2,SICCode.SicText_3,SICCode.SicText_4,LimitedPartnerships.NumGenPartners,LimitedPartnerships.NumLimPartners,URI,PreviousName_1.CONDATE,PreviousName_1.CompanyName,PreviousName_2.CONDATE,PreviousName_2.CompanyName,PreviousName_3.CONDATE,PreviousName_3.CompanyName,PreviousName_4.CONDATE,PreviousName_4.CompanyName,PreviousName_5.CONDATE,PreviousName_5.CompanyName,PreviousName_6.CONDATE,PreviousName_6.CompanyName,PreviousName_7.CONDATE,PreviousName_7.CompanyName,PreviousName_8.CONDATE,PreviousName_8.CompanyName,PreviousName_9.CONDATE,PreviousName_9.CompanyName,PreviousName_10.CONDATE,PreviousName_10.CompanyName,ConfStmtNextDueDate,ConfStmtLastMadeUpDate
4375058,THIRD WAY CONSULTING LIMITED,11487238,,,23A FORE STREET,,HERTFORD,,UNITED KINGDOM,SG14 1DJ,Private Limited Company,Active,United Kingdom,,27/07/2018,31.0,7.0,30/04/2021,31/07/2019,,24/08/2019,,0,0,0,0,70229 - Management consultancy activities othe...,,,,0,0,http://business.data.gov.uk/id/company/11487238,,,,,,,,,,,,,,,,,,,,,14/04/2021,31/03/2020
4270456,THE CEDAR FINEST MEATS & DELI LIMITED,10626054,,,THE CEDAR,31 MALVERN ROAD,LONDON,,ENGLAND,NW6 5PS,Private Limited Company,Active,United Kingdom,,17/02/2017,28.0,2.0,28/02/2021,28/02/2019,,17/03/2018,,0,0,0,0,47110 - Retail sale in non-specialised stores ...,,,,0,0,http://business.data.gov.uk/id/company/10626054,,,,,,,,,,,,,,,,,,,,,03/04/2021,20/02/2020
3858539,SF PROFESSIONAL SOLUTIONS LIMITED,12451432,,,11 11 THE MOAT HOUSE,COMMONS ROAD,PEMBROKE,PEMBROKESHIRE,UNITED KINGDOM,SA71 4EA,Private Limited Company,Active,United Kingdom,,10/02/2020,29.0,2.0,10/11/2021,,NO ACCOUNTS FILED,10/03/2021,,0,0,0,0,82990 - Other business support service activit...,,,,0,0,http://business.data.gov.uk/id/company/12451432,,,,,,,,,,,,,,,,,,,,,23/03/2021,
2506897,LAUNCELOT PARTNERS II NOMINEES LIMITED,11356508,,,64 NEW CAVENDISH STREET,,LONDON,,UNITED KINGDOM,W1G 8TB,Private Limited Company,Active,United Kingdom,,11/05/2018,31.0,5.0,31/05/2021,31/05/2019,DORMANT,08/06/2019,,0,0,0,0,68100 - Buying and selling of own real estate,,,,0,0,http://business.data.gov.uk/id/company/11356508,,,,,,,,,,,,,,,,,,,,,20/07/2021,06/07/2020
2782498,MAXKOTE LIMITED,11766787,,,20 KIRKGATE,SHERBURN IN ELMET,LEEDS,,UNITED KINGDOM,LS25 6BL,Private Limited Company,Active,United Kingdom,,15/01/2019,31.0,1.0,31/10/2021,31/01/2020,,12/02/2020,,0,0,0,0,46180 - Agents specialized in the sale of othe...,,,,0,0,http://business.data.gov.uk/id/company/11766787,,,,,,,,,,,,,,,,,,,,,25/02/2021,14/01/2020


### Initial data evaluation

In [4]:
df.shape

(4837425, 55)

In [5]:
df.columns

Index(['CompanyName', ' CompanyNumber', 'RegAddress.CareOf',
       'RegAddress.POBox', 'RegAddress.AddressLine1',
       ' RegAddress.AddressLine2', 'RegAddress.PostTown', 'RegAddress.County',
       'RegAddress.Country', 'RegAddress.PostCode', 'CompanyCategory',
       'CompanyStatus', 'CountryOfOrigin', 'DissolutionDate',
       'IncorporationDate', 'Accounts.AccountRefDay',
       'Accounts.AccountRefMonth', 'Accounts.NextDueDate',
       'Accounts.LastMadeUpDate', 'Accounts.AccountCategory',
       'Returns.NextDueDate', 'Returns.LastMadeUpDate',
       'Mortgages.NumMortCharges', 'Mortgages.NumMortOutstanding',
       'Mortgages.NumMortPartSatisfied', 'Mortgages.NumMortSatisfied',
       'SICCode.SicText_1', 'SICCode.SicText_2', 'SICCode.SicText_3',
       'SICCode.SicText_4', 'LimitedPartnerships.NumGenPartners',
       'LimitedPartnerships.NumLimPartners', 'URI', 'PreviousName_1.CONDATE',
       ' PreviousName_1.CompanyName', ' PreviousName_2.CONDATE',
       ' PreviousName_2.C

<br><br>The data from Companies House comes with a range of features including the address and [Standard Industrial Classification](https://www.gov.uk/government/publications/standard-industrial-classification-of-economic-activities-sic) which could be useful features for deduplication, but first let's just consider the similarity of the company names.

In [6]:
df = df[['CompanyName', ' CompanyNumber']]

In [7]:
df[['CompanyName', ' CompanyNumber']].nunique()

CompanyName       4836627
 CompanyNumber    4837425
dtype: int64

Straight away it can be seen that some of the company names in the dataset are direct duplicates of other rows, without even considering any fuzzy matching.

In [8]:
print('Top 10 direct duplicates:')
df['CompanyName'].value_counts(ascending=False)[:10]

Top 10 direct duplicates:


DEVON FUEL ASSOCIATES LTD. T/A SERVICETECH SOUTH WEST PARTNERSHIP        10
FOXLEY COURT EQUESTRIAN                                                   4
COTFIELD FARMERS                                                          3
BONSHAW FARMS                                                             3
JACK KAY & SONS                                                           3
FARRANS (CONSTRUCTION) LIMITED                                            3
ROYAL INSURANCE (U.K.) LIMITED                                            3
ASSOCIATED SUB-CONTRACTORS IN LIMITED PARTNERSHIP WITH MICHAEL MURPHY     3
ASSOCIATED SUB-CONTRACTORS IN LIMITED PARTNERSHIP WITH LEE SMITH          3
JAMES MCDOUGALL                                                           3
Name: CompanyName, dtype: int64

**For the sake of runtime, the following examples will use a subset of the overall data**

In [9]:
df = df.sample(n=50000)

### Fuzzywuzzy 

As mentioned, one of the simplest ways to get the similarity between the company names would be to use the fuzzywuzzy python package.

In [10]:
company_list = df['CompanyName'].unique()
print(process.extractOne('AVOCHIE HOME FARM PARTNERSHIP', company_list))
time_result = %timeit -o process.extractOne('AVOCHIE HOME FARM PARTNERSHIP', company_list)

('ASSOCIATED SUB-CONTRACTORS IN LIMITED PARTNERSHIP WITH DAVID KELLY', 86)
2.36 s ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [11]:
print('Rough time for full dataset: {} hours'.format(int((time_result.average/3600)*len(company_list))))

Rough time for full dataset: 32 hours


Nearly 2.4 seconds just to find the closest match of one record! For our relatively small sample of 50,000 companies, we'd have our duplicate matches after *33 hours*.
<br><br>
The sample dataset used here was about 1/96th of the full dataset. If this method was to be completed using all 4.8 million companies, it would take roughly 96<sup>2</sup> times longer - that's over 300,000 hours, or about **_35 years_**.
<br><br>
While this could be scaled up or out with greater compute power, matrix multiplication offers another approach.<br><br>

### Fuzzy matching with TF-IDF

[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) (**Term Frequency-Inverse Document Frequency**) is a statistical measure used in Natural Language Processing to evaluate how important a word is to a document in a collection of documents.
<br><br>
An example of this would be in collecting a large list of books, and giving a score to each word in every book that describes how useful that word is in identifying the book itself.
<br><br>
**Term Frequency** is the raw count of how many times term appears within the document, i.e. how many times a word appears in the book.
<br><br>
**Inverse Document Frequency** is a measure of how valuable that term is overall to the identity of the document, i.e. if the word is one which is very common across all of the books (documents), or if it is rare and so likely to be specific to one book.
<br><br>
For this example, _"the"_ is likely to be a word with high Term Frequency, but in every single book in the collection, so will have an overall low score by the TF-IDF measure. _"Mockingbird"_ is less likely to be frequent in many books, so will be more impactful in identifying a specific book, and will therefore have a far greater TF-IDF score.


<br>**How does this help with fuzzy matching or deduplication?**
<br><br>[scikit-learn](https://scikit-learn.org/stable/) has a module to carry out the TF-IDF transformation. Normally, the "terms" of TF-IDF would be whole words within a document, but for the case of the company names within this dataset the terms can be a collection of n-grams that compose the name instead.

Using the following function **ngram**, a company name can be broken up into every consecutive 3 character combination.
<br><br>Some cleaning using regular expressions is completed first to clean the company name, such as removing any irrelevant characters, and padding the string with spaces so to add an extra n-gram for the start and end of the name.

In [12]:
def ngram(s):
    n=3
    s = re.sub(r'[^a-zA-Z0-9\-\,&\s!]','',s)
    s = re.sub(r'[\-\,]',' ', s)
    s = re.sub(r'\s+',' ', s)
    s = s.replace('&','and')
    s = s.title()
    s = s.strip()
    s = ' '+s+' '
    ngrams = zip(*[s[i:] for i in range(n)])
    ngrams = [''.join(ngram) for ngram in ngrams]
    return ngrams

print(ngram('AVOCHIE HOME FARM PARTNERSHIP'))

[' Av', 'Avo', 'voc', 'och', 'chi', 'hie', 'ie ', 'e H', ' Ho', 'Hom', 'ome', 'me ', 'e F', ' Fa', 'Far', 'arm', 'rm ', 'm P', ' Pa', 'Par', 'art', 'rtn', 'tne', 'ner', 'ers', 'rsh', 'shi', 'hip', 'ip ']


With the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) module of scikit-learn, this ngram function can be used in place of the usual functionality of looking for whole words by using the *analyzer* parameter of the module.
<br><br>
In order to demonstrate this, a TF-IDF matrix is created for a list of 4 example company names below.

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

example = ['Royal Bank', 'The Royal Bank','Clydesdale Bank','Clydesdale Banking']

vectorizer_ex = TfidfVectorizer(min_df=1, analyzer=ngram)
tfidf_matrix_ex = vectorizer_ex.fit_transform(example)
pd.DataFrame(tfidf_matrix_ex.toarray(), index=example, columns=vectorizer_ex.get_feature_names()).round(decimals=3)

Unnamed: 0,Ba,Cl,Ro,Th,Ban,Cly,Roy,The,al,ale,ank,dal,des,e B,e R,esd,he,ing,kin,l B,le,lyd,ng,nk,nki,oya,sda,yal,yde
Royal Bank,0.234,0.0,0.354,0.0,0.234,0.0,0.354,0.0,0.354,0.0,0.234,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.354,0.0,0.0,0.0,0.287,0.0,0.354,0.0,0.354,0.0
The Royal Bank,0.174,0.0,0.263,0.334,0.174,0.0,0.263,0.334,0.263,0.0,0.174,0.0,0.0,0.0,0.334,0.0,0.334,0.0,0.0,0.263,0.0,0.0,0.0,0.213,0.0,0.263,0.0,0.263,0.0
Clydesdale Bank,0.184,0.278,0.0,0.0,0.184,0.278,0.0,0.0,0.0,0.278,0.184,0.278,0.278,0.278,0.0,0.278,0.0,0.0,0.0,0.0,0.278,0.278,0.0,0.225,0.0,0.0,0.278,0.0,0.278
Clydesdale Banking,0.153,0.231,0.0,0.0,0.153,0.231,0.0,0.0,0.0,0.231,0.153,0.231,0.231,0.231,0.0,0.231,0.0,0.293,0.293,0.0,0.231,0.231,0.293,0.0,0.293,0.0,0.231,0.0,0.231


It can be seen that the vectorizer has split all 4 company names into a unique list of n-grams, and given a TF-IDF score to each one for each of the 4 companies.
<br><br>
The score measures how important each n-gram is to identifying that specific company name.<br><br>**"Ban"** is present in all 4 of the companies in this example, so has a unanimously low score across the board.<br><br>On the other hand, **"Roy"** only appears within the two Royal Bank examples, and therefore has a higher score for these two entries and a score of 0 for the two Clydesdale examples.

<br>**Finding close matches**
<br><br>
Once this TF-IDF matrix is prepared, the cosine similarity metric can be used to compare the matrix against itself and highlight company names that have a strong similarity to others within the dataset.

In [14]:
from sklearn.metrics.pairwise import cosine_similarity

In [15]:
cosine_similarity(tfidf_matrix_ex, tfidf_matrix_ex)

array([[1.        , 0.74382013, 0.19373958, 0.10751709],
       [0.74382013, 1.        , 0.1441074 , 0.07997338],
       [0.19373958, 0.1441074 , 1.        , 0.78967928],
       [0.10751709, 0.07997338, 0.78967928, 1.        ]])

In [16]:
pd.DataFrame(cosine_similarity(tfidf_matrix_ex, tfidf_matrix_ex), index=example, columns=example).round(decimals=3)

Unnamed: 0,Royal Bank,The Royal Bank,Clydesdale Bank,Clydesdale Banking
Royal Bank,1.0,0.744,0.194,0.108
The Royal Bank,0.744,1.0,0.144,0.08
Clydesdale Bank,0.194,0.144,1.0,0.79
Clydesdale Banking,0.108,0.08,0.79,1.0


As seen above, the metric has done a great job at identifying similar company names.
<br><br>**"Royal Bank"** and **"The Royal Bank"** have been given a similarity score of 0.744. Likewise, **"Clydesdale Bank"** and **"Clydesdale Banking"** have a score of 0.79.
<br><br>Conversely, the score between the Royal Bank and Clydesdale entries all sit below 0.2, correctly identifying these as different.
<br><br>The scores produced by the cosine similarity can then be filtered as needed to suit the level of precision or recall for the task at hand.
<br><br>*Note: As the TF-IDF matrix is compared against itself, every row will have at least 1 perfect match with itself. These can be filtered out by only considering scores < 1.*

### How much faster is the TF-IDF method than fuzzywuzzy?

Creating a TF-IDF matrix for our sample dataset of 50,000 companies:

In [17]:
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngram)
tfidf_matrix = vectorizer.fit_transform(company_list)

In [18]:
tfidf_matrix.shape

(50000, 19525)

Something to note is that the TF-IDF matrix will be very sparse, as each row will likely only have a value above 0 for a small set of the n-grams that make up the thousands of columns of the overall matrix.

In [19]:
from numpy import count_nonzero
print('Percentage of matrix that is non-zero: {:.2f}%'.format((tfidf_matrix.count_nonzero() / float(tfidf_matrix.toarray().size))*100))

Percentage of matrix that is non-zero: 0.12%


Rather than use the cosine similarity module from scikit-learn, Data Scientists at ING developed a more efficient method to perform the matrix multiplication for such a sparse matrix (described [here](https://medium.com/wbaa/https-medium-com-ingwbaa-boosting-selection-of-the-most-similar-entities-in-large-scale-datasets-450b3242e618), using this [package](https://github.com/ing-bank/sparse_dot_topn) that was published to implement the method).
<br><br>Included in the package is the function *awesome_cossim_topn*, which performs ING's implementation of cosine similiary. There are two extra parameters in this function to restrict the results to the top N matches for each row, and set a lower bound  for the score. 

In [20]:
from scipy.sparse import csr_matrix
from sparse_dot_topn import awesome_cossim_topn

print('Time taken to create the TF-IDF matrix:')
%timeit -r 1 -n 1 tfidf_matrix = vectorizer.fit_transform(company_list)

print('\nTime taken to calculate the similary matrix:')
%timeit -r 1 -n 1 matches = awesome_cossim_topn(tfidf_matrix, tfidf_matrix.transpose(), 10, 0.6)

Time taken to create the TF-IDF matrix:
1.15 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Time taken to calculate the similary matrix:
17.8 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


**Less than 20 seconds** for the whole sample dataset of 50,000 rows! Compared to the fuzzywuzzy method from earlier that was going to take an estimated 33 hours, that's a speed-up of over _**6000x**_.

### Using the TF-IDF results

The final step is to convert the matrix result into tabular form to match the result format of fuzzywuzzy, so that records can be linked as needed.

In [21]:
matches = awesome_cossim_topn(tfidf_matrix, tfidf_matrix.transpose(), 10, 0.6)
matches

<50000x50000 sparse matrix of type '<class 'numpy.float64'>'
	with 79046 stored elements in Compressed Sparse Row format>

Using the *get_matches_df* function described in [this post](https://bergvca.github.io/2017/10/14/super-fast-string-matching.html), the matrix results can be unpacked into a DataFrame.

In [22]:
def get_matches_df(sparse_matrix, name_vector, top=100):
    non_zeros = sparse_matrix.nonzero()
    
    sparserows = non_zeros[0]
    sparsecols = non_zeros[1]
    
    if top:
        nr_matches = top
    else:
        nr_matches = sparsecols.size
    
    left_side = np.empty([nr_matches], dtype=object)
    right_side = np.empty([nr_matches], dtype=object)
    similarity = np.zeros(nr_matches)
    
    for index in range(0, nr_matches):
        left_side[index] = name_vector[sparserows[index]]
        right_side[index] = name_vector[sparsecols[index]]
        similarity[index] = sparse_matrix.data[index]
    
    return pd.DataFrame({'left_side': left_side,
                          'right_side': right_side,
                           'similarity': similarity})

In [23]:
matches_df = get_matches_df(matches, company_list, top=100)

The resulting DataFrame is filtered for where the score is below 1 to remove the exact matches arising from comparing the TF-IDF matrix against itself.

In [24]:
matches_df = matches_df[matches_df['similarity'] < 0.99999]

Looking at the top scoring matches, the results are now in the format that would be expected after using the fuzzywuzzy package.

In [25]:
matches_df.sort_values(by='similarity', ascending=False).head(10)

Unnamed: 0,left_side,right_side,similarity
78,KNIGHTSBRIDGE HIGHRISE 1 LIMITED,KNIGHTSBRIDGE HOMES LIMITED,0.709281
1,BEAUTY CARE SUPPLIES LIMITED,BEAUTY SALON SUPPLIES LTD,0.704602
19,PRO GLAZE WINDOWS DOORS & CONSERVATORIES LIMITED,CROWN CONSERVATORIES WINDOWS AND DOORS LTD,0.693704
25,KGF PLUMBING AND HEATING LTD,DSF PLUMBING AND HEATING LTD,0.674433
72,HEATHWOOD CARPETS AND FLOORING LIMITED,HALLS CARPETS & FLOORING LTD,0.672994
73,HEATHWOOD CARPETS AND FLOORING LIMITED,GLOBAL CARPETS & FLOORING LTD,0.661707
26,KGF PLUMBING AND HEATING LTD,S A PLUMBING AND HEATING LTD,0.657188
27,KGF PLUMBING AND HEATING LTD,IVE PLUMBING & HEATING LTD,0.655919
28,KGF PLUMBING AND HEATING LTD,HILL PLUMBING & HEATING LIMITED,0.655301
29,KGF PLUMBING AND HEATING LTD,INVITE PLUMBING AND HEATING LTD,0.654454


The quality of the results will vary depending on the dataset used. As this was just a small sample of the overall dataset from Companies House  there were few real duplicates.

### Possible Next steps

**Combining other features**
<br><br>
The original dataset came with a number of other data points, including each company's registered address and SIC description, which could be used in combination with the similarity score found by fuzzy matching. This could allow for finding duplicates below the score threshold that would otherwise be considered the limit for acceptable accuracy.
<br><br><br><br>
**Removing company extensions**
<br><br>
Many of the company names in the dataset include extensions not important to identifying the company itself, such as "Ltd" or "Limited". The TF-IDF metric will already take this into account somewhat, but it may be worth testing whether removing the extensions outright improves the overall quality of the similarity results. (Full list of extensions found [here](https://www.corporateinformation.com/Company-Extensions-Security-Identifiers.aspx).) 
<br><br><br><br>
**Encoding the names to remove mispellings**
<br><br>
If the dataset is likely to contain a large amount of mispellings (e.g. some CRM data), then splitting the names into n-grams may not be enough to overcome the issue. An example of an encoding method that could assist in this case is [Double Metaphone](https://en.wikipedia.org/wiki/Metaphone#Double_Metaphone), an algorithm that produces an approximate phonetic representation of words.

In [26]:
from metaphone import doublemetaphone
print(doublemetaphone('The Royal Bank'))
print(doublemetaphone('The Ryoal Bnk'))

('0RLPNK', 'TRLPNK')
('0RLPNK', 'TRLPNK')


As seen the Double Metaphone algorithm produces the same output for both the correct name and mispelling. There is some risk in similar sounding names being encoded to the same output, so the value of encoding the names in this form depends highly on the dataset in question and if there are any other features that may be used in combination.

### Inspirations and further reading:

[Datacamp - Fuzzy String Matching in Python](https://www.datacamp.com/community/tutorials/fuzzy-string-python)
<br><br>
[Super Fast String Matching in Python](https://bergvca.github.io/2017/10/14/super-fast-string-matching.html)
<br><br>
[Boosting the selection of the most similar entities in large scale datasets](https://medium.com/wbaa/https-medium-com-ingwbaa-boosting-selection-of-the-most-similar-entities-in-large-scale-datasets-450b3242e618)
<br><br>
[Fuzzy search using Double Metaphone](https://blog.cloudant.com/2019/08/08/fuzzy-search-using-the-double-metaphone-algorithm.html)