# NLP 07: Fuzzy Matching

Frequencies get me counts on exact matches. Now I'm interested in fuzzy matches, matches that are similar but not exact. What I want to get out of this are groups that are the same thing but are misspelled or abbreviated.

To get this information, I'll need to compare every word to every other word, which is computational expensive. I'll need to keep that in mind when I get to comparing addresses for all the rows. For example, while comparing each word to every other word in the 2,000+ row word list may be entirely feasible, doing the same for all 400,000+ addresses at the same time would take too long. Partitioning the data into chunks, perhaps smaller even that processing by country, will be necessary.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
from time import gmtime, strftime
import sys
import os
import io

import string
import re
# import itertools
# import nltk
# nltk.download('stopwords')

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from rapidfuzz import fuzz as rfuzz
import jaro

In [2]:
def frequency_ct(ngram_list):
    freq_dict = {}
    for ngram in ngram_list:
        if ngram not in freq_dict:
            freq_dict[ngram] = 0
        freq_dict[ngram] +=1
    return freq_dict

In [3]:
df = pd.read_csv('data/parsed_bahamas_addresses.csv')

df['address_wordlist'] = df['working_address'].fillna('').str.split()

freq_df = pd.DataFrame.from_dict(
    frequency_ct(df['address_wordlist'].sum()
                ), orient='index').reset_index().rename(
    columns={'index':'word', 0:'count'}).sort_values('count', ascending=False)

In [4]:
freq_df.head(3)

Unnamed: 0,word,count
9,bahamas,2324
8,nassau,2043
6,box,1484


## Fuzzy match metrics

There are two main ways to measure string similarity:

- Levenshtein Distance: uses the number of single characters edits needed to convert the first string in to the second
- Jaro-Winkler Distance: uses the number of matching characters and the number of transpositions

Edit operations include the following:


- Addition: Adding a character
- Deletion: Removing a character
- Substitution: Replacing a character
- Transposition: Swapping two adjacent characters

Levenshtein Distance uses the first three. An extension of this distance metric, Damerau-Levenshtein Distance, uses all four edit operations.

For more information on these metrics, Moosa Ali has a good write up on [Medium](https://medium.com/) in his [Best Libraries for Fuzzy Matching In Python](https://medium.com/codex/best-libraries-for-fuzzy-matching-in-python-cbb3e0ef87dd) article.

### `fuzzywuzzy`

The main library for performing fuzzy matching with python is the `fuzzywuzzy` package. It uses Levenshtein Distance to calculate how similar or dissimilar two strings are.

`fuzzywuzzy` comes with four metrics:

- Ratio: compares the entire string with the characters in order
- Partial ratio: compares the shorter string with a substring of the same length from the longer string
- Token sort ratio: compares the string while ignoring word/character order
- Token set ratio: compares the string while ignoring duplicate words/characters

For more information on these metrics, Catherine Gitau has a good explanation in her [Fuzzy String Matching](https://towardsdatascience.com/fuzzy-string-matching-in-python-68f240d910fe) article on [Towards Data Science](https://towardsdatascience.com/).

`fuzzywuzzy` also has several methods in the `.process` function:

- `extract`: compares a single string to a list of strings. When used with a series, returns three values
    - Second string that is being compared to the first string
    - Score
    - Index of the second string
- `dedup`: removes duplicate values from a list of strings from a specified score threshold

There are several limitations to these methods.

`extract` can only take a single string for the first value. I'd need to set this up in a for loop to compare every item to every other item. For loops are notoriously slow, so this doesn't appear to be a good solution given the amount of data I want to process.

`dedup` removes the duplicates without providing any information on what was removed. For my usecase, I'll need to know what is matching with what. For this initial example using the word list, I'll want to replace misspellings with the correct word to update the values in the `working_address` column. When I apply this to the full addresses, I'll need to group rows that should be the same address and give them a new node id. I can then use the original node id and the newly assigned node it to correctly associate addresses with their counterparts in the rest of the Offshore Leaks data.

### `rapidfuzz`

There is another package, `rapidfuzz` that is based on `fuzzywuzzy`, but supposedly runs faster. It also returns more detailed scores. `fuzzywuzzy` rounds to the nearest integer while `rapidfuzz` provides the decimal answer. FYI, occasionally they output different results.

In [5]:
str1 = freq_df.iloc[0,0]
str2 = freq_df.iloc[1,0]

print(f'Comparison strings: String 1: "{str1}", String 2: "{str2}"', '\n')
print('Metric', '\t\t\tfuzzywuzzy', '\trapidfuzz')
print('Ratio:', '\t\t\t', fuzz.ratio(str1, str2), '\t\t', rfuzz.ratio(str1, str2))
print('Partial ratio:', '\t\t', fuzz.partial_ratio(str1, str2), '\t\t', rfuzz.partial_ratio(str1, str2))
print('Token sort ratio:', '\t', fuzz.token_sort_ratio(str1, str2), '\t\t', rfuzz.token_sort_ratio(str1, str2))
print('Token set ratio:', '\t', fuzz.token_set_ratio(str1, str2), '\t\t', rfuzz.token_set_ratio(str1, str2))
print('Jaro-Winkler:', '\t', jaro.jaro_winkler_metric(str1, str2))

Comparison strings: String 1: "bahamas", String 2: "nassau" 

Metric 			fuzzywuzzy 	rapidfuzz
Ratio: 			 31 		 30.76923076923077
Partial ratio: 		 33 		 50.0
Token sort ratio: 	 31 		 30.76923076923077
Token set ratio: 	 31 		 30.769230769230774
Jaro-Winkler: 	 0.5396825396825397


## Storage format

Before deciding how to process all the data, I need to know what information to keep. Based on what I wanted to see above I need:

- The original string
- The match string
- All five metrics

Additionally, since I'm working with dataframes, I'll also want the index of both the original and match values so that I can easily find them in the dataframe.

Ultimately, when I'm working with the full address string, it should look something like this:

<table>
    <tr>
        <td>address_index</td>
        <td>address</td>
        <td>match_index</td>
        <td>match</td>
        <td>ratio_score</td>
        <td>partial_ratio_score</td>
        <td>token_sort_score</td>
        <td>token_set_score</td>
        <td>jaro_winkler_score</td>
    </tr>
    <tr>
        <td>9</td>
        <td>'ground floor goodmans bay corporate ce po box n 3933 nassau bahamas'</td>
        <td>1975</td>
        <td>'ground floor goodmans bay corporate ce po box n 3933 nassau bahamas'</td>
        <td>100</td>
        <td>100</td>
        <td>100</td>
        <td>100</td>
        <td>1.0</td>
    </tr>
    <tr>
        <td>9</td>
        <td>'ground floor goodmans bay corporate ce po box n 3933 nassau bahamas'</td>
        <td>2068</td>
        <td>'goodmans bay corporate centre po box cb10976 nassau bahamas'</td>
        <td>78</td>
        <td>87</td>
        <td>75</td>
        <td>85</td>
        <td>0.78</td>
    </tr>
    <tr>
        <td>9</td>
        <td>'ground floor goodmans bay corporate ce po box n 3933 nassau bahamas'</td>
        <td>548</td>
        <td>'second  floor goodmans bay corporate centre suite 261 po box cb12762 nassau bahamas'</td>
        <td>77</td>
        <td>69</td>
        <td>72</td>
        <td>85</td>
        <td>0.74</td>
    </tr>
    <tr>
        <td>9</td>
        <td>'ground floor goodmans bay corporate ce po box n 3933 nassau bahamas'</td>
        <td>268</td>
        <td>'goodmans bay corporate centre po box cb12407 nassau bahamas'</td>
        <td>76</td>
        <td>85</td>
        <td>75</td>
        <td>85</td>
        <td>0.78</td>
    </tr>
    <tr>
        <td>9</td>
        <td>'ground floor goodmans bay corporate ce po box n 3933 nassau bahamas'</td>
        <td>1511</td>
        <td>'goodmans bay corporate centre west bay street po box n3933 nassau bahamas'</td>
        <td>76</td>
        <td>70</td>
        <td>70</td>
        <td>79</td>
        <td>0.77</td>
    </tr>
    <tr>
        <td>9</td>
        <td>'ground floor goodmans bay corporate ce po box n 3933 nassau bahamas'</td>
        <td>1707</td>
        <td>'first floor goodmans bay corporate centre bay street nassau bahamas'</td>
        <td>76</td>
        <td>77</td>
        <td>73</td>
        <td>81</td>
        <td>0.73</td>
    </tr>
</table>

In [6]:
def calc_ffuzz_df(df, column):
    row_list = []
    
    for o_i, o_v in enumerate(df[column].sort_index()):
        for m_i, m_v in enumerate(df[column].sort_index()):
            if o_i != m_i:
                dict1 = {
                    'original_index': o_i,
                    'original_value': o_v,
                    'match_index': m_i,
                    'match_value': m_v,
                    'ratio_score': fuzz.ratio(o_v, m_v),
                    'partial_ratio_score': fuzz.partial_ratio(o_v, m_v),
                    'token_sort_score': fuzz.token_sort_ratio(o_v, m_v),
                    'token_set_score': fuzz.token_set_ratio(o_v, m_v),
                    'jaro_winkler_score': jaro.jaro_winkler_metric(o_v, m_v)
                }
                if (dict1['ratio_score']>0) | (dict1['partial_ratio_score']>0) | (dict1['token_sort_score']>0) | (dict1['token_set_score']>0) | (dict1['jaro_winkler_score']>0):
                    row_list.append(dict1)
    score_df = pd.DataFrame(row_list)
        
    return score_df

def calc_rfuzz_df(df, column):
    row_list = []
    
    for o_i, o_v in enumerate(df[column].sort_index()):
        for m_i, m_v in enumerate(df[column].sort_index()):
            if o_i != m_i:
                dict1 = {
                    'original_index': o_i,
                    'original_value': o_v,
                    'match_index': m_i,
                    'match_value': m_v,
                    'ratio_score': rfuzz.ratio(o_v, m_v),
                    'partial_ratio_score': rfuzz.partial_ratio(o_v, m_v),
                    'token_sort_score': rfuzz.token_sort_ratio(o_v, m_v),
                    'token_set_score': rfuzz.token_set_ratio(o_v, m_v),
                    'jaro_winkler_score': jaro.jaro_winkler_metric(o_v, m_v)
                }
                if (dict1['ratio_score']>0) | (dict1['partial_ratio_score']>0) | (dict1['token_sort_score']>0) | (dict1['token_set_score']>0) | (dict1['jaro_winkler_score']>0):
                    row_list.append(dict1)
    score_df = pd.DataFrame(row_list)
        
    return score_df

\**Notes* on the `calc_fuzz_df` function:

1. I use a dictionary to iteratively collect the scores as they're run. Initially, I used `.iloc` to concatenate a row onto the dataframe. However, this was unacceptably slow (I was too impatient to even let it finish). 
1. I don't store any row where all metrics are 0. If any metric has a value greater than 0, the whole row is captured.

### Function speed

Running the `%%timeit` magic function reveals that the `rapidfuzz` library is indeed significantly faster. As such, I'll be using the metric functions from the `rapidfuzz` library.

In [15]:
# %%timeit
# calc_rfuzz_df(freq_df, 'word')

43.4 s ± 200 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [18]:
# %%timeit
# calc_ffuzz_df(freq_df, 'word')

2min 42s ± 934 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
fuzzy_words_df = calc_rfuzz_df(freq_df, 'word')
fuzzy_words_df['jaro_winkler_score'] = fuzzy_words_df['jaro_winkler_score']*100
fuzzy_words_df

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
0,0,annex,1,frederick,14.285714,25.000000,14.285714,14.285714,43.703704
1,0,annex,2,and,50.000000,80.000000,50.000000,50.000000,68.888889
2,0,annex,3,shirley,16.666667,28.571429,16.666667,16.666667,44.761905
3,0,annex,4,street,18.181818,28.571429,18.181818,18.181818,45.555556
4,0,annex,6,box,25.000000,50.000000,25.000000,25.000000,0.000000
...,...,...,...,...,...,...,...,...,...
2272337,2038,2ntl,2029,tortola,36.363636,50.000000,36.363636,36.363636,59.523810
2272338,2038,2ntl,2031,switzerland,26.666667,33.333333,26.666667,26.666667,56.060606
2272339,2038,2ntl,2033,montagne,33.333333,50.000000,33.333333,33.333333,58.333333
2272340,2038,2ntl,2034,sterline,33.333333,50.000000,33.333333,33.333333,58.333333


## Most matched

I'm really not interested in sifting through 14-43% matches like "annex" matched with "frederick", it just isn't useful. What I really want are the words that occur frequently and have variations in spelling. To do this, I'll count the number of matches above a certain threshold for each unique word in my dataset, then start looking at those with the most matches. I chose a threshold of 60.

In [8]:
index_col = 'original_index'
metric_cts = pd.DataFrame(fuzzy_words_df[index_col].unique(), columns=[index_col])

for metric in ['ratio_score', 'partial_ratio_score', 'token_sort_score', 'token_set_score', 'jaro_winkler_score']:
    met_df = fuzzy_words_df.loc[fuzzy_words_df[metric]>60, [index_col, metric]].groupby(index_col).count().reset_index()
    metric_cts = metric_cts.merge(met_df, on=index_col, how='outer')
    
metric_cts = fuzzy_words_df[[index_col, 'original_value']].drop_duplicates().merge(metric_cts, on=index_col, how='outer')
metric_cts.columns = ['original_index', 'original_value', 'ratio_match_ct', 'partial_ratio_match_ct', 'token_sort_match_ct', 'token_set_match_ct', 'jaro_winkler_match_ct']
metric_cts

Unnamed: 0,original_index,original_value,ratio_match_ct,partial_ratio_match_ct,token_sort_match_ct,token_set_match_ct,jaro_winkler_match_ct
0,0,annex,5.0,58,5.0,5.0,55
1,1,frederick,3.0,63,3.0,3.0,67
2,2,and,14.0,239,14.0,14.0,133
3,3,shirley,19.0,60,19.0,19.0,122
4,4,street,18.0,81,18.0,18.0,179
...,...,...,...,...,...,...,...
2034,2034,sterline,21.0,127,21.0,21.0,238
2035,2035,bav,7.0,97,7.0,7.0,91
2036,2036,hast,6.0,109,6.0,6.0,114
2037,2037,coj,1.0,100,1.0,1.0,84


#### Remove noise

One of the first things I noticed was how many of the "words" are actually PO Boxes, suite numbers, ect. I won't be changing any of these values, so they aren't particularly interesting in this analysis. As such, I removed number blocks starting with "n" or 1-2 other letters.

In [9]:
metric_cts[metric_cts['original_value'].str.contains('^n\d+|^\w\w\d+|^\d+$')]

Unnamed: 0,original_index,original_value,ratio_match_ct,partial_ratio_match_ct,token_sort_match_ct,token_set_match_ct,jaro_winkler_match_ct
7,7,n4805,16.0,113,16.0,16.0,123
15,15,n8188,17.0,97,17.0,17.0,88
19,19,n7785,18.0,102,18.0,18.0,88
20,20,n3708,8.0,120,8.0,8.0,113
26,26,n3024,25.0,138,25.0,25.0,135
...,...,...,...,...,...,...,...
2019,2019,3242,24.0,121,24.0,24.0,137
2020,2020,25,13.0,252,13.0,13.0,55
2022,2022,875,6.0,114,6.0,6.0,58
2023,2023,cr567,2.0,39,2.0,2.0,20


In [11]:
metric_cts = metric_cts[~metric_cts['original_value'].str.contains('^n\d+|^\w\w\d+|^\d+$')]

## Metrics

### Ratio

The `ratio` metric returns interesting results right off the bat. In the top 10 results (sorted in decending order) has two misspellings of "bahamas" and three spelling varations of "street"/"streets".

In [22]:
metric_cts.sort_values(['ratio_match_ct', 'original_value'], ascending=False).head(10)

Unnamed: 0,original_index,original_value,ratio_match_ct,partial_ratio_match_ct,token_sort_match_ct,token_set_match_ct,jaro_winkler_match_ct
1137,1137,ste,24.0,243,24.0,24.0,149
1579,1579,steret,23.0,113,23.0,23.0,198
1136,1136,slite,23.0,94,23.0,23.0,156
1225,1225,charlote,23.0,92,23.0,23.0,230
662,662,center,23.0,100,23.0,23.0,189
516,516,bahams,23.0,70,23.0,23.0,83
464,464,suites,22.0,80,22.0,22.0,169
947,947,strees,22.0,87,22.0,22.0,179
1968,1968,stre,22.0,205,22.0,22.0,148
659,659,bhamas,22.0,75,22.0,22.0,94


### Partial ratio

The first thing that stands out to me for `partial_ratio` is how short the original values are. It's only 1-2 letters. The number of matches is also significantly higher than the other metrics. This may just mean that I need a higher threshold for this metric. This makes sense because the `partial_ratio` metric is comparing smaller, more similar portions of the two strings. Alternately, this may mean that `partial_ratio` isn't a good metric for this use case. I'll expore this more in later analysis.

In [14]:
metric_cts.sort_values('partial_ratio_match_ct', ascending=False).head(10)

Unnamed: 0,original_index,original_value,ratio_match_ct,partial_ratio_match_ct,token_sort_match_ct,token_set_match_ct,jaro_winkler_match_ct
49,49,n,8.0,838,8.0,9.0,524
884,884,an,16.0,727,16.0,16.0,101
1755,1755,na,5.0,677,5.0,5.0,85
419,419,e,11.0,648,11.0,11.0,291
238,238,a,15.0,637,15.0,15.0,385
1150,1150,in,10.0,607,10.0,10.0,79
711,711,on,9.0,595,9.0,9.0,59
336,336,np,2.0,573,2.0,2.0,26
1045,1045,se,10.0,568,10.0,10.0,64
563,563,no,4.0,551,4.0,4.0,43


### Token sort and set

For the single word use case I'm exploring, `token_sort` and `token_set` are returning very, very similar (if not exactly the same) results. Just like with `ratio`, I'm getting very good similarity matching on common terms like "bahamas" and "street". In fact, the top 10 results are the same across all three metrics.

In [20]:
metric_cts.sort_values(['token_sort_match_ct', 'original_value'], ascending=False).head(10)

Unnamed: 0,original_index,original_value,ratio_match_ct,partial_ratio_match_ct,token_sort_match_ct,token_set_match_ct,jaro_winkler_match_ct
1137,1137,ste,24.0,243,24.0,24.0,149
1579,1579,steret,23.0,113,23.0,23.0,198
1136,1136,slite,23.0,94,23.0,23.0,156
1225,1225,charlote,23.0,92,23.0,23.0,230
662,662,center,23.0,100,23.0,23.0,189
516,516,bahams,23.0,70,23.0,23.0,83
464,464,suites,22.0,80,22.0,22.0,169
947,947,strees,22.0,87,22.0,22.0,179
1968,1968,stre,22.0,205,22.0,22.0,148
659,659,bhamas,22.0,75,22.0,22.0,94


In [21]:
metric_cts.sort_values(['token_set_match_ct', 'original_value'], ascending=False).head(10)

Unnamed: 0,original_index,original_value,ratio_match_ct,partial_ratio_match_ct,token_sort_match_ct,token_set_match_ct,jaro_winkler_match_ct
1137,1137,ste,24.0,243,24.0,24.0,149
1579,1579,steret,23.0,113,23.0,23.0,198
1136,1136,slite,23.0,94,23.0,23.0,156
1225,1225,charlote,23.0,92,23.0,23.0,230
662,662,center,23.0,100,23.0,23.0,189
516,516,bahams,23.0,70,23.0,23.0,83
464,464,suites,22.0,80,22.0,22.0,169
947,947,strees,22.0,87,22.0,22.0,179
1968,1968,stre,22.0,205,22.0,22.0,148
659,659,bhamas,22.0,75,22.0,22.0,94


### Jaro-Winkler

Like `partial_ratio`, the `jaro_winkler` metric is returning many shorter (single letter) words with higher counts. I'll explore whether Jaro-Winkler needs a higher threshold or isn't a good fit for this use case in later analysis.

In [17]:
metric_cts.sort_values('jaro_winkler_match_ct', ascending=False).head(10)

Unnamed: 0,original_index,original_value,ratio_match_ct,partial_ratio_match_ct,token_sort_match_ct,token_set_match_ct,jaro_winkler_match_ct
49,49,n,8.0,838,8.0,9.0,524
238,238,a,15.0,637,15.0,15.0,385
419,419,e,11.0,648,11.0,11.0,291
1088,1088,o,6.0,480,6.0,6.0,277
123,123,s,9.0,491,9.0,9.0,270
433,433,r,4.0,497,4.0,4.0,264
1282,1282,alrite,8.0,99,8.0,8.0,255
441,441,c,9.0,360,9.0,9.0,255
1145,1145,i,3.0,430,3.0,3.0,247
1249,1249,eastern,12.0,93,12.0,12.0,247


## Analysis

Initially, I looked at the groupings based on the highest frequency. For example, the top result under `ratio_match_ct`, `token_sort_match_ct`, and `token_sort_match_ct` was "ste". However, what I'm really interested in are groupings around common words such as "street", "bahamas", and "nassau". I also found that when looking at the results of the correct spelling, I got better results in the top results.

### Street

#### Ratio, Token sort, Token set

Just like with the highest frquency matches, `ratio`, `token_sort`, and `token_set` are returning the same score results. All results over 80 look like relevant results. If I wanted to replace misspellings with the correct word, this would be a great place to start. It also found more misspellings than I did while reviewing the data manually.

In [29]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='street') & (fuzzy_words_df['ratio_score']>60)].sort_values(['ratio_score', 'original_value'], ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
5568,4,street,1198,streeet,92.307692,90.909091,92.307692,92.307692,97.142857
5171,4,street,322,stret,90.909091,88.888889,90.909091,90.909091,96.666667
5275,4,street,607,stree,90.909091,100.0,90.909091,90.909091,96.666667
5993,4,street,1790,streeets,85.714286,90.909091,85.714286,85.714286,95.0
5428,4,street,947,strees,83.333333,90.909091,83.333333,83.333333,93.333333
5556,4,street,1179,streer,83.333333,90.909091,83.333333,83.333333,93.333333
5870,4,street,1579,steret,83.333333,83.333333,83.333333,83.333333,95.555556
5902,4,street,1629,strets,83.333333,90.909091,83.333333,83.333333,93.333333
6084,4,street,1968,stre,80.0,100.0,80.0,80.0,93.333333
5971,4,street,1743,treetops,71.428571,90.909091,71.428571,71.428571,81.944444


In [32]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='street') & (fuzzy_words_df['token_sort_score']>75)].sort_values(['token_sort_score', 'original_value'], ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
5568,4,street,1198,streeet,92.307692,90.909091,92.307692,92.307692,97.142857
5171,4,street,322,stret,90.909091,88.888889,90.909091,90.909091,96.666667
5275,4,street,607,stree,90.909091,100.0,90.909091,90.909091,96.666667
5993,4,street,1790,streeets,85.714286,90.909091,85.714286,85.714286,95.0
5428,4,street,947,strees,83.333333,90.909091,83.333333,83.333333,93.333333
5556,4,street,1179,streer,83.333333,90.909091,83.333333,83.333333,93.333333
5870,4,street,1579,steret,83.333333,83.333333,83.333333,83.333333,95.555556
5902,4,street,1629,strets,83.333333,90.909091,83.333333,83.333333,93.333333
6084,4,street,1968,stre,80.0,100.0,80.0,80.0,93.333333


In [33]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='street') & (fuzzy_words_df['token_set_score']>75)].sort_values(['token_set_score', 'original_value'], ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
5568,4,street,1198,streeet,92.307692,90.909091,92.307692,92.307692,97.142857
5171,4,street,322,stret,90.909091,88.888889,90.909091,90.909091,96.666667
5275,4,street,607,stree,90.909091,100.0,90.909091,90.909091,96.666667
5993,4,street,1790,streeets,85.714286,90.909091,85.714286,85.714286,95.0
5428,4,street,947,strees,83.333333,90.909091,83.333333,83.333333,93.333333
5556,4,street,1179,streer,83.333333,90.909091,83.333333,83.333333,93.333333
5870,4,street,1579,steret,83.333333,83.333333,83.333333,83.333333,95.555556
5902,4,street,1629,strets,83.333333,90.909091,83.333333,83.333333,93.333333
6084,4,street,1968,stre,80.0,100.0,80.0,80.0,93.333333


#### Partial ratio

There is a lot of noise in the top scores. This includes single letter results ("s", "e", "r", and "t", which is every unique letter in the original word), two letter results ("ee", "st", and "et") as well as "treetops" in the over 90 threshold group.

This metric may be appropriate for longer strings, but it doesn't appear useful with my short, single word use case. I'll double check the results for the longer address strings, but I suspect there is a minimum string length required for this metric to be useful.

In [40]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='street') & (fuzzy_words_df['partial_ratio_score']>80)].sort_values(['partial_ratio_score', 'original_value'], ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
5101,4,street,123,s,28.571429,100.0,28.571429,28.571429,75.0
5210,4,street,419,e,28.571429,100.0,28.571429,28.571429,0.0
5212,4,street,433,r,28.571429,100.0,28.571429,28.571429,72.222222
5275,4,street,607,stree,90.909091,100.0,90.909091,90.909091,96.666667
5307,4,street,713,ee,50.0,100.0,50.0,50.0,55.555556
5321,4,street,743,st,50.0,100.0,50.0,50.0,82.222222
5975,4,street,1751,t,28.571429,100.0,28.571429,28.571429,72.222222
6084,4,street,1968,stre,80.0,100.0,80.0,80.0,93.333333
6085,4,street,1969,et,50.0,100.0,50.0,50.0,55.555556
5428,4,street,947,strees,83.333333,90.909091,83.333333,83.333333,93.333333


#### Jaro-Winkler

The top `jaro_winkler` results are similar to those of `ratio`, `token_sort`, and `token_set`. Of note is that the score is higher for `jaro_winkler`. It appears that the trick to getting good results from `jarko_winkler` is a higher threshold.

In [37]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='street') & (fuzzy_words_df['jaro_winkler_score']>80)].sort_values(['jaro_winkler_score', 'original_value'], ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
5568,4,street,1198,streeet,92.307692,90.909091,92.307692,92.307692,97.142857
5171,4,street,322,stret,90.909091,88.888889,90.909091,90.909091,96.666667
5275,4,street,607,stree,90.909091,100.0,90.909091,90.909091,96.666667
5870,4,street,1579,steret,83.333333,83.333333,83.333333,83.333333,95.555556
5993,4,street,1790,streeets,85.714286,90.909091,85.714286,85.714286,95.0
5428,4,street,947,strees,83.333333,90.909091,83.333333,83.333333,93.333333
5556,4,street,1179,streer,83.333333,90.909091,83.333333,83.333333,93.333333
5902,4,street,1629,strets,83.333333,90.909091,83.333333,83.333333,93.333333
6084,4,street,1968,stre,80.0,100.0,80.0,80.0,93.333333
5532,4,street,1137,ste,66.666667,80.0,66.666667,66.666667,86.666667


#### Compare top results

I don't really want to go through with a fine tooth comb and try to figure out if the top matches for `ratio`, `token_sort`, `token_set`, and `jaro_winkler` actually match exactly. So I'm going to make the computer do it for me. Below we can see that the best results for all four metrics are indeed exactly the same and in the same order.

In [43]:
st_ratio_list = fuzzy_words_df.loc[(fuzzy_words_df['original_value']=='street') & (fuzzy_words_df['ratio_score']>75), 'match_value'].to_list()
st_token_sort_list = fuzzy_words_df.loc[(fuzzy_words_df['original_value']=='street') & (fuzzy_words_df['token_sort_score']>75), 'match_value'].to_list()
st_token_set_list = fuzzy_words_df.loc[(fuzzy_words_df['original_value']=='street') & (fuzzy_words_df['token_set_score']>75), 'match_value'].to_list()
st_jaro_list = fuzzy_words_df.loc[(fuzzy_words_df['original_value']=='street') & (fuzzy_words_df['jaro_winkler_score']>90), 'match_value'].to_list()

In [51]:
[len(x) for x in [st_ratio_list, st_token_sort_list, st_token_set_list, st_jaro_list]]

[9, 9, 9, 9]

In [48]:
pd.DataFrame([st_ratio_list, st_token_sort_list, st_token_set_list, st_jaro_list]).T

Unnamed: 0,0,1,2,3
0,stret,stret,stret,stret
1,stree,stree,stree,stree
2,strees,strees,strees,strees
3,streer,streer,streer,streer
4,streeet,streeet,streeet,streeet
5,steret,steret,steret,steret
6,strets,strets,strets,strets
7,streeets,streeets,streeets,streeets
8,stre,stre,stre,stre


## Review results in data

Now that I have a set of results that I'm interested in, I want to see what they look like in the actual data. Based on the below results, the nine values the fuzzy matching scores turned up are all typos or misspellings of the word "street".

In [53]:
pd.set_option('display.max_colwidth', 1000)

In [54]:
word_list = fuzzy_words_df[(fuzzy_words_df['original_value']=='street') & (fuzzy_words_df['jaro_winkler_score']>90)].sort_values(['jaro_winkler_score', 'original_value'], ascending=False)['match_value']
for word in word_list:
    print(word)
    display(df[df['address_wordlist'].apply(lambda x: word in x)])
    print('\n')

streeet


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1031,14079989,Suite E-2; Union Court Building; Elizabeth Avenue and Shirley Streeet; Nassau; Bahamas,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e 2 union court building elizabeth avenue and shirley streeet nassau bahamas,"[suite, e, 2, union, court, building, elizabeth, avenue, and, shirley, streeet, nassau, bahamas]"




stret


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
156,24000157,"SG HAMBROS BUILDING WEST BAY STRET, P.O. BOX CB-12263, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,sg hambros building west bay stret po box cb12263 nassau bahamas,"[sg, hambros, building, west, bay, stret, po, box, cb12263, nassau, bahamas]"
694,14035596,Charlotte House; Charlotte Stret; POB N-65; nassau; Bahamas,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,charlotte house charlotte stret pob n 65 nassau bahamas,"[charlotte, house, charlotte, stret, pob, n, 65, nassau, bahamas]"
1383,286502,"SUITE 306, 3/F CENTRE OF COMMERCE 1 BAY STRET NASSAU BAHAMAS",,Bahamas,BHS,Offshore Leaks,The Offshore Leaks data is current through 2010,,suite 306 3f centre of commerce 1 bay stret nassau bahamas,"[suite, 306, 3f, centre, of, commerce, 1, bay, stret, nassau, bahamas]"




stree


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
350,24000351,"2ND. FL. GOLD CIRCLE HSE. EAST BAY STREE, P.O. BOX N-3726, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,second fl gold circle hse east bay stree po box n3726 nassau bahamas,"[second, fl, gold, circle, hse, east, bay, stree, po, box, n3726, nassau, bahamas]"
546,14012243,2ND FLOOR; ANSBACHER HOUSE; BANK LANE AN EAST STREE; PO BOX N-9934; NASSAU; BAHAMAS,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,second floor ansbacher house bank lane an east stree po box n9934 nassau bahamas,"[second, floor, ansbacher, house, bank, lane, an, east, stree, po, box, n9934, nassau, bahamas]"
1032,14079990,SUITE E-2; UNION COURT BUILDING; ELIZABETH AVENUE AND SHIRLEY STREE; NASSAU; THE BAHAMAS,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e 2 union court building elizabeth avenue and shirley stree nassau the bahamas,"[suite, e, 2, union, court, building, elizabeth, avenue, and, shirley, stree, nassau, the, bahamas]"
1862,33000052,"2ND FL GOLD CIRCLE HSE EAST BAY STREE, PO BOX N-3726, NASSAU, BAHAMAS",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current through 2016,,second fl gold circle hse east bay stree po box n3726 nassau bahamas,"[second, fl, gold, circle, hse, east, bay, stree, po, box, n3726, nassau, bahamas]"




steret


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1412,245979,"1st Floor Norfolk House Frederick Steret, Nassau BAHAMAS",,Bahamas,BHS,Offshore Leaks,The Offshore Leaks data is current through 2010,,first floor norfolk house frederick steret nassau bahamas,"[first, floor, norfolk, house, frederick, steret, nassau, bahamas]"




streeets


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1677,81058625,Bahamas Financial Centre; Shirley & Charlotte Streeets; P.O. Box CB-13136; Nassau; Bahamas,Bahamas Financial Centre,Bahamas,BHS,Paradise Papers - Appleby,Appleby data is current through 2014,,bahamas financial centre shirley and charlotte streeets po box cb13136 nassau bahamas,"[bahamas, financial, centre, shirley, and, charlotte, streeets, po, box, cb13136, nassau, bahamas]"




strees


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
607,14030183,BAHAMAS FINANCIAL CENTRE; P.O. BOX N-3023 SHIRLEY & CHARLOTTE STREES; NASSAU; BAHAMAS,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,bahamas financial centre po box n3023 shirley and charlotte strees nassau bahamas,"[bahamas, financial, centre, po, box, n3023, shirley, and, charlotte, strees, nassau, bahamas]"
775,14044324,"he Bahamas Financial Centre Fourth Floor Shirley & Charlotte Strees Nassau, Bahamas",,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,he bahamas financial centre fourth floor shirley and charlotte strees nassau bahamas,"[he, bahamas, financial, centre, fourth, floor, shirley, and, charlotte, strees, nassau, bahamas]"
1082,14080632,The Bahamas Financial centre; Charlotte and Shirley Strees; Nassau; Bahamas,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,the bahamas financial centre charlotte and shirley strees nassau bahamas,"[the, bahamas, financial, centre, charlotte, and, shirley, strees, nassau, bahamas]"
1086,14080636,"The Bahamas Financial Centre Fourth Floor Shirley & Charlotte Strees Nassau, Bahamas",,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,the bahamas financial centre fourth floor shirley and charlotte strees nassau bahamas,"[the, bahamas, financial, centre, fourth, floor, shirley, and, charlotte, strees, nassau, bahamas]"




streer


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
995,14079952,Suite E-2; Unioin Court Building; Elizabeth Avenue and Shirley Streer; P.O. Box N-8188; Nassau; Bahamas.,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e2 unioin court building elizabeth avenue and shirley streer po box n8188 nassau bahamas,"[suite, e2, unioin, court, building, elizabeth, avenue, and, shirley, streer, po, box, n8188, nassau, bahamas]"
999,14079956,Suite E-2; Union Court Buiding; Elizabeth Avenue and Shirley Streer; P.O. Box N-8188; Nassao; Bahamas.,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e2 union court buiding elizabeth avenue and shirley streer po box n8188 nassao bahamas,"[suite, e2, union, court, buiding, elizabeth, avenue, and, shirley, streer, po, box, n8188, nassao, bahamas]"




strets


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1475,287068,"THE BAHAMAS FINANCIAL CENTRE FOUTH FLOOR SHIRLEY & CHARLOTTE STRETS NASSAU, BAHAMAS",,Bahamas,BHS,Offshore Leaks,The Offshore Leaks data is current through 2010,,the bahamas financial centre fouth floor shirley and charlotte strets nassau bahamas,"[the, bahamas, financial, centre, fouth, floor, shirley, and, charlotte, strets, nassau, bahamas]"




stre


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
2153,120019222,"THE CHAMBERS OF MESSRS, MCKINNEY, BANCRO FT & HUGHES, MAREVA HOUSE, 4 GEORGE STRE ET, NASSAU, BAHAMAS.","THE CHAMBERS OF MESSRS, MCKINNEY, BANCRO FT & HUGHES, MAREVA HOUSE, 4 GEORGE STRE ET, NASSAU, BAHAMAS.",Bahamas,BHS,Paradise Papers - Barbados corporate registry,Barbados corporate registry data is current through 2016,,the chambers of messrs mckinney bancro ft and hughes mareva house 4 george stre et nassau bahamas,"[the, chambers, of, messrs, mckinney, bancro, ft, and, hughes, mareva, house, 4, george, stre, et, nassau, bahamas]"






## Replicate results

To ensure my findings aren't a fluke, I like to be able to replicate them at least once. To do this, I'll use the most frequent word in the dataset "bahamas".

### Bahamas

Starting with an initial threshold of 60, I can see that `ratio`, `token_sort` and `token_set` are return the same values with the same scores again. The relevant values are again all over a threshold of 75. In this case, I can go as high as 80, but I don't see anything relevant under 75, just like was the case in the "street" example.

`partial_ratio` returns many relevant results, but these are again mixed with more noise than I'm willing to parse.

`jaro_winkler` again returns very similar results as `ratio`, `token_sort` and `token_set`. However, in this example, there is more noise in the lower end results.

In [55]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='bahamas') & (fuzzy_words_df['ratio_score']>60)].sort_values(['ratio_score', 'original_value'], ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
10421,9,bahamas,907,bahamasc,93.333333,100.0,93.333333,93.333333,97.5
10436,9,bahamas,930,bahamas1,93.333333,100.0,93.333333,93.333333,97.5
10544,9,bahamas,1101,bahamaas,93.333333,92.307692,93.333333,93.333333,97.5
10603,9,bahamas,1200,bahamas6,93.333333,100.0,93.333333,93.333333,97.5
10909,9,bahamas,1602,bahamasa,93.333333,100.0,93.333333,93.333333,97.5
11087,9,bahamas,1911,abahamas,93.333333,100.0,93.333333,93.333333,81.547619
10107,9,bahamas,188,bahama,92.307692,100.0,92.307692,92.307692,97.142857
10233,9,bahamas,516,bahams,92.307692,90.909091,92.307692,92.307692,97.142857
10300,9,bahamas,659,bhamas,92.307692,90.909091,92.307692,92.307692,85.714286
10420,9,bahamas,904,bahmas,92.307692,83.333333,92.307692,92.307692,92.777778


In [56]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='bahamas') & (fuzzy_words_df['token_sort_score']>80)].sort_values(['token_sort_score', 'original_value'], ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
10421,9,bahamas,907,bahamasc,93.333333,100.0,93.333333,93.333333,97.5
10436,9,bahamas,930,bahamas1,93.333333,100.0,93.333333,93.333333,97.5
10544,9,bahamas,1101,bahamaas,93.333333,92.307692,93.333333,93.333333,97.5
10603,9,bahamas,1200,bahamas6,93.333333,100.0,93.333333,93.333333,97.5
10909,9,bahamas,1602,bahamasa,93.333333,100.0,93.333333,93.333333,97.5
11087,9,bahamas,1911,abahamas,93.333333,100.0,93.333333,93.333333,81.547619
10107,9,bahamas,188,bahama,92.307692,100.0,92.307692,92.307692,97.142857
10233,9,bahamas,516,bahams,92.307692,90.909091,92.307692,92.307692,97.142857
10300,9,bahamas,659,bhamas,92.307692,90.909091,92.307692,92.307692,85.714286
10420,9,bahamas,904,bahmas,92.307692,83.333333,92.307692,92.307692,92.777778


In [57]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='bahamas') & (fuzzy_words_df['token_set_score']>80)].sort_values(['token_set_score', 'original_value'], ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
10421,9,bahamas,907,bahamasc,93.333333,100.0,93.333333,93.333333,97.5
10436,9,bahamas,930,bahamas1,93.333333,100.0,93.333333,93.333333,97.5
10544,9,bahamas,1101,bahamaas,93.333333,92.307692,93.333333,93.333333,97.5
10603,9,bahamas,1200,bahamas6,93.333333,100.0,93.333333,93.333333,97.5
10909,9,bahamas,1602,bahamasa,93.333333,100.0,93.333333,93.333333,97.5
11087,9,bahamas,1911,abahamas,93.333333,100.0,93.333333,93.333333,81.547619
10107,9,bahamas,188,bahama,92.307692,100.0,92.307692,92.307692,97.142857
10233,9,bahamas,516,bahams,92.307692,90.909091,92.307692,92.307692,97.142857
10300,9,bahamas,659,bhamas,92.307692,90.909091,92.307692,92.307692,85.714286
10420,9,bahamas,904,bahmas,92.307692,83.333333,92.307692,92.307692,92.777778


In [37]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='bahamas') & (fuzzy_words_df['partial_ratio_score']>70)].sort_values(['partial_ratio_score', 'original_value'], ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
10543,9,bahamas,1100,ba,44.444444,100.0,44.444444,44.444444,80.952381
10367,9,bahamas,806,bah,60.0,100.0,60.0,60.0,86.666667
10603,9,bahamas,1200,bahamas6,93.333333,100.0,93.333333,93.333333,97.5
11087,9,bahamas,1911,abahamas,93.333333,100.0,93.333333,93.333333,81.547619
10551,9,bahamas,1125,bahamaspo,87.5,100.0,87.5,87.5,95.555556
10082,9,bahamas,123,s,25.0,100.0,25.0,25.0,0.0
10436,9,bahamas,930,bahamas1,93.333333,100.0,93.333333,93.333333,97.5
10432,9,bahamas,926,as,44.444444,100.0,44.444444,44.444444,54.761905
10424,9,bahamas,914,ahamas,92.307692,100.0,92.307692,92.307692,95.238095
10780,9,bahamas,1425,m,25.0,100.0,25.0,25.0,0.0


In [40]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='bahamas') & (fuzzy_words_df['jaro_winkler_score']>75)].sort_values(['jaro_winkler_score', 'original_value'], ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
10421,9,bahamas,907,bahamasc,93.333333,100.0,93.333333,93.333333,97.5
10909,9,bahamas,1602,bahamasa,93.333333,100.0,93.333333,93.333333,97.5
10603,9,bahamas,1200,bahamas6,93.333333,100.0,93.333333,93.333333,97.5
10544,9,bahamas,1101,bahamaas,93.333333,92.307692,93.333333,93.333333,97.5
10436,9,bahamas,930,bahamas1,93.333333,100.0,93.333333,93.333333,97.5
10107,9,bahamas,188,bahama,92.307692,100.0,92.307692,92.307692,97.142857
10491,9,bahamas,1018,bahaams,85.714286,85.714286,85.714286,85.714286,97.142857
10233,9,bahamas,516,bahams,92.307692,90.909091,92.307692,92.307692,97.142857
10551,9,bahamas,1125,bahamaspo,87.5,100.0,87.5,87.5,95.555556
10424,9,bahamas,914,ahamas,92.307692,100.0,92.307692,92.307692,95.238095


#### Compare results

In [69]:
bah_ratio_list = fuzzy_words_df.loc[(fuzzy_words_df['original_value']=='bahamas') & (fuzzy_words_df['ratio_score']>75), 'match_value'].to_list()
bah_token_sort_list = fuzzy_words_df.loc[(fuzzy_words_df['original_value']=='bahamas') & (fuzzy_words_df['token_sort_score']>75), 'match_value'].to_list()
bah_token_set_list = fuzzy_words_df.loc[(fuzzy_words_df['original_value']=='bahamas') & (fuzzy_words_df['token_set_score']>75), 'match_value'].to_list()
bah_jaro_list = fuzzy_words_df.loc[(fuzzy_words_df['original_value']=='bahamas') & (fuzzy_words_df['jaro_winkler_score']>80), 'match_value'].to_list()

In [70]:
[len(x) for x in [bah_ratio_list, bah_token_sort_list, bah_token_set_list, bah_jaro_list]]

[17, 17, 17, 18]

In [71]:
pd.DataFrame([bah_ratio_list, bah_token_sort_list, bah_token_set_list, bah_jaro_list]).T

Unnamed: 0,0,1,2,3
0,bahama,bahama,bahama,bahama
1,bahams,bahams,bahams,bahams
2,bhamas,bhamas,bhamas,bhamas
3,bahmas,bahmas,bahmas,bah
4,bahamasc,bahamasc,bahamasc,bahmas
5,ahamas,ahamas,ahamas,bahamasc
6,bahamas1,bahamas1,bahamas1,ahamas
7,bahaams,bahaams,bahaams,bahamas1
8,bahamaas,bahamaas,bahamaas,bahaams
9,bahanas,bahanas,bahanas,ba


As can be seen in the above dataframe, the results are very similar across the four metrics, but not exact. By inserting a couple of NANs, the differences become clearer. `jaro_winkler` (column name "3") has two results that the others don't ("bah" and "ba") and is missing one that the others have ("hamas")

In [72]:
for x in [bah_ratio_list, bah_token_sort_list, bah_token_set_list]:
    x.insert(3, np.nan)
    x.insert(9, np.nan)

In [73]:
pd.DataFrame([bah_ratio_list, bah_token_sort_list, bah_token_set_list, bah_jaro_list]).T

Unnamed: 0,0,1,2,3
0,bahama,bahama,bahama,bahama
1,bahams,bahams,bahams,bahams
2,bhamas,bhamas,bhamas,bhamas
3,,,,bah
4,bahmas,bahmas,bahmas,bahmas
5,bahamasc,bahamasc,bahamasc,bahamasc
6,ahamas,ahamas,ahamas,ahamas
7,bahamas1,bahamas1,bahamas1,bahamas1
8,bahaams,bahaams,bahaams,bahaams
9,,,,ba


### Review results in data

The in data fuzzy match results of "bahamas" is a bit more interesting than those of "street".

- "bahama" generally refers to the island "Grand Bahama", a specific island in the "Bahamas" island group. This value occurs both at the end of the original string and closer to the middle.
- node id 14064257, which gives me "bahamaspo" has a duplication of the address, resulting in a missing space between "bahamas" and "po box".
- "Bahamas" refers to more than just the country, it commonly occurs in the building "Bahamas Financial Centre"
- one misspelling "baham" is actually for the island Grand Bahama
- "bah" is a shortening of "bahamas"
- "ba" appears to be extraneous

In [80]:
word_list = fuzzy_words_df[(fuzzy_words_df['original_value']=='bahamas') & (fuzzy_words_df['jaro_winkler_score']>80)].sort_values(['jaro_winkler_score', 'original_value'], ascending=False)['match_value'].to_list() + ['hamas']
for word in word_list:
    print(word)
    display(df[df['address_wordlist'].apply(lambda x: word in x)])
    print('\n')

bahamasc


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
570,14018521,51 Frederick Street; P.O. Box N-1136; Nassau; BahamasC,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,51 frederick street po box n1136 nassau bahamasc,"[51, frederick, street, po, box, n1136, nassau, bahamasc]"




bahamas1


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
587,14028501,Atlantic House; 3rd Floor; Collins Avenue & 2nd Terrace; Nassau; Bahamas1,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,atlantic house third floor collins avenue and second terrace nassau bahamas1,"[atlantic, house, third, floor, collins, avenue, and, second, terrace, nassau, bahamas1]"
1177,14085238,WINTERBOTHAM PLACE; MARLBOROUGH & QUEEN STREETS; P.O. BOX N-7523; NASSAU; BAHAMAS1,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,winterbotham place marlborough and queen street po box n7523 nassau bahamas1,"[winterbotham, place, marlborough, and, queen, street, po, box, n7523, nassau, bahamas1]"




bahamaas


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
834,14051201,"NASSAU, BAHAMAAS",,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,nassau bahamaas,"[nassau, bahamaas]"




bahamas6


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1041,14080001,SUITE E-2; UNION COURT BUILDING; ELIZABETH AVENUE AND SHIRLEY STREET; NASSAU; THE BAHAMAS6,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e 2 union court building elizabeth avenue and shirley street nassau the bahamas6,"[suite, e, 2, union, court, building, elizabeth, avenue, and, shirley, street, nassau, the, bahamas6]"




bahamasa


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1442,240054,"Winterbotham Place Marlborough & Queen Streets PO Box CB 11343 Nassau, Bahamasa",,Bahamas,BHS,Offshore Leaks,The Offshore Leaks data is current through 2010,,winterbotham place marlborough and queen street po box cb 11343 nassau bahamasa,"[winterbotham, place, marlborough, and, queen, street, po, box, cb, 11343, nassau, bahamasa]"




bahama


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
76,24000077,"P.O. BOX F-40773, FREEPORT, GR. BAHAMA 242-352-7291",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,po box f40773 freeport gr bahama 2423527291,"[po, box, f40773, freeport, gr, bahama, 2423527291]"
79,24000080,"REGENT CENTRE, P.O. BOX F-40132 FREEPORT, GRAND BAHAMA",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,regent centre po box f40132 freeport grand bahama,"[regent, centre, po, box, f40132, freeport, grand, bahama]"
83,24000084,"CHANCERY HOUSE, P.O. BOX F-42578 FREEPORT, GRAND BAHAMA",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,chancery house po box f42578 freeport grand bahama,"[chancery, house, po, box, f42578, freeport, grand, bahama]"
87,24000088,"CHANCERY COURT THE MALL, P.O. BOX F-42643 FREEPORT, GRAND BAHAMA",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,chancery court the mall po box f42643 freeport grand bahama,"[chancery, court, the, mall, po, box, f42643, freeport, grand, bahama]"
104,24000105,"SUITE A, REGENT CENTRE, P.O. BOX F-42682 FREEPORT, GRAND BAHAMA",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,suite a regent centre po box f42682 freeport grand bahama,"[suite, a, regent, centre, po, box, f42682, freeport, grand, bahama]"
...,...,...,...,...,...,...,...,...,...,...
2083,33000290,"REGENT CENTRE PO BOX F-40132 FREEPORT, GR BAHAMA, BAHAMAS",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current through 2016,,regent centre po box f40132 freeport gr bahama bahamas,"[regent, centre, po, box, f40132, freeport, gr, bahama, bahamas]"
2084,33000291,"REGENT CENTRE PO BOX F-40132 FREEPORT, GRAND BAHAMA",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current through 2016,,regent centre po box f40132 freeport grand bahama,"[regent, centre, po, box, f40132, freeport, grand, bahama]"
2085,33000293,"SUITE 10 SEVENTEEN CENTRE, BANK LANE PO BOX F-43018 FREEPORT, GRAND BAHAMA",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current through 2016,,suite 10 seventeen centre bank lane po box f43018 freeport grand bahama,"[suite, 10, seventeen, centre, bank, lane, po, box, f43018, freeport, grand, bahama]"
2091,33000299,"FIRST COMMERCIAL CENTRE SUITE 1, 2ND FL PO BOX F-42411 FREEPORT, GRAND BAHAMA",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current through 2016,,first commercial centre suite 1 second fl po box f42411 freeport grand bahama,"[first, commercial, centre, suite, 1, second, fl, po, box, f42411, freeport, grand, bahama]"




bahams


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
274,24000275,"P.O. BOX N 8680, NASSAU, BAHAMS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,po box n 8680 nassau bahams,"[po, box, n, 8680, nassau, bahams]"
559,14018044,4TH FLOOR THE BAHAMAS FINANCIAL CENTRE SHIRLEY & CHARLOTTE STREET NASSAU BAHAMS,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,fourth floor the bahamas financial centre shirley and charlotte street nassau bahams,"[fourth, floor, the, bahamas, financial, centre, shirley, and, charlotte, street, nassau, bahams]"
631,14030207,BAHAMS FINANCILA CENTRE PO BOX N-3023 SHIRLEY & CHARLOTTE STREETSNASSAU BAHAMAS,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,bahams financila centre po box n3023 shirley and charlotte street nassau bahamas,"[bahams, financila, centre, po, box, n3023, shirley, and, charlotte, street, nassau, bahamas]"
826,14050608,MOSSACK FONSECA & CO (BAHAMS) LIMITED SAFFREY SQUARE; SUITE 205; BANK LANE; P.O.BOX N-8188; NASSAU; COMMONWEALTH OF THE BAHAMAS.,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,mossack fonseca and co bahams limited saffrey square suite 205 bank lane po box n8188 nassau commonwealth of the bahamas,"[mossack, fonseca, and, co, bahams, limited, saffrey, square, suite, 205, bank, lane, po, box, n8188, nassau, commonwealth, of, the, bahamas]"
867,14064246,P.O.BOX N-3944; PROVIDENCE HOUSE; EAST HILL STREET; NASSAU; BAHAMS,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,po box n3944 providence house east hill street nassau bahams,"[po, box, n3944, providence, house, east, hill, street, nassau, bahams]"
889,14064268,P O BOX N8188 NASSAU BAHAMS,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,po box n8188 nassau bahams,"[po, box, n8188, nassau, bahams]"
1910,33000104,"NASSAU, BAHAMS",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current through 2016,,nassau bahams,"[nassau, bahams]"




bahaams


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
725,14038328,Elizabeth Avenue and Shirley Street; Union Court Building; Suite E-2; N-8188; Nassau; Bahaams,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,elizabeth avenue and shirley street union court building suite e 2 n 8188 nassau bahaams,"[elizabeth, avenue, and, shirley, street, union, court, building, suite, e, 2, n, 8188, nassau, bahaams]"




bahamaspo


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
878,14064257,P.O. Box N-7768; Nassau; BahamasP.O. Box N-7768; Nassau; Bahamas,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,po box n7768 nassau bahamaspo box n7768 nassau bahamas,"[po, box, n7768, nassau, bahamaspo, box, n7768, nassau, bahamas]"




ahamas


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
576,14025414,ahamas Financial Centre; 4th Floor; Shirley & Charlotte Street; Nassau Bahamas,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,ahamas financial centre fourth floor shirley and charlotte street nassau bahamas,"[ahamas, financial, centre, fourth, floor, shirley, and, charlotte, street, nassau, bahamas]"




bahanas


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
875,14064254,P.O. Box N-7757; East Bay Street; Nassau; Bahanas,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,po box n7757 east bay street nassau bahanas,"[po, box, n7757, east, bay, street, nassau, bahanas]"




baham


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1930,33000127,"CHANCERY COURT, THE MALL PO BOX F-42519 FREEPORT, GRAND BAHAM BAHAMAS",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current through 2016,,chancery court the mall po box f42519 freeport grand baham bahamas,"[chancery, court, the, mall, po, box, f42519, freeport, grand, baham, bahamas]"




bahmas


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
563,14018385,50 Shirley Street; Nassau; Bahmas,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,50 shirley street nassau bahmas,"[50, shirley, street, nassau, bahmas]"
668,14033053,c/o Morgan Trust Company of The Bahamas Limited P.O. Box N-4899; Nassau; Bahmas,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,co morgan trust company of the bahamas limited po box n4899 nassau bahmas,"[co, morgan, trust, company, of, the, bahamas, limited, po, box, n4899, nassau, bahmas]"
749,14042830,FOURTH FLOOR; THE BAHAMAS FINANCIAL CENTRE; SHIRLEY & CHARLOTTE STREETS; P.O.BOX N-3023; NASSAU; BAHMAS,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,fourth floor the bahamas financial centre shirley and charlotte street po box n3023 nassau bahmas,"[fourth, floor, the, bahamas, financial, centre, shirley, and, charlotte, street, po, box, n3023, nassau, bahmas]"
932,14077074,Saffrey Square; Suite 205; Bank Lane; P.O. Box N-8188; Nassau; Bahmas,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,saffrey square suite 205 bank lane po box n8188 nassau bahmas,"[saffrey, square, suite, 205, bank, lane, po, box, n8188, nassau, bahmas]"
1152,14085026,"WEST BAY STREET NASSAU, BAHMAS",,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,west bay street nassau bahmas,"[west, bay, street, nassau, bahmas]"




bah


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
495,24000496,"SHIRLEY & CHARLOTTE STS BAH. FIN. CENTRE, P.O. BOX SS-6373, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,shirley and charlotte street bah fin centre po box ss6373 nassau bahamas,"[shirley, and, charlotte, street, bah, fin, centre, po, box, ss6373, nassau, bahamas]"
760,14043538,GOODMAN S BAY CORPORATE CENTER WEST BAY STREET NASSAU BAH,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,goodman s bay corporate center west bay street nassau bah,"[goodman, s, bay, corporate, center, west, bay, street, nassau, bah]"




brahmas


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1120,14080679,"The Brahmas Financial Centre, Shirley and Charlotte Streets P O Box N - 3023 Nassau Bahamas",,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,the brahmas financial centre shirley and charlotte street po box n 3023 nassau bahamas,"[the, brahmas, financial, centre, shirley, and, charlotte, street, po, box, n, 3023, nassau, bahamas]"




bhamas


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
390,24000391,"P.O. BOX N-4485, NASSAU BHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,po box n4485 nassau bhamas,"[po, box, n4485, nassau, bhamas]"




abahamas


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1901,33000091,NEW PROVIDENCE ABAHAMAS,,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current through 2016,,new providence abahamas,"[new, providence, abahamas]"




ba


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
833,14051200,Nassau-BA-Bahamas,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,nassau ba bahamas,"[nassau, ba, bahamas]"
951,14077696,"Sede Nassau-BA (capital), Bahamas",,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,sede nassau ba capital bahamas,"[sede, nassau, ba, capital, bahamas]"




hamas


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
2243,240492153,"MONTAGUE STERLING CENTRE. EAST BAY STREET, NASSAU, HAMAS, SWITZERLAND, BAHAMAS",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2016,,montague sterling centre east bay street nassau hamas switzerland bahamas,"[montague, sterling, centre, east, bay, street, nassau, hamas, switzerland, bahamas]"






## Conclusion

The ultimate goal of this use case is to ensure that there is only one node id associated with a single address, i.e. deduplicate the values.

If I wanted the results to reflect things like country, I'd want to leave in "Bahamas". This would allow me to run the same fuzzy matching on the full dataset without partitioning by country. However, this increases computationially expense exponentially. As such, I've already decided to partition by country (and may be city if I can pull cities out of all the addresses).

While this will reduce the computational complexity, the scores will be more similar if I don't apply any parsing because "bahamas" will occur in almost every result for this subset of my address data. As such, I plan to use my fuzzy match results and data parsing to pull out similar information (country, city, island, po box), then apply fuzzy matching to the remaining information.

### Nassau

In [42]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='nassau') & (fuzzy_words_df['ratio_score']>60)].sort_values('ratio_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
9110,8,nassau,698,nasssau,92.307692,83.333333,92.307692,92.307692,97.142857
9248,8,nassau,925,nassaub,92.307692,100.0,92.307692,92.307692,97.142857
9398,8,nassau,1160,nassaau,92.307692,90.909091,92.307692,92.307692,97.142857
9453,8,nassau,1224,nassaus,92.307692,100.0,92.307692,92.307692,97.142857
9746,8,nassau,1593,naussau,92.307692,83.333333,92.307692,92.307692,96.190476
9217,8,nassau,872,nasau,90.909091,80.0,90.909091,90.909091,96.111111
9424,8,nassau,1194,nassu,90.909091,88.888889,90.909091,90.909091,96.666667
9350,8,nassau,1083,massau,83.333333,90.909091,83.333333,83.333333,88.888889
9387,8,nassau,1146,nassua,83.333333,90.909091,83.333333,83.333333,96.666667
9415,8,nassau,1185,nassao,83.333333,90.909091,83.333333,83.333333,93.333333


In [1]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='nassau') & (fuzzy_words_df['token_sort_score']>=80)].sort_values('token_sort_score', ascending=False)

NameError: name 'fuzzy_words_df' is not defined

In [45]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='nassau') & (fuzzy_words_df['token_set_score']>75)].sort_values('token_set_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
9110,8,nassau,698,nasssau,92.307692,83.333333,92.307692,92.307692,97.142857
9248,8,nassau,925,nassaub,92.307692,100.0,92.307692,92.307692,97.142857
9398,8,nassau,1160,nassaau,92.307692,90.909091,92.307692,92.307692,97.142857
9453,8,nassau,1224,nassaus,92.307692,100.0,92.307692,92.307692,97.142857
9746,8,nassau,1593,naussau,92.307692,83.333333,92.307692,92.307692,96.190476
9217,8,nassau,872,nasau,90.909091,80.0,90.909091,90.909091,96.111111
9424,8,nassau,1194,nassu,90.909091,88.888889,90.909091,90.909091,96.666667
9350,8,nassau,1083,massau,83.333333,90.909091,83.333333,83.333333,88.888889
9387,8,nassau,1146,nassua,83.333333,90.909091,83.333333,83.333333,96.666667
9415,8,nassau,1185,nassao,83.333333,90.909091,83.333333,83.333333,93.333333


In [43]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='nassau') & (fuzzy_words_df['partial_ratio_score']>70)].sort_values('partial_ratio_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
8654,8,nassau,49,n,28.571429,100.0,28.571429,28.571429,75.0
8792,8,nassau,238,a,28.571429,100.0,28.571429,28.571429,72.222222
8995,8,nassau,530,ss,50.0,100.0,50.0,50.0,77.777778
9853,8,nassau,1755,na,50.0,100.0,50.0,50.0,82.222222
9453,8,nassau,1224,nassaus,92.307692,100.0,92.307692,92.307692,97.142857
9248,8,nassau,925,nassaub,92.307692,100.0,92.307692,92.307692,97.142857
9249,8,nassau,926,as,50.0,100.0,50.0,50.0,77.777778
9290,8,nassau,993,343nassau,80.0,100.0,80.0,80.0,88.888889
8707,8,nassau,123,s,28.571429,100.0,28.571429,28.571429,72.222222
9398,8,nassau,1160,nassaau,92.307692,90.909091,92.307692,92.307692,97.142857


In [46]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='nassau') & (fuzzy_words_df['jaro_winkler_score']>85)].sort_values('jaro_winkler_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
9110,8,nassau,698,nasssau,92.307692,83.333333,92.307692,92.307692,97.142857
9248,8,nassau,925,nassaub,92.307692,100.0,92.307692,92.307692,97.142857
9398,8,nassau,1160,nassaau,92.307692,90.909091,92.307692,92.307692,97.142857
9453,8,nassau,1224,nassaus,92.307692,100.0,92.307692,92.307692,97.142857
9387,8,nassau,1146,nassua,83.333333,90.909091,83.333333,83.333333,96.666667
9424,8,nassau,1194,nassu,90.909091,88.888889,90.909091,90.909091,96.666667
9746,8,nassau,1593,naussau,92.307692,83.333333,92.307692,92.307692,96.190476
9217,8,nassau,872,nasau,90.909091,80.0,90.909091,90.909091,96.111111
9415,8,nassau,1185,nassao,83.333333,90.909091,83.333333,83.333333,93.333333
9431,8,nassau,1201,nassan,83.333333,90.909091,83.333333,83.333333,93.333333


In [47]:
word_list = fuzzy_words_df[(fuzzy_words_df['original_value']=='nassau') & (fuzzy_words_df['jaro_winkler_score']>85)].sort_values('jaro_winkler_score', ascending=False)['match_value']
for word in word_list:
    print(word)
    display(df[df['address_wordlist'].apply(lambda x: word in x)])
    print('\n')

nasssau


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
420,24000421,"3RD FLOOR, GEORGE HOUSE, GEORGE STREET, P.O. B...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,third floor george house george street po box ...,"[third, floor, george, house, george, street, ..."




nassaub


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
583,14026897,ANSBACHER (BAHAMAS) LIMITED P.O. BOX N 7768 AN...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,ansbacher bahamas limited po box n 7768 ansbac...,"[ansbacher, bahamas, limited, po, box, n, 7768..."




nassaau


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
966,14078961,Suite 102; Saffrey Square; Bay Street and Bank...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite 102 saffrey square bay street and bank l...,"[suite, 102, saffrey, square, bay, street, and..."




nassaus


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1106,14080656,The Bahamas Financial Centre; Shirley & Charlo...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,the bahamas financial centre shirley and charl...,"[the, bahamas, financial, centre, shirley, and..."




nassua


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
933,14077075,SAFFREY SQUARE; SUITE 205; BANK LANE; P.O. BOX...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,saffrey square suite 205 bank lane po box n818...,"[saffrey, square, suite, 205, bank, lane, po, ..."
969,14078964,Suite 102; Saffrey Square; Bay Street and Bank...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite 102 saffrey square bay street and bank l...,"[suite, 102, saffrey, square, bay, street, and..."




nassu


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1021,14079979,Suite E-2; Union Court Building; Elizabeth Ave...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e 2 union court building elizabeth avenu...,"[suite, e, 2, union, court, building, elizabet..."
1117,14080667,The Bahamas Financial Centre; Shirley and Char...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,the bahamas financial centre shirley and charl...,"[the, bahamas, financial, centre, shirley, and..."




naussau


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1428,252371,"43 Elizabeth Avenue, P.O.Box CB-13022 Naussau ...",,Bahamas,BHS,Offshore Leaks,The Offshore Leaks data is current through 2010,,43 elizabeth avenue po box cb13022 naussau bah...,"[43, elizabeth, avenue, po, box, cb13022, naus..."




nasau


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
536,14000678,"101 East Hill Street, Nasau Bahamas",,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,101 east hill street nasau bahamas,"[101, east, hill, street, nasau, bahamas]"
612,14030188,Bahamas Financial Centre; Shirley & Charlotte ...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,bahamas financial centre shirley and charlotte...,"[bahamas, financial, centre, shirley, and, cha..."
682,14035228,"CB 11-343 Nasau, Bahamas",,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,cb 11 343 nasau bahamas,"[cb, 11, 343, nasau, bahamas]"
724,14038327,Elizabeth Avenue and Shirley Street; Union Cou...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,elizabeth avenue and shirley street union cour...,"[elizabeth, avenue, and, shirley, street, unio..."
965,14078960,Suite 102; Saffrey Square; Bay Street and Bank...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite 102 saffrey square bay street and bank l...,"[suite, 102, saffrey, square, bay, street, and..."
1440,239867,"UBS Trustees (Bahamas) Ltd, UBS House, East Ba...",,Bahamas,BHS,Offshore Leaks,The Offshore Leaks data is current through 2010,,ubs trustees bahamas ltd ubs house east bay st...,"[ubs, trustees, bahamas, ltd, ubs, house, east..."




nassao


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
999,14079956,Suite E-2; Union Court Buiding; Elizabeth Aven...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e2 union court buiding elizabeth avenue ...,"[suite, e2, union, court, buiding, elizabeth, ..."




nassan


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1043,14080003,Suite E-2; Union Court Building; Elizabeth Ave...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e 2 union court building elizabeth avenu...,"[suite, e, 2, union, court, building, elizabet..."
1050,14080011,Suite E-2; Union Court Building; Elizabeth Ave...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e 2 union court building elizabeth avenu...,"[suite, e, 2, union, court, building, elizabet..."




massau


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
810,14049672,"MASSAU, BAHAMAS",,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,massau bahamas,"[massau, bahamas]"




343nassau


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
681,14035227,CB 11.343/Nassau Bahamas,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,cb 11 343nassau bahamas,"[cb, 11, 343nassau, bahamas]"






### Shirley

In [48]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='shirley') & (fuzzy_words_df['ratio_score']>60)].sort_values('ratio_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
4427,3,shirley,1197,shiriley,93.333333,85.714286,93.333333,93.333333,97.5
4432,3,shirley,1203,shirly,92.307692,90.909091,92.307692,92.307692,97.142857
4366,3,shirley,1087,shorley,85.714286,85.714286,85.714286,85.714286,92.380952
4424,3,shirley,1193,shitley,85.714286,85.714286,85.714286,85.714286,93.333333
4426,3,shirley,1196,shirely,85.714286,85.714286,85.714286,85.714286,97.142857
4458,3,shirley,1237,shriley,85.714286,85.714286,85.714286,85.714286,96.190476
4429,3,shirley,1199,andshirley,82.352941,100.0,82.352941,82.352941,90.0
4380,3,shirley,1119,shirleyand,82.352941,100.0,82.352941,82.352941,94.0
4069,3,shirley,525,shirlaw,71.428571,83.333333,71.428571,71.428571,88.571429
4645,3,shirley,1444,haley,66.666667,75.0,66.666667,66.666667,79.047619


In [49]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='shirley') & (fuzzy_words_df['partial_ratio_score']>60)].sort_values('partial_ratio_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
4675,3,shirley,1479,y,25.0,100.0,25.0,25.0,0.0
4029,3,shirley,419,e,25.0,100.0,25.0,25.0,0.0
4379,3,shirley,1113,l,25.0,100.0,25.0,25.0,0.0
4380,3,shirley,1119,shirleyand,82.352941,100.0,82.352941,82.352941,94.0
3909,3,shirley,123,s,25.0,100.0,25.0,25.0,74.285714
4388,3,shirley,1145,i,25.0,100.0,25.0,25.0,71.428571
4429,3,shirley,1199,andshirley,82.352941,100.0,82.352941,82.352941,90.0
4937,3,shirley,1858,hi,44.444444,100.0,44.444444,44.444444,76.190476
4032,3,shirley,433,r,25.0,100.0,25.0,25.0,0.0
4000,3,shirley,355,h,25.0,100.0,25.0,25.0,71.428571


In [50]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='shirley') & (fuzzy_words_df['token_sort_score']>60)].sort_values('token_sort_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
4427,3,shirley,1197,shiriley,93.333333,85.714286,93.333333,93.333333,97.5
4432,3,shirley,1203,shirly,92.307692,90.909091,92.307692,92.307692,97.142857
4366,3,shirley,1087,shorley,85.714286,85.714286,85.714286,85.714286,92.380952
4424,3,shirley,1193,shitley,85.714286,85.714286,85.714286,85.714286,93.333333
4426,3,shirley,1196,shirely,85.714286,85.714286,85.714286,85.714286,97.142857
4458,3,shirley,1237,shriley,85.714286,85.714286,85.714286,85.714286,96.190476
4429,3,shirley,1199,andshirley,82.352941,100.0,82.352941,82.352941,90.0
4380,3,shirley,1119,shirleyand,82.352941,100.0,82.352941,82.352941,94.0
4069,3,shirley,525,shirlaw,71.428571,83.333333,71.428571,71.428571,88.571429
4645,3,shirley,1444,haley,66.666667,75.0,66.666667,66.666667,79.047619


In [51]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='shirley') & (fuzzy_words_df['token_set_score']>60)].sort_values('token_set_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
4427,3,shirley,1197,shiriley,93.333333,85.714286,93.333333,93.333333,97.5
4432,3,shirley,1203,shirly,92.307692,90.909091,92.307692,92.307692,97.142857
4366,3,shirley,1087,shorley,85.714286,85.714286,85.714286,85.714286,92.380952
4424,3,shirley,1193,shitley,85.714286,85.714286,85.714286,85.714286,93.333333
4426,3,shirley,1196,shirely,85.714286,85.714286,85.714286,85.714286,97.142857
4458,3,shirley,1237,shriley,85.714286,85.714286,85.714286,85.714286,96.190476
4429,3,shirley,1199,andshirley,82.352941,100.0,82.352941,82.352941,90.0
4380,3,shirley,1119,shirleyand,82.352941,100.0,82.352941,82.352941,94.0
4069,3,shirley,525,shirlaw,71.428571,83.333333,71.428571,71.428571,88.571429
4645,3,shirley,1444,haley,66.666667,75.0,66.666667,66.666667,79.047619


In [52]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='shirley') & (fuzzy_words_df['jaro_winkler_score']>60)].sort_values('jaro_winkler_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
4427,3,shirley,1197,shiriley,93.333333,85.714286,93.333333,93.333333,97.500000
4426,3,shirley,1196,shirely,85.714286,85.714286,85.714286,85.714286,97.142857
4432,3,shirley,1203,shirly,92.307692,90.909091,92.307692,92.307692,97.142857
4458,3,shirley,1237,shriley,85.714286,85.714286,85.714286,85.714286,96.190476
4380,3,shirley,1119,shirleyand,82.352941,100.000000,82.352941,82.352941,94.000000
...,...,...,...,...,...,...,...,...,...
4544,3,shirley,1333,sociedad,40.000000,50.000000,40.000000,40.000000,60.119048
4561,3,shirley,1353,property,40.000000,42.857143,40.000000,40.000000,60.119048
4822,3,shirley,1676,highland,40.000000,50.000000,40.000000,40.000000,60.119048
4619,3,shirley,1416,vinicole,40.000000,50.000000,40.000000,40.000000,60.119048


In [53]:
word_list = fuzzy_words_df[(fuzzy_words_df['original_value']=='shirley') & (fuzzy_words_df['jaro_winkler_score']>85)].sort_values('jaro_winkler_score', ascending=False)['match_value']
for word in word_list:
    print(word)
    display(df[df['address_wordlist'].apply(lambda x: word in x)])
    print('\n')

shiriley


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1029,14079987,SUITE E-2; UNION COURT BUILDING; ELIZABETH AVE...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e2 union court building elizabeth avenue...,"[suite, e2, union, court, building, elizabeth,..."




shirely


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1027,14079985,Suite E-2; Union Court Building; Elizabeth Ave...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e 2 union court building elizabeth avenu...,"[suite, e, 2, union, court, building, elizabet..."
1028,14079986,SUITE E-2; UNION COURT BUILDING; ELIZABETH AVE...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e2 union court building elizabeth avenue...,"[suite, e2, union, court, building, elizabeth,..."




shirly


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1050,14080011,Suite E-2; Union Court Building; Elizabeth Ave...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e 2 union court building elizabeth avenu...,"[suite, e, 2, union, court, building, elizabet..."
1476,287069,THE BAHAMAS FINANCIAL CENTRE SHIRLY AND CHARLO...,,Bahamas,BHS,Offshore Leaks,The Offshore Leaks data is current through 2010,,the bahamas financial centre shirly and charlo...,"[the, bahamas, financial, centre, shirly, and,..."




shriley


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1140,14082122,Union Court Building; Suiete E-2; Elizabeth Av...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,union court building suiete e 2 elizabeth aven...,"[union, court, building, suiete, e, 2, elizabe..."




shirleyand


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
860,14064239,P O BOX N-3023 BAHAMAS FINANCIAL CENTRE; SHIRL...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,po box n3023 bahamas financial centre shirleya...,"[po, box, n3023, bahamas, financial, centre, s..."




shitley


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1017,14079975,Suite E-2; Union Court Building; Elizabeth Ave...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e 2 union court building elizabeth avenu...,"[suite, e, 2, union, court, building, elizabet..."




shorley


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
818,14050515,Morgan Trust Co. of Bahamas Ltd.; The Bahamas ...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,morgan trust co of bahamas ltd the bahamas fin...,"[morgan, trust, co, of, bahamas, ltd, the, bah..."
943,14077452,Sasson House Building; 107 Shorley Street; P.O...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,sasson house building 107 shorley street po bo...,"[sasson, house, building, 107, shorley, street..."




andshirley


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1037,14079997,Suite E - 2; Union Court Building; Elizabeth A...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e 2 union court building elizabeth avenu...,"[suite, e, 2, union, court, building, elizabet..."




shirlaw


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
281,24000282,"SHIRLEY STREET, SHIRLAW HOUSE, P.O. BOX N-4839...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,shirley street shirlaw house po box n4839 nass...,"[shirley, street, shirlaw, house, po, box, n48..."
1591,81031328,Higgs & Johnson Corporate Services Ltd.; Shirl...,Higgs & Johnson Corporate Services Ltd.,Bahamas,BHS,Paradise Papers - Appleby,Appleby data is current through 2014,,higgs and johnson corporate services ltd shirl...,"[higgs, and, johnson, corporate, services, ltd..."
1880,33000070,"SHIRLEY STREET, SHIRLAW HOUSE, PO BOX N-4839, ...",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current thr...,,shirley street shirlaw house po box n4839 nass...,"[shirley, street, shirlaw, house, po, box, n48..."




shorline


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
573,14020532,64; Shorline; Double Road; Freeport; Grand Bah...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,64 shorline double road freeport grand bahamas...,"[64, shorline, double, road, freeport, grand, ..."






# Resources

## Fuzzy Matching

- [Fuzzing matching in pandas with fuzzywuzzy](https://jonathansoma.com/lede/algorithms-2017/classes/fuzziness-matplotlib/fuzzing-matching-in-pandas-with-fuzzywuzzy/)
- [Best Libraries for Fuzzy Matching In Python](https://medium.com/codex/best-libraries-for-fuzzy-matching-in-python-cbb3e0ef87dd)
- [Fuzzy String Matching](https://towardsdatascience.com/fuzzy-string-matching-in-python-68f240d910fe)
- [Fuzzy String Comparison](https://stackoverflow.com/a/28467760)
- [How to do Fuzzy Matching on Pandas Dataframe Column Using Python?](https://www.geeksforgeeks.org/how-to-do-fuzzy-matching-on-pandas-dataframe-column-using-python/)

## Timeit

- [Timeit in Jupyter Notebook](https://linuxhint.com/timeit-jupyter-notebook/)