# NLP 07: Fuzzy Matching

Frequencies get me counts on exact matches. Now I'm interested in fuzzy matches, matches that are similar but not exact. What I want to get out of this are groups that are the same thing but are misspelled or abbreviated.

To get this information, I'll need to compare every word to every other word, which is computational expensive. I'll need to keep that in mind when I get to comparing addresses for all the rows. For example, while comparing each word to every other word in the 2,000+ row word list may be entirely feasible, doing the same for all 400,000+ addresses at the same time would take too long. Partitioning the data into chunks, perhaps smaller even that processing by country, will be necessary.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
from time import gmtime, strftime
import sys
import os
import io

import string
import re
# import itertools
# import nltk
# nltk.download('stopwords')

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from rapidfuzz import fuzz as rfuzz
import jaro

In [2]:
def frequency_ct(ngram_list):
    freq_dict = {}
    for ngram in ngram_list:
        if ngram not in freq_dict:
            freq_dict[ngram] = 0
        freq_dict[ngram] +=1
    return freq_dict

In [3]:
df = pd.read_csv('data/parsed_bahamas_addresses.csv')

df['address_wordlist'] = df['working_address'].fillna('').str.split()

freq_df = pd.DataFrame.from_dict(
    frequency_ct(df['address_wordlist'].sum()
                ), orient='index').reset_index().rename(
    columns={'index':'word', 0:'count'}).sort_values('count', ascending=False)

In [29]:
freq_df.head(3)

Unnamed: 0,word,count
9,bahamas,2324
8,nassau,2043
6,box,1484


## Fuzzy match metrics

There are two main ways to measure string similarity:

- Levenshtein Distance: uses the number of single characters edits needed to convert the first string in to the second
- Jaro-Winkler Distance: uses the number of matching characters and the number of transpositions

Edit operations include the following:


- Addition: Adding a character
- Deletion: Removing a character
- Substitution: Replacing a character
- Transposition: Swapping two adjacent characters

Levenshtein Distance uses the first three. An extension of this distance metric, Damerau-Levenshtein Distance, uses all four edit operations.

For more information on these metrics, Moosa Ali has a good write up on [Medium](https://medium.com/) in his [Best Libraries for Fuzzy Matching In Python](https://medium.com/codex/best-libraries-for-fuzzy-matching-in-python-cbb3e0ef87dd) article.

### `fuzzywuzzy`

The main library for performing fuzzy matching with python is the `fuzzywuzzy` package. It uses Levenshtein Distance to calculate how similar or dissimilar two strings are.

`fuzzywuzzy` comes with four metrics:

- Ratio: compares the entire string with the characters in order
- Partial ratio: compares the shorter string with a substring of the same length from the longer string
- Token sort ratio: compares the string while ignoring word/character order
- Token set ratio: compares the string while ignoring duplicate words/characters

For more information on these metrics, Catherine Gitau has a good explanation in her [Fuzzy String Matching](https://towardsdatascience.com/fuzzy-string-matching-in-python-68f240d910fe) article on [Towards Data Science](https://towardsdatascience.com/).

`fuzzywuzzy` also has several methods in the `.process` function:

- `extract`: compares a single string to a list of strings. When used with a series, returns three values
    - Second string that is being compared to the first string
    - Score
    - Index of the second string
- `dedup`: removes duplicate values from a list of strings from a specified score threshold

There are several limitations to these methods.

`extract` can only take a single string for the first value. I'd need to set this up in a for loop to compare every item to every other item. For loops are notoriously slow, so this doesn't appear to be a good solution given the amount of data I want to process.

`dedup` removes the duplicates without providing any information on what was removed. For my usecase, I'll need to know what is matching with what. For this initial example using the word list, I'll want to replace misspellings with the correct word to update the values in the `working_address` column. When I apply this to the full addresses, I'll need to group rows that should be the same address and give them a new node id. I can then use the original node id and the newly assigned node it to correctly associate addresses with their counterparts in the rest of the Offshore Leaks data.

### `rapidfuzz`

There is another package, `rapidfuzz` that is based on `fuzzywuzzy`, but supposedly runs faster. It also returns more detailed scores. `fuzzywuzzy` rounds to the nearest integer while `rapidfuzz` provides the decimal answer. FYI, occasionally they output different results.

In [10]:
str1 = freq_df.iloc[0,0]
str2 = freq_df.iloc[1,0]

print(f'Comparison strings: String 1: "{str1}", String 2: "{str2}"', '\n')
print('Metric', '\t\t\tfuzzywuzzy', '\trapidfuzz')
print('Ratio:', '\t\t\t', fuzz.ratio(str1, str2), '\t\t', rfuzz.ratio(str1, str2))
print('Partial ratio:', '\t\t', fuzz.partial_ratio(str1, str2), '\t\t', rfuzz.partial_ratio(str1, str2))
print('Token sort ratio:', '\t', fuzz.token_sort_ratio(str1, str2), '\t\t', rfuzz.token_sort_ratio(str1, str2))
print('Token set ratio:', '\t', fuzz.token_set_ratio(str1, str2), '\t\t', rfuzz.token_set_ratio(str1, str2))
print('Jaro-Winkler:', '\t', jaro.jaro_winkler_metric(str1, str2))

Comparison strings: String 1: "bahamas", String 2: "nassau" 

Metric 			fuzzywuzzy 	rapidfuzz
Ratio: 			 31 		 30.76923076923077
Partial ratio: 		 33 		 50.0
Token sort ratio: 	 31 		 30.76923076923077
Token set ratio: 	 31 		 30.769230769230774
Jaro-Winkler: 	 0.5396825396825397


## Storage format

Before deciding how to process all the data, I need to know what information to keep. Based on what I wanted to see above I need:

- The original string
- The match string
- All five metrics

Additionally, since I'm working with dataframes, I'll also want the index of both the original and match values so that I can easily find them in the dataframe.

Ultimately, when I'm working with the full address string, it should look something like this:

<table>
    <tr>
        <td>address_index</td>
        <td>address</td>
        <td>match_index</td>
        <td>match</td>
        <td>ratio_score</td>
        <td>partial_ratio_score</td>
        <td>token_sort_score</td>
        <td>token_set_score</td>
        <td>jaro_winkler_score</td>
    </tr>
    <tr>
        <td>9</td>
        <td>'ground floor goodmans bay corporate ce po box n 3933 nassau bahamas'</td>
        <td>1975</td>
        <td>'ground floor goodmans bay corporate ce po box n 3933 nassau bahamas'</td>
        <td>100</td>
        <td>100</td>
        <td>100</td>
        <td>100</td>
        <td>1.0</td>
    </tr>
    <tr>
        <td>9</td>
        <td>'ground floor goodmans bay corporate ce po box n 3933 nassau bahamas'</td>
        <td>2068</td>
        <td>'goodmans bay corporate centre po box cb10976 nassau bahamas'</td>
        <td>78</td>
        <td>87</td>
        <td>75</td>
        <td>85</td>
        <td>0.78</td>
    </tr>
    <tr>
        <td>9</td>
        <td>'ground floor goodmans bay corporate ce po box n 3933 nassau bahamas'</td>
        <td>548</td>
        <td>'second  floor goodmans bay corporate centre suite 261 po box cb12762 nassau bahamas'</td>
        <td>77</td>
        <td>69</td>
        <td>72</td>
        <td>85</td>
        <td>0.74</td>
    </tr>
    <tr>
        <td>9</td>
        <td>'ground floor goodmans bay corporate ce po box n 3933 nassau bahamas'</td>
        <td>268</td>
        <td>'goodmans bay corporate centre po box cb12407 nassau bahamas'</td>
        <td>76</td>
        <td>85</td>
        <td>75</td>
        <td>85</td>
        <td>0.78</td>
    </tr>
    <tr>
        <td>9</td>
        <td>'ground floor goodmans bay corporate ce po box n 3933 nassau bahamas'</td>
        <td>1511</td>
        <td>'goodmans bay corporate centre west bay street po box n3933 nassau bahamas'</td>
        <td>76</td>
        <td>70</td>
        <td>70</td>
        <td>79</td>
        <td>0.77</td>
    </tr>
    <tr>
        <td>9</td>
        <td>'ground floor goodmans bay corporate ce po box n 3933 nassau bahamas'</td>
        <td>1707</td>
        <td>'first floor goodmans bay corporate centre bay street nassau bahamas'</td>
        <td>76</td>
        <td>77</td>
        <td>73</td>
        <td>81</td>
        <td>0.73</td>
    </tr>
</table>

In [17]:
def calc_ffuzz_df(df, column):
    row_list = []
    
    for o_i, o_v in enumerate(df[column].sort_index()):
        for m_i, m_v in enumerate(df[column].sort_index()):
            if o_i != m_i:
                dict1 = {
                    'original_index': o_i,
                    'original_value': o_v,
                    'match_index': m_i,
                    'match_value': m_v,
                    'ratio_score': fuzz.ratio(o_v, m_v),
                    'partial_ratio_score': fuzz.partial_ratio(o_v, m_v),
                    'token_sort_score': fuzz.token_sort_ratio(o_v, m_v),
                    'token_set_score': fuzz.token_set_ratio(o_v, m_v),
                    'jaro_winkler_score': jaro.jaro_winkler_metric(o_v, m_v)
                }
                if (dict1['ratio_score']>0) | (dict1['partial_ratio_score']>0) | (dict1['token_sort_score']>0) | (dict1['token_set_score']>0) | (dict1['jaro_winkler_score']>0):
                    row_list.append(dict1)
    score_df = pd.DataFrame(row_list)
        
    return score_df

def calc_rfuzz_df(df, column):
    row_list = []
    
    for o_i, o_v in enumerate(df[column].sort_index()):
        for m_i, m_v in enumerate(df[column].sort_index()):
            if o_i != m_i:
                dict1 = {
                    'original_index': o_i,
                    'original_value': o_v,
                    'match_index': m_i,
                    'match_value': m_v,
                    'ratio_score': rfuzz.ratio(o_v, m_v),
                    'partial_ratio_score': rfuzz.partial_ratio(o_v, m_v),
                    'token_sort_score': rfuzz.token_sort_ratio(o_v, m_v),
                    'token_set_score': rfuzz.token_set_ratio(o_v, m_v),
                    'jaro_winkler_score': jaro.jaro_winkler_metric(o_v, m_v)
                }
                if (dict1['ratio_score']>0) | (dict1['partial_ratio_score']>0) | (dict1['token_sort_score']>0) | (dict1['token_set_score']>0) | (dict1['jaro_winkler_score']>0):
                    row_list.append(dict1)
    score_df = pd.DataFrame(row_list)
        
    return score_df

\**Notes* on the `calc_fuzz_df` function:

1. I use a dictionary to iteratively collect the scores as they're run. Initially, I used `.iloc` to concatenate a row onto the dataframe. However, this was unacceptably slow (I was too impatient to even let it finish). 
1. I don't store any row where all metrics are 0. If any metric has a value greater than 0, the whole row is captured.

### Function speed

Running the `%%timeit` magic function reveals that the `rapidfuzz` library is indeed significantly faster. As such, I'll be using `rapidfuzz`.

In [15]:
%%timeit
calc_rfuzz_df(freq_df, 'word')

43.4 s ± 200 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [18]:
%%timeit
calc_ffuzz_df(freq_df, 'word')

2min 42s ± 934 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [19]:
fuzzy_words_df = calc_rfuzz_df(freq_df, 'word')
fuzzy_words_df['jaro_winkler_score'] = fuzzy_words_df['jaro_winkler_score']*100
fuzzy_words_df

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
0,0,annex,1,frederick,14.285714,25.000000,14.285714,14.285714,43.703704
1,0,annex,2,and,50.000000,80.000000,50.000000,50.000000,68.888889
2,0,annex,3,shirley,16.666667,28.571429,16.666667,16.666667,44.761905
3,0,annex,4,street,18.181818,28.571429,18.181818,18.181818,45.555556
4,0,annex,6,box,25.000000,50.000000,25.000000,25.000000,0.000000
...,...,...,...,...,...,...,...,...,...
2272337,2038,2ntl,2029,tortola,36.363636,50.000000,36.363636,36.363636,59.523810
2272338,2038,2ntl,2031,switzerland,26.666667,33.333333,26.666667,26.666667,56.060606
2272339,2038,2ntl,2033,montagne,33.333333,50.000000,33.333333,33.333333,58.333333
2272340,2038,2ntl,2034,sterline,33.333333,50.000000,33.333333,33.333333,58.333333


In [25]:
index_col = 'original_index'
metric_cts = pd.DataFrame(fuzzy_words_df[index_col].unique(), columns=[index_col])

for metric in ['ratio_score', 'partial_ratio_score', 'token_sort_score', 'token_set_score', 'jaro_winkler_score']:
    met_df = fuzzy_words_df.loc[fuzzy_words_df[metric]>60, [index_col, metric]].groupby(index_col).count().reset_index()
    metric_cts = metric_cts.merge(met_df, on=index_col, how='outer')
    
metric_cts = fuzzy_words_df[[index_col, 'original_value']].drop_duplicates().merge(metric_cts, on=index_col, how='outer')
metric_cts.columns = ['original_index', 'original_value', 'ratio_match_ct', 'partial_ratio_match_ct', 'token_sort_match_ct', 'token_set_match_ct', 'jaro_winkler_match_ct']
metric_cts

Unnamed: 0,original_index,original_value,ratio_match_ct,partial_ratio_match_ct,token_sort_match_ct,token_set_match_ct,jaro_winkler_match_ct
0,0,annex,5.0,58,5.0,5.0,55
1,1,frederick,3.0,63,3.0,3.0,67
2,2,and,14.0,239,14.0,14.0,133
3,3,shirley,19.0,60,19.0,19.0,122
4,4,street,18.0,81,18.0,18.0,179
...,...,...,...,...,...,...,...
2034,2034,sterline,21.0,127,21.0,21.0,238
2035,2035,bav,7.0,97,7.0,7.0,91
2036,2036,hast,6.0,109,6.0,6.0,114
2037,2037,coj,1.0,100,1.0,1.0,84


In [26]:
metric_cts[metric_cts['original_value'].str.contains('^n\d+|cb\d+|no\d+|\d+$')]

Unnamed: 0,original_index,original_value,ratio_match_ct,partial_ratio_match_ct,token_sort_match_ct,token_set_match_ct,jaro_winkler_match_ct
7,7,n4805,16.0,113,16.0,16.0,123
11,11,e2,3.0,412,3.0,3.0,24
15,15,n8188,17.0,97,17.0,17.0,88
19,19,n7785,18.0,102,18.0,18.0,88
20,20,n3708,8.0,120,8.0,8.0,113
...,...,...,...,...,...,...,...
2019,2019,3242,24.0,121,24.0,24.0,137
2020,2020,25,13.0,252,13.0,13.0,55
2022,2022,875,6.0,114,6.0,6.0,58
2023,2023,cr567,2.0,39,2.0,2.0,20


In [27]:
metric_cts = metric_cts[~metric_cts['original_value'].str.contains('^n\d+|^\w\w\d+|^\d+$')]
metric_cts

Unnamed: 0,original_index,original_value,ratio_match_ct,partial_ratio_match_ct,token_sort_match_ct,token_set_match_ct,jaro_winkler_match_ct
0,0,annex,5.0,58,5.0,5.0,55
1,1,frederick,3.0,63,3.0,3.0,67
2,2,and,14.0,239,14.0,14.0,133
3,3,shirley,19.0,60,19.0,19.0,122
4,4,street,18.0,81,18.0,18.0,179
...,...,...,...,...,...,...,...
2034,2034,sterline,21.0,127,21.0,21.0,238
2035,2035,bav,7.0,97,7.0,7.0,91
2036,2036,hast,6.0,109,6.0,6.0,114
2037,2037,coj,1.0,100,1.0,1.0,84


### Street

In [30]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='street') & (fuzzy_words_df['ratio_score']>50)].sort_values('ratio_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
5568,4,street,1198,streeet,92.307692,90.909091,92.307692,92.307692,97.142857
5171,4,street,322,stret,90.909091,88.888889,90.909091,90.909091,96.666667
5275,4,street,607,stree,90.909091,100.0,90.909091,90.909091,96.666667
5993,4,street,1790,streeets,85.714286,90.909091,85.714286,85.714286,95.0
5902,4,street,1629,strets,83.333333,90.909091,83.333333,83.333333,93.333333
5556,4,street,1179,streer,83.333333,90.909091,83.333333,83.333333,93.333333
5870,4,street,1579,steret,83.333333,83.333333,83.333333,83.333333,95.555556
5428,4,street,947,strees,83.333333,90.909091,83.333333,83.333333,93.333333
6084,4,street,1968,stre,80.0,100.0,80.0,80.0,93.333333
5971,4,street,1743,treetops,71.428571,90.909091,71.428571,71.428571,81.944444


In [31]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='street') & (fuzzy_words_df['partial_ratio_score']>50)].sort_values('partial_ratio_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
5307,4,street,713,ee,50.000000,100.000000,50.000000,50.000000,55.555556
5975,4,street,1751,t,28.571429,100.000000,28.571429,28.571429,72.222222
5212,4,street,433,r,28.571429,100.000000,28.571429,28.571429,72.222222
5275,4,street,607,stree,90.909091,100.000000,90.909091,90.909091,96.666667
5101,4,street,123,s,28.571429,100.000000,28.571429,28.571429,75.000000
...,...,...,...,...,...,...,...,...,...
5760,4,street,1442,systems,46.153846,54.545455,46.153846,46.153846,64.285714
5759,4,street,1441,netware,46.153846,54.545455,46.153846,46.153846,53.174603
5748,4,street,1429,microsystems,33.333333,54.545455,33.333333,33.333333,50.000000
5709,4,street,1380,ventures,42.857143,54.545455,42.857143,42.857143,63.888889


In [32]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='street') & (fuzzy_words_df['token_sort_score']>50)].sort_values('token_sort_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
5568,4,street,1198,streeet,92.307692,90.909091,92.307692,92.307692,97.142857
5171,4,street,322,stret,90.909091,88.888889,90.909091,90.909091,96.666667
5275,4,street,607,stree,90.909091,100.0,90.909091,90.909091,96.666667
5993,4,street,1790,streeets,85.714286,90.909091,85.714286,85.714286,95.0
5902,4,street,1629,strets,83.333333,90.909091,83.333333,83.333333,93.333333
5556,4,street,1179,streer,83.333333,90.909091,83.333333,83.333333,93.333333
5870,4,street,1579,steret,83.333333,83.333333,83.333333,83.333333,95.555556
5428,4,street,947,strees,83.333333,90.909091,83.333333,83.333333,93.333333
6084,4,street,1968,stre,80.0,100.0,80.0,80.0,93.333333
5971,4,street,1743,treetops,71.428571,90.909091,71.428571,71.428571,81.944444


In [33]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='street') & (fuzzy_words_df['token_set_score']>50)].sort_values('token_set_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
5568,4,street,1198,streeet,92.307692,90.909091,92.307692,92.307692,97.142857
5171,4,street,322,stret,90.909091,88.888889,90.909091,90.909091,96.666667
5275,4,street,607,stree,90.909091,100.0,90.909091,90.909091,96.666667
5993,4,street,1790,streeets,85.714286,90.909091,85.714286,85.714286,95.0
5902,4,street,1629,strets,83.333333,90.909091,83.333333,83.333333,93.333333
5556,4,street,1179,streer,83.333333,90.909091,83.333333,83.333333,93.333333
5870,4,street,1579,steret,83.333333,83.333333,83.333333,83.333333,95.555556
5428,4,street,947,strees,83.333333,90.909091,83.333333,83.333333,93.333333
6084,4,street,1968,stre,80.0,100.0,80.0,80.0,93.333333
5971,4,street,1743,treetops,71.428571,90.909091,71.428571,71.428571,81.944444


In [34]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='street') & (fuzzy_words_df['jaro_winkler_score']>50)].sort_values('jaro_winkler_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
5568,4,street,1198,streeet,92.307692,90.909091,92.307692,92.307692,97.142857
5171,4,street,322,stret,90.909091,88.888889,90.909091,90.909091,96.666667
5275,4,street,607,stree,90.909091,100.000000,90.909091,90.909091,96.666667
5870,4,street,1579,steret,83.333333,83.333333,83.333333,83.333333,95.555556
5993,4,street,1790,streeets,85.714286,90.909091,85.714286,85.714286,95.000000
...,...,...,...,...,...,...,...,...,...
5788,4,street,1478,consultores,47.058824,54.545455,47.058824,47.058824,50.505051
5647,4,street,1304,consultants,35.294118,36.363636,35.294118,35.294118,50.505051
5553,4,street,1171,redomiciled,35.294118,50.000000,35.294118,35.294118,50.505051
5691,4,street,1356,association,23.529412,33.333333,23.529412,23.529412,50.505051


In [35]:
word_list = fuzzy_words_df[(fuzzy_words_df['original_value']=='street') & (fuzzy_words_df['jaro_winkler_score']>75)].sort_values('jaro_winkler_score', ascending=False)['match_value']
for word in word_list:
    print(word)
    display(df[df['address_wordlist'].apply(lambda x: word in x)])
    print('\n')

streeet


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1031,14079989,Suite E-2; Union Court Building; Elizabeth Ave...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e 2 union court building elizabeth avenu...,"[suite, e, 2, union, court, building, elizabet..."




stret


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
156,24000157,"SG HAMBROS BUILDING WEST BAY STRET, P.O. BOX C...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,sg hambros building west bay stret po box cb12...,"[sg, hambros, building, west, bay, stret, po, ..."
694,14035596,Charlotte House; Charlotte Stret; POB N-65; na...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,charlotte house charlotte stret pob n 65 nassa...,"[charlotte, house, charlotte, stret, pob, n, 6..."
1383,286502,"SUITE 306, 3/F CENTRE OF COMMERCE 1 BAY STRET ...",,Bahamas,BHS,Offshore Leaks,The Offshore Leaks data is current through 2010,,suite 306 3f centre of commerce 1 bay stret na...,"[suite, 306, 3f, centre, of, commerce, 1, bay,..."




stree


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
350,24000351,"2ND. FL. GOLD CIRCLE HSE. EAST BAY STREE, P.O....",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,second fl gold circle hse east bay stree po bo...,"[second, fl, gold, circle, hse, east, bay, str..."
546,14012243,2ND FLOOR; ANSBACHER HOUSE; BANK LANE AN EAST ...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,second floor ansbacher house bank lane an east...,"[second, floor, ansbacher, house, bank, lane, ..."
1032,14079990,SUITE E-2; UNION COURT BUILDING; ELIZABETH AVE...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e 2 union court building elizabeth avenu...,"[suite, e, 2, union, court, building, elizabet..."
1862,33000052,"2ND FL GOLD CIRCLE HSE EAST BAY STREE, PO BOX ...",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current thr...,,second fl gold circle hse east bay stree po bo...,"[second, fl, gold, circle, hse, east, bay, str..."




steret


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1412,245979,"1st Floor Norfolk House Frederick Steret, Nass...",,Bahamas,BHS,Offshore Leaks,The Offshore Leaks data is current through 2010,,first floor norfolk house frederick steret nas...,"[first, floor, norfolk, house, frederick, ster..."




streeets


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1677,81058625,Bahamas Financial Centre; Shirley & Charlotte ...,Bahamas Financial Centre,Bahamas,BHS,Paradise Papers - Appleby,Appleby data is current through 2014,,bahamas financial centre shirley and charlotte...,"[bahamas, financial, centre, shirley, and, cha..."




streer


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
995,14079952,Suite E-2; Unioin Court Building; Elizabeth Av...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e2 unioin court building elizabeth avenu...,"[suite, e2, unioin, court, building, elizabeth..."
999,14079956,Suite E-2; Union Court Buiding; Elizabeth Aven...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e2 union court buiding elizabeth avenue ...,"[suite, e2, union, court, buiding, elizabeth, ..."




stre


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
2153,120019222,"THE CHAMBERS OF MESSRS, MCKINNEY, BANCRO FT & ...","THE CHAMBERS OF MESSRS, MCKINNEY, BANCRO FT & ...",Bahamas,BHS,Paradise Papers - Barbados corporate registry,Barbados corporate registry data is current th...,,the chambers of messrs mckinney bancro ft and ...,"[the, chambers, of, messrs, mckinney, bancro, ..."




strees


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
607,14030183,BAHAMAS FINANCIAL CENTRE; P.O. BOX N-3023 SHIR...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,bahamas financial centre po box n3023 shirley ...,"[bahamas, financial, centre, po, box, n3023, s..."
775,14044324,he Bahamas Financial Centre Fourth Floor Shirl...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,he bahamas financial centre fourth floor shirl...,"[he, bahamas, financial, centre, fourth, floor..."
1082,14080632,The Bahamas Financial centre; Charlotte and Sh...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,the bahamas financial centre charlotte and shi...,"[the, bahamas, financial, centre, charlotte, a..."
1086,14080636,The Bahamas Financial Centre Fourth Floor Shir...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,the bahamas financial centre fourth floor shir...,"[the, bahamas, financial, centre, fourth, floo..."




strets


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1475,287068,THE BAHAMAS FINANCIAL CENTRE FOUTH FLOOR SHIRL...,,Bahamas,BHS,Offshore Leaks,The Offshore Leaks data is current through 2010,,the bahamas financial centre fouth floor shirl...,"[the, bahamas, financial, centre, fouth, floor..."




ste


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
916,14077057,Saffrey Square; Ste 205; Bank Lane; Nassau; Ba...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,saffrey square ste 205 bank lane nassau bahamas,"[saffrey, square, ste, 205, bank, lane, nassau..."
2220,240452852,"P O BOX CR56766 STE 875 — NASSAU, BAHAMAS — NE...",,Bahamas,BHS,Pandora Papers - SFM Corporate Services,Provider data is current through 2012,,po box cr56766 ste 875 nassau bahamas new prov...,"[po, box, cr56766, ste, 875, nassau, bahamas, ..."




secret


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1722,81077816,Charlotte House; Charlotte Secret; POBOX N-65;...,Charlotte House,Bahamas,BHS,Paradise Papers - Appleby,Appleby data is current through 2014,,charlotte house charlotte secret po box n65 na...,"[charlotte, house, charlotte, secret, po, box,..."




st


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
450,24000451,"ST. MALCOLM BUILDING, VICTORIA & BAY STS, P.O....",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,st malcolm building victoria and bay street po...,"[st, malcolm, building, victoria, and, bay, st..."
960,14078480,St. Andrew's Court; Frederick Street Steps; Na...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,st andrews court frederick street steps nassau...,"[st, andrews, court, frederick, street, steps,..."
961,14078481,St. Andrew's Court; Frederick Street Steps; P....,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,st andrews court frederick street steps p o bo...,"[st, andrews, court, frederick, street, steps,..."
1935,33000132,ST ANDREW'S COURT FREDERICK ST STEPS PO BOX N-...,,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current thr...,,st andrews court frederick street steps po box...,"[st, andrews, court, frederick, street, steps,..."
2161,120015591,"ST. ANDREW'S COURT FREDERICK STREET, STEPS, NA...","ST. ANDREW'S COURT FREDERICK STREET, STEPS, NA...",Bahamas,BHS,Paradise Papers - Barbados corporate registry,Barbados corporate registry data is current th...,,st andrews court frederick street steps nassau...,"[st, andrews, court, frederick, street, steps,..."




treetops


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1618,81038069,Treetops; Lyford Cay; Nassau; Bahamas,Treetops,Bahamas,BHS,Paradise Papers - Appleby,Appleby data is current through 2014,,treetops lyford cay nassau bahamas,"[treetops, lyford, cay, nassau, bahamas]"




trustees


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
585,14028452,as trustees of the Wave Trust - Charlotte Hous...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,as trustees of the wavenue trust charlotte hou...,"[as, trustees, of, the, wavenue, trust, charlo..."
773,14043947,Guernsey as trustees of the Archon Trust (Baha...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,guernsey as trustees of the archon trust bahamas,"[guernsey, as, trustees, of, the, archon, trus..."
1132,14081947,UBS Trustees (Bahamas) Ltd.; East Bay Street; ...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,ubs trustees bahamas ltd east bay street po bo...,"[ubs, trustees, bahamas, ltd, east, bay, stree..."
1384,287385,"UBS Trustees (Bahamas) Ltd East Bay Street, PO...",,Bahamas,BHS,Offshore Leaks,The Offshore Leaks data is current through 2010,,ubs trustees bahamas ltd east bay street po bo...,"[ubs, trustees, bahamas, ltd, east, bay, stree..."
1440,239867,"UBS Trustees (Bahamas) Ltd, UBS House, East Ba...",,Bahamas,BHS,Offshore Leaks,The Offshore Leaks data is current through 2010,,ubs trustees bahamas ltd ubs house east bay st...,"[ubs, trustees, bahamas, ltd, ubs, house, east..."
1750,81083245,Rhone Trustees (Bahamas) Ltd.; PO Box SP 63131...,Rhone Trustees (Bahamas) Ltd.,Bahamas,BHS,Paradise Papers - Appleby,Appleby data is current through 2014,,rhone trustees bahamas ltd po box sp 63131 bay...,"[rhone, trustees, bahamas, ltd, po, box, sp, 6..."




sterline


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
2254,240492536,"MONTAGNE STERLINE CENTRE. EAST BAV STREET, NAS...",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2016,,montagne sterline centre east bav street nassa...,"[montagne, sterline, centre, east, bav, street..."




se


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
764,14043542,GOODMANS BAY; CORPORATE CENTRE; WEST BAY STREE...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,goodmans bay corporate centre west bay street ...,"[goodmans, bay, corporate, centre, west, bay, ..."




trustee


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1193,14087509,(THE PRIVATE CORPORATION LIMITED AS TRUSTEE OF...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,the private corporation limited as trustee of ...,"[the, private, corporation, limited, as, trust..."
1580,81028921,As Trustee of The Tuleu Family Settlement; 3 F...,As Trustee of The Tuleu Family Settlement,Bahamas,BHS,Paradise Papers - Appleby,Appleby data is current through 2014,,as trustee of the tuleu family settlement 3 fl...,"[as, trustee, of, the, tuleu, family, settleme..."
1629,81039653,"Trustee of Settlement, T-737, Box N 3933, Shir...","Trustee of Settlement, T-737, Box N 3933, Shir...",Bahamas,BHS,Paradise Papers - Appleby,Appleby data is current through 2014,,trustee of settlement t 737 box n 3933 shirley...,"[trustee, of, settlement, t, 737, box, n, 3933..."




stephane


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1685,81061466,Mr. Stephane Pizzo; Mignon (Nassau) Limited; N...,Mr. Stephane Pizzo,Bahamas,BHS,Paradise Papers - Appleby,Appleby data is current through 2014,,mr stephane pizzo mignon nassau limited nassau...,"[mr, stephane, pizzo, mignon, nassau, limited,..."




regent


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
79,24000080,"REGENT CENTRE, P.O. BOX F-40132 FREEPORT, GRAN...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,regent centre po box f40132 freeport grand bahama,"[regent, centre, po, box, f40132, freeport, gr..."
104,24000105,"SUITE A, REGENT CENTRE, P.O. BOX F-42682 FREEP...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,suite a regent centre po box f42682 freeport g...,"[suite, a, regent, centre, po, box, f42682, fr..."
140,24000141,"SUITE 2A SECOND FLOOR REGENT CENTRE, P.O. BOX ...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,suite 2a second floor regent centre po box f60...,"[suite, 2a, second, floor, regent, centre, po,..."
161,24000162,"SUITE 6,7 & 8 REGENT CENTRE, P.O. BOX F-502 FR...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,suite 6 7 and 8 regent centre po box f502 free...,"[suite, 6, 7, and, 8, regent, centre, po, box,..."
181,24000182,"P.O. BOX F-40210, FREEPORT, GRAND BAHAMA 2B RE...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,po box f40210 freeport grand bahama 2b regent ...,"[po, box, f40210, freeport, grand, bahama, 2b,..."
198,24000199,"SUITE 2B REGENT CENTRE, EXPLORERS WAY, P.O. BO...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,suite 2b regent centre explorers way po box f4...,"[suite, 2b, regent, centre, explorers, way, po..."
354,24000355,"SUITE G, REGENT CENTRE, P.O. BOX F-60217 FREEP...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,suite g regent centre po box f60217 freeport g...,"[suite, g, regent, centre, po, box, f60217, fr..."
442,24000443,"SUITE B REGENT CENTRE EXPLORERS WAY, P.O. BOX ...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,suite b regent centre explorers way po box f 6...,"[suite, b, regent, centre, explorers, way, po,..."
461,24000462,"1 REGENT CENTRE 4A, P.O. BOX F-60127 FREEPORT,...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,1 regent centre 4a po box f60127 freeport gran...,"[1, regent, centre, 4a, po, box, f60127, freep..."
1336,14105087,REGENT INTERNATIONAL MANAGEMENT S.A. SUITE E-2...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,regent international management s a suite e 2 ...,"[regent, international, management, s, a, suit..."




sede


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
951,14077696,"Sede Nassau-BA (capital), Bahamas",,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,sede nassau ba capital bahamas,"[sede, nassau, ba, capital, bahamas]"




seventh


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
334,24000335,"7TH TERRACE CENTREVILLE, P.O. BOX N-10095, NAS...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,seventh terrace centreville po box n10095 nass...,"[seventh, terrace, centreville, po, box, n1009..."
498,24000499,"7TH TERRACE, CENTERVILLE (WEST), P.O. BOX EE-1...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,seventh terrace centerville west po box ee1595...,"[seventh, terrace, centerville, west, po, box,..."
1877,33000067,"7TH TERRACE CENTREVILLE, PO BOX N-10095, NASSA...",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current thr...,,seventh terrace centreville po box n10095 nass...,"[seventh, terrace, centreville, po, box, n1009..."




steps


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
960,14078480,St. Andrew's Court; Frederick Street Steps; Na...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,st andrews court frederick street steps nassau...,"[st, andrews, court, frederick, street, steps,..."
961,14078481,St. Andrew's Court; Frederick Street Steps; P....,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,st andrews court frederick street steps p o bo...,"[st, andrews, court, frederick, street, steps,..."
1569,81026720,"51 Frederick Street Steps,; P.O. Box N - 1136;...","51 Frederick Street Steps,",Bahamas,BHS,Paradise Papers - Appleby,Appleby data is current through 2014,,51 frederick street steps po box n 1136 nassau...,"[51, frederick, street, steps, po, box, n, 113..."
1935,33000132,ST ANDREW'S COURT FREDERICK ST STEPS PO BOX N-...,,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current thr...,,st andrews court frederick street steps po box...,"[st, andrews, court, frederick, street, steps,..."
2161,120015591,"ST. ANDREW'S COURT FREDERICK STREET, STEPS, NA...","ST. ANDREW'S COURT FREDERICK STREET, STEPS, NA...",Bahamas,BHS,Paradise Papers - Barbados corporate registry,Barbados corporate registry data is current th...,,st andrews court frederick street steps nassau...,"[st, andrews, court, frederick, street, steps,..."




summerset


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1533,81021574,90 Summerset House; Thomson Blvd.; NASSAU; Bah...,90 Summerset House,Bahamas,BHS,Paradise Papers - Appleby,Appleby data is current through 2014,,90 summerset house thomson boulevard nassau ba...,"[90, summerset, house, thomson, boulevard, nas..."




sommerset


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1801,81013692,Sommerset House; Thomson Blvd.; NASSAU; Bahamas,Sommerset House,Bahamas,BHS,Paradise Papers - Appleby,Appleby data is current through 2014,,sommerset house thomson boulevard nassau bahamas,"[sommerset, house, thomson, boulevard, nassau,..."




s


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
46,24000047,"SUITE D, S.G. HAMBROS BLDG., P.O. BOX N-3741, ...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,suite d s g hambros bldg po box n3741 nassau b...,"[suite, d, s, g, hambros, bldg, po, box, n3741..."
213,24000214,"F.E.S. BUILDING - MILTON STREET, P.O. BOX F-44...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,f e s building milton street po box f44181 fre...,"[f, e, s, building, milton, street, po, box, f..."
307,24000308,"SUITE 1, K. S. DARLING BUILDING, P.O. BOX N-49...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,suite 1 k s darling building po box n4922 dowd...,"[suite, 1, k, s, darling, building, po, box, n..."
493,24000494,"K S DARLING BLDG. DOWDESWELL STREET, P.O. BOX ...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,k s darling bldg dowdeswell street po box n948...,"[k, s, darling, bldg, dowdeswell, street, po, ..."
759,14043537,GOODMAN S BAY CORPORATE CENTER - WEST BAY STRE...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,goodman s bay corporate center west bay street...,"[goodman, s, bay, corporate, center, west, bay..."
760,14043538,GOODMAN S BAY CORPORATE CENTER WEST BAY STREE...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,goodman s bay corporate center west bay street...,"[goodman, s, bay, corporate, center, west, bay..."
928,14077070,SAFFREY SQUARE; SUITE 205; BANK LANE P.O. BOX ...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,saffrey square suite 205 bank lane po box n818...,"[saffrey, square, suite, 205, bank, lane, po, ..."
1129,14081728,TRUSBAN INTERNATIONAL S.A. SUITE E-2 UNION COU...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,trusban international s a suite e 2 union cour...,"[trusban, international, s, a, suite, e, 2, un..."
1200,14090904,ADMINOTIS S.A. SUITE E-2 UNION COURT BUILDING ...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,adminotis s a suite e 2 union court building e...,"[adminotis, s, a, suite, e, 2, union, court, b..."
1208,14091417,ANATOLE TRADING S.A. SUITE E-2 UNION COURT BUI...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,anatole trading s a suite e 2 union court buil...,"[anatole, trading, s, a, suite, e, 2, union, c..."






### Bahamas

In [36]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='bahamas') & (fuzzy_words_df['ratio_score']>60)].sort_values('ratio_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
10544,9,bahamas,1101,bahamaas,93.333333,92.307692,93.333333,93.333333,97.5
11087,9,bahamas,1911,abahamas,93.333333,100.0,93.333333,93.333333,81.547619
10909,9,bahamas,1602,bahamasa,93.333333,100.0,93.333333,93.333333,97.5
10603,9,bahamas,1200,bahamas6,93.333333,100.0,93.333333,93.333333,97.5
10436,9,bahamas,930,bahamas1,93.333333,100.0,93.333333,93.333333,97.5
10421,9,bahamas,907,bahamasc,93.333333,100.0,93.333333,93.333333,97.5
10424,9,bahamas,914,ahamas,92.307692,100.0,92.307692,92.307692,95.238095
10107,9,bahamas,188,bahama,92.307692,100.0,92.307692,92.307692,97.142857
10420,9,bahamas,904,bahmas,92.307692,83.333333,92.307692,92.307692,92.777778
10300,9,bahamas,659,bhamas,92.307692,90.909091,92.307692,92.307692,85.714286


In [37]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='bahamas') & (fuzzy_words_df['partial_ratio_score']>70)].sort_values('partial_ratio_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
10543,9,bahamas,1100,ba,44.444444,100.0,44.444444,44.444444,80.952381
10367,9,bahamas,806,bah,60.0,100.0,60.0,60.0,86.666667
10603,9,bahamas,1200,bahamas6,93.333333,100.0,93.333333,93.333333,97.5
11087,9,bahamas,1911,abahamas,93.333333,100.0,93.333333,93.333333,81.547619
10551,9,bahamas,1125,bahamaspo,87.5,100.0,87.5,87.5,95.555556
10082,9,bahamas,123,s,25.0,100.0,25.0,25.0,0.0
10436,9,bahamas,930,bahamas1,93.333333,100.0,93.333333,93.333333,97.5
10432,9,bahamas,926,as,44.444444,100.0,44.444444,44.444444,54.761905
10424,9,bahamas,914,ahamas,92.307692,100.0,92.307692,92.307692,95.238095
10780,9,bahamas,1425,m,25.0,100.0,25.0,25.0,0.0


In [38]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='bahamas') & (fuzzy_words_df['token_sort_score']>65)].sort_values('token_sort_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
10544,9,bahamas,1101,bahamaas,93.333333,92.307692,93.333333,93.333333,97.5
11087,9,bahamas,1911,abahamas,93.333333,100.0,93.333333,93.333333,81.547619
10909,9,bahamas,1602,bahamasa,93.333333,100.0,93.333333,93.333333,97.5
10421,9,bahamas,907,bahamasc,93.333333,100.0,93.333333,93.333333,97.5
10603,9,bahamas,1200,bahamas6,93.333333,100.0,93.333333,93.333333,97.5
10436,9,bahamas,930,bahamas1,93.333333,100.0,93.333333,93.333333,97.5
10233,9,bahamas,516,bahams,92.307692,90.909091,92.307692,92.307692,97.142857
10107,9,bahamas,188,bahama,92.307692,100.0,92.307692,92.307692,97.142857
10424,9,bahamas,914,ahamas,92.307692,100.0,92.307692,92.307692,95.238095
10420,9,bahamas,904,bahmas,92.307692,83.333333,92.307692,92.307692,92.777778


In [39]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='bahamas') & (fuzzy_words_df['token_set_score']>65)].sort_values('token_set_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
10544,9,bahamas,1101,bahamaas,93.333333,92.307692,93.333333,93.333333,97.5
11087,9,bahamas,1911,abahamas,93.333333,100.0,93.333333,93.333333,81.547619
10909,9,bahamas,1602,bahamasa,93.333333,100.0,93.333333,93.333333,97.5
10421,9,bahamas,907,bahamasc,93.333333,100.0,93.333333,93.333333,97.5
10603,9,bahamas,1200,bahamas6,93.333333,100.0,93.333333,93.333333,97.5
10436,9,bahamas,930,bahamas1,93.333333,100.0,93.333333,93.333333,97.5
10233,9,bahamas,516,bahams,92.307692,90.909091,92.307692,92.307692,97.142857
10107,9,bahamas,188,bahama,92.307692,100.0,92.307692,92.307692,97.142857
10424,9,bahamas,914,ahamas,92.307692,100.0,92.307692,92.307692,95.238095
10420,9,bahamas,904,bahmas,92.307692,83.333333,92.307692,92.307692,92.777778


In [40]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='bahamas') & (fuzzy_words_df['jaro_winkler_score']>75)].sort_values('jaro_winkler_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
10421,9,bahamas,907,bahamasc,93.333333,100.0,93.333333,93.333333,97.5
10909,9,bahamas,1602,bahamasa,93.333333,100.0,93.333333,93.333333,97.5
10603,9,bahamas,1200,bahamas6,93.333333,100.0,93.333333,93.333333,97.5
10544,9,bahamas,1101,bahamaas,93.333333,92.307692,93.333333,93.333333,97.5
10436,9,bahamas,930,bahamas1,93.333333,100.0,93.333333,93.333333,97.5
10107,9,bahamas,188,bahama,92.307692,100.0,92.307692,92.307692,97.142857
10491,9,bahamas,1018,bahaams,85.714286,85.714286,85.714286,85.714286,97.142857
10233,9,bahamas,516,bahams,92.307692,90.909091,92.307692,92.307692,97.142857
10551,9,bahamas,1125,bahamaspo,87.5,100.0,87.5,87.5,95.555556
10424,9,bahamas,914,ahamas,92.307692,100.0,92.307692,92.307692,95.238095


In [41]:
word_list = fuzzy_words_df[(fuzzy_words_df['original_value']=='bahamas') & (fuzzy_words_df['jaro_winkler_score']>75)].sort_values('jaro_winkler_score', ascending=False)['match_value']
for word in word_list:
    print(word)
    display(df[df['address_wordlist'].apply(lambda x: word in x)])
    print('\n')

bahamasc


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
570,14018521,51 Frederick Street; P.O. Box N-1136; Nassau; ...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,51 frederick street po box n1136 nassau bahamasc,"[51, frederick, street, po, box, n1136, nassau..."




bahamasa


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1442,240054,Winterbotham Place Marlborough & Queen Streets...,,Bahamas,BHS,Offshore Leaks,The Offshore Leaks data is current through 2010,,winterbotham place marlborough and queen stree...,"[winterbotham, place, marlborough, and, queen,..."




bahamas6


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1041,14080001,SUITE E-2; UNION COURT BUILDING; ELIZABETH AVE...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e 2 union court building elizabeth avenu...,"[suite, e, 2, union, court, building, elizabet..."




bahamaas


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
834,14051201,"NASSAU, BAHAMAAS",,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,nassau bahamaas,"[nassau, bahamaas]"




bahamas1


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
587,14028501,Atlantic House; 3rd Floor; Collins Avenue & 2n...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,atlantic house third floor collins avenue and ...,"[atlantic, house, third, floor, collins, avenu..."
1177,14085238,WINTERBOTHAM PLACE; MARLBOROUGH & QUEEN STREET...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,winterbotham place marlborough and queen stree...,"[winterbotham, place, marlborough, and, queen,..."




bahama


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
76,24000077,"P.O. BOX F-40773, FREEPORT, GR. BAHAMA 242-352...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,po box f40773 freeport gr bahama 2423527291,"[po, box, f40773, freeport, gr, bahama, 242352..."
79,24000080,"REGENT CENTRE, P.O. BOX F-40132 FREEPORT, GRAN...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,regent centre po box f40132 freeport grand bahama,"[regent, centre, po, box, f40132, freeport, gr..."
83,24000084,"CHANCERY HOUSE, P.O. BOX F-42578 FREEPORT, GRA...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,chancery house po box f42578 freeport grand ba...,"[chancery, house, po, box, f42578, freeport, g..."
87,24000088,"CHANCERY COURT THE MALL, P.O. BOX F-42643 FREE...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,chancery court the mall po box f42643 freeport...,"[chancery, court, the, mall, po, box, f42643, ..."
104,24000105,"SUITE A, REGENT CENTRE, P.O. BOX F-42682 FREEP...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,suite a regent centre po box f42682 freeport g...,"[suite, a, regent, centre, po, box, f42682, fr..."
...,...,...,...,...,...,...,...,...,...,...
2083,33000290,"REGENT CENTRE PO BOX F-40132 FREEPORT, GR BAHA...",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current thr...,,regent centre po box f40132 freeport gr bahama...,"[regent, centre, po, box, f40132, freeport, gr..."
2084,33000291,"REGENT CENTRE PO BOX F-40132 FREEPORT, GRAND B...",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current thr...,,regent centre po box f40132 freeport grand bahama,"[regent, centre, po, box, f40132, freeport, gr..."
2085,33000293,"SUITE 10 SEVENTEEN CENTRE, BANK LANE PO BOX F-...",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current thr...,,suite 10 seventeen centre bank lane po box f43...,"[suite, 10, seventeen, centre, bank, lane, po,..."
2091,33000299,"FIRST COMMERCIAL CENTRE SUITE 1, 2ND FL PO BOX...",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current thr...,,first commercial centre suite 1 second fl po b...,"[first, commercial, centre, suite, 1, second, ..."




bahaams


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
725,14038328,Elizabeth Avenue and Shirley Street; Union Cou...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,elizabeth avenue and shirley street union cour...,"[elizabeth, avenue, and, shirley, street, unio..."




bahams


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
274,24000275,"P.O. BOX N 8680, NASSAU, BAHAMS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,po box n 8680 nassau bahams,"[po, box, n, 8680, nassau, bahams]"
559,14018044,4TH FLOOR THE BAHAMAS FINANCIAL CENTRE SHIRLEY...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,fourth floor the bahamas financial centre shir...,"[fourth, floor, the, bahamas, financial, centr..."
631,14030207,BAHAMS FINANCILA CENTRE PO BOX N-3023 SHIRLEY ...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,bahams financila centre po box n3023 shirley a...,"[bahams, financila, centre, po, box, n3023, sh..."
826,14050608,MOSSACK FONSECA & CO (BAHAMS) LIMITED SAFFREY ...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,mossack fonseca and co bahams limited saffrey ...,"[mossack, fonseca, and, co, bahams, limited, s..."
867,14064246,P.O.BOX N-3944; PROVIDENCE HOUSE; EAST HILL ST...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,po box n3944 providence house east hill street...,"[po, box, n3944, providence, house, east, hill..."
889,14064268,P O BOX N8188 NASSAU BAHAMS,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,po box n8188 nassau bahams,"[po, box, n8188, nassau, bahams]"
1910,33000104,"NASSAU, BAHAMS",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current thr...,,nassau bahams,"[nassau, bahams]"




bahamaspo


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
878,14064257,P.O. Box N-7768; Nassau; BahamasP.O. Box N-776...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,po box n7768 nassau bahamaspo box n7768 nassau...,"[po, box, n7768, nassau, bahamaspo, box, n7768..."




ahamas


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
576,14025414,ahamas Financial Centre; 4th Floor; Shirley & ...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,ahamas financial centre fourth floor shirley a...,"[ahamas, financial, centre, fourth, floor, shi..."




bahanas


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
875,14064254,P.O. Box N-7757; East Bay Street; Nassau; Bahanas,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,po box n7757 east bay street nassau bahanas,"[po, box, n7757, east, bay, street, nassau, ba..."




baham


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1930,33000127,"CHANCERY COURT, THE MALL PO BOX F-42519 FREEPO...",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current thr...,,chancery court the mall po box f42519 freeport...,"[chancery, court, the, mall, po, box, f42519, ..."




bahmas


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
563,14018385,50 Shirley Street; Nassau; Bahmas,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,50 shirley street nassau bahmas,"[50, shirley, street, nassau, bahmas]"
668,14033053,c/o Morgan Trust Company of The Bahamas Limite...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,co morgan trust company of the bahamas limited...,"[co, morgan, trust, company, of, the, bahamas,..."
749,14042830,FOURTH FLOOR; THE BAHAMAS FINANCIAL CENTRE; SH...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,fourth floor the bahamas financial centre shir...,"[fourth, floor, the, bahamas, financial, centr..."
932,14077074,Saffrey Square; Suite 205; Bank Lane; P.O. Box...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,saffrey square suite 205 bank lane po box n818...,"[saffrey, square, suite, 205, bank, lane, po, ..."
1152,14085026,"WEST BAY STREET NASSAU, BAHMAS",,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,west bay street nassau bahmas,"[west, bay, street, nassau, bahmas]"




bah


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
495,24000496,"SHIRLEY & CHARLOTTE STS BAH. FIN. CENTRE, P.O....",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,shirley and charlotte street bah fin centre po...,"[shirley, and, charlotte, street, bah, fin, ce..."
760,14043538,GOODMAN S BAY CORPORATE CENTER WEST BAY STREE...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,goodman s bay corporate center west bay street...,"[goodman, s, bay, corporate, center, west, bay..."




brahmas


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1120,14080679,"The Brahmas Financial Centre, Shirley and Char...",,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,the brahmas financial centre shirley and charl...,"[the, brahmas, financial, centre, shirley, and..."




bhamas


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
390,24000391,"P.O. BOX N-4485, NASSAU BHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,po box n4485 nassau bhamas,"[po, box, n4485, nassau, bhamas]"




abahamas


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1901,33000091,NEW PROVIDENCE ABAHAMAS,,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current thr...,,new providence abahamas,"[new, providence, abahamas]"




ba


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
833,14051200,Nassau-BA-Bahamas,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,nassau ba bahamas,"[nassau, ba, bahamas]"
951,14077696,"Sede Nassau-BA (capital), Bahamas",,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,sede nassau ba capital bahamas,"[sede, nassau, ba, capital, bahamas]"




bazaar


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
549,14012324,2nd Floor; International Bazaar; Bay Street; P...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,second floor international bazaar bay street p...,"[second, floor, international, bazaar, bay, st..."




bosham


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
199,24000200,"#6 BOSHAM CLOSE, CAMPERDOWN HEIGHTS P.O. BOX S...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,6 bosham close camperdown heights po box sp 63...,"[6, bosham, close, camperdown, heights, po, bo..."
1887,33000077,"#6 BOSHAM CLOSE, CAMPERDOWN HEIGHTS PO BOX SP ...",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current thr...,,6 bosham close camperdown heights po box sp 63...,"[6, bosham, close, camperdown, heights, po, bo..."
2180,240003759,"NO. 6 BOSHAM CLOSE, CAMPERDOWN HEIGHTS NEW PRO...",,Bahamas,BHS,"Pandora Papers - Alemán, Cordero, Galindo & Le...",Provider data is current through 2018,,no 6 bosham close camperdown heights new provi...,"[no, 6, bosham, close, camperdown, heights, ne..."




hamas


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
2243,240492153,"MONTAGUE STERLING CENTRE. EAST BAY STREET, NAS...",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2016,,montague sterling centre east bay street nassa...,"[montague, sterling, centre, east, bay, street..."






### Nassau

In [42]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='nassau') & (fuzzy_words_df['ratio_score']>60)].sort_values('ratio_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
9110,8,nassau,698,nasssau,92.307692,83.333333,92.307692,92.307692,97.142857
9248,8,nassau,925,nassaub,92.307692,100.0,92.307692,92.307692,97.142857
9398,8,nassau,1160,nassaau,92.307692,90.909091,92.307692,92.307692,97.142857
9453,8,nassau,1224,nassaus,92.307692,100.0,92.307692,92.307692,97.142857
9746,8,nassau,1593,naussau,92.307692,83.333333,92.307692,92.307692,96.190476
9217,8,nassau,872,nasau,90.909091,80.0,90.909091,90.909091,96.111111
9424,8,nassau,1194,nassu,90.909091,88.888889,90.909091,90.909091,96.666667
9350,8,nassau,1083,massau,83.333333,90.909091,83.333333,83.333333,88.888889
9387,8,nassau,1146,nassua,83.333333,90.909091,83.333333,83.333333,96.666667
9415,8,nassau,1185,nassao,83.333333,90.909091,83.333333,83.333333,93.333333


In [43]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='nassau') & (fuzzy_words_df['partial_ratio_score']>70)].sort_values('partial_ratio_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
8654,8,nassau,49,n,28.571429,100.0,28.571429,28.571429,75.0
8792,8,nassau,238,a,28.571429,100.0,28.571429,28.571429,72.222222
8995,8,nassau,530,ss,50.0,100.0,50.0,50.0,77.777778
9853,8,nassau,1755,na,50.0,100.0,50.0,50.0,82.222222
9453,8,nassau,1224,nassaus,92.307692,100.0,92.307692,92.307692,97.142857
9248,8,nassau,925,nassaub,92.307692,100.0,92.307692,92.307692,97.142857
9249,8,nassau,926,as,50.0,100.0,50.0,50.0,77.777778
9290,8,nassau,993,343nassau,80.0,100.0,80.0,80.0,88.888889
8707,8,nassau,123,s,28.571429,100.0,28.571429,28.571429,72.222222
9398,8,nassau,1160,nassaau,92.307692,90.909091,92.307692,92.307692,97.142857


In [44]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='nassau') & (fuzzy_words_df['token_sort_score']>80)].sort_values('token_sort_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
9110,8,nassau,698,nasssau,92.307692,83.333333,92.307692,92.307692,97.142857
9248,8,nassau,925,nassaub,92.307692,100.0,92.307692,92.307692,97.142857
9398,8,nassau,1160,nassaau,92.307692,90.909091,92.307692,92.307692,97.142857
9453,8,nassau,1224,nassaus,92.307692,100.0,92.307692,92.307692,97.142857
9746,8,nassau,1593,naussau,92.307692,83.333333,92.307692,92.307692,96.190476
9217,8,nassau,872,nasau,90.909091,80.0,90.909091,90.909091,96.111111
9424,8,nassau,1194,nassu,90.909091,88.888889,90.909091,90.909091,96.666667
9350,8,nassau,1083,massau,83.333333,90.909091,83.333333,83.333333,88.888889
9387,8,nassau,1146,nassua,83.333333,90.909091,83.333333,83.333333,96.666667
9415,8,nassau,1185,nassao,83.333333,90.909091,83.333333,83.333333,93.333333


In [45]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='nassau') & (fuzzy_words_df['token_set_score']>75)].sort_values('token_set_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
9110,8,nassau,698,nasssau,92.307692,83.333333,92.307692,92.307692,97.142857
9248,8,nassau,925,nassaub,92.307692,100.0,92.307692,92.307692,97.142857
9398,8,nassau,1160,nassaau,92.307692,90.909091,92.307692,92.307692,97.142857
9453,8,nassau,1224,nassaus,92.307692,100.0,92.307692,92.307692,97.142857
9746,8,nassau,1593,naussau,92.307692,83.333333,92.307692,92.307692,96.190476
9217,8,nassau,872,nasau,90.909091,80.0,90.909091,90.909091,96.111111
9424,8,nassau,1194,nassu,90.909091,88.888889,90.909091,90.909091,96.666667
9350,8,nassau,1083,massau,83.333333,90.909091,83.333333,83.333333,88.888889
9387,8,nassau,1146,nassua,83.333333,90.909091,83.333333,83.333333,96.666667
9415,8,nassau,1185,nassao,83.333333,90.909091,83.333333,83.333333,93.333333


In [46]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='nassau') & (fuzzy_words_df['jaro_winkler_score']>85)].sort_values('jaro_winkler_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
9110,8,nassau,698,nasssau,92.307692,83.333333,92.307692,92.307692,97.142857
9248,8,nassau,925,nassaub,92.307692,100.0,92.307692,92.307692,97.142857
9398,8,nassau,1160,nassaau,92.307692,90.909091,92.307692,92.307692,97.142857
9453,8,nassau,1224,nassaus,92.307692,100.0,92.307692,92.307692,97.142857
9387,8,nassau,1146,nassua,83.333333,90.909091,83.333333,83.333333,96.666667
9424,8,nassau,1194,nassu,90.909091,88.888889,90.909091,90.909091,96.666667
9746,8,nassau,1593,naussau,92.307692,83.333333,92.307692,92.307692,96.190476
9217,8,nassau,872,nasau,90.909091,80.0,90.909091,90.909091,96.111111
9415,8,nassau,1185,nassao,83.333333,90.909091,83.333333,83.333333,93.333333
9431,8,nassau,1201,nassan,83.333333,90.909091,83.333333,83.333333,93.333333


In [47]:
word_list = fuzzy_words_df[(fuzzy_words_df['original_value']=='nassau') & (fuzzy_words_df['jaro_winkler_score']>85)].sort_values('jaro_winkler_score', ascending=False)['match_value']
for word in word_list:
    print(word)
    display(df[df['address_wordlist'].apply(lambda x: word in x)])
    print('\n')

nasssau


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
420,24000421,"3RD FLOOR, GEORGE HOUSE, GEORGE STREET, P.O. B...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,third floor george house george street po box ...,"[third, floor, george, house, george, street, ..."




nassaub


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
583,14026897,ANSBACHER (BAHAMAS) LIMITED P.O. BOX N 7768 AN...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,ansbacher bahamas limited po box n 7768 ansbac...,"[ansbacher, bahamas, limited, po, box, n, 7768..."




nassaau


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
966,14078961,Suite 102; Saffrey Square; Bay Street and Bank...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite 102 saffrey square bay street and bank l...,"[suite, 102, saffrey, square, bay, street, and..."




nassaus


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1106,14080656,The Bahamas Financial Centre; Shirley & Charlo...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,the bahamas financial centre shirley and charl...,"[the, bahamas, financial, centre, shirley, and..."




nassua


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
933,14077075,SAFFREY SQUARE; SUITE 205; BANK LANE; P.O. BOX...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,saffrey square suite 205 bank lane po box n818...,"[saffrey, square, suite, 205, bank, lane, po, ..."
969,14078964,Suite 102; Saffrey Square; Bay Street and Bank...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite 102 saffrey square bay street and bank l...,"[suite, 102, saffrey, square, bay, street, and..."




nassu


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1021,14079979,Suite E-2; Union Court Building; Elizabeth Ave...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e 2 union court building elizabeth avenu...,"[suite, e, 2, union, court, building, elizabet..."
1117,14080667,The Bahamas Financial Centre; Shirley and Char...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,the bahamas financial centre shirley and charl...,"[the, bahamas, financial, centre, shirley, and..."




naussau


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1428,252371,"43 Elizabeth Avenue, P.O.Box CB-13022 Naussau ...",,Bahamas,BHS,Offshore Leaks,The Offshore Leaks data is current through 2010,,43 elizabeth avenue po box cb13022 naussau bah...,"[43, elizabeth, avenue, po, box, cb13022, naus..."




nasau


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
536,14000678,"101 East Hill Street, Nasau Bahamas",,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,101 east hill street nasau bahamas,"[101, east, hill, street, nasau, bahamas]"
612,14030188,Bahamas Financial Centre; Shirley & Charlotte ...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,bahamas financial centre shirley and charlotte...,"[bahamas, financial, centre, shirley, and, cha..."
682,14035228,"CB 11-343 Nasau, Bahamas",,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,cb 11 343 nasau bahamas,"[cb, 11, 343, nasau, bahamas]"
724,14038327,Elizabeth Avenue and Shirley Street; Union Cou...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,elizabeth avenue and shirley street union cour...,"[elizabeth, avenue, and, shirley, street, unio..."
965,14078960,Suite 102; Saffrey Square; Bay Street and Bank...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite 102 saffrey square bay street and bank l...,"[suite, 102, saffrey, square, bay, street, and..."
1440,239867,"UBS Trustees (Bahamas) Ltd, UBS House, East Ba...",,Bahamas,BHS,Offshore Leaks,The Offshore Leaks data is current through 2010,,ubs trustees bahamas ltd ubs house east bay st...,"[ubs, trustees, bahamas, ltd, ubs, house, east..."




nassao


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
999,14079956,Suite E-2; Union Court Buiding; Elizabeth Aven...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e2 union court buiding elizabeth avenue ...,"[suite, e2, union, court, buiding, elizabeth, ..."




nassan


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1043,14080003,Suite E-2; Union Court Building; Elizabeth Ave...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e 2 union court building elizabeth avenu...,"[suite, e, 2, union, court, building, elizabet..."
1050,14080011,Suite E-2; Union Court Building; Elizabeth Ave...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e 2 union court building elizabeth avenu...,"[suite, e, 2, union, court, building, elizabet..."




massau


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
810,14049672,"MASSAU, BAHAMAS",,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,massau bahamas,"[massau, bahamas]"




343nassau


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
681,14035227,CB 11.343/Nassau Bahamas,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,cb 11 343nassau bahamas,"[cb, 11, 343nassau, bahamas]"






### Shirley

In [48]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='shirley') & (fuzzy_words_df['ratio_score']>60)].sort_values('ratio_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
4427,3,shirley,1197,shiriley,93.333333,85.714286,93.333333,93.333333,97.5
4432,3,shirley,1203,shirly,92.307692,90.909091,92.307692,92.307692,97.142857
4366,3,shirley,1087,shorley,85.714286,85.714286,85.714286,85.714286,92.380952
4424,3,shirley,1193,shitley,85.714286,85.714286,85.714286,85.714286,93.333333
4426,3,shirley,1196,shirely,85.714286,85.714286,85.714286,85.714286,97.142857
4458,3,shirley,1237,shriley,85.714286,85.714286,85.714286,85.714286,96.190476
4429,3,shirley,1199,andshirley,82.352941,100.0,82.352941,82.352941,90.0
4380,3,shirley,1119,shirleyand,82.352941,100.0,82.352941,82.352941,94.0
4069,3,shirley,525,shirlaw,71.428571,83.333333,71.428571,71.428571,88.571429
4645,3,shirley,1444,haley,66.666667,75.0,66.666667,66.666667,79.047619


In [49]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='shirley') & (fuzzy_words_df['partial_ratio_score']>60)].sort_values('partial_ratio_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
4675,3,shirley,1479,y,25.0,100.0,25.0,25.0,0.0
4029,3,shirley,419,e,25.0,100.0,25.0,25.0,0.0
4379,3,shirley,1113,l,25.0,100.0,25.0,25.0,0.0
4380,3,shirley,1119,shirleyand,82.352941,100.0,82.352941,82.352941,94.0
3909,3,shirley,123,s,25.0,100.0,25.0,25.0,74.285714
4388,3,shirley,1145,i,25.0,100.0,25.0,25.0,71.428571
4429,3,shirley,1199,andshirley,82.352941,100.0,82.352941,82.352941,90.0
4937,3,shirley,1858,hi,44.444444,100.0,44.444444,44.444444,76.190476
4032,3,shirley,433,r,25.0,100.0,25.0,25.0,0.0
4000,3,shirley,355,h,25.0,100.0,25.0,25.0,71.428571


In [50]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='shirley') & (fuzzy_words_df['token_sort_score']>60)].sort_values('token_sort_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
4427,3,shirley,1197,shiriley,93.333333,85.714286,93.333333,93.333333,97.5
4432,3,shirley,1203,shirly,92.307692,90.909091,92.307692,92.307692,97.142857
4366,3,shirley,1087,shorley,85.714286,85.714286,85.714286,85.714286,92.380952
4424,3,shirley,1193,shitley,85.714286,85.714286,85.714286,85.714286,93.333333
4426,3,shirley,1196,shirely,85.714286,85.714286,85.714286,85.714286,97.142857
4458,3,shirley,1237,shriley,85.714286,85.714286,85.714286,85.714286,96.190476
4429,3,shirley,1199,andshirley,82.352941,100.0,82.352941,82.352941,90.0
4380,3,shirley,1119,shirleyand,82.352941,100.0,82.352941,82.352941,94.0
4069,3,shirley,525,shirlaw,71.428571,83.333333,71.428571,71.428571,88.571429
4645,3,shirley,1444,haley,66.666667,75.0,66.666667,66.666667,79.047619


In [51]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='shirley') & (fuzzy_words_df['token_set_score']>60)].sort_values('token_set_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
4427,3,shirley,1197,shiriley,93.333333,85.714286,93.333333,93.333333,97.5
4432,3,shirley,1203,shirly,92.307692,90.909091,92.307692,92.307692,97.142857
4366,3,shirley,1087,shorley,85.714286,85.714286,85.714286,85.714286,92.380952
4424,3,shirley,1193,shitley,85.714286,85.714286,85.714286,85.714286,93.333333
4426,3,shirley,1196,shirely,85.714286,85.714286,85.714286,85.714286,97.142857
4458,3,shirley,1237,shriley,85.714286,85.714286,85.714286,85.714286,96.190476
4429,3,shirley,1199,andshirley,82.352941,100.0,82.352941,82.352941,90.0
4380,3,shirley,1119,shirleyand,82.352941,100.0,82.352941,82.352941,94.0
4069,3,shirley,525,shirlaw,71.428571,83.333333,71.428571,71.428571,88.571429
4645,3,shirley,1444,haley,66.666667,75.0,66.666667,66.666667,79.047619


In [52]:
fuzzy_words_df[(fuzzy_words_df['original_value']=='shirley') & (fuzzy_words_df['jaro_winkler_score']>60)].sort_values('jaro_winkler_score', ascending=False)

Unnamed: 0,original_index,original_value,match_index,match_value,ratio_score,partial_ratio_score,token_sort_score,token_set_score,jaro_winkler_score
4427,3,shirley,1197,shiriley,93.333333,85.714286,93.333333,93.333333,97.500000
4426,3,shirley,1196,shirely,85.714286,85.714286,85.714286,85.714286,97.142857
4432,3,shirley,1203,shirly,92.307692,90.909091,92.307692,92.307692,97.142857
4458,3,shirley,1237,shriley,85.714286,85.714286,85.714286,85.714286,96.190476
4380,3,shirley,1119,shirleyand,82.352941,100.000000,82.352941,82.352941,94.000000
...,...,...,...,...,...,...,...,...,...
4544,3,shirley,1333,sociedad,40.000000,50.000000,40.000000,40.000000,60.119048
4561,3,shirley,1353,property,40.000000,42.857143,40.000000,40.000000,60.119048
4822,3,shirley,1676,highland,40.000000,50.000000,40.000000,40.000000,60.119048
4619,3,shirley,1416,vinicole,40.000000,50.000000,40.000000,40.000000,60.119048


In [53]:
word_list = fuzzy_words_df[(fuzzy_words_df['original_value']=='shirley') & (fuzzy_words_df['jaro_winkler_score']>85)].sort_values('jaro_winkler_score', ascending=False)['match_value']
for word in word_list:
    print(word)
    display(df[df['address_wordlist'].apply(lambda x: word in x)])
    print('\n')

shiriley


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1029,14079987,SUITE E-2; UNION COURT BUILDING; ELIZABETH AVE...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e2 union court building elizabeth avenue...,"[suite, e2, union, court, building, elizabeth,..."




shirely


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1027,14079985,Suite E-2; Union Court Building; Elizabeth Ave...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e 2 union court building elizabeth avenu...,"[suite, e, 2, union, court, building, elizabet..."
1028,14079986,SUITE E-2; UNION COURT BUILDING; ELIZABETH AVE...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e2 union court building elizabeth avenue...,"[suite, e2, union, court, building, elizabeth,..."




shirly


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1050,14080011,Suite E-2; Union Court Building; Elizabeth Ave...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e 2 union court building elizabeth avenu...,"[suite, e, 2, union, court, building, elizabet..."
1476,287069,THE BAHAMAS FINANCIAL CENTRE SHIRLY AND CHARLO...,,Bahamas,BHS,Offshore Leaks,The Offshore Leaks data is current through 2010,,the bahamas financial centre shirly and charlo...,"[the, bahamas, financial, centre, shirly, and,..."




shriley


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1140,14082122,Union Court Building; Suiete E-2; Elizabeth Av...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,union court building suiete e 2 elizabeth aven...,"[union, court, building, suiete, e, 2, elizabe..."




shirleyand


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
860,14064239,P O BOX N-3023 BAHAMAS FINANCIAL CENTRE; SHIRL...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,po box n3023 bahamas financial centre shirleya...,"[po, box, n3023, bahamas, financial, centre, s..."




shitley


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1017,14079975,Suite E-2; Union Court Building; Elizabeth Ave...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e 2 union court building elizabeth avenu...,"[suite, e, 2, union, court, building, elizabet..."




shorley


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
818,14050515,Morgan Trust Co. of Bahamas Ltd.; The Bahamas ...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,morgan trust co of bahamas ltd the bahamas fin...,"[morgan, trust, co, of, bahamas, ltd, the, bah..."
943,14077452,Sasson House Building; 107 Shorley Street; P.O...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,sasson house building 107 shorley street po bo...,"[sasson, house, building, 107, shorley, street..."




andshirley


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1037,14079997,Suite E - 2; Union Court Building; Elizabeth A...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,suite e 2 union court building elizabeth avenu...,"[suite, e, 2, union, court, building, elizabet..."




shirlaw


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
281,24000282,"SHIRLEY STREET, SHIRLAW HOUSE, P.O. BOX N-4839...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,,shirley street shirlaw house po box n4839 nass...,"[shirley, street, shirlaw, house, po, box, n48..."
1591,81031328,Higgs & Johnson Corporate Services Ltd.; Shirl...,Higgs & Johnson Corporate Services Ltd.,Bahamas,BHS,Paradise Papers - Appleby,Appleby data is current through 2014,,higgs and johnson corporate services ltd shirl...,"[higgs, and, johnson, corporate, services, ltd..."
1880,33000070,"SHIRLEY STREET, SHIRLAW HOUSE, PO BOX N-4839, ...",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current thr...,,shirley street shirlaw house po box n4839 nass...,"[shirley, street, shirlaw, house, po, box, n48..."




shorline


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
573,14020532,64; Shorline; Double Road; Freeport; Grand Bah...,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,64 shorline double road freeport grand bahamas...,"[64, shorline, double, road, freeport, grand, ..."






# Resources

## Fuzzy Matching

- [Fuzzing matching in pandas with fuzzywuzzy](https://jonathansoma.com/lede/algorithms-2017/classes/fuzziness-matplotlib/fuzzing-matching-in-pandas-with-fuzzywuzzy/)
- [Best Libraries for Fuzzy Matching In Python](https://medium.com/codex/best-libraries-for-fuzzy-matching-in-python-cbb3e0ef87dd)
- [Fuzzy String Matching](https://towardsdatascience.com/fuzzy-string-matching-in-python-68f240d910fe)
- [Fuzzy String Comparison](https://stackoverflow.com/a/28467760)
- [How to do Fuzzy Matching on Pandas Dataframe Column Using Python?](https://www.geeksforgeeks.org/how-to-do-fuzzy-matching-on-pandas-dataframe-column-using-python/)

## Timeit

- [Timeit in Jupyter Notebook](https://linuxhint.com/timeit-jupyter-notebook/)