# NLP 06: Word Frequencies

To be honest, I did a lot of this the hard way to start. It involved manually filtering an Excel spreadsheet version of the addresses, scrolling through thousands of addresses, copy and paste, and making lots of processing mistakes. It occurred to me that a lot of that could have been accomplished by looking at word frequencies, then search the dataset for the most frequent terms.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
from time import gmtime, strftime
import sys
import os
import io

import string
import re

In [2]:
df = pd.read_csv('data/edited/addresses_Bahamas.csv')

## Create a new column to work with

It's important to keep a copy of the original values. This makes it much easier to revert changes that don't perform as expected and also to reference if data is lost or to put it back into context. I've lost count of the number of times I need to walk back a certain processing technique.

In [3]:
df['working_address'] = df['address']

## Word counts

I frequently use word counts to determine any words I need to add to the stopword list (stopwords are words that occur frequently, such as 'the' or 'a' that don't add value to the analysis and should be removed).

In this case, I'm hoping it will bubble up things like cities, states, islands, other address features that occur frequently, and punctuation usage. 

First I'll fill  blank lines with an empty string to help full dataset processing (i.e. I really don't want to deal with NANs). Next I lowercase everything (there is no use for "Annex" and "annex" to be two different values in this analysis). Then I split on spaces. Lastly, I reuse the frequency function from [Entry NLP4: Frequencies and Comparison](https://julielinx.github.io/blog/nlp04_vocal_auth_freq_compare/) to get the word counts.

In [4]:
df['address_wordlist'] = df['working_address'].fillna('').str.lower().str.split()

In [5]:
df.address_wordlist

0       [annex, frederick, &, shirley, sts,, p.o., box...
1       [suite, e-2,union, court, building,, p.o., box...
2       [lyford, cay, house,, lyford, cay,, p.o., box,...
3       [p.o., box, n-3708, bahamas, financial, centre...
4       [lyford, cay, house,, 3rd, floor,, lyford, cay...
                              ...                        
2253    [j.p.morgan, trust, company, (bahamas), limite...
2254    [montagne, sterline, centre., east, bav, stree...
2255    [deltec, house,, lyford, cay,, po, box, n-3229...
2256    [providence, house,, hast, wing,, east, hill, ...
2257    [c/oj.p., morgan, trust, company, (bahamas), l...
Name: address_wordlist, Length: 2258, dtype: object

In [6]:
def frequency_ct(ngram_list):
    freq_dict = {}
    for ngram in ngram_list:
        if ngram not in freq_dict:
            freq_dict[ngram] = 0
        freq_dict[ngram] +=1
    return freq_dict

In [7]:
freq_df = pd.DataFrame.from_dict(
    frequency_ct(df['address_wordlist'].sum()
                ), orient='index').reset_index().rename(
    columns={'index':'word', 0:'count'}).sort_values('count', ascending=False)

In [8]:
freq_df.head()

Unnamed: 0,word,count
9,bahamas,2140
6,box,1447
8,"nassau,",974
5,p.o.,889
1008,nassau;,772


\**Note*: when applied to a column with lists the `.sum()` function will return a single list. If that column is saved as a `.csv`, then reloaded the list will be read in as a string and `.sum()` will return something that looks like a list of lists, but is a list of strings that were lists.

In [9]:
df['address_wordlist'].sum()[:15]

['annex',
 'frederick',
 '&',
 'shirley',
 'sts,',
 'p.o.',
 'box',
 'n-4805,',
 'nassau,',
 'bahamas',
 'suite',
 'e-2,union',
 'court',
 'building,',
 'p.o.']

In [10]:
df.head(10).to_csv('stringified_list.csv', index=False)
str_lst_df = pd.read_csv('stringified_list.csv')

In [11]:
str_lst_df['address_wordlist'].sum()

'[\'annex\', \'frederick\', \'&\', \'shirley\', \'sts,\', \'p.o.\', \'box\', \'n-4805,\', \'nassau,\', \'bahamas\'][\'suite\', \'e-2,union\', \'court\', \'building,\', \'p.o.\', \'box\', \'n-8188,\', \'nassau,\', \'bahamas\'][\'lyford\', \'cay\', \'house,\', \'lyford\', \'cay,\', \'p.o.\', \'box\', \'n-7785,\', \'nassau,\', \'bahamas\'][\'p.o.\', \'box\', \'n-3708\', \'bahamas\', \'financial\', \'centre,\', \'p.o.\', \'box\', \'n-3708\', \'shirley\', \'&\', \'charlotte\', \'sts,\', \'nassau,\', \'bahamas\'][\'lyford\', \'cay\', \'house,\', \'3rd\', \'floor,\', \'lyford\', \'cay,\', \'p.o.\', \'box\', \'n-3024,\', \'nassau,\', \'bahamas\'][\'303\', \'shirley\', \'street,\', \'p.o.\', \'box\', \'n-492,\', \'nassau,\', \'bahamas\'][\'ocean\', \'centre,\', \'montagu\', \'foreshore,\', \'p.o.\', \'box\', \'ss-19084\', \'east\', \'bay\', \'street,\', \'nassau,\', \'bahamas\'][\'providence\', \'house,\', \'east\', \'wing\', \'east\', \'hill\', \'st,\', \'p.o.\', \'box\', \'cb-12399,\', \'nass

## Non-standardizations

I only need to see the top 20 results to start identifying ways to standardize the data.

- PO Boxes are written both as "p.o." and "po"
- There are both "&" and "and"

I'll double check to make sure these are what I think they are.

In [12]:
freq_df.head(20)

Unnamed: 0,word,count
9,bahamas,2140
6,box,1447
8,"nassau,",974
5,p.o.,889
1008,nassau;,772
3,shirley,484
417,po,461
10,suite,445
35,bay,405
1009,street;,362


In [13]:
pd.set_option('display.max_colwidth', 1000)

In [14]:
df[df['working_address'].str.lower().str.contains('p\.?o\.?', regex=True)].tail()

Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
2225,240491356,"P.O. BOX N- 3944, SUITE 200B, 2ND FLOOR, CENTRE OF COMMERCE, ONE BAY STREET, NASSAU, BAHAMAS, NASSAU, BAHAMAS",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2017,,"P.O. BOX N- 3944, SUITE 200B, 2ND FLOOR, CENTRE OF COMMERCE, ONE BAY STREET, NASSAU, BAHAMAS, NASSAU, BAHAMAS","[p.o., box, n-, 3944,, suite, 200b,, 2nd, floor,, centre, of, commerce,, one, bay, street,, nassau,, bahamas,, nassau,, bahamas]"
2227,240491474,"SUITE 200B, 2ND FLOOR, CENTRE OF COMMERCE, ONE BAY STREET, PO BOX N-3944, NASSAU, BAHAMAS",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2016,,"SUITE 200B, 2ND FLOOR, CENTRE OF COMMERCE, ONE BAY STREET, PO BOX N-3944, NASSAU, BAHAMAS","[suite, 200b,, 2nd, floor,, centre, of, commerce,, one, bay, street,, po, box, n-3944,, nassau,, bahamas]"
2229,240491518,"RBC TRUST COMPANY (BAHAMAS) LIMITED, BAYSIDE EXECUTIVE PARK BUILDING 3, P.O. BOX NO. 30-24, NASSAU, BAHAMAS",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2016,,"RBC TRUST COMPANY (BAHAMAS) LIMITED, BAYSIDE EXECUTIVE PARK BUILDING 3, P.O. BOX NO. 30-24, NASSAU, BAHAMAS","[rbc, trust, company, (bahamas), limited,, bayside, executive, park, building, 3,, p.o., box, no., 30-24,, nassau,, bahamas]"
2255,240491733,"DELTEC HOUSE, LYFORD CAY, PO BOX N-3229, NASSAU, BAHAMAS",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2016,,"DELTEC HOUSE, LYFORD CAY, PO BOX N-3229, NASSAU, BAHAMAS","[deltec, house,, lyford, cay,, po, box, n-3229,, nassau,, bahamas]"
2256,240491778,"PROVIDENCE HOUSE, HAST WING, EAST HILL STREET, P.O. BOX CB-12399, NASSAU, CB-12399, BAHAMAS",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2016,,"PROVIDENCE HOUSE, HAST WING, EAST HILL STREET, P.O. BOX CB-12399, NASSAU, CB-12399, BAHAMAS","[providence, house,, hast, wing,, east, hill, street,, p.o., box, cb-12399,, nassau,, cb-12399,, bahamas]"


Even in just these five rows, I can see "po box" and "p.o. box" are both represented.

In [15]:
df[df['working_address'].str.lower().str.contains('p\.?o\.?b', regex=True)].tail()

Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1807,88017491,"BOLAM HOUSE, KING AND GEORGE STREETS P.O.BOX N-514, NASSAU, BAHAMAS",,Bahamas,BHS,Paradise Papers - Aruba corporate registry,Aruba corporate registry data is current through 2016,,"BOLAM HOUSE, KING AND GEORGE STREETS P.O.BOX N-514, NASSAU, BAHAMAS","[bolam, house,, king, and, george, streets, p.o.box, n-514,, nassau,, bahamas]"
2123,120000350,"SAFFREY SQUARE, SUITE 205 BANK LANE, P.O.BOX N, 8188, NASSAU, BAHAMAS.","SAFFREY SQUARE, SUITE 205 BANK LANE, P.O.BOX N, 8188, NASSAU, BAHAMAS.",Bahamas,BHS,Paradise Papers - Barbados corporate registry,Barbados corporate registry data is current through 2016,,"SAFFREY SQUARE, SUITE 205 BANK LANE, P.O.BOX N, 8188, NASSAU, BAHAMAS.","[saffrey, square,, suite, 205, bank, lane,, p.o.box, n,, 8188,, nassau,, bahamas.]"
2143,120010247,"LENNOX PATON CORPORATE SERVICES LIMITED, P.O.BOX N-4875, NASSAU, BAHAMAS.","LENNOX PATON CORPORATE SERVICES LIMITED, P.O.BOX N-4875, NASSAU, BAHAMAS.",Bahamas,BHS,Paradise Papers - Barbados corporate registry,Barbados corporate registry data is current through 2016,,"LENNOX PATON CORPORATE SERVICES LIMITED, P.O.BOX N-4875, NASSAU, BAHAMAS.","[lennox, paton, corporate, services, limited,, p.o.box, n-4875,, nassau,, bahamas.]"
2150,120006606,"OCEAN CENTRE, MONTAGU FORESHORE, EAST BAY STREET, P.O.BOX SS-19084, NASSAU, BAHAMAS.","OCEAN CENTRE, MONTAGU FORESHORE, EAST BAY STREET, P.O.BOX SS-19084, NASSAU, BAHAMAS.",Bahamas,BHS,Paradise Papers - Barbados corporate registry,Barbados corporate registry data is current through 2016,,"OCEAN CENTRE, MONTAGU FORESHORE, EAST BAY STREET, P.O.BOX SS-19084, NASSAU, BAHAMAS.","[ocean, centre,, montagu, foreshore,, east, bay, street,, p.o.box, ss-19084,, nassau,, bahamas.]"
2193,240001242,"DOMINION HOUSE60,MONTROSE AVENUE, P.O.BOX N-9932, NASSAU, BAHAMAS",,Bahamas,BHS,"Pandora Papers - Alemán, Cordero, Galindo & Lee (Alcogal)",Provider data is current through 2018,,"DOMINION HOUSE60,MONTROSE AVENUE, P.O.BOX N-9932, NASSAU, BAHAMAS","[dominion, house60,montrose, avenue,, p.o.box, n-9932,, nassau,, bahamas]"


If I dive a little further, I also see there are some instances where the two words are run together, punctuation other than "." separates the letters, and even a few where the "po" or "po box" portion has been left off entirely.

In [16]:
df[~df['working_address'].str.lower().str.contains('p\.?o\.?', regex=True)].head(10)

Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
235,24000236,"#4 PINEAPPLE GROVE,OLD FORT BAY, NEW PRODIVENCE, BOX SP-60063, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"#4 PINEAPPLE GROVE,OLD FORT BAY, NEW PRODIVENCE, BOX SP-60063, NASSAU, BAHAMAS","[#4, pineapple, grove,old, fort, bay,, new, prodivence,, box, sp-60063,, nassau,, bahamas]"
255,24000256,"P,O, BOX N-4759, NASSAU",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"P,O, BOX N-4759, NASSAU","[p,o,, box, n-4759,, nassau]"
290,24000291,"3RD FLOOR TRADE WINDS BLDG, BAY ST P>O. BOX CB 12724",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"3RD FLOOR TRADE WINDS BLDG, BAY ST P>O. BOX CB 12724","[3rd, floor, trade, winds, bldg,, bay, st, p>o., box, cb, 12724]"
484,24000485,"#70 WULFF ROAD, NASSAU, BAHAMAS N-989",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"#70 WULFF ROAD, NASSAU, BAHAMAS N-989","[#70, wulff, road,, nassau,, bahamas, n-989]"
485,24000486,DEVEAUX STREET,,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,DEVEAUX STREET,"[deveaux, street]"
517,24000518,"3RD FL. BRITISH COLONIAL CENTRE OF COMM, SUITE 304, 1 BAY STREET, SP 63776, NASSAU, NP, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"3RD FL. BRITISH COLONIAL CENTRE OF COMM, SUITE 304, 1 BAY STREET, SP 63776, NASSAU, NP, BAHAMAS","[3rd, fl., british, colonial, centre, of, comm,, suite, 304,, 1, bay, street,, sp, 63776,, nassau,, np,, bahamas]"
529,14000065,#1 Bay Street; Centre of Commerce Nassau; Bahamas.,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,#1 Bay Street; Centre of Commerce Nassau; Bahamas.,"[#1, bay, street;, centre, of, commerce, nassau;, bahamas.]"
530,14000073,#1 Venetian Villa N492; Old Fort Day; Nassau; New Providence; Bahamas.,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,#1 Venetian Villa N492; Old Fort Day; Nassau; New Providence; Bahamas.,"[#1, venetian, villa, n492;, old, fort, day;, nassau;, new, providence;, bahamas.]"
535,14007586,1st FLOOR; EURO CANADIAN CENTRE; MARLBOROUGHT STREET; NASSAU; BAHAMAS,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,1st FLOOR; EURO CANADIAN CENTRE; MARLBOROUGHT STREET; NASSAU; BAHAMAS,"[1st, floor;, euro, canadian, centre;, marlborought, street;, nassau;, bahamas]"
536,14000678,"101 East Hill Street, Nasau Bahamas",,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,"101 East Hill Street, Nasau Bahamas","[101, east, hill, street,, nasau, bahamas]"


In [17]:
df[df['working_address'].str.lower().str.contains('&|and')]

Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
0,24000001,"ANNEX FREDERICK & SHIRLEY STS, P.O. BOX N-4805, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"ANNEX FREDERICK & SHIRLEY STS, P.O. BOX N-4805, NASSAU, BAHAMAS","[annex, frederick, &, shirley, sts,, p.o., box, n-4805,, nassau,, bahamas]"
3,24000004,"P.O. BOX N-3708 BAHAMAS FINANCIAL CENTRE, P.O. BOX N-3708 SHIRLEY & CHARLOTTE STS, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"P.O. BOX N-3708 BAHAMAS FINANCIAL CENTRE, P.O. BOX N-3708 SHIRLEY & CHARLOTTE STS, NASSAU, BAHAMAS","[p.o., box, n-3708, bahamas, financial, centre,, p.o., box, n-3708, shirley, &, charlotte, sts,, nassau,, bahamas]"
8,24000009,"BAYSIDE EXECUTIVE PARK, WEST BAY & BLAKE, P.O. BOX N-4875, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"BAYSIDE EXECUTIVE PARK, WEST BAY & BLAKE, P.O. BOX N-4875, NASSAU, BAHAMAS","[bayside, executive, park,, west, bay, &, blake,, p.o., box, n-4875,, nassau,, bahamas]"
10,24000011,"TK HOUSE, BAYSIDE EXECUTIVE PARK, P.O. BOX AP-59213 WEST BAY & BLAKE ROAD, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"TK HOUSE, BAYSIDE EXECUTIVE PARK, P.O. BOX AP-59213 WEST BAY & BLAKE ROAD, NASSAU, BAHAMAS","[tk, house,, bayside, executive, park,, p.o., box, ap-59213, west, bay, &, blake, road,, nassau,, bahamas]"
11,24000012,"BAYSIDE HOUSE WEST BAY & BLAKE ROAD, P.O. BOX AP-59213, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"BAYSIDE HOUSE WEST BAY & BLAKE ROAD, P.O. BOX AP-59213, NASSAU, BAHAMAS","[bayside, house, west, bay, &, blake, road,, p.o., box, ap-59213,, nassau,, bahamas]"
...,...,...,...,...,...,...,...,...,...,...
2244,240492203,"J.P. MORGAN TRUST COMPANY (BAHAMAS) LIMITED, 2ND FLOOR BAHAMAS FINANCIAL CENTRE, SHIRLEY AND CHARLOTTE STREET, NASSAU, N-4899, NEW PROVIDENCE, BAHAMAS",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2017,,"J.P. MORGAN TRUST COMPANY (BAHAMAS) LIMITED, 2ND FLOOR BAHAMAS FINANCIAL CENTRE, SHIRLEY AND CHARLOTTE STREET, NASSAU, N-4899, NEW PROVIDENCE, BAHAMAS","[j.p., morgan, trust, company, (bahamas), limited,, 2nd, floor, bahamas, financial, centre,, shirley, and, charlotte, street,, nassau,, n-4899,, new, providence,, bahamas]"
2245,240492204,"J.P. MORGAN TRUST COMPANY (BAHAMAS) LIMITED, 2ND FLOOR BAHAMAS FINANCIAL CENTRE, SHIRLEY AND CHARLOTTE STREET, NASSAU, N-4899, BAHAMAS",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2017,,"J.P. MORGAN TRUST COMPANY (BAHAMAS) LIMITED, 2ND FLOOR BAHAMAS FINANCIAL CENTRE, SHIRLEY AND CHARLOTTE STREET, NASSAU, N-4899, BAHAMAS","[j.p., morgan, trust, company, (bahamas), limited,, 2nd, floor, bahamas, financial, centre,, shirley, and, charlotte, street,, nassau,, n-4899,, bahamas]"
2246,240492207,"SHIRLEY AND CHARLOTTE STREETS, NASSAU, COUNTRY BAHAMAS, BAHAMAS",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2016,,"SHIRLEY AND CHARLOTTE STREETS, NASSAU, COUNTRY BAHAMAS, BAHAMAS","[shirley, and, charlotte, streets,, nassau,, country, bahamas,, bahamas]"
2248,240492221,"JPMORGAN TRUST COMPANY (BAHAMAS) LIMITED, 2ND FLOOR, BAHAMAS FINANCIAL CENTRE, SHIRLEY AND CHARLOTTE STREET, NASSAU, BAHAMAS",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2017,,"JPMORGAN TRUST COMPANY (BAHAMAS) LIMITED, 2ND FLOOR, BAHAMAS FINANCIAL CENTRE, SHIRLEY AND CHARLOTTE STREET, NASSAU, BAHAMAS","[jpmorgan, trust, company, (bahamas), limited,, 2nd, floor,, bahamas, financial, centre,, shirley, and, charlotte, street,, nassau,, bahamas]"


### St

"st" is a complicated one. It's short for "street"; appears at the end of "east", "west", "first", and "trust;" is at the start of the word "street;" and is an abreviation for "saint."

In [18]:
df[df['address_wordlist'].apply(lambda x: 'st' in x)]

Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
225,24000226,"1 BAY ST 3RD FL BRITISH COLONIAL CENTRE, P.O. BOX 7115, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"1 BAY ST 3RD FL BRITISH COLONIAL CENTRE, P.O. BOX 7115, NASSAU, BAHAMAS","[1, bay, st, 3rd, fl, british, colonial, centre,, p.o., box, 7115,, nassau,, bahamas]"
290,24000291,"3RD FLOOR TRADE WINDS BLDG, BAY ST P>O. BOX CB 12724",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"3RD FLOOR TRADE WINDS BLDG, BAY ST P>O. BOX CB 12724","[3rd, floor, trade, winds, bldg,, bay, st, p>o., box, cb, 12724]"
374,24000375,"#2 DEWGARD PLAZA BRADLEY ST PALMDALE, P.O. BOX SS-5062, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"#2 DEWGARD PLAZA BRADLEY ST PALMDALE, P.O. BOX SS-5062, NASSAU, BAHAMAS","[#2, dewgard, plaza, bradley, st, palmdale,, p.o., box, ss-5062,, nassau,, bahamas]"
1595,81031897,Sassoon House; Shirley St and Victoria Ave; Nassau; NP; Bahamas,Sassoon House,Bahamas,BHS,Paradise Papers - Appleby,Appleby data is current through 2014,,Sassoon House; Shirley St and Victoria Ave; Nassau; NP; Bahamas,"[sassoon, house;, shirley, st, and, victoria, ave;, nassau;, np;, bahamas]"
1814,33000002,"31B, ANNEX BUILDING EAST BAY ST 2ND FL, PO BOX N-3930, NASSAU, BAHAMAS",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current through 2016,,"31B, ANNEX BUILDING EAST BAY ST 2ND FL, PO BOX N-3930, NASSAU, BAHAMAS","[31b,, annex, building, east, bay, st, 2nd, fl,, po, box, n-3930,, nassau,, bahamas]"
1832,33000020,"#10 PETRONA HOUSE, FOWLER ST EAST, PO BOX N 1375, NASSAU, BAHAMAS",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current through 2016,,"#10 PETRONA HOUSE, FOWLER ST EAST, PO BOX N 1375, NASSAU, BAHAMAS","[#10, petrona, house,, fowler, st, east,, po, box, n, 1375,, nassau,, bahamas]"
1923,33000119,"LYFORD MANOR, WEST BUILDING, WEST BAY ST PO BOX CB-13007, NASSAU, NP, BAHAMAS",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current through 2016,,"LYFORD MANOR, WEST BUILDING, WEST BAY ST PO BOX CB-13007, NASSAU, NP, BAHAMAS","[lyford, manor,, west, building,, west, bay, st, po, box, cb-13007,, nassau,, np,, bahamas]"
1935,33000132,"ST ANDREW'S COURT FREDERICK ST STEPS PO BOX N-4805, NASSAU, BAHAMAS",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current through 2016,,"ST ANDREW'S COURT FREDERICK ST STEPS PO BOX N-4805, NASSAU, BAHAMAS","[st, andrew's, court, frederick, st, steps, po, box, n-4805,, nassau,, bahamas]"
1958,33000157,"#308 EAST BAY ST 4TH FLOOR PO BOX N-7768, NASSAU, BAHAMAS",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current through 2016,,"#308 EAST BAY ST 4TH FLOOR PO BOX N-7768, NASSAU, BAHAMAS","[#308, east, bay, st, 4th, floor, po, box, n-7768,, nassau,, bahamas]"
1959,33000158,"WINTERBOTHAM PLC MARLBOROUGH & QUEEN ST PO BOX CB-11343, NASSAU, BAHAMAS",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current through 2016,,"WINTERBOTHAM PLC MARLBOROUGH & QUEEN ST PO BOX CB-11343, NASSAU, BAHAMAS","[winterbotham, plc, marlborough, &, queen, st, po, box, cb-11343,, nassau,, bahamas]"


In [19]:
df[df['address_wordlist'].apply(lambda x: 'st.' in x)]

Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
12,24000013,"#308 EAST BAY ST. 4TH FLOOR, P.O. BOX N-7768, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"#308 EAST BAY ST. 4TH FLOOR, P.O. BOX N-7768, NASSAU, BAHAMAS","[#308, east, bay, st., 4th, floor,, p.o., box, n-7768,, nassau,, bahamas]"
16,24000017,"SASSOON HOUSE SHIRLEY ST. & VICTORIA AVE, P.O. BOX SS-5383, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"SASSOON HOUSE SHIRLEY ST. & VICTORIA AVE, P.O. BOX SS-5383, NASSAU, BAHAMAS","[sassoon, house, shirley, st., &, victoria, ave,, p.o., box, ss-5383,, nassau,, bahamas]"
21,24000022,"3RD FLOOR, ONE MONTAGUE PLACE, EAST BAY ST. P.O. BOX N3231, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"3RD FLOOR, ONE MONTAGUE PLACE, EAST BAY ST. P.O. BOX N3231, NASSAU, BAHAMAS","[3rd, floor,, one, montague, place,, east, bay, st., p.o., box, n3231,, nassau,, bahamas]"
34,24000035,"31B, ANNEX BUILDING EAST BAY ST. 2ND FL., P.O. BOX N-3930, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"31B, ANNEX BUILDING EAST BAY ST. 2ND FL., P.O. BOX N-3930, NASSAU, BAHAMAS","[31b,, annex, building, east, bay, st., 2nd, fl.,, p.o., box, n-3930,, nassau,, bahamas]"
37,24000038,"SUITE #102 SAFFREY SQUARE 1ST FLOOR, P.O. BOX CB-13937 BAY ST. & BANK LANE, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"SUITE #102 SAFFREY SQUARE 1ST FLOOR, P.O. BOX CB-13937 BAY ST. & BANK LANE, NASSAU, BAHAMAS","[suite, #102, saffrey, square, 1st, floor,, p.o., box, cb-13937, bay, st., &, bank, lane,, nassau,, bahamas]"
78,24000079,"#10 PETRONA HOUSE, FOWLER ST. EAST, P.O. BOX N 1375, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"#10 PETRONA HOUSE, FOWLER ST. EAST, P.O. BOX N 1375, NASSAU, BAHAMAS","[#10, petrona, house,, fowler, st., east,, p.o., box, n, 1375,, nassau,, bahamas]"
93,24000094,"308 EAST BAY ST. 3RD FL., P.O. BOX N-7527, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"308 EAST BAY ST. 3RD FL., P.O. BOX N-7527, NASSAU, BAHAMAS","[308, east, bay, st., 3rd, fl.,, p.o., box, n-7527,, nassau,, bahamas]"
125,24000126,"3RD FLOOR, MARITIME HOUSE, FREDERICK ST. P.O. BOX N-4584, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"3RD FLOOR, MARITIME HOUSE, FREDERICK ST. P.O. BOX N-4584, NASSAU, BAHAMAS","[3rd, floor,, maritime, house,, frederick, st., p.o., box, n-4584,, nassau,, bahamas]"
343,24000344,"DOWDESWELL ST. KI-MALEX HOUSE, P.O. BOX SS-6836, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"DOWDESWELL ST. KI-MALEX HOUSE, P.O. BOX SS-6836, NASSAU, BAHAMAS","[dowdeswell, st., ki-malex, house,, p.o., box, ss-6836,, nassau,, bahamas]"
406,24000407,"MALBOROUGH ST. & NAVY LION RD., P.O. BOX SS-19051, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"MALBOROUGH ST. & NAVY LION RD., P.O. BOX SS-19051, NASSAU, BAHAMAS","[malborough, st., &, navy, lion, rd.,, p.o., box, ss-19051,, nassau,, bahamas]"


## Standardize values

After some trial and error, I settled on the following standardizations to start:

- Lowercase everything
- Replace "&" with "and"
- Remove punctuation
- Expand abbreviations

### Lowercase everything

Different capitalization strategies quickly complicate an analysis as most value matching is based on exact matches.

In [20]:
df['working_address'] = df['address'].str.lower()

### Replace "&"

In the Bahamas addresses "&" and "and" are used interchangeably.

I did a quick sanity check to ensure that "&" wasn't being used in another way. There are only 326 rows that use "&". A quick perusal shows that most, if not all, are used to connect street names.

In [21]:
df['working_address'] = df['working_address'].str.replace('&', 'and')

### Remove punctuation

Punctuation can be particularly helpful in splitting fomratted text blocks into smaller pieces. However, there is no standardized format for these addresses, thus the punctuation actually makes pulling out relevant information and standardizing values harder. For example "-" is generally used in PO Box addresses, but it separates city and country in one row.

After some trial and error, the best process appears to be to replace "-" in all rows with a PO Box with an empty string, the characters \.|,|- as well as newlines with a space, and finally everything else with an empty string (i.e. delete them).

In [22]:
df.loc[df['working_address'
         ].str.contains('box\s?\w+-\d+'), 'working_address'
      ]  = df.loc[df['working_address'].str.contains('box\s?\w+-\d+'), 'working_address'].str.replace('-', '')
df['working_address'] = df['working_address'].str.replace('\.|,|-', ' ', regex=True)

newline_list = '\t\r\n'
remove_newline = str.maketrans(' ', ' ', newline_list)
punct_list = string.punctuation + '—¿–'
nopunct = str.maketrans('', '', punct_list)

df['working_address'] = df['working_address'].str.translate(remove_newline).str.translate(nopunct)
df['working_address']

0                                                                               annex frederick and shirley sts  p o  box n4805  nassau  bahamas
1                                                                                 suite e2 union court building  p o  box n8188  nassau  bahamas
2                                                                                  lyford cay house  lyford cay  p o  box n7785  nassau  bahamas
3                                             p o  box n3708 bahamas financial centre  p o  box n3708 shirley and charlotte sts  nassau  bahamas
4                                                                       lyford cay house  3rd floor  lyford cay  p o  box n3024  nassau  bahamas
                                                                          ...                                                                   
2253                                                                       j p morgan trust company bahamas limited  nassau  n 489

### Standardize abbreviations

I got the majority of the abbreviations the hard way, I went though the dataset by hand. I was attempting to pull out the city name for each address that had a street in it. The only way to do this for many addresses was to look at the street, resulting in a lot of `contains` searches.

In the frequency count I also noticed that "p" and "o" occur rather frequently. A quick peek shows that in 84 rows "po box" is listed as "p o box". When doing replacements like these, it's important to do sanity checks as the results won't always be what you expect. One of my favorite examples was a search for "demon", I also got "demonstrate". Watch out for these kinds of things. Fortunately, in the "po box" example for the Bahamas addresses "p o" only occurs when for PO boxes.

There are also occurences of "pobox", but in joining "p" and "o" from "p o" gives me one last "pobox" so I'll need to apply this after the main abbreviation changes.

#### The difficulties of "street"

Replacing "st" was particularly challenging. Simply replacing "st" with "street" will also alter the following words:

- street: streetreet
- sts: streets
- west: westreet
- east: eastreet
- st: street
- first: firstreet
- trust: trustreet

I only discovered this after applying several iterations of "st" replacements and re-running word counts. I ended up looking for " st " esentially ensuring there was a space before and after "st". This also had the side effect of ensuring the "saint" rows remain the same.

I also made the mistake of replacing " st " with "street" essentially connecting it with the words around it. I got a lot of "unionstreet" and "streetnassau" values. Fortunately, the solution is simple, just add the spaces back in.

In [23]:
abbrev_dict = {'\sst\s': ' street ',
              'str\s': 'street ',
              'streets': 'street',
              'sts': 'street',
              'blvd': 'boulevard',
              'sq\s': 'square ',
              'dr\s': 'drive ',
              'ave\s': 'avenue ',
              '\sln': ' lane',
              'lanes': 'lane',
              'hwy': 'highway',
              '1st': 'first',
              '2nd': 'second',
              '2 nd': 'second',
              '3rd': 'third',
              '4th': 'fourth',
              '5th': 'fifth',
              '6th': 'sixth',
              '7th': 'seventh',
              '8th': 'eighth',
              '9th': 'ninth',
              'p o': 'po',
              'pobox': 'po box',
               '\s\s+': ' ',
               'nassaubahamas':'nassau bahamas'}

In [24]:
df['working_address'] = df['working_address'].replace(abbrev_dict, regex=True)

In [25]:
df.loc[df['working_address'
         ].str.contains('street\w+'), 'working_address'
      ] = df.loc[df['working_address'].str.contains('street\w+'), 'working_address'].str.replace('street', 'street ')

df.loc[df['working_address'
         ].str.contains('\w+street'), 'working_address'
      ] = df.loc[df['working_address'].str.contains('\w+street'), 'working_address'].str.replace('street', ' street')

## Random extras

In [28]:
df[df['working_address'].str.contains("343nass")]

Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
681,14035227,CB 11.343/Nassau Bahamas,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,cb 11 343nassau bahamas,"[cb, 11.343/nassau, bahamas]"


In [29]:
df.loc[df['working_address'].str.contains("343nass"), 'working_address'] = 'cb 11 343 nassau bahamas'

In [30]:
df[df['working_address'].str.contains("343nass")]

Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist


In [32]:
df[df['working_address'].str.contains(" 343 nass")]

Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
577,14026445,ALEMAN; CORDERO; GALINDO & LEE (BAHAMAS) LIMITED; BOLAM HOUSE; KING & GEORGES STREETS; PO BOX CB 11.343; NASSAU; BAHAMAS,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,aleman cordero galindo and lee bahamas limited bolam house king and georges street po box cb 11 343 nassau bahamas,"[aleman;, cordero;, galindo, &, lee, (bahamas), limited;, bolam, house;, king, &, georges, streets;, po, box, cb, 11.343;, nassau;, bahamas]"
578,14026446,Aleman; Cordero; Galindo & Lee (Bahamas) Limited; Bolam House; King & George Streets; PO Box CB 11.343; Nassau; Bahamas,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,aleman cordero galindo and lee bahamas limited bolam house king and george street po box cb 11 343 nassau bahamas,"[aleman;, cordero;, galindo, &, lee, (bahamas), limited;, bolam, house;, king, &, george, streets;, po, box, cb, 11.343;, nassau;, bahamas]"
655,14031493,BOLAM HOUSE KING & GEORGE STREETS P O BOX CB 11.343 NASSAU BAHAMAS,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,bolam house king and george street po box cb 11 343 nassau bahamas,"[bolam, house, king, &, george, streets, p, o, box, cb, 11.343, nassau, bahamas]"
656,14031494,Bolam House; King & George Streets; PO BOX CB 11-343; NASSAU; BAHAMAS,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,bolam house king and george street po box cb 11 343 nassau bahamas,"[bolam, house;, king, &, george, streets;, po, box, cb, 11-343;, nassau;, bahamas]"
657,14031495,Bolam House; King and George Street; P.O. Box 11; 343; Nassau; Bahamas,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,bolam house king and george street po box 11 343 nassau bahamas,"[bolam, house;, king, and, george, street;, p.o., box, 11;, 343;, nassau;, bahamas]"
658,14031496,Bolam House; King and George Streets. P.O. Box CB 11.343 Nassau; Bahamas,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,bolam house king and george street po box cb 11 343 nassau bahamas,"[bolam, house;, king, and, george, streets., p.o., box, cb, 11.343, nassau;, bahamas]"
681,14035227,CB 11.343/Nassau Bahamas,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,cb 11 343 nassau bahamas,"[cb, 11.343/nassau, bahamas]"
683,14035229,CB11.343; Nassau; Bahamas,,Bahamas,BHS,Panama Papers,The Panama Papers data is current through 2015,,cb11 343 nassau bahamas,"[cb11.343;, nassau;, bahamas]"
2208,240450064,"WINTERBOTHAM PLACE, MARLBOROUGH & QUEEN STREETS, P.O. BOX CB 11.343, NASSAU, BAHAMAS",,Bahamas,BHS,Pandora Papers - SFM Corporate Services,Provider data is current through 2015,,winterbotham place marlborough and queen street po box cb 11 343 nassau bahamas,"[winterbotham, place,, marlborough, &, queen, streets,, p.o., box, cb, 11.343,, nassau,, bahamas]"


## Save to csv

In [100]:
df.drop('address_wordlist', axis=1).to_csv('data/parsed_bahamas_addresses.csv', index=False)