# NLP 0?: Word Frequencies

To be honest, I did a lot of this the hard way to start. It involved manually filtering an Excel spreadsheet version of the addresses, scrolling through thousands of addresses, copy and paste, and making lots of processing mistakes. It occurred to me that a lot of that could have been accomplished by looking at word frequencies, then search the dataset for the most frequent terms.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
from time import gmtime, strftime
import sys
import os
import io

import string
import re

In [2]:
df = pd.read_csv('data/edited/addresses_Bahamas.csv')

## Create a new column to work with

It's important to keep a copy of the original values. This makes it much easier to revert changes that don't perform as expected and also to reference if data is lost or to put it back into context. I've lost count of the number of times I need to walk back a certain processing technique.

In [3]:
df['working_address'] = df['address']

## Word counts

I frequently use word counts to determine any words I need to add to the stopword list (stopwords are words that occur frequently, such as 'the' or 'a' that don't add value to the analysis and should be removed).

In this case, I'm hoping it will bubble up things like cities, states, islands, other address features that occur frequently, and punctuation usage. 

First I'll fill  blank lines with an empty string to help full dataset processing (i.e. I really don't want to deal with NANs). Next I lowercase everything (there is no use for "Annex" and "annex" to be two different values in this analysis). Then I split on spaces. Lastly, I reuse the frequency function from [Entry NLP4: Frequencies and Comparison](https://julielinx.github.io/blog/nlp04_vocal_auth_freq_compare/) to get the word counts.

In [6]:
df['address_wordlist'] = df['working_address'].fillna('').str.lower().str.split()

In [8]:
df.address_wordlist

0       [annex, frederick, &, shirley, sts,, p.o., box...
1       [suite, e-2,union, court, building,, p.o., box...
2       [lyford, cay, house,, lyford, cay,, p.o., box,...
3       [p.o., box, n-3708, bahamas, financial, centre...
4       [lyford, cay, house,, 3rd, floor,, lyford, cay...
                              ...                        
2253    [j.p.morgan, trust, company, (bahamas), limite...
2254    [montagne, sterline, centre., east, bav, stree...
2255    [deltec, house,, lyford, cay,, po, box, n-3229...
2256    [providence, house,, hast, wing,, east, hill, ...
2257    [c/oj.p., morgan, trust, company, (bahamas), l...
Name: address_wordlist, Length: 2258, dtype: object

In [9]:
def frequency_ct(ngram_list):
    freq_dict = {}
    for ngram in ngram_list:
        if ngram not in freq_dict:
            freq_dict[ngram] = 0
        freq_dict[ngram] +=1
    return freq_dict

In [10]:
freq_df = pd.DataFrame.from_dict(
    frequency_ct(df['address_wordlist'].sum()
                ), orient='index').reset_index().rename(
    columns={'index':'word', 0:'count'}).sort_values('count', ascending=False)

In [11]:
freq_df

Unnamed: 0,word,count
9,bahamas,2140
6,box,1447
8,"nassau,",974
5,p.o.,889
1008,nassau;,772
...,...,...
1428,13.253;,1
1426,l,1
1423,12102;,1
1422,42498,1


\**Note*: when applied to a column with lists the `.sum()` function will return a single list. If that column is saved as a `.csv`, then reloaded the list will be read in as a string and `.sum()` will return something that looks like a list of lists, but is a list of strings that were lists.

In [13]:
df['address_wordlist'].sum()[:20]

['annex',
 'frederick',
 '&',
 'shirley',
 'sts,',
 'p.o.',
 'box',
 'n-4805,',
 'nassau,',
 'bahamas',
 'suite',
 'e-2,union',
 'court',
 'building,',
 'p.o.',
 'box',
 'n-8188,',
 'nassau,',
 'bahamas',
 'lyford']

In [24]:
df.head(10).to_csv('stringified_list.csv', index=False)
str_lst_df = pd.read_csv('stringified_list.csv')

In [25]:
str_lst_df['address_wordlist'].sum()

'[\'annex\', \'frederick\', \'&\', \'shirley\', \'sts,\', \'p.o.\', \'box\', \'n-4805,\', \'nassau,\', \'bahamas\'][\'suite\', \'e-2,union\', \'court\', \'building,\', \'p.o.\', \'box\', \'n-8188,\', \'nassau,\', \'bahamas\'][\'lyford\', \'cay\', \'house,\', \'lyford\', \'cay,\', \'p.o.\', \'box\', \'n-7785,\', \'nassau,\', \'bahamas\'][\'p.o.\', \'box\', \'n-3708\', \'bahamas\', \'financial\', \'centre,\', \'p.o.\', \'box\', \'n-3708\', \'shirley\', \'&\', \'charlotte\', \'sts,\', \'nassau,\', \'bahamas\'][\'lyford\', \'cay\', \'house,\', \'3rd\', \'floor,\', \'lyford\', \'cay,\', \'p.o.\', \'box\', \'n-3024,\', \'nassau,\', \'bahamas\'][\'303\', \'shirley\', \'street,\', \'p.o.\', \'box\', \'n-492,\', \'nassau,\', \'bahamas\'][\'ocean\', \'centre,\', \'montagu\', \'foreshore,\', \'p.o.\', \'box\', \'ss-19084\', \'east\', \'bay\', \'street,\', \'nassau,\', \'bahamas\'][\'providence\', \'house,\', \'east\', \'wing\', \'east\', \'hill\', \'st,\', \'p.o.\', \'box\', \'cb-12399,\', \'nass

## Non-standardizations

I only need to see the top 20 results to start identifying ways to standardize the data.

- PO Boxes are written both as "p.o." and "po"
- There are both "&" and "and"

I'll double check to make sure these are what I think they are.

In [27]:
freq_df.head(20)

Unnamed: 0,word,count
9,bahamas,2140
6,box,1447
8,"nassau,",974
5,p.o.,889
1008,nassau;,772
3,shirley,484
417,po,461
10,suite,445
35,bay,405
1009,street;,362


In [31]:
pd.set_option('display.max_colwidth', 1000)

In [50]:
df[df['working_address'].str.lower().str.contains('p\.?o\.?', regex=True)].tail()

Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
2225,240491356,"P.O. BOX N- 3944, SUITE 200B, 2ND FLOOR, CENTRE OF COMMERCE, ONE BAY STREET, NASSAU, BAHAMAS, NASSAU, BAHAMAS",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2017,,"P.O. BOX N- 3944, SUITE 200B, 2ND FLOOR, CENTRE OF COMMERCE, ONE BAY STREET, NASSAU, BAHAMAS, NASSAU, BAHAMAS","[p.o., box, n-, 3944,, suite, 200b,, 2nd, floor,, centre, of, commerce,, one, bay, street,, nassau,, bahamas,, nassau,, bahamas]"
2227,240491474,"SUITE 200B, 2ND FLOOR, CENTRE OF COMMERCE, ONE BAY STREET, PO BOX N-3944, NASSAU, BAHAMAS",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2016,,"SUITE 200B, 2ND FLOOR, CENTRE OF COMMERCE, ONE BAY STREET, PO BOX N-3944, NASSAU, BAHAMAS","[suite, 200b,, 2nd, floor,, centre, of, commerce,, one, bay, street,, po, box, n-3944,, nassau,, bahamas]"
2229,240491518,"RBC TRUST COMPANY (BAHAMAS) LIMITED, BAYSIDE EXECUTIVE PARK BUILDING 3, P.O. BOX NO. 30-24, NASSAU, BAHAMAS",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2016,,"RBC TRUST COMPANY (BAHAMAS) LIMITED, BAYSIDE EXECUTIVE PARK BUILDING 3, P.O. BOX NO. 30-24, NASSAU, BAHAMAS","[rbc, trust, company, (bahamas), limited,, bayside, executive, park, building, 3,, p.o., box, no., 30-24,, nassau,, bahamas]"
2255,240491733,"DELTEC HOUSE, LYFORD CAY, PO BOX N-3229, NASSAU, BAHAMAS",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2016,,"DELTEC HOUSE, LYFORD CAY, PO BOX N-3229, NASSAU, BAHAMAS","[deltec, house,, lyford, cay,, po, box, n-3229,, nassau,, bahamas]"
2256,240491778,"PROVIDENCE HOUSE, HAST WING, EAST HILL STREET, P.O. BOX CB-12399, NASSAU, CB-12399, BAHAMAS",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2016,,"PROVIDENCE HOUSE, HAST WING, EAST HILL STREET, P.O. BOX CB-12399, NASSAU, CB-12399, BAHAMAS","[providence, house,, hast, wing,, east, hill, street,, p.o., box, cb-12399,, nassau,, cb-12399,, bahamas]"


Even in just these five rows, I can see "po box" and "p.o. box" are both represented.

In [53]:
df[df['working_address'].str.lower().str.contains('p\.?o\.?b', regex=True)].tail()

Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
1807,88017491,"BOLAM HOUSE, KING AND GEORGE STREETS P.O.BOX N-514, NASSAU, BAHAMAS",,Bahamas,BHS,Paradise Papers - Aruba corporate registry,Aruba corporate registry data is current through 2016,,"BOLAM HOUSE, KING AND GEORGE STREETS P.O.BOX N-514, NASSAU, BAHAMAS","[bolam, house,, king, and, george, streets, p.o.box, n-514,, nassau,, bahamas]"
2123,120000350,"SAFFREY SQUARE, SUITE 205 BANK LANE, P.O.BOX N, 8188, NASSAU, BAHAMAS.","SAFFREY SQUARE, SUITE 205 BANK LANE, P.O.BOX N, 8188, NASSAU, BAHAMAS.",Bahamas,BHS,Paradise Papers - Barbados corporate registry,Barbados corporate registry data is current through 2016,,"SAFFREY SQUARE, SUITE 205 BANK LANE, P.O.BOX N, 8188, NASSAU, BAHAMAS.","[saffrey, square,, suite, 205, bank, lane,, p.o.box, n,, 8188,, nassau,, bahamas.]"
2143,120010247,"LENNOX PATON CORPORATE SERVICES LIMITED, P.O.BOX N-4875, NASSAU, BAHAMAS.","LENNOX PATON CORPORATE SERVICES LIMITED, P.O.BOX N-4875, NASSAU, BAHAMAS.",Bahamas,BHS,Paradise Papers - Barbados corporate registry,Barbados corporate registry data is current through 2016,,"LENNOX PATON CORPORATE SERVICES LIMITED, P.O.BOX N-4875, NASSAU, BAHAMAS.","[lennox, paton, corporate, services, limited,, p.o.box, n-4875,, nassau,, bahamas.]"
2150,120006606,"OCEAN CENTRE, MONTAGU FORESHORE, EAST BAY STREET, P.O.BOX SS-19084, NASSAU, BAHAMAS.","OCEAN CENTRE, MONTAGU FORESHORE, EAST BAY STREET, P.O.BOX SS-19084, NASSAU, BAHAMAS.",Bahamas,BHS,Paradise Papers - Barbados corporate registry,Barbados corporate registry data is current through 2016,,"OCEAN CENTRE, MONTAGU FORESHORE, EAST BAY STREET, P.O.BOX SS-19084, NASSAU, BAHAMAS.","[ocean, centre,, montagu, foreshore,, east, bay, street,, p.o.box, ss-19084,, nassau,, bahamas.]"
2193,240001242,"DOMINION HOUSE60,MONTROSE AVENUE, P.O.BOX N-9932, NASSAU, BAHAMAS",,Bahamas,BHS,"Pandora Papers - Alemán, Cordero, Galindo & Lee (Alcogal)",Provider data is current through 2018,,"DOMINION HOUSE60,MONTROSE AVENUE, P.O.BOX N-9932, NASSAU, BAHAMAS","[dominion, house60,montrose, avenue,, p.o.box, n-9932,, nassau,, bahamas]"


If I dive a little further, I also see there are some instances where the two words are run together, punctuation other than "." separates the letters, and even a few where the "po" or "po box" portion has been left off entirely.

In [65]:
df[~df['working_address'].str.lower().str.contains('p\.?o\.?', regex=True)]

Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
235,24000236,"#4 PINEAPPLE GROVE,OLD FORT BAY, NEW PRODIVENCE, BOX SP-60063, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"#4 PINEAPPLE GROVE,OLD FORT BAY, NEW PRODIVENCE, BOX SP-60063, NASSAU, BAHAMAS","[#4, pineapple, grove,old, fort, bay,, new, prodivence,, box, sp-60063,, nassau,, bahamas]"
255,24000256,"P,O, BOX N-4759, NASSAU",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"P,O, BOX N-4759, NASSAU","[p,o,, box, n-4759,, nassau]"
290,24000291,"3RD FLOOR TRADE WINDS BLDG, BAY ST P>O. BOX CB 12724",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"3RD FLOOR TRADE WINDS BLDG, BAY ST P>O. BOX CB 12724","[3rd, floor, trade, winds, bldg,, bay, st, p>o., box, cb, 12724]"
484,24000485,"#70 WULFF ROAD, NASSAU, BAHAMAS N-989",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"#70 WULFF ROAD, NASSAU, BAHAMAS N-989","[#70, wulff, road,, nassau,, bahamas, n-989]"
485,24000486,DEVEAUX STREET,,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,DEVEAUX STREET,"[deveaux, street]"
...,...,...,...,...,...,...,...,...,...,...
2251,240492292,"J.P. MORGAN TRUST COMPANY (BAHAMAS) LIMITED, 2ND FLOOR BAHAMAS FINANCIAL CENTRE, NASSAU, N-4899, BAHAMAS",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2017,,"J.P. MORGAN TRUST COMPANY (BAHAMAS) LIMITED, 2ND FLOOR BAHAMAS FINANCIAL CENTRE, NASSAU, N-4899, BAHAMAS","[j.p., morgan, trust, company, (bahamas), limited,, 2nd, floor, bahamas, financial, centre,, nassau,, n-4899,, bahamas]"
2252,240492375,"DELTEC HOUSE, LYFORD CAY, NASSAU, COUNTRY BAHAMAS, BAHAMAS",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2016,,"DELTEC HOUSE, LYFORD CAY, NASSAU, COUNTRY BAHAMAS, BAHAMAS","[deltec, house,, lyford, cay,, nassau,, country, bahamas,, bahamas]"
2253,240492525,"J.P.MORGAN TRUST COMPANY (BAHAMAS) LIMITED, NASSAU, N-4899, ZH, BAHAMAS",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2016,,"J.P.MORGAN TRUST COMPANY (BAHAMAS) LIMITED, NASSAU, N-4899, ZH, BAHAMAS","[j.p.morgan, trust, company, (bahamas), limited,, nassau,, n-4899,, zh,, bahamas]"
2254,240492536,"MONTAGNE STERLINE CENTRE. EAST BAV STREET, NASSAU, COUNTRY BAHAMAS, BAHAMAS",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2016,,"MONTAGNE STERLINE CENTRE. EAST BAV STREET, NASSAU, COUNTRY BAHAMAS, BAHAMAS","[montagne, sterline, centre., east, bav, street,, nassau,, country, bahamas,, bahamas]"


In [55]:
df[df['working_address'].str.lower().str.contains('(^p\.?o\.?)(.?box)', regex=True)].tail()

  df[df['working_address'].str.lower().str.contains('(^p\.?o\.?)(.?box)', regex=True)].tail()


Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
2118,33000328,"PO BOX N-877, NASSAU, BAHAMAS",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current through 2016,,"PO BOX N-877, NASSAU, BAHAMAS","[po, box, n-877,, nassau,, bahamas]"
2121,33000331,"PO BOX EE-17971, NASSAU, BAHAMAS",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current through 2016,,"PO BOX EE-17971, NASSAU, BAHAMAS","[po, box, ee-17971,, nassau,, bahamas]"
2122,33000332,"PO BOX N-1000, NASSAU, BAHAMAS",,Bahamas,BHS,Paradise Papers - Bahamas corporate registry,Bahamas corporate registry data is current through 2016,,"PO BOX N-1000, NASSAU, BAHAMAS","[po, box, n-1000,, nassau,, bahamas]"
2169,240360721,"PO BOX AP59223, SLOT# 308, PALATIAL ESTATES, LOT# 4, NEW PROVIDENCE, BAHAMAS",,Bahamas,BHS,Pandora Papers - Fidelity Corporate Services,Provider data is current through 2017,,"PO BOX AP59223, SLOT# 308, PALATIAL ESTATES, LOT# 4, NEW PROVIDENCE, BAHAMAS","[po, box, ap59223,, slot#, 308,, palatial, estates,, lot#, 4,, new, providence,, bahamas]"
2225,240491356,"P.O. BOX N- 3944, SUITE 200B, 2ND FLOOR, CENTRE OF COMMERCE, ONE BAY STREET, NASSAU, BAHAMAS, NASSAU, BAHAMAS",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2017,,"P.O. BOX N- 3944, SUITE 200B, 2ND FLOOR, CENTRE OF COMMERCE, ONE BAY STREET, NASSAU, BAHAMAS, NASSAU, BAHAMAS","[p.o., box, n-, 3944,, suite, 200b,, 2nd, floor,, centre, of, commerce,, one, bay, street,, nassau,, bahamas,, nassau,, bahamas]"


In [52]:
df[df['working_address'].str.lower().str.contains('&|and')]

Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note,working_address,address_wordlist
0,24000001,"ANNEX FREDERICK & SHIRLEY STS, P.O. BOX N-4805, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"ANNEX FREDERICK & SHIRLEY STS, P.O. BOX N-4805, NASSAU, BAHAMAS","[annex, frederick, &, shirley, sts,, p.o., box, n-4805,, nassau,, bahamas]"
3,24000004,"P.O. BOX N-3708 BAHAMAS FINANCIAL CENTRE, P.O. BOX N-3708 SHIRLEY & CHARLOTTE STS, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"P.O. BOX N-3708 BAHAMAS FINANCIAL CENTRE, P.O. BOX N-3708 SHIRLEY & CHARLOTTE STS, NASSAU, BAHAMAS","[p.o., box, n-3708, bahamas, financial, centre,, p.o., box, n-3708, shirley, &, charlotte, sts,, nassau,, bahamas]"
8,24000009,"BAYSIDE EXECUTIVE PARK, WEST BAY & BLAKE, P.O. BOX N-4875, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"BAYSIDE EXECUTIVE PARK, WEST BAY & BLAKE, P.O. BOX N-4875, NASSAU, BAHAMAS","[bayside, executive, park,, west, bay, &, blake,, p.o., box, n-4875,, nassau,, bahamas]"
10,24000011,"TK HOUSE, BAYSIDE EXECUTIVE PARK, P.O. BOX AP-59213 WEST BAY & BLAKE ROAD, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"TK HOUSE, BAYSIDE EXECUTIVE PARK, P.O. BOX AP-59213 WEST BAY & BLAKE ROAD, NASSAU, BAHAMAS","[tk, house,, bayside, executive, park,, p.o., box, ap-59213, west, bay, &, blake, road,, nassau,, bahamas]"
11,24000012,"BAYSIDE HOUSE WEST BAY & BLAKE ROAD, P.O. BOX AP-59213, NASSAU, BAHAMAS",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,"BAYSIDE HOUSE WEST BAY & BLAKE ROAD, P.O. BOX AP-59213, NASSAU, BAHAMAS","[bayside, house, west, bay, &, blake, road,, p.o., box, ap-59213,, nassau,, bahamas]"
...,...,...,...,...,...,...,...,...,...,...
2244,240492203,"J.P. MORGAN TRUST COMPANY (BAHAMAS) LIMITED, 2ND FLOOR BAHAMAS FINANCIAL CENTRE, SHIRLEY AND CHARLOTTE STREET, NASSAU, N-4899, NEW PROVIDENCE, BAHAMAS",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2017,,"J.P. MORGAN TRUST COMPANY (BAHAMAS) LIMITED, 2ND FLOOR BAHAMAS FINANCIAL CENTRE, SHIRLEY AND CHARLOTTE STREET, NASSAU, N-4899, NEW PROVIDENCE, BAHAMAS","[j.p., morgan, trust, company, (bahamas), limited,, 2nd, floor, bahamas, financial, centre,, shirley, and, charlotte, street,, nassau,, n-4899,, new, providence,, bahamas]"
2245,240492204,"J.P. MORGAN TRUST COMPANY (BAHAMAS) LIMITED, 2ND FLOOR BAHAMAS FINANCIAL CENTRE, SHIRLEY AND CHARLOTTE STREET, NASSAU, N-4899, BAHAMAS",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2017,,"J.P. MORGAN TRUST COMPANY (BAHAMAS) LIMITED, 2ND FLOOR BAHAMAS FINANCIAL CENTRE, SHIRLEY AND CHARLOTTE STREET, NASSAU, N-4899, BAHAMAS","[j.p., morgan, trust, company, (bahamas), limited,, 2nd, floor, bahamas, financial, centre,, shirley, and, charlotte, street,, nassau,, n-4899,, bahamas]"
2246,240492207,"SHIRLEY AND CHARLOTTE STREETS, NASSAU, COUNTRY BAHAMAS, BAHAMAS",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2016,,"SHIRLEY AND CHARLOTTE STREETS, NASSAU, COUNTRY BAHAMAS, BAHAMAS","[shirley, and, charlotte, streets,, nassau,, country, bahamas,, bahamas]"
2248,240492221,"JPMORGAN TRUST COMPANY (BAHAMAS) LIMITED, 2ND FLOOR, BAHAMAS FINANCIAL CENTRE, SHIRLEY AND CHARLOTTE STREET, NASSAU, BAHAMAS",,Bahamas,BHS,Pandora Papers - Trident Trust,Provider data is current through 2017,,"JPMORGAN TRUST COMPANY (BAHAMAS) LIMITED, 2ND FLOOR, BAHAMAS FINANCIAL CENTRE, SHIRLEY AND CHARLOTTE STREET, NASSAU, BAHAMAS","[jpmorgan, trust, company, (bahamas), limited,, 2nd, floor,, bahamas, financial, centre,, shirley, and, charlotte, street,, nassau,, bahamas]"


## Standardize values

Most matching algorithms are based on exact matches. Small changes can make using these common matching functions useless. Such changes include abbreviations like "st" and "blvd" for "street" and "boulevard", capitalization differences such as "Annex Frederick" vs "ANNEX FREDERICK", and punctuation such as "P.O. Box" vs "PO Box". As a general rule, the use of multiple representations for the same thing makes it harder to match like values.

Standardizing datasets makes finding patterns and trends much easier.

### Replace "&"

In the Bahamas addresses "&" and "and" are used interchangeably.

I did a quick sanity check to ensure that "&" wasn't being used in another way. There are only 326 rows that use "&". A quick perusal shows that most, if not all, are used to connect street names.

### Lowercase everything

Different capitalization strategies quickly complicate an analysis as most value matching is based on exact matches.

In [None]:
df['working_address'] = df['address'].str.replace('&', 'and').str.lower()
df.loc[df['working_address'
         ].str.contains('box\s?\w+-\d+'), 'working_address'
      ]  = df.loc[df['working_address'].str.contains('box\s?\w+-\d+'), 'working_address'].str.replace('-', '')
df['working_address'] = df['working_address'].str.replace('\.|,|-', ' ', regex=True)

### Remove punctuation

Punctuation can be particularly helpful in splitting fomratted text blocks into smaller pieces. However, there is no standardized format for these addresses, thus the punctuation actually makes pulling out relevant information harder in this dataset. As such, I'm going to get rid of it.

In [None]:
newline_list = '\t\r\n'
remove_newline = str.maketrans(' ', ' ', newline_list)
punct_list = string.punctuation + '—¿–'
nopunct = str.maketrans('', '', punct_list)

In [None]:
df['working_address'] = df['working_address'].str.translate(remove_newline).str.translate(nopunct)
df['working_address']

### Standardize abbreviations

I got the majority of the abbreviations the hard way, I went though the dataset by hand. I was attempting to pull out the city name for each address that had a street in it. The only way to do this for many addresses was to look at the street, resulting in a lot of `contains` searches.

In the frequency count I also noticed that "p" and "o" occur rather frequently. A quick peek shows that in 84 rows "po box" is listed as "p o box". When doing replacements like these, it's important to do sanity checks as the results won't always be what you expect. One of my favorite examples was a search for "demon", I also got "demonstrate". Watch out for these kinds of things. Fortunately, in the "po box" example for the Bahamas addresses "p o" only occurs when for PO boxes.

There are also occurences of "pobox", but in joining "p" and "o" from "p o" gives me one last "pobox" so I'll need to apply this after the main abbreviation changes.

#### The difficulties of "street"

Speaking of tricking searches, "st" was particularly interesting. Simply replacing "st" with "street" will also alter the following words:

- street: streetreet
- sts: streets
- west: westreet
- east: eastreet
- st: street

I compromised to look for " st " esentially ensuring there was a space before and after "st".

# Added complication: "st" is also an abbreviation for "saint"

In [None]:
abbrev_dict = {'\sst\s': ' street ',
              'str\s': 'street ',
              'streets': 'street',
              'sts': 'street',
              'blvd': 'boulevard',
              'sq\s': 'square ',
              'dr\s': 'drive ',
              'ave\s': 'avenue ',
              '\sln': ' lane',
              'lanes': 'lane',
              'hwy': 'highway',
              '1st': 'first',
              '2nd': 'second',
              '2 nd': 'second',
              '3rd': 'third',
              '4th': 'fourth',
              '5th': 'fifth',
              '6th': 'sixth',
              '7th': 'seventh',
              '8th': 'eighth',
              '9th': 'ninth',
              'p o': 'po',
              'pobox': 'po box',
               '\s\s+': ' ',
               'nassaubahamas':'nassau bahamas'}

In [None]:
df['working_address'] = df['working_address'].replace(abbrev_dict, regex=True)

In [None]:
df.loc[df['working_address'
         ].str.contains('street\w+'), 'working_address'
      ] = df.loc[df['working_address'].str.contains('street\w+'), 'working_address'].str.replace('street', 'street ')

df.loc[df['working_address'
         ].str.contains('\w+street'), 'working_address'
      ] = df.loc[df['working_address'].str.contains('\w+street'), 'working_address'].str.replace('street', ' street')

In [None]:
df.to_csv('data/parsed_bahamas_addresses.csv', index=False)