# Download deepparse address data

The https://github.com/GRAAL-Research/deepparse-address-data repository includes some easily-accessible data which is originally
from libpostal.
This is real address data, which allows the correlation of e.g. city and state with each other.

Here, we get it out of its rather obscure format (doubly-compressed pickle files of Python lists) into a dataframe for simulation use.

In [1]:
import pandas as pd, numpy as np
import zipfile
import lzma
import pickle

! whoami
! date

zmbc
Wed Dec  7 16:29:01 PST 2022


In [2]:
! wget https://graal.ift.ulaval.ca/public/deepparse/dataset/data.zip -P ../data/raw/deepparse_address_data

--2022-12-07 16:29:01--  https://graal.ift.ulaval.ca/public/deepparse/dataset/data.zip
Resolving graal.ift.ulaval.ca (graal.ift.ulaval.ca)... 132.203.210.68
Connecting to graal.ift.ulaval.ca (graal.ift.ulaval.ca)|132.203.210.68|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 438983300 (419M) [application/zip]
Saving to: ‘../data/raw/deepparse_address_data/data.zip.1’


2022-12-07 16:29:12 (39.3 MB/s) - ‘../data/raw/deepparse_address_data/data.zip.1’ saved [438983300/438983300]



In [3]:
with zipfile.ZipFile('../data/raw/deepparse_address_data/data.zip', 'r') as zip_ref:
    zip_ref.extractall('../data/raw/deepparse_address_data/data_extracted')

**NOTE:** We only use test data for some reason.
This is the vast majority, but we could get an extra 100k rows for free if we also included training.

In [4]:
%%time

with lzma.open('../data/raw/deepparse_address_data/data_extracted/data/clean_data/test/us.lzma', 'rb') as f:
    us_test_data = pickle.load(f)

CPU times: user 19.8 s, sys: 1.24 s, total: 21 s
Wall time: 21 s


In [5]:
len(us_test_data)

7999999

In [6]:
us_test_data[0]

('780 46th st nort birmingham al 35212',
 ['StreetNumber',
  'StreetName',
  'StreetName',
  'StreetName',
  'Municipality',
  'Province',
  'PostalCode'])

In [7]:
def my_tokenize(x):
    address_str, labels = x
    token_dict = {l:[] for l in labels}

    address_list = address_str.split(' ')
    for ai, li in zip(address_list, labels):
        token_dict[li].append(ai)
    return token_dict

In [8]:
%%time

address_list = list(map(my_tokenize, us_test_data))

CPU times: user 47.3 s, sys: 4.22 s, total: 51.5 s
Wall time: 51.5 s


In [9]:
len(address_list)

7999999

In [10]:
address_list[0]

{'StreetNumber': ['780'],
 'StreetName': ['46th', 'st', 'nort'],
 'Municipality': ['birmingham'],
 'Province': ['al'],
 'PostalCode': ['35212']}

In [11]:
%%time

df_address = pd.DataFrame([{k: ' '.join(v) for k, v in address.items()} for address in address_list]).fillna('')
df_address

CPU times: user 16.6 s, sys: 1.52 s, total: 18.1 s
Wall time: 18.1 s


Unnamed: 0,StreetNumber,StreetName,Municipality,Province,PostalCode,Unit
0,780,46th st nort,birmingham,al,35212,
1,106,saranac dr,missoula,mt,59803,
2,4140,edgemere ct,indianapolis,in,46205,
3,1616,nw 4th st,boca raton,fl,33486,
4,n87w 36989,mapleton road,oconomowoc,wi,53066,
...,...,...,...,...,...,...
7999994,6160,gln oak st,los angeles,ca,90068,
7999995,7128,marbella ct,cape canaveral,fl,32920,
7999996,215,spring st,martinsville,virginia,24112,
7999997,77,centennial avenue,gloucester,ma,01930,


In [12]:
df_address.to_csv('deepparse_address_data_usa.csv.bz2')

In [13]:
! diff deepparse_address_data_usa.csv.bz2 /home/j/Project/simulation_science/prl/data/deepparse_address_data_usa.csv.bz2

In [14]:
# ! mv deepparse_address_data_usa.csv.bz2 /home/j/Project/simulation_science/prl/data/deepparse_address_data_usa.csv.bz2
! rm deepparse_address_data_usa.csv.bz2