# Get a list of addresses to geocode

In [1]:
import pandas as pd

In [2]:
oakland311 = pd.read_csv('../oakland_311_cleaned.csv')
oakland311.tail()

Unnamed: 0,REQUESTID,DATETIMEINIT,SOURCE,DESCRIPTION,REQCATEGORY,REQADDRESS,STATUS,REFERREDTO,DATETIMECLOSED,SRX,SRY,COUNCILDISTRICT,BEAT,PROBADDRESS,City,State,ELAPSED_TIME
935469,1114706,2021-05-19 16:27:17,SeeClickFix,Parking - Abandoned Vehicle,POLICE,"(37.773695001577835, -122.17537597640717)",CLOSED,,2021-08-02 13:20:17,6077464.481,2108516.009,CCD6,29X,6618 LAIRD AVE,Oakland,CA,74 days 20:53:00
935470,1124225,2021-06-24 12:07:46,SeeClickFix,"Illegal Dumping - debris, appliances, etc.",ILLDUMP,"(37.849167834332334, -122.28460742401488)",CLOSED,,2021-06-25 17:38:07,6046423.116,2136576.035,CCD1,10X,6628 HELEN CT,Oakland,CA,1 days 05:30:21
935471,1119508,2021-06-07 15:25:51,SeeClickFix,Parking - Abandoned Vehicle,POLICE,"(37.802163003247536, -122.23499896883865)",CANCEL,,,6060426.881,2119194.029,CCD2,17Y,3219 ELLIOT ST,Oakland,CA,
935472,1114657,2021-05-19 14:35:57,Voicemail,Recycling - Repair/Replace Cart,RECYCLING,"(37.81240377066922, -122.27655475245274)",CLOSED,,2021-05-21 15:42:05,6048493.657,2123147.497,CCD3,05X,2131 WEST ST,Oakland,CA,2 days 01:06:08
935473,1108146,2021-04-26 14:37:02,Phone,"Illegal Dumping - debris, appliances, etc.",ILLDUMP,"(37.84761984869444, -122.25216142316818)",CLOSED,,2021-04-30 15:35:32,6055779.117,2135835.46,CCD1,12Y,5951 COLLEGE AV,Oakland,CA,4 days 00:58:30


## What are you noticing about geographic point data and street addresses?

- `REQADDRESS` has geographic coordinates
- `PROBADDRESS` has street address


I like to use `df.sample(n)` to get random rows of data. Let's look at the addresses we have to clean!

In [3]:
oakland311.sample(20)[['PROBADDRESS']]

Unnamed: 0,PROBADDRESS
309077,5620 HARMON AV
12675,646 BOULEVARD WAY
784778,14TH AVE & E 12TH ST
222449,FIRE STATION 21
148245,3318 MAPLE AVE.
747491,DIMOND LIBRARY
783080,45TH AVE & SAN LEANDRO ST
587490,9332 E ST
377562,JACK LONDON AQUATICS CENTER
559989,MCDONELL AV & FIRE RD


We learned some string normalizing ideas in [Lecture 1031](../lecture1031/lecture1031_pt3_cleaning.ipynb). String normalizing can help you spot duplicate data. Why would we want to do this?

Let's take the address `2505 Hearst Ave, Berkeley, CA 94709`. It could be written in the data as:
- 2505 HEARST AVE
- 2505 Hearst Ave

In this case, if we capitalize everything, pandas will immediately identify the 2 addresses as duplicates. So then when you request the geographic points from a geocoder, you won't have to submit both `2505 HEARST AVE` AND `2505 Hearst Ave`.


## Clean up the `PROBADDRESS` field

Let's create a new column called `ADDRESS_CLEANED`.

In [4]:
# Uppercase the new column:
oakland311['ADDRESS_CLEANED'] = oakland311['PROBADDRESS'].str.upper()

# Create a new column and replace 2+ whitespaces with 1 whitespace:
oakland311['ADDRESS_CLEANED'] = oakland311['ADDRESS_CLEANED'].str.replace(r'[ ]+', ' ', regex=True)

# Remove leading and trailing whitespaces from the new column:
oakland311['ADDRESS_CLEANED'] = oakland311['ADDRESS_CLEANED'].str.strip()

There are other techniques we could use, like converting all `STREET` into `ST` and `ROAD` into `RD` and so forth. This is a simple lecture, so we're not going to do that.

In [5]:
addresses_to_geocode = oakland311[
    ( oakland311['REQADDRESS'].isnull() ) & 
    ( oakland311['PROBADDRESS'].notnull() )
].reset_index(drop=True)
addresses_to_geocode

Unnamed: 0,REQUESTID,DATETIMEINIT,SOURCE,DESCRIPTION,REQCATEGORY,REQADDRESS,STATUS,REFERREDTO,DATETIMECLOSED,SRX,SRY,COUNCILDISTRICT,BEAT,PROBADDRESS,City,State,ELAPSED_TIME,ADDRESS_CLEANED
0,1119698,2021-06-08 09:06:47,Phone,Illegal Dumping � mattress/boxspring,ILLDUMP,,CLOSED,,2021-06-08 20:16:06,,,,,99TH AVE & BIRCH,Oakland,CA,0 days 11:09:19,99TH AVE & BIRCH
1,1112327,2021-05-11 11:50:04,Phone,Parking - Meter Maintenance,METER_REPAIR,,CLOSED,,2021-05-13 07:30:38,,,,,GRAND AV & SUNNYSLOPE AV,Oakland,CA,1 days 19:40:34,GRAND AV & SUNNYSLOPE AV
2,1113081,2021-05-13 13:46:02,Phone,Streets - Street Deterioration,STREETSW,,OPEN,,,,,,,94TH & LAWLER,Oakland,CA,,94TH & LAWLER
3,1123941,2021-06-23 11:17:40,Voicemail,Recycling Hotline - Miscellaneous,RECYCLING,,CLOSED,,2021-06-23 11:17:55,,,,,ZZ,Oakland,CA,0 days 00:00:15,ZZ
4,1120645,2021-06-11 10:28:54,Phone,"Illegal Dumping - debris, appliances, etc.",ILLDUMP,,CLOSED,,2021-06-14 19:57:27,,,,,94TH AVE & BANCROFT AVE,Oakland,CA,3 days 09:28:33,94TH AVE & BANCROFT AVE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17069,1120609,2021-06-11 08:53:57,Phone,Sewers - Blockage,SEWERS,,CLOSED,,2021-06-29 07:09:33,,,,,7044 NORFOLK ROAD,Oakland,CA,17 days 22:15:36,7044 NORFOLK ROAD
17070,1113573,2021-05-15 17:29:28,Phone,City Bldg - Other/Complex,BLDGMAINT,,CLOSED,"GETWOOD, ROY",2021-06-02 13:43:29,,,,,Cross streets are skyline and juaqiem miller m...,Oakland,CA,17 days 20:14:01,CROSS STREETS ARE SKYLINE AND JUAQIEM MILLER M...
17071,1120590,2021-06-11 08:14:03,Phone,"Illegal Dumping - debris, appliances, etc.",ILLDUMP,,CLOSED,,2021-06-21 16:03:04,,,,,1310 76TH,Oakland,CA,10 days 07:49:01,1310 76TH
17072,1120594,2021-06-11 08:22:46,Phone,Parking - Abandoned Vehicle,POLICE,,CANCEL,,,,,,21X,2343 EAST 24TH ST,Oakland,CA,,2343 EAST 24TH ST


In [6]:
num_addresses_og = addresses_to_geocode['PROBADDRESS'].nunique()
num_addresses_cleaned = addresses_to_geocode['ADDRESS_CLEANED'].nunique()
reduced = num_addresses_og - num_addresses_cleaned

In [7]:
print(f"Reduced total addresses to geocode from { num_addresses_og } addresses to { num_addresses_cleaned } addresses. That's { reduced } addresses we deduped.")

Reduced total addresses to geocode from 8298 addresses to 8141 addresses. That's 157 addresses we deduped.


## Exports

### Export cleaned up dataset with new column

Export a version of your original dataset with the cleaned up column name so we have a column to match on later.

In [9]:
# create an exports folder
!mkdir -p exports

In [10]:
addresses_to_geocode.to_csv('exports/oakland311_cleaned.csv', index=False)

### Export addresses to geocode

The code below is commented out because I do not want to geocode thousands of addresses for this lecture!

In [11]:
# addresses_to_geocode[['ADDRESS_CLEANED']].drop_duplicates().to_csv('exports/addresses_to_geocode.csv', index=False)

For the purposes of this lecture, I only want to geocode 10 addresses. Otherwise it's going to take forever. And there will likely be failures!

In [12]:
deduped = addresses_to_geocode[['ADDRESS_CLEANED']].drop_duplicates()
addresses_sample = deduped.sample(10).reset_index(drop=True)
addresses_sample[['ADDRESS_CLEANED']]

Unnamed: 0,ADDRESS_CLEANED
0,500 E. 22ND ST
1,900 36TH AV
2,47TH ST & DOVER
3,850 PINE ST
4,5300 BLOCK OF JAMES AVE
5,INTERNATIONAL BLVD & 42ND AV
6,2000 CAMPBELL ST
7,611 OLD QUARRY LOOP
8,2045 EAST 15TH ST
9,MUNSON & E 15TH ST


In [13]:
# export
addresses_sample[['ADDRESS_CLEANED']].to_csv('exports/random_addresses_10.csv', index=False)