# Cleaning the street addresses
This script will standardize the street addresses used in the Monroe County crash data. There are two fields with addresses, `Roadway Id` and `Intersecting Road`, which are used to indicate the location of each crash. The fields are not standardized when they are entered by law enforcement, so there is a lot of variation in how each road name is entered. For example, `ST RD 37` could be `SR37`, `STATE ROAD 37`, `STRD37`, `S R 37`, `OLD STATE ROAD 37`, etc. 

For most analysis, looking at a map of the points encoded with the `Latitude` and `Longitude` fields is the best approach, because the `Roadway Id` and `Intersecting Road` fields aren't always filled out based on the same information. For example, if a crash occurred on DUNN ST, 200 feet from the nearest intersection of 2ND ST & DUNN ST, some law enforcement officers might note the `Intersecting Road` as 2ND ST, while others might leave it blank and rely on geolocation to show the precise spot. 

While `Latitude` and `Longitude` are usually more useful, it is still useful to clean the intersection names, because this allows for easier analysis of the most dangerous intersections.

This script explains the logic that went into the cleaning script `clean_addresses.py` in the `cleaning-workflow/cleaning-scripts` folder.

In [1]:
import pandas as pd
import numpy as np
import re

pd.set_option('display.max_columns', None)

In [6]:
# let's start with the 2022 data
df = pd.read_csv('../source-data/moco-crash-2022.csv', 
                 usecols=['Roadway Id','Intersecting Road'],
                )
df

Unnamed: 0,Roadway Id,Intersecting Road
0,I69N,STATE RD 37
1,SR46W,DEER PARK
2,W REEVES,
3,THIRD,S HAWTHORNE
4,S HENDERSON,E HILLSIDE
...,...,...
2245,W COUNTRY CLUB DR,S OLD STATE ROAD 37
2246,E 10TH ST,E 10TH ST
2247,S COLLEGE MALL RD,E BUICK CADILLAC BLVD
2248,S OLD SR 37,MORMAN RD


In [30]:
# what are the unique roadway id values? what duplicates/bad formatting can we notice? 
df.sort_values('Roadway Id')['Roadway Id'].unique()

array(['', '100 E. 6TH ST', '1011 W. 2ND ST', '104 E PETE ELLIS DR',
       '10TH', '10TH ST', '10TH STREET', '1100 S COLLEGE MALL RD',
       '1107 W 3RD ST BLOOMINGTON, IN', '1110', '1119 N. MADISON',
       '115 S STATE ROAD 46', '1150 S CLARIZZ BLVD', '116 N GRANT ST',
       '1175 S COLLEGE MALL ROAD', '1199 S COLLEGE MALL ROAD',
       '120 N. GATES DR', '1255 S COLLEGE MALL ROAD',
       '1284 S. LIBERTY DR.', '1285 S. COLLEGE MALL RD',
       '130 W. GRIMES LN', '1300 BLK E. 2ND ST', '1300 S. PATTERSON DR',
       '1308 E. 3RD ST', '1348 N ARLINGTON PARK DRIVE',
       '1400 S. BRENDA LN', '1421 N WILLIS DR', '1519 S. PIAZZA DR',
       '1600 E HILLSIDE DR', '1605 S PECAN LN', '1615 S BUFFSTONE CT',
       '1616 S. HENDERSON ST', '1623 W. ARLINGTON RD',
       '1700 BLK S. HUNTINGTON DR', '1709 W. 8TH ST',
       '1710 N KINSER PK BLOOMINGTON, IN', '17TH', '17TH ST.',
       '1800 BLK. S.OLIVE ST', '1823 S HIGHLAND AVE', '1825 N. KINSER PK',
       '1870 S WALNUT STBLOOMINGTON 

In [31]:
# how many unique values are there?
df.sort_values('Roadway Id')['Roadway Id'].unique().shape

(748,)

In [7]:
# replace NA values with empty strings to make cleaning easier
df['Roadway Id'] = df['Roadway Id'].fillna('')
df['Intersecting Road'] = df['Intersecting Road'].fillna('')

To standardize the street addresses, I will use a string replace method. This is essentially a find and replace function. The replacement strings were compiled based on looking through the most common spellings of different roads.

In [9]:
def replace_str(df, str1, str2):
    return df.replace(to_replace=str1, value=str2, regex=True)

In [10]:
# identify strings to replace 
strs_to_replace = [
    ['BLOOMINGTON IN',''],
    ['BLOOMINGTON, IN',''],
    ['S\. ', 'S '],
    ['N\. ', 'N '],
    ['W\. ', 'W '],
    ['E\.','E'],
    ['SOUTH ', 'S '],
    ['NORTH ', 'N '],
    ['WEST ', 'W '],
    ['EAST ', 'E '],
    [' AVE\.', 'AVE'],
    ['AVENUE', 'AVE'],
    ['STREET','ST'],
    ['ST\.','ST'],
    ['PIKE', 'PK'],
    ['ROAD','RD'],
    ['STATE RD','ST RD'],
    ['SR4','SR 4'],
    ['THIRD', '3RD'],
    [' 47401',''],
    ['SR ','ST RD '],
    ['46W', '46 W'],
    ['45W', '45 W'],
    ['STATE 46','ST RD 46'],
    ['S\.R\.','ST RD'],
    ['I 69','I-69'],
    ['I69','I-69'],
    ['INTERSTATE 69','I-69'],
    ['I-69 SOUTH','I-69 S'],
    ['SBOUND','S'],
    ['S BOUND','S'],
    ['S I-69','I-69 S'],
    ['I-69N','I-69 N'],
    ['JORDAN AVE','EAGLESON AVE'],
    ['JORDAN','EAGLESON AVE'],
    ['DRIVE','DR'],
    ['LANE','LN'],
    ['IN-45','ST RD 45'],
    ['W ST RD 45/46 BYPASS','W ST RD 45/46'],
    ['WMAIN','W MAIN'],
    ['ROGERS RD','ROGERS ST'],
    ['BLK ',''],
    ['N JORDAN','N EAGLESON AVE'],
    ['S JORDAN','S EAGLESON AVE'],
    ['E3RD','E 3RD'],
    ['W3RD','W 3RD'],
    ['BLOCK ',''],
    ['STRE','ST'],
    ['E E','E '],
    ['3RD AVE','3RD ST'],
    ['PARKING LOT',''],
    ['ST ST','ST'],
    ['2ND AVE','2ND ST'],
    ['W2ND','W 2ND'],
    ['E2ND','E 2ND'],
    ['E10TH','E 10TH'],
    ['45-46','45/46'],
    ['SR37','ST RD 37'],
    ['SR37S','S ST RD 37'],
    ['OLD SR37','ST RD 37'],
    ['SR37N','N ST RD 37'],
    ['OLD ST RD','ST RD'],
    ['37 BUSINESS','37'],
    ['37 HWY','37'],
    ['HWY 37','ST RD 37'],
    ['OLDSR37','ST RD 37'],
    ['OLD 37','ST RD 37'],
    ['ST RD 37S','S ST RD 37'],
    ['ST RD 37N','N ST RD 37'],
    ['37 RD','37'],
    ['ST RD 37 S','S ST RD 37'],
    ['ST RD 37 N','N ST RD 37'],
    ['OLST RD 37','ST RD 37'],
    ['BUSINESS 37','ST RD 37'],
    ['US37','ST RD 37'],
    ['ST RD 37 S HWY','S ST RD 37'],
    ['ST RD 37 N RD','N ST RD 37'],
    
]

In [23]:
clean_df = df

for string in strs_to_replace:
    clean_df = replace_str(clean_df,string[0],string[1])

In [17]:
# extract the `101` from `101 E 2ND ST` and create a new column
def extract_house_nums(road):
    address_num_exists = False
    if road:
        # get the first word of the road name
        first_word = road.split(" ")[0]
        # get the length of the road name. if there's only one number, we don't want to identify that as an address num.
        road_len = len(road.split(" "))
        if road_len > 1:
            # if the first word is all numerals, the address num exists
            address_num_exists = bool(re.search("^\d+$",first_word))
    # if it exists, return the address num. if not, return an empty string.
    return first_word if address_num_exists else ''


In [18]:
# All entries like `101 E 2ND ST` should be `E 2ND ST`
def remove_house_nums(road):
    address_num_exists = False
    if road:
        first_word = road.split(" ")[0]
        road_len = len(road.split(" "))
        if road_len > 1:
            address_num_exists = bool(re.search("^\d+$",first_word))
    #         if the address num exists, remove it from the original address. else, return original address.
    return " ".join(road.split(" ")[1:]) if address_num_exists else road

In [19]:
# All entries like `S 17TH` should be `S 17TH ST`
def clean_numbered_streets(road):
    if road:
        if bool(re.search('[\d]{1,2}(TH|ST|ND|RD)$',road.strip())):
            road = road.strip() + " ST"
#         if bool(re.search('^[\d]{1,2}(TH|ST|ND|RD)$',road)):
#             road = 
    return road.strip()

In [20]:
clean_numbered_streets('E 3RD ')

'E 3RD ST'

In [26]:
# remove info after semicolons or colons
def remove_colons(road):
    if road:
        road = road.split(";")[0]
        road = road.split(":")[0]
    return road 

In [27]:
clean_df['Address Number'] = clean_df['Roadway Id'].apply(extract_house_nums)
clean_df['Roadway Id'] = clean_df['Roadway Id'].apply(remove_house_nums)
clean_df['Roadway Id'] = clean_df['Roadway Id'].apply(clean_numbered_streets)
clean_df['Roadway Id'] = clean_df['Roadway Id'].apply(remove_colons)

In [28]:
clean_df.sort_values('Roadway Id')['Roadway Id'].unique()

array(['', '10TH ST', '1110', '17TH ST', '1ST ST',
       "3200' S OF SMITHVILLE RD", '3RD ST', '45', '45/46 BYPASS', '46',
       '69', '7TH ST', 'ACCESS DR TO HILLTOP GARDENS', 'ALEXANDER DR',
       'ALLEN ST', 'ATWATER AVE', 'BAYLES', 'BEECHWOOD DR',
       'BLK. S.OLIVE ST', 'BLOOMFIELD RD', 'CATHERINE ST',
       'CENTENNIAL DR', 'CHEEKWOOD LN', 'CHURCH LN',
       'CLEAR CREEK TRAILHEAD NORTH', 'COLLEGE MALL RD', 'CONNAUGHT',
       'COUNTRY CLUB', 'CURRY PK', 'DANIELS WAY', 'DISCOVERY PARKWAY',
       'DISCOVERY PKWY', 'DITTEMORE RD', 'DORCHESTER DR', 'E 10TH ST',
       'E 11TH ST', 'E 12TH ST', 'E 13TH ST', 'E 14TH ST', 'E 15TH ST',
       'E 16TH ST', 'E 17TH ST', 'E 19TH ST', 'E 1ST ST', 'E 2ND ST',
       'E 3RD ST', 'E 4TH ST', 'E 6TH ST', 'E 7TH ST', 'E 8TH ST',
       'E 9TH ST', 'E ALLEN ST', 'E ATWATER', 'E ATWATER AVE',
       'E AUTO MALL RD', 'E BETHEL LN', 'E BILL MALLORY BLVD',
       'E BUICK CADILLAC BLVD', 'E BURKS DR', 'E COTTAGE GROVE AVE',
       'E COVENAN

In [37]:
# how many unique values are there before/after cleaning?
df.sort_values('Roadway Id')['Roadway Id'].unique().shape[0],clean_df.sort_values('Roadway Id')['Roadway Id'].unique().shape[0]


(748, 509)

In [47]:
# what are the top intersections now that it's clean?
clean_df['Intersection Id'] = clean_df['Roadway Id'] + ' | ' + clean_df['Intersecting Road']
clean_df[len(clean_df['Intersecting Road']) > 0]['Intersection Id'].value_counts()[:20]

KeyError: 'Intersection Id'