# P3: Wrangle OpenStreetMap Data

## Data

The map area I chose is the Austin, TX area. As delineated in the class, I obtained the data by downloading an already-prepared extract which I found in the link below:

https://mapzen.com/data/metro-extracts/metro/austin_texas/

I chose the 66MB raw OpenStreetMap OSM XML dataset. After unzipping the file, it gave about 1.4 GB dataset. Opening this dataset using Sublime took a while.

### Preliminary examination of the dataset

This is done to see how the data looks like.

In [1]:
import xml.etree.cElementTree as ET
from collections import defaultdict
import re

In [2]:
street_type_re = re.compile(r'\S+\.?$', re.IGNORECASE)
street_types = defaultdict(int)

In [3]:
def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        street_types[street_type] += 1

In [4]:
def print_sorted_dict(d):
    keys = d.keys()
    keys = sorted(keys, key=lambda s: s.lower())
    for k in keys:
        v = d[k]
        print "%s: %d" % (k,v)

In [5]:
def is_street_name(elem):
    return (elem.tag == "tag") and (elem.attrib['k'] == "addr:street")

In [6]:
osmfile = "austin_texas.osm"

In [9]:
for event, element in ET.iterparse(osmfile):
    if is_street_name(element):
        audit_street_type(street_types, element.attrib['v'])
print_sorted_dict(street_types)

#100: 2
#101: 1
#104: 1
#150: 1
#203: 2
#260: 1
#300: 2
#3000a: 1
#306: 1
#4: 1
#406: 1
#600: 1
#8: 1
#B100: 1
#F-4: 1
#G-145: 1
#L2: 1
100: 2
104: 1
1100: 45
117: 1
12: 8
120: 1
129: 11
1327: 61
138: 3
1431: 121
150: 2
1625: 76
1626: 91
163: 1
170: 1
1805: 1
1825: 1
1826: 57
183: 7
213: 1
2222: 68
2243: 2
2244: 1
275: 1
2769: 163
280: 3
290: 333
298: 1
3: 1
301: 2
3177: 1
320: 1
35: 25
400: 1
414: 1
45: 1
452: 1
459: 6
535: 2
6: 1
619: 1
620: 551
685: 5
7: 1
71: 17
79: 1
8: 1
812: 176
969: 2
973: 170
A: 76
A-15: 1
A500: 1
Acres: 16
Adventurer: 2
Affirmed: 7
Alley: 44
Alps: 15
Alto: 28
Amistad: 26
Apache: 6
Arbolago: 21
Arrow: 17
Atlantic: 11
Austin: 1
Ave: 33
Ave.: 1
Avene: 1
Avenue: 15891
B: 105
Barrhead: 12
Bend: 1777
Birch: 12
Blackfoot: 7
Bluff: 41
Blvd: 25
Blvd.: 6
Boggy: 4
Bonanza: 20
Bonita: 18
Bottom: 1
Boulevard: 8759
Branch: 17
Bridge: 26
Buckskin: 1
C: 127
C-200: 1
C1-100: 1
Caliche: 5
Calle: 24
Camelback: 6
Camino: 27
Cannon: 1
Cantera: 11
Canterwood: 27
Canyon: 79
Capri: 

##### From above, we can see that there are street names that need to be fixed.

Avenue, Ave., Ave, and Avene

Boulevard, Blvd, Blvd.

Circle, Cc(?)

Costa, Corta(?)

Court, court, Ct

Cove, cove, Cv

Drive, Dr, Dr.

"Drive/Rd"?

Highway <= Hwy

FM1431, 1431, RM1431

I35, IH-35, IH35, IH35,

Lane, lane, Lanes(?), Ln

Pass, pass

Parkway, Pkwy

Place, Pl

Overlook, Ovlk

North, N(?)

Ps(?)

Road, Rd, "Road,1100"

SB?

St, St. street, Street

Trail, Tr, Trl

West, W

Way, way


## High Level Tags

To determine the number of high level tags the dataset has, iterative parsing is done.

In [10]:
import pprint

In [11]:
def count_tags(filename):
    tag_counts = defaultdict(int)
    for event, elem in ET.iterparse(filename):
        tag_counts[elem.tag] += 1
    return tag_counts

In [12]:
tags = count_tags(osmfile)

In [14]:
pprint.pprint(dict(tags))

{'bounds': 1,
 'member': 20197,
 'nd': 6985591,
 'node': 6356394,
 'osm': 1,
 'relation': 2357,
 'tag': 2377504,
 'way': 666390}


## Checking the k values

In [15]:
import re

In [16]:
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

In [20]:
def key_type(element, keys):
    if element.tag == 'tag':
        try:
            lower.search(element.attrib['k']).group()
            keys["lower"] += 1
        except AttributeError:
            try:
                lower_colon.search(element.attrib['k']).group()
                keys["lower_colon"] += 1
            except AttributeError:
                try:
                    problemchars.search(element.attrib['k']).group()
                    keys["problemchars"] += 1
                except AttributeError:
                    keys["other"] += 1
    return keys

In [21]:
def process_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other":0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)
    return keys

In [22]:
keys = process_map(osmfile)

In [23]:
pprint.pprint(keys)

{'lower': 1297812, 'lower_colon': 1067727, 'other': 11964, 'problemchars': 1}


## Exploring Users

In [24]:
def get_user(element):
    return

In [25]:
def process_map_users(filename):
    users = set()
    for _, element in ET.iterparse(filename):
        for key in element.attrib:
            if key == 'uid':
                users.add(element.attrib[key])
    return users

In [26]:
users = process_map_users(osmfile)

In [27]:
len(users)

1155

## Auditing and Improving Street Names 

Auditing the osmfile and using the variable 'mapping', check to see the changes needed to fix the unexpected street types to the appropriate ones in the expected list.

In [28]:
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

In [29]:
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons", "Cove", "Highway", "IH-35", "Lane", "North", "Overlook", "Pass"]

In [30]:
mapping = { "St": "Street",
            "St.": "Street",
            "st": "Street",
            "street": "Street",
            "Ave": "Avenue",
            "Ave.": "Avenue",
            "Avene": "Avenue",
            "Blvd": "Boulevard",
            "Blvd.": "Boulevard",
            "Dr": "Drive",
            "Dr.": "Drive",
            "Ct": "Court",
            "Ct.": "Court",
            "court": "Court",
            "Cv": "Cove",
            "cove": "Cove",
            "Pl": "Place",
            "Pl.": "Place",
            "lane": "Lane",
            "Ln": "Lane",
            "Rd": "Road", 
            "Rd.": "Road",
            "Trl": "Trail",
            "Tr": "Trail",
            "Pkwy": "Parkway",
            "Hwy": "Highway",
            "I35": "IH-35",
            "IH35": "IH-35",
            "IH35,": "IH-35",
            "N": "North",
            "Ovlk": "Overlook",
            "pass": "Pass",
            "Ps": "Pass",
            "W": "West",
            "texas": "Texas",
            "1431": "FM1431",
            "RM1431": "FM1431"}

In [36]:
def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)

In [32]:
def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")

In [33]:
def audit(osmfile):
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])
    osm_file.close()
    return street_types

In [34]:
def update_name(name, mapping):
    parts = name.split()
    if parts[-1] in mapping.keys():
        parts[-1] = mapping[parts[-1]]
    name = ' '.join(parts)
    return name

In [37]:
st_types = audit(osmfile)

In [38]:
pprint.pprint(dict(st_types))

{'100': set(['Avery Ranch Blvd Building A #100',
             'Jollyville Road Suite 100',
             'Old Jollyville Road, Suite 100']),
 '101': set(['4207 James Casey st #101']),
 '104': set(['11410 Century Oaks Terrace Suite #104', 'S 1st St, Suite 104']),
 '1100': set(['Farm-to-Market Road 1100']),
 '117': set(['County Road 117']),
 '12': set(['Ranch to Market Road 12']),
 '120': set(['Building B Suite 120']),
 '129': set(['County Road 129']),
 '1327': set(['FM 1327', 'Farm-to-Market Road 1327']),
 '138': set(['County Road 138']),
 '1431': set(['Farm-to-Market Road 1431', 'Old Farm-to-Market 1431']),
 '150': set(['Farm-to-Market Road 150', 'IH-35 South, #150']),
 '1625': set(['Farm-to-Market Road 1625']),
 '1626': set(['F.M. 1626', 'FM 1626', 'Farm-to-Market Road 1626']),
 '163': set(['Bee Cave Road Suite 163']),
 '170': set(['County Road 170']),
 '1805': set(['N Interstate 35, Suite 1805']),
 '1825': set(['FM 1825']),
 '1826': set(['Farm To Market Road 1826', 'Ranch to Market Ro