# P3: Wrangle OpenStreetMap Data

## Data

The map area I chose is the Austin, TX area. As delineated in the class, I obtained the data by downloading an already-prepared extract which I found in the link below:

https://mapzen.com/data/metro-extracts/metro/austin_texas/

I chose the 66MB raw OpenStreetMap OSM XML dataset. After unzipping the file, it gave about 1.4 GB dataset. Opening this dataset using Sublime took a while.

### Preliminary examination of the dataset

This is done to see how the data looks like.

In [1]:
import xml.etree.cElementTree as ET
from collections import defaultdict
import re

In [2]:
street_type_re = re.compile(r'\S+\.?$', re.IGNORECASE)
street_types = defaultdict(int)

In [3]:
def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        street_types[street_type] += 1

In [4]:
def print_sorted_dict(d):
    keys = d.keys()
    keys = sorted(keys, key=lambda s: s.lower())
    for k in keys:
        v = d[k]
        print "%s: %d" % (k,v)

In [5]:
def is_street_name(elem):
    return (elem.tag == "tag") and (elem.attrib['k'] == "addr:street")

In [6]:
osmfile = "austin_texas.osm"

In [9]:
for event, element in ET.iterparse(osmfile):
    if is_street_name(element):
        audit_street_type(street_types, element.attrib['v'])
print_sorted_dict(street_types)

#100: 2
#101: 1
#104: 1
#150: 1
#203: 2
#260: 1
#300: 2
#3000a: 1
#306: 1
#4: 1
#406: 1
#600: 1
#8: 1
#B100: 1
#F-4: 1
#G-145: 1
#L2: 1
100: 2
104: 1
1100: 45
117: 1
12: 8
120: 1
129: 11
1327: 61
138: 3
1431: 121
150: 2
1625: 76
1626: 91
163: 1
170: 1
1805: 1
1825: 1
1826: 57
183: 7
213: 1
2222: 68
2243: 2
2244: 1
275: 1
2769: 163
280: 3
290: 333
298: 1
3: 1
301: 2
3177: 1
320: 1
35: 25
400: 1
414: 1
45: 1
452: 1
459: 6
535: 2
6: 1
619: 1
620: 551
685: 5
7: 1
71: 17
79: 1
8: 1
812: 176
969: 2
973: 170
A: 76
A-15: 1
A500: 1
Acres: 16
Adventurer: 2
Affirmed: 7
Alley: 44
Alps: 15
Alto: 28
Amistad: 26
Apache: 6
Arbolago: 21
Arrow: 17
Atlantic: 11
Austin: 1
Ave: 33
Ave.: 1
Avene: 1
Avenue: 15891
B: 105
Barrhead: 12
Bend: 1777
Birch: 12
Blackfoot: 7
Bluff: 41
Blvd: 25
Blvd.: 6
Boggy: 4
Bonanza: 20
Bonita: 18
Bottom: 1
Boulevard: 8759
Branch: 17
Bridge: 26
Buckskin: 1
C: 127
C-200: 1
C1-100: 1
Caliche: 5
Calle: 24
Camelback: 6
Camino: 27
Cannon: 1
Cantera: 11
Canterwood: 27
Canyon: 79
Capri: 

##### From above, we can see that there are street names that need to be fixed.

Avenue, Ave., Ave, and Avene

Boulevard, Blvd, Blvd.

Circle, Cc(?)

Costa, Corta(?)

Court, court, Ct

Cove, cove, Cv

Drive, Dr, Dr.

"Drive/Rd"?

Highway <= Hwy

FM1431, 1431, RM1431

I35, IH-35, IH35, IH35,

Lane, lane, Lanes(?), Ln

Pass, pass

Parkway, Pkwy

Place, Pl

Overlook, Ovlk

North, N(?)

Ps(?)

Road, Rd, "Road,1100"

SB?

St, St. street, Street

Trail, Tr, Trl

West, W

Way, way


## High Level Tags

To determine the number of high level tags the dataset has, iterative parsing is done.

In [10]:
import pprint

In [11]:
def count_tags(filename):
    tag_counts = defaultdict(int)
    for event, elem in ET.iterparse(filename):
        tag_counts[elem.tag] += 1
    return tag_counts

In [12]:
tags = count_tags(osmfile)

In [14]:
pprint.pprint(dict(tags))

{'bounds': 1,
 'member': 20197,
 'nd': 6985591,
 'node': 6356394,
 'osm': 1,
 'relation': 2357,
 'tag': 2377504,
 'way': 666390}


## Checking the k values

In [15]:
import re

In [16]:
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

In [20]:
def key_type(element, keys):
    if element.tag == 'tag':
        try:
            lower.search(element.attrib['k']).group()
            keys["lower"] += 1
        except AttributeError:
            try:
                lower_colon.search(element.attrib['k']).group()
                keys["lower_colon"] += 1
            except AttributeError:
                try:
                    problemchars.search(element.attrib['k']).group()
                    keys["problemchars"] += 1
                except AttributeError:
                    keys["other"] += 1
    return keys

In [21]:
def process_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other":0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)
    return keys

In [22]:
keys = process_map(osmfile)

In [23]:
pprint.pprint(keys)

{'lower': 1297812, 'lower_colon': 1067727, 'other': 11964, 'problemchars': 1}


## Exploring Users

In [24]:
def get_user(element):
    return

In [25]:
def process_map_users(filename):
    users = set()
    for _, element in ET.iterparse(filename):
        for key in element.attrib:
            if key == 'uid':
                users.add(element.attrib[key])
    return users

In [26]:
users = process_map_users(osmfile)

In [27]:
len(users)

1155

## Auditing and Improving Street Names 

Auditing the osmfile and using the variable 'mapping', check to see the changes needed to fix the unexpected street types to the appropriate ones in the expected list.

In [28]:
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

In [29]:
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons", "Cove", "Highway", "IH-35", "Lane", "North", "Overlook", "Pass"]

In [79]:
mapping = { "St": "Street",
            "St.": "Street",
            "st": "Street",
            "street": "Street",
            "Street,": "Street",
            "Ave": "Avenue",
            "Ave.": "Avenue",
            "Avene": "Avenue",
            "Blvd": "Boulevard",
            "Blvd.": "Boulevard",
            "Boulevard,": "Boulevard",
            "Blvd,": "Boulevard",
            "Dr": "Drive",
            "Dr.": "Drive",
            "Ct": "Court",
            "Ct.": "Court",
            "court": "Court",
            "Cv": "Cove",
            "cove": "Cove",
            "1st": "First",
            "Pl": "Place",
            "Pl.": "Place",
            "lane": "Lane",
            "Ln": "Lane",
            "Rd": "Road", 
            "Rd.": "Road",
            "R": "Road",
            "Trl": "Trail",
            "Tr": "Trail",
            "Pkwy": "Parkway",
            "Hwy": "Highway",
            "HWY": "Highway",
            "Hwy,": "Highway",
            "I35": "Interstate Highway 35",
            "IH35": "Interstate Highway 35",
            "IH35,": "Interstate Highway 35",
            "IH-35": "Interstate Highway 35",
            "I-35": "Interstate Highway 35",
            "IH": "Interstate Highway",
            "I": "Interstate Highway",
            "35,": "35",
            "main": "Main",
            "N": "North",
            "N.": "North",
            "Ovlk": "Overlook",
            "pass": "Pass",
            "Ps": "Pass",
            "W": "West",
            "W.": "West",
            "E": "East",
            "E.": "East",
            "texas": "Texas",
            "FM": "Farm-to-Market Road",
            "F.M.": "Farm-to-Market Road",
            "U.S.": "United States",
            "US": "United States",
            "RM": "Ranch-to-Market Road",
            "S": "South",
            "south": "South",
            "Bldg": "Building",
            "Bldg.": "Building",
            "Ste": "Suite",
            "Ste,": "Suite",
            "C": "Country",
            "church": "Church",
            "brigadoon": "Brigadoon"}

In [36]:
def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)

In [32]:
def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")

In [33]:
def audit(osmfile):
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])
    osm_file.close()
    return street_types

In [63]:
def update_name(name, mapping):
    parts = name.split()
    newparts = []
    for item in parts:
        if item in mapping.keys():
            item = mapping[item]
            newparts.append(item)
        else:
            newparts.append(item)
    new_name = ' '.join(newparts)
    return new_name

In [37]:
st_types = audit(osmfile)

In [38]:
pprint.pprint(dict(st_types))

{'100': set(['Avery Ranch Blvd Building A #100',
             'Jollyville Road Suite 100',
             'Old Jollyville Road, Suite 100']),
 '101': set(['4207 James Casey st #101']),
 '104': set(['11410 Century Oaks Terrace Suite #104', 'S 1st St, Suite 104']),
 '1100': set(['Farm-to-Market Road 1100']),
 '117': set(['County Road 117']),
 '12': set(['Ranch to Market Road 12']),
 '120': set(['Building B Suite 120']),
 '129': set(['County Road 129']),
 '1327': set(['FM 1327', 'Farm-to-Market Road 1327']),
 '138': set(['County Road 138']),
 '1431': set(['Farm-to-Market Road 1431', 'Old Farm-to-Market 1431']),
 '150': set(['Farm-to-Market Road 150', 'IH-35 South, #150']),
 '1625': set(['Farm-to-Market Road 1625']),
 '1626': set(['F.M. 1626', 'FM 1626', 'Farm-to-Market Road 1626']),
 '163': set(['Bee Cave Road Suite 163']),
 '170': set(['County Road 170']),
 '1805': set(['N Interstate 35, Suite 1805']),
 '1825': set(['FM 1825']),
 '1826': set(['Farm To Market Road 1826', 'Ranch to Market Ro

In [43]:
type(st_types)

collections.defaultdict

In [80]:
for st_type, ways in st_types.iteritems():
    for name in ways:
        better_name = update_name(name, mapping)
        print name, "=>", better_name

Merimac => Merimac
Clara Van => Clara Van
Capri => Capri
Chelsea Moor => Chelsea Moor
Royal Birkdale Ovlk => Royal Birkdale Overlook
Lions Lair => Lions Lair
Apache => Apache
Farm-to-Market Road 812 => Farm-to-Market Road 812
Bee Cave Road Suite 163 => Bee Cave Road Suite 163
Adventurer => Adventurer
Affirmed => Affirmed
West 35th Street Cutoff => West 35th Street Cutoff
Ferguson Cutoff => Ferguson Cutoff
House Wren => House Wren
N I-35 Suite 298 => North Interstate Highway 35 Suite 298
Melody => Melody
East Highway 290 => East Highway 290
Highway 290 => Highway 290
W. Highway 290 => West Highway 290
East Hwy 290 => East Highway 290
C R 290 => Country Road 290
West Highway 290 => West Highway 290
W Hwy 290 => West Highway 290
U.S. 290 => United States 290
West US Highway 290 => West United States Highway 290
E Hwy 290 => East Highway 290
US Highway 290 => United States Highway 290
County Road 290 => County Road 290
W Highway 290 => West Highway 290
W HWY 290 => West Highway 290
Helios 

##### More things to fix:

- postal code (even k value postal code varies (other are postcode))


## Preparing for Database

In [81]:
import csv
import codecs
import cerberus
import schema

In [82]:
NODES_PATH = "nodes.csv"
NODE_TAGS_PATH = "nodes_tags.csv"
WAYS_PATH = "ways.csv"
WAY_NODES_PATH = "ways_nodes.csv"
WAY_TAGS_PATH = "ways_tags.csv"

In [83]:
LOWER_COLON = re.compile(r'^([a-z]|_)+:([a-z]|_)+')
PROBLEMCHARS = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

In [84]:
SCHEMA = {
    'node': {
        'type': 'dict',
        'schema': {
            'id': {'required': True, 'type': 'integer', 'coerce': int},
            'lat': {'required': True, 'type': 'float', 'coerce': float},
            'lon': {'required': True, 'type': 'float', 'coerce': float},
            'user': {'required': True, 'type': 'string'},
            'uid': {'required': True, 'type': 'integer', 'coerce': int},
            'version': {'required': True, 'type': 'string'},
            'changeset': {'required': True, 'type': 'integer', 'coerce': int},
            'timestamp': {'required': True, 'type': 'string'}
        }
    },
    'node_tags': {
        'type': 'list',
        'schema': {
            'type': 'dict',
            'schema': {
                'id': {'required': True, 'type': 'integer', 'coerce': int},
                'key': {'required': True, 'type': 'string'},
                'value': {'required': True, 'type': 'string'},
                'type': {'required': True, 'type': 'string'}
            }
        }
    },
    'way': {
        'type': 'dict',
        'schema': {
            'id': {'required': True, 'type': 'integer', 'coerce': int},
            'user': {'required': True, 'type': 'string'},
            'uid': {'required': True, 'type': 'integer', 'coerce': int},
            'version': {'required': True, 'type': 'string'},
            'changeset': {'required': True, 'type': 'integer', 'coerce': int},
            'timestamp': {'required': True, 'type': 'string'}
        }
    },
    'way_nodes': {
        'type': 'list',
        'schema': {
            'type': 'dict',
            'schema': {
                'id': {'required': True, 'type': 'integer', 'coerce': int},
                'node_id': {'required': True, 'type': 'integer', 'coerce': int},
                'position': {'required': True, 'type': 'integer', 'coerce': int}
            }
        }
    },
    'way_tags': {
        'type': 'list',
        'schema': {
            'type': 'dict',
            'schema': {
                'id': {'required': True, 'type': 'integer', 'coerce': int},
                'key': {'required': True, 'type': 'string'},
                'value': {'required': True, 'type': 'string'},
                'type': {'required': True, 'type': 'string'}
            }
        }
    }
}

Make sure the fields order in the csvs matches the column order in the sql table schema.

In [85]:
NODE_FIELDS = ['id', 'lat', 'lon', 'user', 'uid', 'version', 'changeset', 'timestamp']
NODE_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAY_FIELDS = ['id', 'user', 'uid', 'version', 'changeset', 'timestamp']
WAY_TAGS_FIELDS= ['id', 'key', 'value', 'type']
WAY_NODES_FIELDS = ['id', 'node_id', 'position']

In [86]:
def shape_element(element, node_attr_fields=NODE_FIELDS, way_attr_fields=WAY_FIELDS,
                  problem_chars=PROBLEMCHARS, lower_colon=LOWER_COLON, default_tag_type='regular'):
    """Clean and shape node or way XML element to Python dict"""

    node_attribs = {}
    way_attribs = {}
    way_nodes = []
    tags = []  # Handle secondary tags the same way for both node and way elements

    # YOUR CODE HERE
    if element.tag == 'node':
        for field in node_attr_fields:
            node_attribs[field] = element.attrib[field]
        
        for tag in element.iter("tag"):
            nodetags = {}
            nodetags['id'] = element.attrib['id']
            nodetags['value'] = tag.attrib['v']
            try:
                problem_chars.search(tag.attrib['k']).group()
            except AttributeError:
                try:    
                    lower_colon.search(tag.attrib['k']).group()
                    kvalue = tag.attrib['k'].split(":")
                    nodetags['type'] = kvalue[0]
                    if len(kvalue) == 2:                        
                        nodetags['key'] = kvalue[1]
                    else:
                        nodetags['key'] = ':'.join(kvalue[1:])
                except AttributeError:
                    nodetags['type'] = default_tag_type
                    nodetags['key'] = tag.attrib['k']
            tags.append(nodetags) 
    
        return {'node': node_attribs, 'node_tags': tags}
    elif element.tag == 'way':
        for item in WAY_FIELDS:
            way_attribs[item] = element.attrib[item]
        
        for wtag in element.iter("tag"):
            waytags = {}
            waytags['id'] = element.attrib['id']
            waytags['value'] = wtag.attrib['v']
            try:
                problem_chars.search(wtag.attrib['k']).group()
            except AttributeError:
                try:
                    lower_colon.search(wtag.attrib['k']).group()
                    wkvalue = wtag.attrib['k'].split(":")
                    waytags['type'] = wkvalue[0]
                    if len(wkvalue) == 2:
                        waytags['key'] = wkvalue[1]
                    else:
                        waytags['key'] = ':'.join(wkvalue[1:])
                except AttributeError:
                    waytags['type'] = default_tag_type
                    waytags['key'] = wtag.attrib['k']
            tags.append(waytags) 
        
        
        position = 0
        for waytag in element.iter("nd"):
            waynd = {}
            waynd['id'] = element.attrib['id']
            waynd['node_id'] = waytag.attrib['ref']
            waynd['position'] = position
            position += 1
            way_nodes.append(waynd)
        return {'way': way_attribs, 'way_nodes': way_nodes, 'way_tags': tags}

In [87]:
def get_element(osm_file, tags=('node', 'way', 'relation')):
    """Yield element if it is the right type of tag"""

    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()

In [88]:
def validate_element(element, validator, schema=SCHEMA):
    """Raise ValidationError if element does not match schema"""
    if validator.validate(element, schema) is not True:
        field, errors = next(validator.errors.iteritems())
        message_string = "\nElement of type '{0}' has the following errors:\n{1}"
        error_strings = (
            "{0}: {1}".format(k, v if isinstance(v, str) else ", ".join(v))
            for k, v in errors.iteritems()
        )
        raise cerberus.ValidationError(
            message_string.format(field, "\n".join(error_strings))
        )

In [89]:
class UnicodeDictWriter(csv.DictWriter, object):
    """Extend csv.DictWriter to handle Unicode input"""

    def writerow(self, row):
        super(UnicodeDictWriter, self).writerow({
            k: (v.encode('utf-8') if isinstance(v, unicode) else v) for k, v in row.iteritems()
        })

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

In [90]:
def process_map_db(file_in, validate):
    """Iteratively process each XML element and write to csv(s)"""

    with codecs.open(NODES_PATH, 'w') as nodes_file, codecs.open(NODE_TAGS_PATH, 'w') as nodes_tags_file, codecs.open(WAYS_PATH, 'w') as ways_file, codecs.open(WAY_NODES_PATH, 'w') as way_nodes_file, codecs.open(WAY_TAGS_PATH, 'w') as way_tags_file:
        nodes_writer = UnicodeDictWriter(nodes_file, NODE_FIELDS)
        node_tags_writer = UnicodeDictWriter(nodes_tags_file, NODE_TAGS_FIELDS)
        ways_writer = UnicodeDictWriter(ways_file, WAY_FIELDS)
        way_nodes_writer = UnicodeDictWriter(way_nodes_file, WAY_NODES_FIELDS)
        way_tags_writer = UnicodeDictWriter(way_tags_file, WAY_TAGS_FIELDS)

        nodes_writer.writeheader()
        node_tags_writer.writeheader()
        ways_writer.writeheader()
        way_nodes_writer.writeheader()
        way_tags_writer.writeheader()

        validator = cerberus.Validator()

        for element in get_element(file_in, tags=('node', 'way')):
            el = shape_element(element)
            if el:
                if validate is True:
                    validate_element(el, validator)

                if element.tag == 'node':
                    nodes_writer.writerow(el['node'])
                    node_tags_writer.writerows(el['node_tags'])
                elif element.tag == 'way':
                    ways_writer.writerow(el['way'])
                    way_nodes_writer.writerows(el['way_nodes'])
                    way_tags_writer.writerows(el['way_tags'])

#### Processing the whole Austin, TX map:

In [92]:
process_map_db(osmfile, validate=False)