# Case Study: OpenStreetMap Data

## Iterative Parsing

"""
Your task is to use the iterative parsing to process the map file and
find out not only what tags are there, but also how many, to get the
feeling on how much of which data you can expect to have in the map.
Fill out the count_tags function. It should return a dictionary with the 
tag name as the key and number of times this tag can be encountered in 
the map as value.

Note that your code will be tested with a different data file than the 'example.osm'
"""

In [1]:
import xml.etree.cElementTree as ET
import pprint
from collections import defaultdict

In [2]:
def count_tags(filename):
    tag_counts = defaultdict(int)
    for event, element in ET.iterparse(filename):
        tag_counts[element.tag] += 1
    return tag_counts

In [3]:
tags = count_tags('example.osm')

In [4]:
pprint.pprint(tags)

defaultdict(<type 'int'>, {'node': 20, 'nd': 4, 'bounds': 1, 'member': 3, 'tag': 7, 'relation': 1, 'way': 1, 'osm': 1})


## Ways in the OpenStreetMap

Nodes, ways - are tagged.
Cleaning street names (ways)

To get the street tags, you have to loop through the subtags under a major tag. So the code for it using the iterparse (which is used because we don't like putting the huge file in a tree), is 

for _, element in ET.iterparse(filename, events=((start,)):
    
    if element.tag == "way":
        
        for tag in element.iter("tag"):
            
            if is_street_name(tag):
                
                audit_street_type(street_types, tag.attrib['v'])

## Tag Types

"""
Your task is to explore the data a bit more.
Before you process the data and add it into your database, you should check the
"k" value for each "<tag>" and see if there are any potential problems.

We have provided you with 3 regular expressions to check for certain patterns
in the tags. As we saw in the quiz earlier, we would like to change the data
model and expand the "addr:street" type of keys to a dictionary like this:
{"address": {"street": "Some value"}}
So, we have to see if we have such tags, and if we have any tags with
problematic characters.

Please complete the function 'key_type', such that we have a count of each of
four tag categories in a dictionary:
  "lower", for tags that contain only lowercase letters and are valid,
  "lower_colon", for otherwise valid tags with a colon in their names,
  "problemchars", for tags with problematic characters, and
  "other", for other tags that do not fall into the other three categories.
See the 'process_map' and 'test' functions for examples of the expected format.
"""

In [5]:
import xml.etree.cElementTree as ET

In [6]:
import re

In [7]:
import pprint

In [8]:
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

In [9]:
def key_type(element, keys):
    if element.tag == 'tag':
        try:
            lowermo = lower.search(element.attrib['k'])
            lowermo.group()
            keys["lower"] += 1
        except AttributeError:
            try:
                lower_colonmo = lower_colon.search(element.attrib['k'])
                lower_colonmo.group()
                keys["lower_colon"] += 1
            except AttributeError:
                try:
                    problemcharmo = problemchars.search(element.attrib['k'])
                    problemcharmo.group()
                    keys["problemchars"] += 1
                except AttributeError:
                    keys["other"] += 1
    return keys

In [10]:
def process_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other":0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)
    return keys

In [11]:
keys = process_map('example.osm')

In [12]:
pprint.pprint(keys)

{'lower': 5, 'lower_colon': 0, 'other': 1, 'problemchars': 1}


## Exploring Users


"""
Your task is to explore the data a bit more.
The first task is a fun one - find out how many unique users
have contributed to the map in this particular area!

The function process_map should return a set of unique user IDs ("uid")
"""

In [1]:
import xml.etree.cElementTree as ET
import pprint
import re

In [2]:
def get_user(element):
    return

In [3]:
def process_map(filename):
    users = set()
    for _, element in ET.iterparse(filename):
        for key in element.attrib:
            if key == 'uid':
                users.add(element.attrib[key])
    return users

In [4]:
users = process_map('example.osm')

In [5]:
pprint.pprint(users)

set(['1219059', '147510', '26299', '451048', '567034', '939355'])


In [6]:
len(users)

6

## Auditing and Improving Street Names 

"""
Your task in this exercise has two steps:

- audit the OSMFILE and change the variable 'mapping' to reflect the changes needed to fix 
    the unexpected street types to the appropriate ones in the expected list.
    You have to add mappings only for the actual problems you find in this OSMFILE,
    not a generalized solution, since that may and will depend on the particular area you are auditing.
- write the update_name function, to actually fix the street name.
    The function takes a string with street name as an argument and should return the fixed name
    We have provided a simple test so that you see what exactly is expected
"""

In [7]:
import xml.etree.cElementTree as ET
from collections import defaultdict
import re
import pprint

In [8]:
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

In [9]:
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons"]

In [10]:
mapping = { "St": "Street",
            "St.": "Street",
            "st": "Street",
            "street": "Street",
            "Ave": "Avenue",
            "Ave.": "Avenue",
            "Avene": "Avenue",
            "Blvd": "Boulevard",
            "Blvd.": "Boulevard",
            "Dr": "Drive",
            "Dr.": "Drive",
            "Ct": "Court",
            "Ct.": "Court",
            "Pl": "Place",
            "Pl.": "Place",
            "lane": "Lane",
            "Rd": "Road", 
            "Rd.": "Road",
            "Trl": "Trail",
            "Pkwy": "Parkway"}

In [11]:
def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)

In [12]:
def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")

In [13]:
def audit(osmfile):
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):

        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])
    osm_file.close()
    return street_types

In [14]:
def update_name(name, mapping):
    parts = name.split()
    if parts[-1] in mapping.keys():
        parts[-1] = mapping[parts[-1]]
    name = ' '.join(parts)
    return name

In [16]:
st_types = audit("example2.osm")

In [17]:
len(st_types)

3

In [21]:
print st_types

defaultdict(<type 'set'>, {'Ave': set(['N. Lincoln Ave', 'North Lincoln Ave']), 'St.': set(['West Lexington St.']), 'Rd.': set(['Baldwin Rd.'])})


In [18]:
pprint.pprint(dict(st_types))

{'Ave': set(['N. Lincoln Ave', 'North Lincoln Ave']),
 'Rd.': set(['Baldwin Rd.']),
 'St.': set(['West Lexington St.'])}


In [20]:
for st_type, ways in st_types.iteritems():
    print "ways"
    print ways
    print "st_type"
    print st_type

ways
set(['N. Lincoln Ave', 'North Lincoln Ave'])
st_type
Ave
ways
set(['West Lexington St.'])
st_type
St.
ways
set(['Baldwin Rd.'])
st_type
Rd.


In [23]:
for st_type, ways in st_types.iteritems():
    for name in ways:
        better_name = update_name(name, mapping)
        print name, "=>", better_name

N. Lincoln Ave => N. Lincoln Avenue
North Lincoln Ave => North Lincoln Avenue
West Lexington St. => West Lexington Street
Baldwin Rd. => Baldwin Road


## Preparing for Database - SQL

In [1]:
"""
After auditing is complete the next step is to prepare the data to be inserted into a SQL database.
To do so you will parse the elements in the OSM XML file, transforming them from document format to
tabular format, thus making it possible to write to .csv files.  These csv files can then easily be
imported to a SQL database as tables.

The process for this transformation is as follows:
- Use iterparse to iteratively step through each top level element in the XML
- Shape each element into several data structures using a custom function
- Utilize a schema and validation library to ensure the transformed data is in the correct format
- Write each data structure to the appropriate .csv files

We've already provided the code needed to load the data, perform iterative parsing and write the
output to csv files. Your task is to complete the shape_element function that will transform each
element into the correct format. To make this process easier we've already defined a schema (see
the schema.py file in the last code tab) for the .csv files and the eventual tables. Using the 
cerberus library we can validate the output against this schema to ensure it is correct.

#### Shape Element Function
The function should take as input an iterparse Element object and return a dictionary.

#### If the element top level tag is "node":
The dictionary returned should have the format {"node": .., "node_tags": ...}

The "node" field should hold a dictionary of the following top level node attributes:
- id
- user
- uid
- version
- lat
- lon
- timestamp
- changeset

All other attributes can be ignored

The "node_tags" field should hold a list of dictionaries, one per secondary tag. Secondary tags are
child tags of node which have the tag name/type: "tag". Each dictionary should have the following
fields from the secondary tag attributes:
- id: the top level node id attribute value
- key: the full tag "k" attribute value if no colon is present or the characters after the colon if one is.
- value: the tag "v" attribute value
- type: either the characters before the colon in the tag "k" value or "regular" if a colon is not present.

Additionally,

- if the tag "k" value contains problematic characters, the tag should be ignored
- if the tag "k" value contains a ":" the characters before the ":" should be set as the tag type
  and characters after the ":" should be set as the tag key
- if there are additional ":" in the "k" value they and they should be ignored and kept as part of
  the tag key. For example:

  <tag k="addr:street:name" v="Lincoln"/>

should be turned into

{'id': 12345, 'key': 'street:name', 'value': 'Lincoln', 'type': 'addr'}

- If a node has no secondary tags then the "node_tags" field should just contain an empty list.

The final return value for a "node" element should look something like:

{'node': {'id': 757860928,
          'user': 'uboot',
          'uid': 26299,
          'version': '2',
          'lat': 41.9747374,
          'lon': -87.6920102,
          'timestamp': '2010-07-22T16:16:51Z',
          'changeset': 5288876},
 'node_tags': [{'id': 757860928,
                'key': 'amenity',
                'value': 'fast_food',
                'type': 'regular'},
               {'id': 757860928,
                'key': 'cuisine',
                'value': 'sausage',
                'type': 'regular'},
               {'id': 757860928,
                'key': 'name',
                'value': "Shelly's Tasty Freeze",
                'type': 'regular'}]}

#### If the element top level tag is "way":
The dictionary should have the format {"way": ..., "way_tags": ..., "way_nodes": ...}

The "way" field should hold a dictionary of the following top level way attributes:
- id
- user
- uid
- version
- timestamp
- changeset

All other attributes can be ignored

The "way_tags" field should again hold a list of dictionaries, following the exact same rules as
for "node_tags".

Additionally, the dictionary should have a field "way_nodes". "way_nodes" should hold a list of
dictionaries, one for each nd child tag.  Each dictionary should have the fields:
- id: the top level element (way) id
- node_id: the ref attribute value of the nd tag
- position: the index starting at 0 of the nd tag i.e. what order the nd tag appears within
            the way element

The final return value for a "way" element should look something like:

{'way': {'id': 209809850,
         'user': 'chicago-buildings',
         'uid': 674454,
         'version': '1',
         'timestamp': '2013-03-13T15:58:04Z',
         'changeset': 15353317},
 'way_nodes': [{'id': 209809850, 'node_id': 2199822281, 'position': 0},
               {'id': 209809850, 'node_id': 2199822390, 'position': 1},
               {'id': 209809850, 'node_id': 2199822392, 'position': 2},
               {'id': 209809850, 'node_id': 2199822369, 'position': 3},
               {'id': 209809850, 'node_id': 2199822370, 'position': 4},
               {'id': 209809850, 'node_id': 2199822284, 'position': 5},
               {'id': 209809850, 'node_id': 2199822281, 'position': 6}],
 'way_tags': [{'id': 209809850,
               'key': 'housenumber',
               'type': 'addr',
               'value': '1412'},
              {'id': 209809850,
               'key': 'street',
               'type': 'addr',
               'value': 'West Lexington St.'},
              {'id': 209809850,
               'key': 'street:name',
               'type': 'addr',
               'value': 'Lexington'},
              {'id': '209809850',
               'key': 'street:prefix',
               'type': 'addr',
               'value': 'West'},
              {'id': 209809850,
               'key': 'street:type',
               'type': 'addr',
               'value': 'Street'},
              {'id': 209809850,
               'key': 'building',
               'type': 'regular',
               'value': 'yes'},
              {'id': 209809850,
               'key': 'levels',
               'type': 'building',
               'value': '1'},
              {'id': 209809850,
               'key': 'building_id',
               'type': 'chicago',
               'value': '366409'}]}
"""

import csv
import codecs
import re
import xml.etree.cElementTree as ET

In [2]:
import cerberus

In [3]:
import schema

In [4]:
OSM_PATH = "example3.osm"

In [5]:
NODES_PATH = "nodes.csv"

In [6]:
NODE_TAGS_PATH = "nodes_tags.csv"

In [7]:
WAYS_PATH = "ways.csv"

In [8]:
WAY_NODES_PATH = "ways_nodes.csv"

In [9]:
WAY_TAGS_PATH = "ways_tags.csv"

In [10]:
LOWER_COLON = re.compile(r'^([a-z]|_)+:([a-z]|_)+')
PROBLEMCHARS = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

In [11]:
SCHEMA = {
    'node': {
        'type': 'dict',
        'schema': {
            'id': {'required': True, 'type': 'integer', 'coerce': int},
            'lat': {'required': True, 'type': 'float', 'coerce': float},
            'lon': {'required': True, 'type': 'float', 'coerce': float},
            'user': {'required': True, 'type': 'string'},
            'uid': {'required': True, 'type': 'integer', 'coerce': int},
            'version': {'required': True, 'type': 'string'},
            'changeset': {'required': True, 'type': 'integer', 'coerce': int},
            'timestamp': {'required': True, 'type': 'string'}
        }
    },
    'node_tags': {
        'type': 'list',
        'schema': {
            'type': 'dict',
            'schema': {
                'id': {'required': True, 'type': 'integer', 'coerce': int},
                'key': {'required': True, 'type': 'string'},
                'value': {'required': True, 'type': 'string'},
                'type': {'required': True, 'type': 'string'}
            }
        }
    },
    'way': {
        'type': 'dict',
        'schema': {
            'id': {'required': True, 'type': 'integer', 'coerce': int},
            'user': {'required': True, 'type': 'string'},
            'uid': {'required': True, 'type': 'integer', 'coerce': int},
            'version': {'required': True, 'type': 'string'},
            'changeset': {'required': True, 'type': 'integer', 'coerce': int},
            'timestamp': {'required': True, 'type': 'string'}
        }
    },
    'way_nodes': {
        'type': 'list',
        'schema': {
            'type': 'dict',
            'schema': {
                'id': {'required': True, 'type': 'integer', 'coerce': int},
                'node_id': {'required': True, 'type': 'integer', 'coerce': int},
                'position': {'required': True, 'type': 'integer', 'coerce': int}
            }
        }
    },
    'way_tags': {
        'type': 'list',
        'schema': {
            'type': 'dict',
            'schema': {
                'id': {'required': True, 'type': 'integer', 'coerce': int},
                'key': {'required': True, 'type': 'string'},
                'value': {'required': True, 'type': 'string'},
                'type': {'required': True, 'type': 'string'}
            }
        }
    }
}

##### Make sure the fields order in the csvs matches the column order in the sql table schema

In [12]:
NODE_FIELDS = ['id', 'lat', 'lon', 'user', 'uid', 'version', 'changeset', 'timestamp']
NODE_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAY_FIELDS = ['id', 'user', 'uid', 'version', 'changeset', 'timestamp']
WAY_TAGS_FIELDS= ['id', 'key', 'value', 'type']
WAY_NODES_FIELDS = ['id', 'node_id', 'position']

In [13]:
def shape_element(element, node_attr_fields=NODE_FIELDS, way_attr_fields=WAY_FIELDS,
                  problem_chars=PROBLEMCHARS, lower_colon=LOWER_COLON, default_tag_type='regular'):
    """Clean and shape node or way XML element to Python dict"""

    node_attribs = {}
    way_attribs = {}
    way_nodes = []
    tags = []  # Handle secondary tags the same way for both node and way elements

    # YOUR CODE HERE
    if element.tag == 'node':
        for field in node_attr_fields:
            node_attribs[field] = element.attrib[field]
        
        for tag in element.iter("tag"):
            nodetags = {}
            nodetags['id'] = element.attrib['id']
            nodetags['value'] = tag.attrib['v']
            try:
                problem_chars.search(tag.attrib['k']).group()
            except AttributeError:
                try:    
                    lower_colon.search(tag.attrib['k']).group()
                    kvalue = tag.attrib['k'].split(":")
                    nodetags['type'] = kvalue[0]
                    if len(kvalue) == 2:                        
                        nodetags['key'] = kvalue[1]
                    else:
                        nodetags['key'] = ':'.join(kvalue[1:])
                except AttributeError:
                    nodetags['type'] = default_tag_type
                    nodetags['key'] = tag.attrib['k']
            tags.append(nodetags) 
    
        return {'node': node_attribs, 'node_tags': tags}
    elif element.tag == 'way':
        for item in WAY_FIELDS:
            way_attribs[item] = element.attrib[item]
        
        for wtag in element.iter("tag"):
            waytags = {}
            waytags['id'] = element.attrib['id']
            waytags['value'] = wtag.attrib['v']
            try:
                problem_chars.search(wtag.attrib['k']).group()
            except AttributeError:
                try:
                    lower_colon.search(wtag.attrib['k']).group()
                    wkvalue = wtag.attrib['k'].split(":")
                    waytags['type'] = wkvalue[0]
                    if len(wkvalue) == 2:
                        waytags['key'] = wkvalue[1]
                    else:
                        waytags['key'] = ':'.join(wkvalue[1:])
                except AttributeError:
                    waytags['type'] = default_tag_type
                    waytags['key'] = wtag.attrib['k']
            tags.append(waytags) 
        
        
        position = 0
        for waytag in element.iter("nd"):
            waynd = {}
            waynd['id'] = element.attrib['id']
            waynd['node_id'] = waytag.attrib['ref']
            waynd['position'] = position
            position += 1
            way_nodes.append(waynd)
        return {'way': way_attribs, 'way_nodes': way_nodes, 'way_tags': tags}

### Helper Functions

In [14]:
def get_element(osm_file, tags=('node', 'way', 'relation')):
    """Yield element if it is the right type of tag"""

    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()

In [15]:
def validate_element(element, validator, schema=SCHEMA):
    """Raise ValidationError if element does not match schema"""
    if validator.validate(element, schema) is not True:
        field, errors = next(validator.errors.iteritems())
        message_string = "\nElement of type '{0}' has the following errors:\n{1}"
        error_strings = (
            "{0}: {1}".format(k, v if isinstance(v, str) else ", ".join(v))
            for k, v in errors.iteritems()
        )
        raise cerberus.ValidationError(
            message_string.format(field, "\n".join(error_strings))
        )

In [16]:
class UnicodeDictWriter(csv.DictWriter, object):
    """Extend csv.DictWriter to handle Unicode input"""

    def writerow(self, row):
        super(UnicodeDictWriter, self).writerow({
            k: (v.encode('utf-8') if isinstance(v, unicode) else v) for k, v in row.iteritems()
        })

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

### Main Function

In [17]:
def process_map(file_in, validate):
    """Iteratively process each XML element and write to csv(s)"""

    with codecs.open(NODES_PATH, 'w') as nodes_file, codecs.open(NODE_TAGS_PATH, 'w') as nodes_tags_file, codecs.open(WAYS_PATH, 'w') as ways_file, codecs.open(WAY_NODES_PATH, 'w') as way_nodes_file, codecs.open(WAY_TAGS_PATH, 'w') as way_tags_file:
        nodes_writer = UnicodeDictWriter(nodes_file, NODE_FIELDS)
        node_tags_writer = UnicodeDictWriter(nodes_tags_file, NODE_TAGS_FIELDS)
        ways_writer = UnicodeDictWriter(ways_file, WAY_FIELDS)
        way_nodes_writer = UnicodeDictWriter(way_nodes_file, WAY_NODES_FIELDS)
        way_tags_writer = UnicodeDictWriter(way_tags_file, WAY_TAGS_FIELDS)

        nodes_writer.writeheader()
        node_tags_writer.writeheader()
        ways_writer.writeheader()
        way_nodes_writer.writeheader()
        way_tags_writer.writeheader()

        validator = cerberus.Validator()

        for element in get_element(file_in, tags=('node', 'way')):
            el = shape_element(element)
            if el:
                if validate is True:
                    validate_element(el, validator)

                if element.tag == 'node':
                    nodes_writer.writerow(el['node'])
                    node_tags_writer.writerows(el['node_tags'])
                elif element.tag == 'way':
                    ways_writer.writerow(el['way'])
                    way_nodes_writer.writerows(el['way_nodes'])
                    way_tags_writer.writerows(el['way_tags'])

#### Trying the whole thing:

In [18]:
process_map(OSM_PATH, validate=True)

In [25]:
coll = []
for element in get_element(OSM_PATH, tags=('node', 'way')):
    el = shape_element(element)
    coll.append(el)
    
pprint.pprint(coll)

[{'node': {'changeset': '11129782',
           'id': '261114295',
           'lat': '41.9730791',
           'lon': '-87.6866303',
           'timestamp': '2012-03-28T18:31:23Z',
           'uid': '451048',
           'user': 'bbmiller',
           'version': '7'},
  'node_tags': []},
 {'node': {'changeset': '8448766',
           'id': '261114296',
           'lat': '41.9730416',
           'lon': '-87.6878512',
           'timestamp': '2011-06-15T17:04:54Z',
           'uid': '451048',
           'user': 'bbmiller',
           'version': '6'},
  'node_tags': []},
 {'node': {'changeset': '8581395',
           'id': '261114299',
           'lat': '41.9729565',
           'lon': '-87.6939548',
           'timestamp': '2011-06-29T14:14:14Z',
           'uid': '451048',
           'user': 'bbmiller',
           'version': '5'},
  'node_tags': []},
 {'node': {'changeset': '8581395',
           'id': '261146436',
           'lat': '41.9707380',
           'lon': '-87.6976025',
           'ti

In [24]:
import pprint