# OpenStreetMap Project - Chicago

This project will use the map of a beautiful city, Chicago, IL, United States. I have lived here since graduating from college. I am very interested to see what the map database reveals. After unziping, the total database is a little more than 2GB.

I will analyze this dataset by doing the following:

* Extract a sample from the database.
* Find the problems encountered in this dataset. 
* Clean up the data and import them to SQL.
* Explore the data by querying in SQLite.
* Additional ideas I have after exploring the dataset.

Reference:

* The summary of Chicago area can be found at [OpenStreetMap website](https://www.openstreetmap.org/relation/122604). 
* This data can be downloaded at [Mapzen Metro Extracts](https://mapzen.com/data/metro-extracts/metro/chicago_illinois/). 
* [OpenStreetMap Wiki](https://wiki.openstreetmap.org/wiki/Main_Page) shows the detail explanation of OpenStreetMap database.

## Extract a sample

As mentioned before, this database is quite large, more than 2GB. Directly opening it or parsing it will crash the computer. Therefore, it is a good idea to extract a sample from this dataset. 

In [1]:
import csv
import codecs
import pprint
import re
import xml.etree.cElementTree as ET
import lxml
import cerberus
from collections import defaultdict

I will use the extract-sample-data.py file to extract the data.

In [2]:
%%timeit
%run extract-sample-data.py

1 loop, best of 3: 3min 30s per loop


I will write a function to find element I want from the original .osm file, and write into a sample osm file.
After reading through the wiki, I think the most important tag for this dataset are "node", "way", and "relation" tag. Therefore, the function will focus on getting the elements from these three tags.

In [2]:
osm_file = 'chicago_illinois.osm'
sample_file = 'sample_chicago.osm'

tag = ['node', 'way', 'relation']

In [3]:
def get_element(osm_file, tags=('node', 'way', 'relation')):
    '''
    This function will read an XML file, get the element from desired tags.

    Parameters
    ----------
    osm_file: .xml or .osm file
        the XML or OSM file to be parsed

    tags: string or list
        the tag name that you want to get elements from.
        default is ['node', 'way', 'relation']

    Return
    ------
    .xml or .osm file
    '''

    context = iter(ET.iterparse(osm_file, events=('start', 'end')))
    _, root = next(context)

    for event, elem in context:
        if (event == 'end') and (elem.tag in tags):
            yield elem
            root.clear()

After generate the elements, it is time to write it into another file.

k is a parameter. It defines the one element to export for every k elements. The bigger the k is, the smaller the sample will be. Since the data is big, I choose to use 1000.

In [4]:
k = 1000
    
with open(sample_file, 'wb') as output:
    output.write(bytes('<?xml version="1.0" encoding="UTF-8"?>\n', 'UTF-8'))
    output.write(bytes('<osm>\n  ', 'UTF-8'))

    for i, element in enumerate(get_element(osm_file)):
        if i % k == 0:
            output.write(ET.tostring(element, encoding='utf-8'))

    output.write(bytes('</osm>', 'UTF-8'))

After getting the sample from the database, it is a good idea to see the big picture of this sample to see if we have had enough data within the sample. Therefore, I want to write a function to check what tags are in the sample dataset, and how many of them.

In [4]:
def count_tag(filename):
    tags = {}
    for event, elem in ET.iterparse(filename):
        tag = elem.tag
        if tag not in tags:
            tags[tag] = 1
        else:
            tags[tag] += 1
    return tags

In [5]:
count_tag(sample_file)

{'member': 69,
 'nd': 10728,
 'node': 8718,
 'osm': 1,
 'relation': 5,
 'tag': 6761,
 'way': 1233}

It seems to be that we have a good amount of data within the sample. 

## Problem in this dataset

After getting the sample data, we can look through the dataset, find the problems and clean it up.

Through reading the documente and look through the sample data in a text editor, `<tag>` is used to save all the values. 

Here are some problems I noticed the following potential problems through reading the data:

* The `<tag>`'s k attribute value is not consistent. Some only have lower case like "ele". Some have both lower case and colon, like "gnis: id". Others have special characters like.
* The street name is not consistent. Some uses the whole spell, like "street" and "avenue", while others use abbreviation, like "St" and "St.".

### k attribute issue

I will use regular expression to find the pattern that mentioned above. Later, I will define a function to count each pattern in the sample file.

In [13]:
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}

In [14]:
def key_type(filename, keys):
    '''
    This function will read through the k element and return its catogory

    Parameters
    ---
    filename: .xml or .osm file
        the file that is going to be analyzed
    keys: a dictionary
        a dictionary to show the catogory

    Return
    ---
    the updated keys(a dictionary)
    '''
    for event, element in ET.iterparse(filename):
        if element.tag == 'tag':
            key = element.get('k')
            if lower.search(key):
                keys['lower'] += 1
            elif re.findall(lower_colon, key):
                keys['lower_colon'] += 1
            elif re.findall(problemchars, key):
                keys['problemchars'] += 1
            else:
                keys['other'] += 1

    return keys

In [15]:
key_type(sample_file, keys)

{'lower': 2016, 'lower_colon': 3135, 'other': 1610, 'problemchars': 0}

### Street name issue

Similar to k attribute, I will use regular expression to find the pattern about street type. I will build up a list showing the expected value, and printed street type not in the expected list.

In [16]:
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons"]

In [48]:
def audit_street_type(street_types, street_name):
    '''
    This function find the street_name that doesn't match the expected list

    Parameters
    ---
    street_types: a dictionary
        it is a dictionary that contains the unique key of street types
    street_name: strings
        the street name found in .xml or .osm file

    Return
    ---
    None
    '''
    match = street_type_re.search(street_name)
    if match:
        street_type = match.group(0)
        if street_type not in expected:
            street_types[street_type].add(street_name)

In [49]:
def audit(filename):
    '''
    This function will read a file and print the street types

    Parameters
    ---
    filename: .xml or .osm file

    Return
    ---
    street_types dictionary
    '''
    street_types = defaultdict(set)

    for event, elem in ET.iterparse(filename, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if tag.attrib['k'] == 'addr:street':
                    audit_street_type(street_types, tag.attrib['v'])
    return street_types

In [50]:
audit(sample_file)

defaultdict(set,
            {'Ave': {'New York Ave'},
             'Dr': {'Breckenridge Dr'},
             'E': {'South Avenue E'},
             'Highway': {'Lincoln Highway'}})

## Fix the problems

We have audited the Chicago osm file, and it is time to clean it.

Based on the previous analysis, there is no problematic characters within 'k' attributes. I will not update this part.

The street type is inconsistent. There are many abbreviations inside. I will creating a mapping to update these parts.

In [51]:
mapping = {"St": "Street",
           "St.": "Street",
           "Ave": 'Avenue',
           'Rd.': 'Road',
           'Dr': 'Drive',
           'E': 'East',
           'Highway': 'Highway'
            }

In [52]:
def update_name(name, mapping):
    '''
    This function will update the name based on the given mapping

    Parameters:
    ---
    name: the unexpected street name found in the file
    mapping: the mapping for updating the name

    Return:
    the updated name
    '''
    update_name = name.split(' ')[-1]
    if update_name in mapping:
        new_name = mapping[update_name]

        name = name.replace(update_name, new_name)

    return name

In [63]:
def update_file(filename):
    '''
    This function will bring audit() and update_name() functions together to
    update the street names to make them consistenct

    Parameters
    ---
    filename: the .xml or .osm file that needs to be updated

    Return
    ---
    the updated file
    '''
    street_types = audit(filename)
    for street_type, ways in street_types.items():
        for name in ways:
            name = update_name(name, mapping)

In [64]:
update_file(sample_file)

In [66]:
LOWER_COLON = re.compile(r'^([a-z]|_)+:([a-z]|_)+')
PROBLEMCHARS = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

# Make sure the fields order in the csvs matches the column order in the sql table schema
NODE_FIELDS = ['id', 'lat', 'lon', 'user', 'uid', 'version', 'changeset', 'timestamp']
NODE_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAY_FIELDS = ['id', 'user', 'uid', 'version', 'changeset', 'timestamp']
WAY_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAY_NODES_FIELDS = ['id', 'node_id', 'position']


def shape_element(element, node_attr_fields=NODE_FIELDS, way_attr_fields=WAY_FIELDS,
                  problem_chars=PROBLEMCHARS, default_tag_type='regular'):
    """Clean and shape node or way XML element to Python dict"""

    node_attribs = {}
    way_attribs = {}
    way_nodes = []
    tags = []  # Handle secondary tags the same way for both node and way elements
    # YOUR CODE HERE
    if element.tag == 'node':
        for item in NODE_FIELDS:
            node_attribs[item] = element.get(item)
        for child in element:
            tag_dict = {}
            colon = child.get('k').find(':')
            if (child.tag == 'tag'):
                tag_dict['id'] = element.get('id')
                tag_dict['value'] = child.get('v')
                if (colon != -1):
                    type_value = child.get('k')[:colon]
                    key_value = child.get('k')[colon+1:]
                    tag_dict['type'] = type_value
                    tag_dict['key'] = key_value
                else:
                    tag_dict['key'] = child.get('k')
                    tag_dict['type'] = 'regular'
                tags.append(tag_dict)
        return {'node': node_attribs, 'node_tags': tags}
    elif element.tag == 'way':
        for item in WAY_FIELDS:
            way_attribs[item] = element.get(item)
            
        n = 0
        for child in element:
            if child.tag == 'nd':
                nd_dict = {}
                nd_dict['id'] = element.get('id')
                nd_dict['node_id'] = child.get('ref')
                nd_dict['position'] = n
                n += 1
                way_nodes.append(nd_dict)
            
            if (child.tag == 'tag'):
                way_tag_dict = {}
                colon = child.get('k').find(':')
                way_tag_dict['id'] = element.get('id')
                way_tag_dict['value'] = child.get('v')
                if (colon != -1):
                    type_value = child.get('k')[:colon]
                    key_value = child.get('k')[colon+1:]
                    way_tag_dict['type'] = type_value
                    way_tag_dict['key'] = key_value
                else:
                    way_tag_dict['key'] = child.get('k')
                    way_tag_dict['type'] = 'regular'
                tags.append(way_tag_dict)
                
        return {'way': way_attribs, 'way_nodes': way_nodes, 'way_tags': tags}

In [67]:
OSM_PATH = "example.osm"

NODES_PATH = "nodes.csv"
NODE_TAGS_PATH = "nodes_tags.csv"
WAYS_PATH = "ways.csv"
WAY_NODES_PATH = "ways_nodes.csv"
WAY_TAGS_PATH = "ways_tags.csv"

In [68]:
def process_map(file_in, validate):
    """Iteratively process each XML element and write to csv(s)"""

    with codecs.open(NODES_PATH, 'w') as nodes_file, \
         codecs.open(NODE_TAGS_PATH, 'w') as nodes_tags_file, \
         codecs.open(WAYS_PATH, 'w') as ways_file, \
         codecs.open(WAY_NODES_PATH, 'w') as way_nodes_file, \
         codecs.open(WAY_TAGS_PATH, 'w') as way_tags_file:

        nodes_writer = UnicodeDictWriter(nodes_file, NODE_FIELDS)
        node_tags_writer = UnicodeDictWriter(nodes_tags_file, NODE_TAGS_FIELDS)
        ways_writer = UnicodeDictWriter(ways_file, WAY_FIELDS)
        way_nodes_writer = UnicodeDictWriter(way_nodes_file, WAY_NODES_FIELDS)
        way_tags_writer = UnicodeDictWriter(way_tags_file, WAY_TAGS_FIELDS)

        nodes_writer.writeheader()
        node_tags_writer.writeheader()
        ways_writer.writeheader()
        way_nodes_writer.writeheader()
        way_tags_writer.writeheader()

        validator = cerberus.Validator()

        for element in get_element(file_in, tags=('node', 'way')):
            el = shape_element(element)
            if el:
                if validate is True:
                    validate_element(el, validator)

                if element.tag == 'node':
                    nodes_writer.writerow(el['node'])
                    node_tags_writer.writerows(el['node_tags'])
                elif element.tag == 'way':
                    ways_writer.writerow(el['way'])
                    way_nodes_writer.writerows(el['way_nodes'])
                    way_tags_writer.writerows(el['way_tags'])