# OpenStreetMap Project - Chicago

This project will use the map of a beautiful city, Chicago, IL, United States. I have lived here since graduating from college. I am very interested to see what the map database reveals. After unziping, the total database is a little more than 2GB.

I will analyze this dataset by doing the following:

* Extract a sample from the database.
* Find the problems encountered in this dataset. 
* Clean up the data and import them to SQL.
* Explore the data by querying in SQLite.
* Additional ideas I have after exploring the dataset.

Reference:

* The summary of Chicago area can be found at [OpenStreetMap website](https://www.openstreetmap.org/relation/122604). 
* This data can be downloaded at [Mapzen Metro Extracts](https://mapzen.com/data/metro-extracts/metro/chicago_illinois/). 
* [OpenStreetMap Wiki](https://wiki.openstreetmap.org/wiki/Main_Page) shows the detail explanation of OpenStreetMap database.

## Import Libraries

In [17]:
import csv
import codecs
import pprint
import re
import xml.etree.cElementTree as ET
import lxml
import cerberus
from collections import defaultdict

## Extract a sample

As mentioned before, this database is quite large, more than 2GB. Directly opening it or parsing it will crash the computer. Therefore, it is a good idea to extract a sample from this dataset. 

I will use the extract-sample-data.py file to extract 1% of the original data. This only needs to run once. The final sample file is around 20MB.

In [2]:
%%timeit
%run extract-sample-data.py

1 loop, best of 3: 3min 30s per loop


After getting the sample from the database, it is a good idea to see the big picture of this sample to see if we have had enough data within the sample. Therefore, I want to write a function to check what tags are in the sample dataset, and how many of them.

In [9]:
sample_file = 'sample_chicago.osm'

In [10]:
def count_tag(filename):
    tags = {}
    for event, elem in ET.iterparse(filename):
        tag = elem.tag
        if tag not in tags:
            tags[tag] = 1
        else:
            tags[tag] += 1
    return tags

In [26]:
count_tag(sample_file)

{'member': 349,
 'nd': 103077,
 'node': 87172,
 'osm': 1,
 'relation': 48,
 'tag': 67876,
 'way': 12337}

It seems to be that we have a good amount of data within the sample. 

## Problem in this dataset

After getting the sample data, we can look through the dataset, find the problems and clean it up.

Through reading the documente and look through the sample data in a text editor, `<tag>` is used to save all the values. 

Here are some problems I noticed the following potential problems through reading the sample data:

* The `<tag>`'s k attribute value is not consistent. Some only have lower case like "ele". Some have both lower case and colon, like "gnis: id". Others have special characters like.
* The street name is not consistent. Some uses the whole spell, like "street" and "avenue", while others use abbreviation, like "Ave".
* The phone number format is not consistent. Some have (XXX) XXX-XXXX while others have XXX-XXX-XXXX.

### k attribute issue

The k-attribute has three main patterns:

* The k-attribute values only contain lowercase letter, i.e. "building".
* The k-attribute values contains both lowercase letter and colon, i.e. "addr: city".
* The problematic pattern will contains special characters like "&".'
* The rest will be "others".

The first two patterns and "others" are good. They will not influence future analysis. However, the third one needs some clean-up. I will run the k_attrib_type.py file to find the patterns within my sample file.

In [23]:
%run "k-attribute-issues.py"

k_attrib_type(sample_file, keys)

{'lower': 20677, 'lower_colon': 31308, 'other': 15891, 'problemchars': 0}

Based on this result, there is no problematic characters within k attributes. Therefore, we do not need to clean k attribute for future analysis.

### v attribute issue

The v-attribute contains the value for k-attribute. There are two v attributes that I found have some potential problems after looking through the sample file in a text editor.

* Many of the street name in this file use abbreviation. For example, it uses 'Dr' instead of 'Drive'. It may causes problems in later analysis. Therefore, I need to find abbreviation and fix them.
* The phone number in this .osm file is not consistent. After looking through a small sample of this file, I found at least four kinds of format. Some phone numbers look like "XXX-XXX-XXXX", some look like "(XXX) XXX-XXXX", some look like "+1-XXX-XXX-XXXX" while others look like "(XXX)XXX-XXXX". There might be other formats as well. 

In [48]:
%run audit-street-types.py

street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
expected = ['Street', 'Avenue', 'Boulevard', 'Drive', 'Court', 'Place',
            'Square', 'Lane', 'Road', 'Trail', 'Parkway', 'Commons']

v_attrib_types(sample_file, "addr:street", street_type_re, expected)

defaultdict(set,
            {'14': {'U.S. 14'},
             'Ave': {'Alabama Ave', 'New York Ave'},
             'Ave.': {'Ogden Ave.'},
             'B': {'South Avenue B'},
             'Broadway': {'North Broadway'},
             'C': {'South Avenue C'},
             'C405': {'S Williams St #C405'},
             'Circle': {'Woodland Park Circle'},
             'Ct': {'Boulder Ct', 'Timber Ct', 'Vail Ct'},
             'Dr': {'Breckenridge Dr',
              'Greenbriar Dr',
              'Gregory M Sears Dr',
              'John M Boor Dr',
              'Summit Dr'},
             'E': {'South Avenue E'},
             'F': {'South Avenue F'},
             'G': {'South Avenue G'},
             'H': {'South Avenue H'},
             'Highway': {'Lincoln Highway', 'North Northwest Highway'},
             'J': {'South Avenue J'},
             'L': {'South Avenue L'},
             'Ln': {'Leadville Ln'},
             'M': {'South Avenue M'},
             'N': {'900 N', 'South Avenue N'}

In [51]:
phone_type_re = re.compile(r'(\+1-)?\(?\d\d\d\)?[-| ]?\d\d\d[-| ]?\d\d\d\d')
expected = re.compile(r'^\d\d\d-\d\d\d-\d\d\d\d$')

v_attrib_types(sample_file, "phone", phone_type_re, expected)

defaultdict(set,
            {'(312) 369-7900': {'(312) 369-7900'},
             '(708) 749-0895': {'(708) 749-0895'},
             '(847)434-0300': {'(847)434-0300'},
             '(847)806-1230': {'(847)806-1230'},
             '+1-708-715-7746': {'+1-708-715-7746'}})

## Fix the problems

We have audited the Chicago osm file, and it is time to clean it.

Based on the previous analysis, there is no problematic characters within 'k' attributes. I will not update this part.

The street type is inconsistent. There are many abbreviations inside. I will creating a mapping to update these parts.

In [51]:
mapping = {"St": "Street",
           "St.": "Street",
           "Ave": 'Avenue',
           'Rd.': 'Road',
           'Dr': 'Drive',
           'E': 'East',
           'Highway': 'Highway'
            }

In [52]:
def update_name(name, mapping):
    '''
    This function will update the name based on the given mapping

    Parameters:
    ---
    name: the unexpected street name found in the file
    mapping: the mapping for updating the name

    Return:
    the updated name
    '''
    update_name = name.split(' ')[-1]
    if update_name in mapping:
        new_name = mapping[update_name]

        name = name.replace(update_name, new_name)

    return name

In [63]:
def update_file(filename):
    '''
    This function will bring audit() and update_name() functions together to
    update the street names to make them consistenct

    Parameters
    ---
    filename: the .xml or .osm file that needs to be updated

    Return
    ---
    the updated file
    '''
    street_types = audit(filename)
    for street_type, ways in street_types.items():
        for name in ways:
            name = update_name(name, mapping)

In [64]:
update_file(sample_file)

In [66]:
LOWER_COLON = re.compile(r'^([a-z]|_)+:([a-z]|_)+')
PROBLEMCHARS = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

# Make sure the fields order in the csvs matches the column order in the sql table schema
NODE_FIELDS = ['id', 'lat', 'lon', 'user', 'uid', 'version', 'changeset', 'timestamp']
NODE_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAY_FIELDS = ['id', 'user', 'uid', 'version', 'changeset', 'timestamp']
WAY_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAY_NODES_FIELDS = ['id', 'node_id', 'position']


def shape_element(element, node_attr_fields=NODE_FIELDS, way_attr_fields=WAY_FIELDS,
                  problem_chars=PROBLEMCHARS, default_tag_type='regular'):
    """Clean and shape node or way XML element to Python dict"""

    node_attribs = {}
    way_attribs = {}
    way_nodes = []
    tags = []  # Handle secondary tags the same way for both node and way elements
    # YOUR CODE HERE
    if element.tag == 'node':
        for item in NODE_FIELDS:
            node_attribs[item] = element.get(item)
        for child in element:
            tag_dict = {}
            colon = child.get('k').find(':')
            if (child.tag == 'tag'):
                tag_dict['id'] = element.get('id')
                tag_dict['value'] = child.get('v')
                if (colon != -1):
                    type_value = child.get('k')[:colon]
                    key_value = child.get('k')[colon+1:]
                    tag_dict['type'] = type_value
                    tag_dict['key'] = key_value
                else:
                    tag_dict['key'] = child.get('k')
                    tag_dict['type'] = 'regular'
                tags.append(tag_dict)
        return {'node': node_attribs, 'node_tags': tags}
    elif element.tag == 'way':
        for item in WAY_FIELDS:
            way_attribs[item] = element.get(item)
            
        n = 0
        for child in element:
            if child.tag == 'nd':
                nd_dict = {}
                nd_dict['id'] = element.get('id')
                nd_dict['node_id'] = child.get('ref')
                nd_dict['position'] = n
                n += 1
                way_nodes.append(nd_dict)
            
            if (child.tag == 'tag'):
                way_tag_dict = {}
                colon = child.get('k').find(':')
                way_tag_dict['id'] = element.get('id')
                way_tag_dict['value'] = child.get('v')
                if (colon != -1):
                    type_value = child.get('k')[:colon]
                    key_value = child.get('k')[colon+1:]
                    way_tag_dict['type'] = type_value
                    way_tag_dict['key'] = key_value
                else:
                    way_tag_dict['key'] = child.get('k')
                    way_tag_dict['type'] = 'regular'
                tags.append(way_tag_dict)
                
        return {'way': way_attribs, 'way_nodes': way_nodes, 'way_tags': tags}

In [67]:
OSM_PATH = "example.osm"

NODES_PATH = "nodes.csv"
NODE_TAGS_PATH = "nodes_tags.csv"
WAYS_PATH = "ways.csv"
WAY_NODES_PATH = "ways_nodes.csv"
WAY_TAGS_PATH = "ways_tags.csv"

In [68]:
def process_map(file_in, validate):
    """Iteratively process each XML element and write to csv(s)"""

    with codecs.open(NODES_PATH, 'w') as nodes_file, \
         codecs.open(NODE_TAGS_PATH, 'w') as nodes_tags_file, \
         codecs.open(WAYS_PATH, 'w') as ways_file, \
         codecs.open(WAY_NODES_PATH, 'w') as way_nodes_file, \
         codecs.open(WAY_TAGS_PATH, 'w') as way_tags_file:

        nodes_writer = UnicodeDictWriter(nodes_file, NODE_FIELDS)
        node_tags_writer = UnicodeDictWriter(nodes_tags_file, NODE_TAGS_FIELDS)
        ways_writer = UnicodeDictWriter(ways_file, WAY_FIELDS)
        way_nodes_writer = UnicodeDictWriter(way_nodes_file, WAY_NODES_FIELDS)
        way_tags_writer = UnicodeDictWriter(way_tags_file, WAY_TAGS_FIELDS)

        nodes_writer.writeheader()
        node_tags_writer.writeheader()
        ways_writer.writeheader()
        way_nodes_writer.writeheader()
        way_tags_writer.writeheader()

        validator = cerberus.Validator()

        for element in get_element(file_in, tags=('node', 'way')):
            el = shape_element(element)
            if el:
                if validate is True:
                    validate_element(el, validator)

                if element.tag == 'node':
                    nodes_writer.writerow(el['node'])
                    node_tags_writer.writerows(el['node_tags'])
                elif element.tag == 'way':
                    ways_writer.writerow(el['way'])
                    way_nodes_writer.writerows(el['way_nodes'])
                    way_tags_writer.writerows(el['way_tags'])