# OpenStreetMap Project - Chicago

This project will use the map of a beautiful city, Chicago, IL, United States. I have lived here since graduating from college. I am very interested to see what the map database reveals. After unziping, the total database is a little more than 2GB.

I will analyze this dataset by doing the following:

* Extract a sample from the database.
* Find the problems encountered in this dataset. 
* Clean up the data and import them to SQL.
* Explore the data by querying in SQLite.
* Additional ideas I have after exploring the dataset.

Reference:

* The summary of Chicago area can be found at [OpenStreetMap website](https://www.openstreetmap.org/relation/122604). 
* This data can be downloaded at [Mapzen Metro Extracts](https://mapzen.com/data/metro-extracts/metro/chicago_illinois/). 
* [OpenStreetMap Wiki](https://wiki.openstreetmap.org/wiki/Main_Page) shows the detail explanation of OpenStreetMap database.

## Extract a sample

As mentioned before, this database is quite large, more than 2GB. Directly opening it or parsing it will crash the computer. Therefore, it is a good idea to extract a sample from this dataset. 

Like all analysis, everything starts from importing the necessary modules.

In [23]:
import csv
import codecs
import pprint
import re
import xml.etree.cElementTree as ET
import lxml
import cerberus
from collections import defaultdict

I will write a function to find element I want from the original .osm file, and write into a sample osm file.
After reading through the wiki, I think the most important tag for this dataset are "node", "way", and "relation" tag. Therefore, the function will focus on getting the elements from these three tags.

In [2]:
osm_file = 'chicago_illinois.osm'
sample_file = 'sample_chicago.osm'

tag = ['node', 'way', 'relation']

In [3]:
def get_element(osm_file, tags = ('node', 'way', 'relation')):
    '''
    This function will read through an XML file, get the element from desired tags.
    
    Parameters
    ----------
    osm_file: .xml or .osm file
        the XML or OSM file to be parsed
    
    tags: string or list
        the tag name that you want to get elements from. 
        default is ['node', 'way', 'relation']
    
    Return
    ------
    .xml or .osm file
    '''
    
    context = iter(ET.iterparse(osm_file, events = ('start', 'end')))
    _, root = next(context)
    
    for event, elem in context:
        if (event == 'end') and (elem.tag in tags):
            yield elem
            root.clear()

After generate the elements, it is time to write it into another file.

k is a parameter. It defines the one element to export for every k elements. The bigger the k is, the smaller the sample will be. Since the data is big, I choose to use 1000.

In [5]:
k = 1000
    
with open(sample_file, 'wb') as output:
    output.write(bytes('<?xml version="1.0" encoding="UTF-8"?>\n', 'UTF-8'))
    output.write(bytes('<osm>\n  ', 'UTF-8'))

    for i, element in enumerate(get_element(osm_file)):
        if i % k == 0:
            output.write(ET.tostring(element, encoding='utf-8'))

    output.write(bytes('</osm>', 'UTF-8'))

After getting the sample from the database, it is a good idea to see the big picture of this sample to see if we have had enough data within the sample. Therefore, I want to write a function to check what tags are in the sample dataset, and how many of them.

In [13]:
def count_tag(filename):
    tags = {}
    for event, elem in ET.iterparse(filename):
        tag = elem.tag
        if tag not in tags:
            tags[tag] = 1
        else:
            tags[tag] += 1
    return tags

In [14]:
count_tag(sample_file)

{'member': 69,
 'nd': 10728,
 'node': 8718,
 'osm': 1,
 'relation': 5,
 'tag': 6761,
 'way': 1233}

It seems to be that we have a good amount of data within the sample. 

## Problem in this dataset

After getting the sample data, we can look through the dataset, find the problems and clean it up.

Through reading the documente and look through the sample data in a text editor, `<tag>` is used to save all the values. 

Here are some problems I noticed the following potential problems through reading the data:

* The `<tag>`'s k attribute value is not consistent. Some only have lower case like "ele". Some have both lower case and colon, like "gnis: id". Others have special characters like.
* The street name is not consistent. Some uses the whole spell, like "street" and "avenue", while others use abbreviation, like "St" and "St.".

### k attribute issue

I will use regular expression to find the pattern that mentioned above. Later, I will define a function to count each pattern in the sample file.

In [29]:
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}

In [30]:
def key_type(filename, keys):
    
    for event, element in ET.iterparse(filename):
        if element.tag == 'tag':
            key = element.get('k')
            if lower.search(key):
                keys['lower'] += 1
            elif re.findall(lower_colon, key):
                keys['lower_colon'] += 1
            elif re.findall(problemchars, key):
                keys['problemchars'] += 1
            else:
                keys['other'] += 1
        
    return keys

In [31]:
key_type(sample_file, keys)

{'lower': 2016, 'lower_colon': 3135, 'other': 1610, 'problemchars': 0}

### Street name issue

Similar to k attribute, I will use regular expression to find the pattern. I will build up a list showing the expected value, and printed street type not in the expected list.

In [62]:
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons"]

In [65]:
def audit_street_type(street_types, street_name):
    match = street_type_re.search(street_name)
    if match:
        street_type = match.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)
    pprint.pprint(dict(street_types))

In [73]:
def audit(filename):
    street_types = defaultdict(set)
    
    for event, elem in ET.iterparse(filename, events=("start",)): 
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                print(tag['k'])
                break
                if tag['k'] == 'addr:street':
                    audit_street_type(street_types, tag.attrib['v'])
    return street_types

In [74]:
audit(sample_file)

TypeError: element indices must be integers

In [None]:
mapping = { "St": "Street",
            "St.": "Street",
            "Ave": 'Avenue',
            'Rd.': 'Road'
            }

In [None]:




def update_name(name, mapping):
    
    update_name = name.split(' ')[-1]
    if update_name in mapping:
        new_name = mapping[update_name]
        print(new_name)

        name = name.replace(update_name, new_name)

    return name