# OpenStreetMap Project - Chicago

This project will use the map of a beautiful city, Chicago, IL, United States. I have lived here since graduating from college. I am very interested to see what the map database reveals. After unziping, the total database is a little more than 2GB.

I will analyze this dataset by doing the following:

* Extract a sample from the database.
* Find the problems encountered in this dataset. 
* Clean up the data and import them to SQL.
* Explore the data by querying in SQLite.
* Additional ideas I have after exploring the dataset.

Reference:

* The summary of Chicago area can be found at [OpenStreetMap website](https://www.openstreetmap.org/relation/122604). 
* This data can be downloaded at [Mapzen Metro Extracts](https://mapzen.com/data/metro-extracts/metro/chicago_illinois/). 
* [OpenStreetMap Wiki](https://wiki.openstreetmap.org/wiki/Main_Page) shows the detail explanation of OpenStreetMap database.

## Import Libraries

In [17]:
import csv
import codecs
import pprint
import re
import xml.etree.cElementTree as ET
import lxml
import cerberus
from collections import defaultdict

## Extract a sample

As mentioned before, this database is quite large, more than 2GB. Directly opening it or parsing it will crash the computer. Therefore, it is a good idea to extract a sample from this dataset. 

I will use the extract-sample-data.py file to extract 1% of the original data. This only needs to run once. The final sample file is around 20MB.

In [2]:
%%timeit
%run extract-sample-data.py

1 loop, best of 3: 3min 30s per loop


After getting the sample from the database, it is a good idea to see the big picture of this sample to see if we have had enough data within the sample. Therefore, I want to write a function to check what tags are in the sample dataset, and how many of them.

In [9]:
sample_file = 'sample_chicago.osm'

In [10]:
def count_tag(filename):
    tags = {}
    for event, elem in ET.iterparse(filename):
        tag = elem.tag
        if tag not in tags:
            tags[tag] = 1
        else:
            tags[tag] += 1
    return tags

In [26]:
count_tag(sample_file)

{'member': 349,
 'nd': 103077,
 'node': 87172,
 'osm': 1,
 'relation': 48,
 'tag': 67876,
 'way': 12337}

It seems to be that we have a good amount of data within the sample. 

## Problem in this dataset

After getting the sample data, we can look through the dataset, find the problems and clean it up.

Through reading the documente and look through the sample data in a text editor, `<tag>` is used to save all the values. 

Here are some problems I noticed the following potential problems through reading the sample data:

* The `<tag>`'s k attribute value is not consistent. Some only have lower case like "ele". Some have both lower case and colon, like "gnis: id". Others have special characters like.
* The street name is not consistent. Some uses the whole spell, like "street" and "avenue", while others use abbreviation, like "Ave".
* The phone number format is not consistent. Some have (XXX) XXX-XXXX while others have XXX-XXX-XXXX.

### k attribute issue

The k-attribute has three main patterns:

* The k-attribute values only contain lowercase letter, i.e. "building".
* The k-attribute values contains both lowercase letter and colon, i.e. "addr: city".
* The problematic pattern will contains special characters like "&".'
* The rest will be "others".

The first two patterns and "others" are good. They will not influence future analysis. However, the third one needs some clean-up. I will run the k_attrib_type.py file to find the patterns within my sample file.

In [23]:
%run "k-attribute-issues.py"

k_attrib_type(sample_file, keys)

{'lower': 20677, 'lower_colon': 31308, 'other': 15891, 'problemchars': 0}

Based on this result, there is no problematic characters within k attributes. Therefore, we do not need to clean k attribute for future analysis.

### v attribute issue

The v-attribute contains the value for k-attribute. There are two v attributes that I found have some potential problems after looking through the sample file in a text editor.

* Many of the street name in this file use abbreviation. For example, it uses 'Dr' instead of 'Drive'. It may causes problems in later analysis. Therefore, I need to find abbreviation and fix them.
* The phone number in this .osm file is not consistent. After looking through a small sample of this file, I found at least four kinds of format. Some phone numbers look like "XXX-XXX-XXXX", some look like "(XXX) XXX-XXXX", some look like "+1-XXX-XXX-XXXX" while others look like "(XXX)XXX-XXXX". There might be other formats as well. 

In [178]:
%run update-v-value.py

street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
expected = ['Street', 'Avenue', 'Boulevard', 'Drive', 'Court', 'Place',
            'Square', 'Lane', 'Road', 'Trail', 'Parkway', 'Commons', 'Broadway', 'Circle'
           'Park', 'Path', 'Terrace', 'West', 'Highway']

audit(sample_file, "addr:street", street_type_re, expected)

defaultdict(set,
            {'14': {'U.S. 14'},
             'Ave': {'Alabama Ave', 'New York Ave'},
             'Ave.': {'Ogden Ave.'},
             'B': {'South Avenue B'},
             'C': {'South Avenue C'},
             'C405': {'S Williams St #C405'},
             'Circle': {'Woodland Park Circle'},
             'Ct': {'Boulder Ct', 'Timber Ct', 'Vail Ct'},
             'Dr': {'Breckenridge Dr',
              'Greenbriar Dr',
              'Gregory M Sears Dr',
              'John M Boor Dr',
              'Summit Dr'},
             'E': {'South Avenue E'},
             'F': {'South Avenue F'},
             'G': {'South Avenue G'},
             'H': {'South Avenue H'},
             'J': {'South Avenue J'},
             'L': {'South Avenue L'},
             'Ln': {'Leadville Ln'},
             'M': {'South Avenue M'},
             'N': {'900 N', 'South Avenue N'},
             'O': {'South Avenue O'},
             'Park': {'West Midway Park'},
             'St': {'Kathleen St',

In [181]:
%run update-v-value.py

phone_type_re = re.compile(r'(\+1-)?\(?\d\d\d\)?[-| ]?\d\d\d[-| ]?\d\d\d\d')
expected = re.compile(r'^\d\d\d-\d\d\d-\d\d\d\d$')

audit(sample_file, "phone", phone_type_re, expected)

defaultdict(set,
            {'(312) 369-7900': {'(312) 369-7900'},
             '(708) 749-0895': {'(708) 749-0895'},
             '(847)434-0300': {'(847)434-0300'},
             '(847)806-1230': {'(847)806-1230'},
             '+1-708-715-7746': {'+1-708-715-7746'},
             '219-988-2111': {'+1 219-988-2111'}})

## Generate the csv files

We have audited the Chicago osm file, and it is time to clean it and generate the csv files we need.

Based on the previous analysis, there is no problematic characters within 'k' attributes. I will not update this part.

The street type is inconsistent. There are many abbreviations inside. I will creating a mapping to update these parts.

There are only six phone numbers that have format issues. It is too little compared to the whole datasets. Therefore, I will not clean them.

In [210]:
%run audit-v-value.py

process_map(sample_file)