In [1]:
import xml.etree.cElementTree as ET
import pprint
import time
import re
import sqlite3
import pandas as pd
import schema
import data
import cerberus

In [3]:
tree = ET.parse('torrance.xml')
sqlite_file = 'torrance.xml'
root = tree.getroot()
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

# OpenStreetMap Project - Torrance, CA

At first, I used a map of Los Angeles, CA, which is where I grew up. Unfortunately, with the size of the file, it was tough to sort through the data manually (if I ever wanted to check for myself). As a result, I shortened the data by just drawing a map of the neighborhood where I spent most of my childhood - Torrance, CA.

[https://mapzen.com/data/metro-extracts/your-extracts/eb4ca3779a47](https://mapzen.com/data/metro-extracts/your-extracts/eb4ca3779a47 "My hometown!")

In [4]:
conn = sqlite3.connect(sqlite_file)
c = conn.cursor()

In [5]:
def count_tags(filename):
    tagsDict = {}
    for event, elem in ET.iterparse(filename):
        if elem.tag in tagsDict.keys():
            tagsDict[elem.tag]+=1
        else:
            tagsDict[elem.tag]=1
    return tagsDict

def test():

    tags = count_tags('torrance.xml')
    pprint.pprint(tags)
    

if __name__ == "__main__":
    test()

{'bounds': 1,
 'member': 5950,
 'nd': 1701210,
 'node': 1540164,
 'osm': 1,
 'relation': 1652,
 'tag': 997540,
 'way': 150132}


In [6]:
import csv
from collections import defaultdict

___
To begin the audit of the data file, let's get some general information first. I wanted to see the frequency of different tags in the dataset, which can help us imagine a picture of the area before we even begin. For example, we see that there are 150,132 *ways*, which represent different streets on the map. The 1,540,164 *nodes* represent defining points in space on the map. For a map of just a neighborhood, we can see that there are massive amounts of information!

In [7]:
comp = re.compile(r'\b\S+\.?$', re.IGNORECASE)
default = defaultdict(set)
expected = []

In [8]:
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
street_types = defaultdict(set)

expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Way", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons"]

def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)

def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")

def audit():
    for event, elem, in ET.iterparse('torrance.xml', events=("start",)):
        if elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])
    pprint.pprint(dict(street_types))
    
if __name__ == "__main__":
    audit()

{'Arno': {'Plaza del Arno'},
 'Ave': {'Normandie Ave', 'Pier Ave', 'Inglewood Ave'},
 'Ave.': {'Ocean Ave.', 'South Western Ave.'},
 'Blvd': {'Crenshaw Blvd',
          'Hawthorne Blvd',
          'Torrance Blvd',
          'West Redondo Beach Blvd'},
 'Blvd.': {'Palos Verdes Blvd.', 'West Redondo Beach Blvd.'},
 'Ctr.': {'Peninsula Ctr.'},
 'East': {'Palos Verdes Drive East'},
 'Highway': {'E Pacific Coast Highway',
             'Pacific Coast Highway',
             'West Pacific Coast Highway'},
 'Monte': {'Via del Monte'},
 'Ness': {'Van Ness'},
 'St': {'Carson St'},
 'Torrance': {'Pacific Coast Highway Torrance'},
 'street': {'W. 190th street'}}


___
One of the first audits was to check for any irregularities with how the streets are named. We are all used to (based on where you are in this world) the typical set of names for a street:

- Street
- Avenue
- Boulevard
- Drive
- Court
- Place
- Way
- Square
- Lane
- Road
- Trail
- Parkway
- Commons

I wanted to check the value of all **way** tags against the list above, and only see any street names that did not match. Upon my results, I found that there were a few discrepancies found. As the results above show, some abbreviations had a period at the end, while others didn't. Unique street endings, whether it was a direction (East) or no designation of the type of street ("Van Ness"), also made the list. When I compared my *northtorrance.xml* file before my *torrance.xml* file, I noticed that there were no errors found on that file. Interesting to see how a smaller neighborhood can have no errors, while the bigger city still does.

In [12]:
zipcode_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
zipcode = defaultdict(set)

expecteds = ["90248", "90278", "90501", "90502", "90503", "90504", "90505", "90506", "90507", "90508", "90509", 
            "90510", "90717", "90277"]

def audit_zip(zipcode, zip):
    m = zipcode_re.search(zip)
    if m:
        zcode = m.group()
        if zcode not in expected:
            zipcode[zcode].add(zip)

def is_zip(elem):
    return (elem.attrib['k'] == "postcode")

def zc():
    for event, elem, in ET.iterparse('torrance.xml', events=("start",)):
        if elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_zip(tag):
                    audit_zip(zipcode, tag.attrib['v'])
    pprint.pprint(dict(zipcode))
    
if __name__ == "__main__":
    zc()

{}


Despite the fact that no issues with street names were found, I was not confident to deem the dataset clean. I ran the same check with zip codes, to check if all zip codes reported in the dataset are correct. The only discrepancy that I found was the zip code for a neighboring city: Lawndale. By searching for that value in the dataset, I noticed that the value was stored against the keys *zip_left* and *zip_right*, which lead me to believe that the information was used when describing certain regions closer to the border of the Torrance neighborhood.

## Some Additional Info
***

In [13]:
religion_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
religious = defaultdict(set)

expect = []

def audit_rel(religious, rel):
    m = religion_re.search(rel)
    if m:
        rcode = m.group()
        if rcode not in expected:
            religious[rcode].add(rel)

def is_rel(elem):
    return (elem.attrib['k'] == "religion")

def rc():
    for event, elem, in ET.iterparse('torrance.xml', events=("start",)):
        if elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_rel(tag):
                    audit_rel(religious, tag.attrib['v'])
    pprint.pprint(dict(religious))
    
if __name__ == "__main__":
    rc()

{'christian': {'christian'}}


I wanted to use the same structure to help investigate some additional info about my hometown. Here, we can see that the places of worship around Torrance are all Christian churches!

In [14]:
def audit_buildings(default, x):
    m = comp.search(x)
    if m:
        zcode = m.group()
        if zcode not in expected:
            default[zcode].add(x)

def is_place(elem):
    return (elem.attrib['k'] == "amenity")

def audit_b():
    for event, elem, in ET.iterparse('torrance.xml', events=("start",)):
        if elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_place(tag):
                    audit_buildings(default, tag.attrib['v'])
    pprint.pprint(dict(default))
    
if __name__ == "__main__":
    audit_b()

{'atm': {'atm'},
 'bank': {'bank'},
 'bar': {'bar'},
 'cafe': {'cafe'},
 'car_rental': {'car_rental'},
 'childcare': {'childcare'},
 'cinema': {'cinema'},
 'clinic': {'clinic'},
 'college': {'college'},
 'community_centre': {'community_centre'},
 'doctors': {'doctors'},
 'fast_food': {'fast_food'},
 'fire_station': {'fire_station'},
 'fuel': {'fuel'},
 'grave_yard': {'grave_yard'},
 'hospital': {'hospital'},
 'kindergarten': {'kindergarten'},
 'language_school': {'language_school'},
 'library': {'library'},
 'marketplace': {'marketplace'},
 'parking': {'parking'},
 'parking_space': {'parking_space'},
 'pharmacy': {'pharmacy'},
 'place_of_worship': {'place_of_worship'},
 'police': {'police'},
 'post_office': {'post_office'},
 'restaurant': {'restaurant'},
 'school': {'school'},
 'shelter': {'shelter'},
 'social_centre': {'social_centre'},
 'social_facility': {'social_facility'},
 'spa': {'spa'},
 'swimming_pool': {'swimming_pool'},
 'theatre': {'theatre'},
 'toilets': {'toilets'},
 'vet

It's pretty funny to see that they marked toilets on this map. Also, there's a whirlpool?!

# Conclusion

After some analysis with the Torrance data, there is some room for improvement. The street names are affected by user imput - some users left a period after abbreviating, others didn't. Postal codes, on the other hand, seem to be accurate. This shows that the user input was precise, and indicates that the current version is accurate as is.

### Additional Thoughts

The extraction of the map data from the website was rather difficult. The instructions provided seemed to be a bit outdated, and it required a couple of extra steps than what was provided. It made more sense after I ordered the data, but I could see how it could be overwhelming to someone who is experiencing the website for the first time.

With that being said, I would like to suggest a tutorial, or tour, of the site and how to pull the data.

A benefit can result in more web traffic. Saving keystrokes can improve user experience and increase user retention. With the increase in users, the data can be updated more frequently. The site can also decide to connect the users as well.

The downside to the tutorial would affect those that are familiar with how the extraction of data works. It would provide unneccesary barriers of entry, which can deter users away. I would also anticipate users not feeling that the tutorial is of any help, which would affect all visitors to the site.

In [17]:
from hurry.filesize import size
import os

dirpath = '/Users/kangsankim/Desktop/Projects/UdacityDAND/DANDP4'

files_list = []
for path, dirs, files in os.walk(dirpath):
    files_list.extend([(filename, size(os.path.getsize(os.path.join(path, filename)))) for filename in files])

for filename, size in files_list:
    print ('{:.<40s}: {:5s}'.format(filename,size))

.DS_Store...............................: 6K   
DAND P4 - OpenStreetMap Project.ipynb...: 16K  
data.py.................................: 10K  
nodes.csv...............................: 0B   
nodes_tags.csv..........................: 0B   
northtorrance.xml.......................: 41M  
schema.py...............................: 2K   
torrance.xml............................: 337M 
ways.csv................................: 0B   
ways_nodes.csv..........................: 0B   
ways_tags.csv...........................: 0B   
DAND P4 - OpenStreetMap Project-checkpoint.ipynb: 14K  
DAND P4 - OpenStreetMap-checkpoint.ipynb: 1K   
OpenStreetMap DAND P3-checkpoint.ipynb..: 72B  
osm.py-checkpoint.ipynb.................: 72B  
data.cpython-35.pyc.....................: 10K  
schema.cpython-35.pyc...................: 1K   
