# Cleaning and Wrangling: Seattle Open Street Map

We now analyze Open Street Map for the City of Seattle.

## Explore a Subset of Data

Due to the size of the dataset, we need a way to systematically slice the original dataset for a workable sample to explore. To this end, I have used the following code to achieve this. The **k** value is changed from large to small so that my resulting 
*SAMPLE_FILE* ends up at different sizes. When starting out, try using a larger k, then move on to an intermediate k before processing your whole dataset.

In [24]:
import xml.etree.ElementTree as ET  # Use cElementTree or lxml if too slow
import xml.etree.cElementTree as ET
import pprint
import pickle
from collections import defaultdict
import re

In [29]:
OSM_FILE = "seattle_washington.osm"  # Replace this with your osm file
SAMPLE_FILE = "test.osm"

k = 5000 # Parameter: take every k-th top level element

def get_element(osm_file, tags=('node', 'way', 'relation')):
    """Yield element if it is the right type of tag

    Reference:
    http://stackoverflow.com/questions/3095434/inserting-newlines-in-xml-file-generated-via-xml-etree-elementtree-in-python
    """
    context = iter(ET.iterparse(osm_file, events=('start', 'end')))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()


with open(SAMPLE_FILE, 'wb') as output:
    output.write('<?xml version="1.0" encoding="UTF-8"?>\n'.encode('utf-8'))
    output.write('<osm>\n  '.encode('utf-8'))

    # Write every kth top level element
    for i, element in enumerate(get_element(OSM_FILE)):
        if i % k == 0:
            output.write(ET.tostring(element, encoding='utf-8'))

    output.write('</osm>'.encode('utf-8'))

In [None]:
output.close()

At the end of the above code, we end up with a file *test.osm* with which we can use to explore the dataset. 

### Develop a Dictionary for All Tags In the Original Dataset

Our goal here is to end up with a Python dictionary for the tags in the original dataset, so that we know what needs to be wrangled in the data. The following achives this.

In [2]:
def count_tags(filename):
    tags = {}
    
    for event, elem in ET.iterparse(filename):
        if elem.tag not in tags:
            tags[elem.tag] = 1
            if 
        else:
            tags[elem.tag] += 1
    
    return tags

In [4]:
tags = count_tags('seattle_washington.osm')

In [5]:
import pickle 

with open('tags.pickle', 'wb') as tagsPickle:
    pickle.dump(tags, tagsPickle, protocol=pickle.HIGHEST_PROTOCOL)

In [6]:
with open('tags.pickle', 'rb') as tagsPickle:
    unserialized_tags = pickle.load(tagsPickle)


In [7]:
unserialized_tags

{'bounds': 1,
 'member': 88068,
 'nd': 8453162,
 'node': 7580046,
 'osm': 1,
 'relation': 9411,
 'tag': 4708553,
 'way': 750242}

### Exploring What Is Contained Within Each Tag Type

To get a better sense of what sort of attributes is contained inside each type of tag, we use the following code to return this information to us.

In [4]:
bounds_subtags = []
member_subtags = []
nd_subtags = []
node_subtags = []
osm_subtags = []
relation_subtags = []
tag_subtags = []
way_subtags = []

for _, element in ET.iterparse('seattle_washington.osm'):
    if element.tag == 'bounds' and element.attrib.keys() not in bounds_subtags:
        bounds_subtags.append(element.attrib.keys())
    elif element.tag == 'member' and element.attrib.keys() not in member_subtags:
        member_subtags.append(element.attrib.keys())
    elif element.tag == 'nd' and element.attrib.keys() not in nd_subtags:
        nd_subtags.append(element.attrib.keys())
    elif element.tag == 'node' and element.attrib.keys() not in node_subtags:
        node_subtags.append(element.attrib.keys())
    elif element.tag == 'osm' and element.attrib.keys() not in osm_subtags:
        osm_subtags.append(element.attrib.keys())
    elif element.tag == 'relation' and element.attrib.keys() not in relation_subtags:
        relation_subtags.append(element.attrib.keys())
    elif element.tag == 'tag' and element.attrib.keys() not in tag_subtags:
        tag_subtags.append(element.attrib.keys())
    elif element.tag == 'way' and element.attrib.keys() not in way_subtags:
        way_subtags.append(element.attrib.keys())
    else:
        pass

In [5]:
bounds_subtags

[dict_keys(['maxlon', 'maxlat', 'minlat', 'minlon'])]

In [6]:
member_subtags

[dict_keys(['type', 'ref', 'role'])]

In [7]:
nd_subtags

[dict_keys(['ref'])]

In [8]:
node_subtags

[dict_keys(['lon', 'version', 'changeset', 'lat', 'timestamp', 'id', 'uid', 'user']),
 dict_keys(['lon', 'version', 'changeset', 'lat', 'timestamp', 'id'])]

In [9]:
osm_subtags

[dict_keys(['generator', 'version', 'timestamp'])]

In [10]:
relation_subtags

[dict_keys(['version', 'changeset', 'timestamp', 'id', 'uid', 'user'])]

In [11]:
way_subtags

[dict_keys(['version', 'changeset', 'timestamp', 'id', 'uid', 'user'])]

In [12]:
tag_subtags

[dict_keys(['v', 'k'])]

### Audit Plan: Addresses

From a visual inspection of the subset of Seattle OSM, we get the sense that **<tag>** contains address information. In particular, tags with attribute of  **k** of **addr:street** contains street names that tend to be described inconsistently in the dataset. Therefore, our next goal is to develop a data audit plan that works specifically on tags with addresses.

The following chuncks of code achive this goal.

In [25]:
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)


expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons"]

In [18]:
def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")

In [21]:
def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)

In [33]:
def audit(osmfile):
    osm_file = open('seattle_washington.osm', 'r', encoding='cp1252', errors='replace')
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=('start',)):
        if elem.tag == 'node' or elem.tag == 'way':
            for tag in elem.iter('tag'):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])
    osm_file.close()
    return street_types

In [23]:
def update_name(name, mapping):
    m = street_type_re.search(name)
    street_type = m.group()
    
    name = re.sub(street_type, mapping[street_type], name)
    return name

In [34]:
st_types = audit(OSM_FILE)

In [35]:
st_types

defaultdict(set,
            {'1': {'228th St SE Suite 1', 'Southeast 132nd Street #1'},
             '100': {'Northwest Byron Street #100',
              'Old Highway 9 Southwest  #100',
              'S 196th St #100',
              'Southeast 38th Street Suite 100'},
             '101': {'156th Street East #101',
              '5th Street #101',
              'East Highway 101',
              'East US Highway 101'},
             '102': {'15th Street Southwest  #102'},
             '104': {'Northeast State Highway 104', 'State Highway 104'},
             '105': {'State Route 105'},
             '110': {'Northeast 4th Street, Suite 110'},
             '1109': {'NE Northgate Way #1109'},
             '112': {'Craftsman Way Suite 112'},
             '11th': {'South 11th'},
             '12': {'HWY 12', 'State Route 12', 'US Highway 12'},
             '125': {'Better Way SE Ste 125'},
             '12th': {'South 12th'},
             '13th': {'South 13th'},
             '140': {'Highland