# Cleaning and Wrangling: Seattle Open Street Map

We now analyze Open Street Map for the City of Seattle.

## Explore a Subset of Data

Due to the size of the dataset, we need a way to systematically slice the original dataset for a workable sample to explore. To this end, I have used the following code to achieve this. The **k** value is changed from large to small so that my resulting 
*SAMPLE_FILE* ends up at different sizes. When starting out, try using a larger k, then move on to an intermediate k before processing your whole dataset.

In [1]:
import xml.etree.ElementTree as ET  # Use cElementTree or lxml if too slow
import xml.etree.cElementTree as ET
import pprint

In [15]:
OSM_FILE = "seattle_washington.osm"  # Replace this with your osm file
SAMPLE_FILE = "test.osm"

k = 5000 # Parameter: take every k-th top level element

def get_element(osm_file, tags=('node', 'way', 'relation')):
    """Yield element if it is the right type of tag

    Reference:
    http://stackoverflow.com/questions/3095434/inserting-newlines-in-xml-file-generated-via-xml-etree-elementtree-in-python
    """
    context = iter(ET.iterparse(osm_file, events=('start', 'end')))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()


with open(SAMPLE_FILE, 'wb') as output:
    output.write('<?xml version="1.0" encoding="UTF-8"?>\n'.encode('utf-8'))
    output.write('<osm>\n  '.encode('utf-8'))

    # Write every kth top level element
    for i, element in enumerate(get_element(OSM_FILE)):
        if i % k == 0:
            output.write(ET.tostring(element, encoding='utf-8'))

    output.write('</osm>'.encode('utf-8'))

In [None]:
output.close()

At the end of the above code, we end up with a file *test.osm* with which we can use to explore the dataset. 

### Develop a Dictionary for All Tags In the Original Dataset

Our goal here is to end up with a Python dictionary for the tags in the original dataset, so that we know what needs to be wrangled in the data. The following achives this.

In [2]:
def count_tags(filename):
    tags = {}
    for event, elem in ET.iterparse(filename):
        if elem.tag not in tags:
            tags[elem.tag] = 1
        else:
            tags[elem.tag] += 1
    
    return tags

In [4]:
tags = count_tags('seattle_washington.osm')

In [5]:
import pickle 

with open('tags.pickle', 'wb') as tagsPickle:
    pickle.dump(tags, tagsPickle, protocol=pickle.HIGHEST_PROTOCOL)

In [6]:
with open('tags.pickle', 'rb') as tagsPickle:
    unserialized_tags = pickle.load(tagsPickle)


In [7]:
unserialized_tags

{'bounds': 1,
 'member': 88068,
 'nd': 8453162,
 'node': 7580046,
 'osm': 1,
 'relation': 9411,
 'tag': 4708553,
 'way': 750242}

### Exploring What Is Contained Within Each Tag Type