# Cleaning Las Vegas

My flight to arrive in the USA went to Las Vegas, so it was the first city I spent some time in.

I got myself a cheap bed in the suburbs close to the airport, and started working on my DAND P3, cleaning OSM data. Since I was here, and found the idea exciting, I downloaded the MetroExtract of Las Vegas: https://s3.amazonaws.com/metro-extracts.mapzen.com/las-vegas_nevada.osm.bz2 and went on my task to discover the city I was staying in through Data Science. :)

Here are the results of a wrangling process that lasted much longer than my stay.

---

## Exploring

In [5]:
import os

las_vegas_osm = 'las-vegas_nevada.osm'
## for testing and developing purposes, uncomment the truncated version:
#las_vegas_osm = 'LV_truncated.osm'
file_size = os.path.getsize(las_vegas_osm)
print 'File Size in Bytes:', file_size
print 'File Size in MB:   ', file_size / (2**20)

File Size in Bytes: 184636412
File Size in MB:    176


In [6]:
import xml.etree.cElementTree as ET
import pprint

def count_tags(filename):
    '''Creates a dictionary with the tags present in the dataset, alongside a count for each'''
    tag_dict = {}
    for event, elem in ET.iterparse(filename):
        if elem.tag not in tag_dict:
            tag_dict[elem.tag] = 1
        elif elem.tag in tag_dict:
            tag_dict[elem.tag] += 1
    return tag_dict

las_vegas_osm_dict = count_tags(las_vegas_osm)

In [7]:
import pandas as pd

las_vegas_osm_tags = pd.Series(las_vegas_osm_dict, name='tags and their amounts')
las_vegas_osm_tags

bounds           1
member        3158
nd          995111
node        824219
osm              1
relation       316
tag         545013
way          92487
Name: tags and their amounts, dtype: int64

The city is big. Not only in bytes.

For quite a long time I took digital walks, checking through querying the dataset which places exist and what could be an interesting spot. Sometimes I took a real-life walk to find those places, often I remained online and went to check them out with an online map utility.

During my exploration I wondered what and where are the **bays** in Las Vegas, whether there is **grass**(parks) to find anywhere, where is that one lonely **picnic_table**, and where would I be able to get my longed-for fatty US-style **pizza**.

I also discovered the **TIGER** data that a lot of OSM's street data had been imported from. At first I had no idea what this is about, but with the help of the OSM wiki and some additional research, I started to understand. I also spent a while writing functions that were fetching the different parts of the TIGER data and concatenating it properly to write the missing ` addr:street ` tags and fill their values with what I had programatically stuck together. Shortly after managing, I realized that there was a `name` attribute to one tag in each way Element, that held exactly this data that I had created...

If you are interested in this phase of my explorations, you can find more talking (and code!) here:
https://github.com/martin-martin/cleaning-las-vegas/blob/master/las_vegas.ipynb. It reads a little bit like a blog, I think :)

---

## More focused Exploring

So I started a new notebook. I focused on looking for the street types as suggested in the course material. Now I also knew in which tag to search for them... :)

I audited the street names and ended up with a long list of messiness.

Then, step by step, I took out those street types that were common, and explored further examples of other "street types", or "ways", that seemed suspicious to me for maybe _not being ways_.

OSM uses the "way" tag also for something called an **area**, which can e.g. be a building or a park - basically anything for which it is interesting to preserve its shape. An area is constructed of closed ways: http://wiki.openstreetmap.org/wiki/Area. Here's, as an example, the description for buildings: http://wiki.openstreetmap.org/wiki/Relations/Proposed/Buildings

Using the IDs extracted from the dataset and tapped together with the suspicious "street" name into a dictionary, I went to check some of those way tags online, such as: http://www.openstreetmap.org/way/134574757

This is, yes indeed: a **golf course**. One that, however, nowhere mentions that it is a golf course...

Here's a link to that new file of exploration:

And the following code is the revisited version of what I was doing there:

In [8]:
import xml.etree.cElementTree as ET
import pprint
import re

def audit_street_type(street_types, expected, street_name):
    street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
    found = street_type_re.search(street_name)
    if found:
        street_type = found.group()
        if street_type not in expected:
            if street_type not in street_types:
                street_types[street_type] = [street_name]
            else:
                street_types[street_type].append(street_name)

def collect_way_types(filename, expected_types):
    street_types = {}
    for event, elem in ET.iterparse(filename, events=('start',)):
        if elem.tag == 'way':
            for tag in elem.iter('tag'):
                if tag.attrib['k'] == 'name':
                    street_name = tag.attrib['v']
                    audit_street_type(street_types, expected_types, street_name)                       
    return street_types

In [90]:
# choosing to exclude the common street types
common_types = []
street_types = collect_way_types(las_vegas_osm, common_types)
# While working with the truncated version of the dataset,
# I chose the threshold of 7 through checking the results. 
# 10 returned an empty list, 5 included 'Vegas' :)
threshold = 7
for key, value in street_types.items():
    if len(value) > threshold:
        common_types.append(key)   
    
street_types = collect_way_types(las_vegas_osm, common_types)

In [91]:
len(street_types)

1032

It becomes obvious that this is a crazy amount of Elements that are considered 'ways', but are obviously not streets. While working on my project with the truncated version of the dataset, these "special" instances were fewer, but still quite many.

I went down the path of trying to programmatically exclude and/or clean those special cases to individually reduce this list of 'ways' that have not a street type as their name ending.

**I would not do this again.** It is exciting, because one can discover a lot, but it is very ineffective.

Instead I would adapt the `audit_street_type()` function in order to be better adapted to my dataset. Here there were few Elements labeled with the `addr:street` key, but the street name data was often saved in the `name` attribute. However, as the length of the "`street_types`" dictionary shows, also many other things utilize the `name` attribute.

Anyways. I paid with time, that I invested in a journey of learning and discovery. :)
This is not the worst currency, at all.

For my investigation of the individual suspicious entries, I've used the following function in combination with checking the ID online with OpenStreetMap.

In [11]:
def find_something(filename, regex):
    '''Prints all XML elements matching the specified regex somewhere in their tags, 
    and a link to the specific OSM way. Returns None.'''
    import re
    flag = False
    for event, elem in ET.iterparse(filename, events=('start',)):
        if elem.tag == 'way':
            for tag in elem.iter('tag'):
                if tag.attrib['k'] == 'name':
                    if re.search(regex, ET.tostring(tag)):
                        print "Check ID online at: http://www.openstreetmap.org/way/" + elem.attrib['id'] + '\n'
                        ET.dump(elem)
                        flag = True
    if not flag:
        print "No matching Element was found."

This would produce e.g. the following result - allowing me to check the Element (and especially its attributes) also if I had no internet connection:

In [12]:
find_something(las_vegas_osm, 'Perfect Waters')

Check ID online at: http://www.openstreetmap.org/way/14308393

<way changeset="631433" id="14308393" timestamp="2007-11-27T16:21:07Z" uid="7168" user="DaveHansenTiger" version="1">
		<nd ref="137515455" />
		<nd ref="137432296" />
		<tag k="name" v="Perfect Waters" />
		<tag k="highway" v="residential" />
		<tag k="tiger:cfcc" v="A41" />
		<tag k="tiger:tlid" v="201947443" />
		<tag k="tiger:county" v="Clark, NV" />
		<tag k="tiger:source" v="tiger_import_dch_v0.6_20070813" />
		<tag k="tiger:reviewed" v="no" />
		<tag k="tiger:name_base" v="Perfect Waters" />
		<tag k="tiger:separated" v="no" />
		<tag k="tiger:upload_uuid" v="bulk_upload.pl-fa98df75-5974-4c49-9081-f3ca4b3c7383" />
	</way>
	


I've run quite a few of those queries, which are of course very labour-intensive.
This is what I call already my _more focused_ exploration :)

But I felt I needed to understand what are some of these places, so that I'd know how to deal with them later on. Also, it helped me to better get to know Las Vegas and the OpenStreetMap project.

Here's an intermediate result, that allowed me to reduce the size of the street type dictionary a bit more:

In [13]:
# these can be safely excluded, because they represent (most probably) valid ways
valid_ways = ['Aisle', 'Alley', 'Bypass', 'Channel', 'Highway', 'Interconnect', 'Loop', 'Monorail', 'Path', 'Paths',
             'Route', 'Speedway', 'Walk']
nature_ways = ['Falls', 'Forest', 'Lake', 'Shore', 'Spillway', 'Stream', 'River', 'Thrust', 'Wash']

In [92]:
exclude = common_types + valid_ways + nature_ways
street_types =  collect_way_types(las_vegas_osm, exclude)
len(street_types)

1019

I've come to encounter quite some beautiful street names along my exploration. Some are emotional, others telling, or also sad - because their names are so opposite of how they probably look in reality.

Here are some examples of those that I -saw and liked:

- Willow Wisp Terrace
- Wonderful Day Drive
- Perfect Waters
- Wanderlust
- Whisper Reef

---

## Time to Clean a bit

So I decided it was time to try to reduce the mess I had found, at least a tiny bit.

There were two possibilities I saw myself deciding between:

1. performing all cleaning on the original document, or 
2. filtering the database step by step, reducing its size, then performing cleaning steps on the singled-out Elements, and finally writing them back into the original Element Tree.

It felt that I could learn more with the second approach. I also thought that it would allow me to discover problems that I wouldn't have thought about myself. This makes the cleaning task more thorough ( - but also much more tedious...).

### Reducing to 'way' tags

I reduced the size of the Element Tree by 

- first extracting only the 'way' tags, 
- then by excluding common and well-spelled street names, 
- and even further by excluding some other uses of 'way' tags that were actually not streets.

In [15]:
# Reference: https://discussions.udacity.com/t/changing-attribute-value-in-xml/44575/6
from pprint import pprint
import xml.etree.cElementTree as ET
import re
import codecs

OSM_FILE = las_vegas_osm
NEW_FILE = 'cleaning_1.osm'

def get_ways(osm_file, tags=('node', 'way', 'relation')):
    """Filters an OSM file and yields the 'way' elements."""
    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            if elem.tag == 'way':
                yield elem
                root.clear()

with open(NEW_FILE, 'w') as output:
    output.write('<?xml version="1.0" encoding="UTF-8"?>\n')
    output.write('<osm>\n  ')

    for i, element in enumerate(get_ways(OSM_FILE)):
        output.write(ET.tostring(element, encoding='utf-8'))

    output.write('</osm>')

### Excluding common street types

In [16]:
# setting the input file to the previous output file
OSM_FILE = NEW_FILE
NEW_FILE = 'cleaning_2.osm'
common_and_valid_ways = exclude

def select_some_way_elems(osm_file, excluded_ways):
    """Yields way elements which last word (usually the street type) is not in a list to exclude."""
    import re
    street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
    
    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag == 'way':
            for tag in elem.iter():
                try:
                    if tag.attrib['k'] == 'name':
                        street_name = tag.attrib['v'] 
                        found = street_type_re.search(street_name)
                        street_type = found.group()
                        if street_type not in excluded_ways:
                            yield elem
                            root.clear()
                except:
                    continue

# here I write a new document consisting only of those way elements that select_some_way_elems() yields.
with open(NEW_FILE, 'w') as output:
    output.write('<?xml version="1.0" encoding="UTF-8"?>\n')
    output.write('<osm>\n  ')

    for i, element in enumerate(select_some_way_elems(OSM_FILE, common_and_valid_ways)):
        output.write(ET.tostring(element, encoding='utf-8'))

    output.write('</osm>')

### Cleaning the street names

Having reduced the size of my dataset, I started to write the functions which eventually would allow me to clean it a little bit.

In [17]:
def modify_file(filename, function, *args):
    """Modifies a file according to the output of a function.
    
    Takes as input a file name, a function and its arguments.
    Runs the (cleaning) function and writes the output back into the file,
    using a temporary file object as intermediate step.
    Reference:
    http://stackoverflow.com/questions/17646680/writing-back-into-the-same-file-after-reading-from-the-file
    """
    import tempfile
    import sys
    temp_file = tempfile.NamedTemporaryFile(mode = 'r+')
    input_file = open(filename, 'r')
    for i, element in enumerate(function(*args)):
        temp_file.write(ET.tostring(element, encoding='utf-8'))
    input_file.close()
    temp_file.seek(0)
    with open(filename, 'w') as f:
        f.write('<?xml version="1.0" encoding="UTF-8"?>\n')
        f.write('<osm>\n  ')
        for line in temp_file:
            f.write(line)
        f.write('</osm>')
    temp_file.close() 

In [18]:
def substitute_attrib_value(osm_file, before, after, attrib_key, tags=('way', 'node', 'relation')):
    """Changes text in an attribute to a string defined in 'after'.

    Changes the text in a specified attribute of a 'way' tag 
    that contains the string variable defined in 'before' for a new string defined in 'after'.
    Reference:
    https://discussions.udacity.com/t/changing-attribute-value-in-xml/44575/6
    """
    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            if elem.tag == 'way':
                for tag in elem.iter('tag'):
                    # not changing the original TIGER data
                    if ('tiger:' not in tag.attrib['k'] and 
                        re.search(before, ET.tostring(tag))):
                        tag.set(attrib_key, after)
            yield elem
            root.clear()

In [19]:
def substitute_smth(osm_file, before, after, attrib_key):
    """Wrapper function: Calls substitute_attrib_value() and modify_file().
    
    Substitutes a 'way' tag attribute for another and writes the changes back.
    """
    substitute_attrib_value(osm_file, before, after, attrib_key, tags=('way', 'node', 'relation'))
    modify_file(osm_file, substitute_attrib_value, osm_file, before, after, attrib_key)

Now this wrapper function allows me to do corrections for misspelled street names, such as:

In [20]:
# setting the input file to the previous output file
OSM_FILE = 'cleaning_2.osm'

substitute_smth(OSM_FILE, 'Wonderful Day Driive', 'Wonderful Day Drive', 'v')

After doing the adaptation, here's the check:

In [21]:
find_something(OSM_FILE, 'Wonderful Day Driive')

No matching Element was found.


In [22]:
find_something(OSM_FILE, 'Wonderful Day Drive')

Check ID online at: http://www.openstreetmap.org/way/98572517

<way changeset="24413704" id="98572517" timestamp="2014-07-29T02:09:10Z" uid="3392" user="SimMoonXP" version="4">
		<nd ref="1140354752" />
		<nd ref="1140354845" />
		<nd ref="1140354771" />
		<nd ref="1140354747" />
		<nd ref="1140354413" />
		<nd ref="1140354415" />
		<nd ref="1140354456" />
		<nd ref="1140354488" />
		<nd ref="1140354517" />
		<nd ref="1140354819" />
		<nd ref="1140354655" />
		<nd ref="1140354561" />
		<nd ref="1140354881" />
		<nd ref="1140354859" />
		<nd ref="1140354611" />
		<nd ref="1140354861" />
		<nd ref="1140354615" />
		<nd ref="1140354610" />
		<nd ref="1140354609" />
		<nd ref="1140354497" />
		<nd ref="1140354490" />
		<nd ref="1140354498" />
		<nd ref="1140354864" />
		<nd ref="1140354863" />
		<nd ref="1140354867" />
		<nd ref="1140354633" />
		<nd ref="1140354821" />
		<nd ref="1140354895" />
		<nd ref="1140354865" />
		<nd ref="1140354678" />
		<nd ref="1140354547" />
		<nd ref="114035

In [23]:
def add_attribute(osm_file, elem_id, attrib_key, attrib_value, tags=('way', 'node', 'relation')):
    '''Adds a tag Element with attribute and value to a 'way' Element specified through an ID.

    Reference:
    https://discussions.udacity.com/t/changing-attribute-value-in-xml/44575/6
    '''
    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            if elem.attrib['id'] == elem_id:
                try:
                    for tag in elem.iter('tag'):
                        if tag.attrib['k'] == attrib_key and tag.attrib['v'] == attrib_value:
                            raise Exception('AttributePresentError')
                    ET.SubElement(elem, 'tag', k=attrib_key, v=attrib_value)
                except Exception:
                    print "The attributes %s=%s are already present in this Element."%(attrib_key, attrib_value)
                    continue
            yield elem
            root.clear()

In [24]:
def add_smth(osm_file, elem_id, attrib_key, attrib_value):
    '''Wrapper function: Calls add_attribute() and modify_file().
    
    Adds an attribute with value to an existing "way" tag, writes the changed ET back to the file.'''
    
    add_attribute(osm_file, elem_id, attrib_key, attrib_value, tags=('way', 'node', 'relation'))
    modify_file(osm_file, add_attribute, osm_file, elem_id, attrib_key, attrib_value)

These functions allow me to add tags to describe that some 'ways' are actually _buildings_ or other _areas_. E.g:

In [25]:
find_something(OSM_FILE, 'Green Valley Country Club')

Check ID online at: http://www.openstreetmap.org/way/27575068

<way changeset="15877547" id="27575068" timestamp="2013-04-26T22:01:01Z" uid="63936" user="MojaveNC" version="4">
		<nd ref="2282598954" />
		<nd ref="2282599039" />
		<nd ref="302766150" />
		<nd ref="1495619638" />
		<nd ref="1495619718" />
		<nd ref="302766151" />
		<nd ref="302766153" />
		<nd ref="302766154" />
		<nd ref="302766156" />
		<nd ref="302766157" />
		<nd ref="302766158" />
		<nd ref="1431998076" />
		<nd ref="2282598955" />
		<nd ref="2282598945" />
		<nd ref="2282598949" />
		<nd ref="2282598954" />
		<tag k="name" v="Green Valley Country Club Apts." />
		<tag k="landuse" v="residential" />
	</way>
	


In [26]:
add_smth(OSM_FILE, '27575073', 'building', 'yes')

In [27]:
find_something(OSM_FILE, 'Green Valley Country Club')

Check ID online at: http://www.openstreetmap.org/way/27575068

<way changeset="15877547" id="27575068" timestamp="2013-04-26T22:01:01Z" uid="63936" user="MojaveNC" version="4">
		<nd ref="2282598954" />
		<nd ref="2282599039" />
		<nd ref="302766150" />
		<nd ref="1495619638" />
		<nd ref="1495619718" />
		<nd ref="302766151" />
		<nd ref="302766153" />
		<nd ref="302766154" />
		<nd ref="302766156" />
		<nd ref="302766157" />
		<nd ref="302766158" />
		<nd ref="1431998076" />
		<nd ref="2282598955" />
		<nd ref="2282598945" />
		<nd ref="2282598949" />
		<nd ref="2282598954" />
		<tag k="name" v="Green Valley Country Club Apts." />
		<tag k="landuse" v="residential" />
	</way>
	


In [30]:
def get_id(filename, regex):
    '''Returns a list of the IDs of the element(s) matching the specified regex somewhere in their tags.'''
    import re
    elem_id_list = []
    for event, elem in ET.iterparse(filename, events=('start',)):
        if elem.tag == 'way':
            for tag in elem.iter('tag'):
                if tag.attrib['k'] == 'name':
                    if re.search(regex, ET.tostring(tag)):
                        elem_id_list.append(elem.attrib['id'])
    return elem_id_list

In [31]:
for area in street_types['Estates']:
    for elem_id in get_id(OSM_FILE, area):
        add_smth(OSM_FILE, elem_id, 'place', 'suburb')
        add_smth(OSM_FILE, elem_id, 'area', 'yes')

### Individual changes for specific streets

Some street names are simply missing their street type as an ending. I've double-checked these places with OSM through their ID and with GoogleMaps. If it addressed the same street and there was a street type extension in GoogleMaps, I added it to the data (not sure whether this is legal?).

In [32]:
find_something(OSM_FILE, 'Wanderlust')

Check ID online at: http://www.openstreetmap.org/way/203447533

<way changeset="15393235" id="203447533" timestamp="2013-03-17T09:49:18Z" uid="12434" user="nm7s9" version="2">
		<nd ref="2134638188" />
		<nd ref="2134638182" />
		<nd ref="2134638168" />
		<nd ref="2134638153" />
		<nd ref="2134638146" />
		<nd ref="2134638141" />
		<nd ref="2134638136" />
		<nd ref="2134638135" />
		<nd ref="2134638139" />
		<nd ref="2134638144" />
		<nd ref="2134638150" />
		<nd ref="2134638159" />
		<nd ref="2134638173" />
		<nd ref="2134638183" />
		<nd ref="2134638189" />
		<nd ref="2134638191" />
		<nd ref="2134638194" />
		<nd ref="2134638193" />
		<nd ref="2134638188" />
		<tag k="name" v="Wanderlust" />
		<tag k="oneway" v="yes" />
		<tag k="review" v="no" />
		<tag k="source" v="Bing" />
		<tag k="highway" v="residential" />
	</way>
	
Check ID online at: http://www.openstreetmap.org/way/203447566

<way changeset="15393235" id="203447566" timestamp="2013-03-17T09:49:18Z" uid="12434" user="nm7s9

In [33]:
substitute_smth(OSM_FILE, 'Wanderlust', 'Wanderlust Court', 'v')

In [34]:
find_something(OSM_FILE, 'Wanderlust')

Check ID online at: http://www.openstreetmap.org/way/203447533

<way changeset="15393235" id="203447533" timestamp="2013-03-17T09:49:18Z" uid="12434" user="nm7s9" version="2">
		<nd ref="2134638188" />
		<nd ref="2134638182" />
		<nd ref="2134638168" />
		<nd ref="2134638153" />
		<nd ref="2134638146" />
		<nd ref="2134638141" />
		<nd ref="2134638136" />
		<nd ref="2134638135" />
		<nd ref="2134638139" />
		<nd ref="2134638144" />
		<nd ref="2134638150" />
		<nd ref="2134638159" />
		<nd ref="2134638173" />
		<nd ref="2134638183" />
		<nd ref="2134638189" />
		<nd ref="2134638191" />
		<nd ref="2134638194" />
		<nd ref="2134638193" />
		<nd ref="2134638188" />
		<tag k="name" v="Wanderlust Court" />
		<tag k="oneway" v="yes" />
		<tag k="review" v="no" />
		<tag k="source" v="Bing" />
		<tag k="highway" v="residential" />
	</way>
	
Check ID online at: http://www.openstreetmap.org/way/203447566

<way changeset="15393235" id="203447566" timestamp="2013-03-17T09:49:18Z" uid="12434" user=

In [35]:
substitute_smth(OSM_FILE, 'Seven Oaks', 'Seven Oaks Way', 'v')

In [36]:
substitute_smth(OSM_FILE, 'Padero', 'North Padero Drive', 'v')

In [37]:
substitute_smth(OSM_FILE, 'Scottyboy', 'Scottyboy Drive', 'v')

In [38]:
substitute_smth(OSM_FILE, 'Seashore', 'Seashore Drive', 'v')

In [39]:
substitute_smth(OSM_FILE, 'S FLore del Sol', 'S Flore del Sol Street', 'v')

In [40]:
substitute_smth(OSM_FILE, street_types['Avenmue'][0], 'West Fenway Park Avenue', 'v')

Okay! Now my cleaning functions are working fine, it seems : )

Next step is that I'd have to automate them to apply the necessary changes to batches of the special exceptions that I found. Otherwise I'd have to go through it one-by-one, which is very tedious. 

For this I will have to classify the information I have into different connected groups. E.g., that all those Elements that need `area=yes` added may be collected together, so I can do one action that will run the same task on them all. In order to be able to group them, I will, however, have to know what they are and what are their specific issues.

### Grouping the Exceptions

**DISCLAIMER**: There's one problem with my approach, which is that I developed it with the **truncated version** of the OSM file. Since the decisions how to group the Elements with the specific name value endings was often taken after individually investigating all the returned Elements, it might cause troubles when applied to the complete file. Because there might be instances that have the same ending, however should be treated differently!

In [41]:
# Many street names have their 'type' at the start, when they are non-english type names
other_langs = ['Avenida', 'Via', 'Camino', 'Calle', 'Plaza', 'Calle', 'Vista'] 

# the following have to be corrected:
misspelled = ['Avenmue', 'Driive']
shortenings = ['Hwy', 'Mhp', 'Rd']
prefixed = ['Avenue']
wrong_suffixed = ['North', '(Difficult)', ', Lower', 'S', 'South']
# Lower': ['Las Vegas Wash Trail, Lower']
# things including 'Trail' somewhere

The following were found to be actual streets with uncommon name endings:

In [48]:
all_fine = {'Access' : street_types['Access'],
            'Oak' : street_types['Oak'],
            'Oasis' : street_types['Oasis'],
            'Paseo' : street_types['Paseo'],
            'Pines' : street_types['Pines'],
            'Cottage' : street_types['Cottage'],
            'Point' : street_types['Point'],
            'Portico' : street_types['Portico'],
            'Reef' : street_types['Reef'],
            'Sawtooth' : street_types['Sawtooth'],
            'Sierra' : street_types['Sierra'],
            'Solano' : street_types['Solano'],
            'Star' : street_types['Star']}

The following were found to be living quaters (therefore: `area=yes`), that according to the OSM wiki should have a tag with the `place=suburb` attribute.

In [59]:
add_area_suburb = {'Homestretch' : street_types['Homestretch'],
                  'Homes' : street_types['Homes'],
                  'Paradise' : street_types['Paradise'],
                  'Somerset' : street_types['Somerset']}

The following individual buildings recorded as closed ways (therefore: `area=yes` and `building=yes`).

In [60]:
add_area_building = {'Alex' : street_types['Alex']}

The following are some kind of areas, so it makes sense to add: `area=yes`

In [65]:
add_area = {'P' : street_types['P'],
            'Wilderness' : street_types['Wilderness']}

And finally these ones represent abbreviations for common street types, and should be expanded to the fully spelled version.

In [66]:
substitute = {'Ave' : street_types['Ave'],
              'Hwy' : street_types['Hwy'],
              'Rd' : street_types['Rd']}

I have to state here again that these steps of cleaning are not exhaustive. Already in the truncated version there are special cases that I didn't address, and there will be many more when running this file with the full dataset.

However, it gave me an insight into the troubles that one encounters when dealing with cleaning data. There are a lot of human generated imprecisions, different mappings and opinions on what to do, different (human generated) inconsistencies already in the reality of the data that gets recorded (e.g. not all streets have a street type at their end, or at all) etc.

Therefore I decided that I dove deep enough into this dataset, did my part of cleaning it a bit and learning about the parts involved, and that it's enough with this for now :)

So here I will apply the cleanings that I have devised, and then I'll move on.

In [67]:
# add area=yes
type_dict = add_area

for key, value in type_dict.items():
    for v in enumerate(value):
        name = v[1]
        for elem_id in get_id(OSM_FILE, name):
            add_smth(OSM_FILE, elem_id, 'area', 'yes')

In [68]:
# add area=yes, building=yes
type_dict = add_area_building

for key, value in type_dict.items():
    for v in enumerate(value):
        name = v[1]
        for elem_id in get_id(OSM_FILE, name):
            add_smth(OSM_FILE, elem_id, 'area', 'yes')
            add_smth(OSM_FILE, elem_id, 'building', 'yes')

The attributes area=yes are already present in this Element.


In [69]:
# add area=yes, place=suburb
type_dict = add_area_suburb

for key, value in type_dict.items():
    for v in enumerate(value):
        name = v[1]
        for elem_id in get_id(OSM_FILE, name):
            add_smth(OSM_FILE, elem_id, 'area', 'yes')
            add_smth(OSM_FILE, elem_id, 'place', 'suburb')

The attributes area=yes are already present in this Element.


Note: Could scrape data for common US street type abbreviations and add the mappings to the list. E.g. from here: http://pe.usps.gov/text/pub28/28apc_002.htm

In [70]:
# extend street name abbreviations
import re
map_dict = {'Rd' : 'Road', 'Hwy' : 'Highway', 'Ave' : 'Avenue'}
type_dict = substitute
street_re = re.compile(r'[^ ]+[ ]', re.IGNORECASE)

for key, value in type_dict.items():
    for v in enumerate(value):
        old_name = v[1]
        re_li = re.findall(street_re, old_name)
        new_name = ''.join(re_li) + map_dict[key]
        substitute_smth(OSM_FILE, old_name, new_name, 'v')

And finally I will apply the street types I learned I could exclude back onto my original function, to create a new and smaller `street_type` dictionary. This dict will contain the special cases that would need further attention and more thorough and individual cleaning.

In [71]:
for key in all_fine.keys():
    if key not in exclude:
        exclude.append(key)

In [72]:
# adapting the function with the newly learned aspects
def collect_way_types(filename, expected_types):
    street_types = {}
    # added these common non-english street names that appear at the beginning rather than the end
    non_eng_street_names = ['Avenida', 'Via', 'Camino', 'Calle', 'Vista', 'Placida']
    # here are some attributes that I found define non-street ways, so I exclude Elements containing them
    non_street_attribs = ['area', 'building', 'amenity', 'golf', 'railway']
    for event, elem in ET.iterparse(filename, events=('start',)):
        flag = False
        if elem.tag == 'way':
            for tag in elem.iter('tag'):
                if (tag.attrib['k'] in non_street_attribs) and (tag.attrib['v'] != 'no'):
                        flag = True
                for non_eng_name in non_eng_street_names:
                    # if a street starts with one of the non-eng names, it is excluded
                    if tag.attrib['v'].startswith(non_eng_name):
                        flag = True
                
            if flag == False:
                for tag in elem.iter('tag'):
                    if tag.attrib['k'] == 'name':
                        street_name = tag.attrib['v']
                        audit_street_type(street_types, expected_types, street_name)                       
    return street_types

street_types = collect_way_types(OSM_FILE, exclude)

In [73]:
len(street_types)

959

But actually this is all mostly crap, because it's little single adaptations that don't make a big difference. Even if I'd work this file down to zero - I realized that this is only a truncated file! Therefore, there are surely many more such single-cases, that would need personal attendance (that I am not willing to give for much longer)...

So, and that's also my task as a programmer, I shall abstract more and make some functions that do many cleanings.
And then that's it.
Doesn't make it clean, but makes it clea**er**.

Which is maybe good enough.

I must say also that I don't feel too bad about doing it, because I was also still exploring the city a bit, and also exploring OSM a bit, and XML and python and all.
So it's not wasted time and effort, but I think of those things it's enough now, and it's time to wrap it up.

### Merging changes with original XML

In [74]:
tree_changes = ET.ElementTree(file=OSM_FILE)
chang_root = tree_changes.getroot()
changed_elems = chang_root.findall('way')

In [75]:
try_elem = chang_root[0]
print try_elem.attrib
ET.dump(try_elem)

{'changeset': '4250464', 'uid': '20587', 'timestamp': '2010-03-27T22:52:27Z', 'version': '2', 'user': 'balrog-kun', 'id': '14278349'}
<way changeset="4250464" id="14278349" timestamp="2010-03-27T22:52:27Z" uid="20587" user="balrog-kun" version="2">
		<nd ref="137032566" />
		<nd ref="137032567" />
		<tag k="name" v="Dallas" />
		<tag k="name_1" v="Dallas Court" />
		<tag k="highway" v="residential" />
		<tag k="tiger:cfcc" v="A41" />
		<tag k="tiger:tlid" v="201902597" />
		<tag k="tiger:county" v="Clark, NV" />
		<tag k="tiger:source" v="tiger_import_dch_v0.6_20070813" />
		<tag k="tiger:reviewed" v="no" />
		<tag k="tiger:name_base" v="Dallas" />
		<tag k="tiger:separated" v="no" />
		<tag k="tiger:name_base_1" v="Dallas" />
		<tag k="tiger:name_type_1" v="Ct" />
	</way>
	


- maybe removing all tags
- reinserting the new ones?

So I got some xml elements saved in this findall() returned list, that can now be nicely queried :)

Maybe this could have saved me lots of work? Well, now I have this, maybe I can work with it!

In [76]:
changes_dict = {}
for elem in changed_elems:
    changes_dict[elem.attrib['id']] = elem        

In [77]:
len(changes_dict)

2106

Now with the following function I'll be merging the changes with the original Element Tree and create a new file.

In [78]:
ORIG_FILE = 'LV_truncated.osm'
NEW_FILE = 'LV_applied_changes.osm'

def merge_changes(osm_file, changes):
    '''Merges the changes applied on the street names back into the original OSM file structure, creating a new file.'''
    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    for event, elem in context:
        if event == 'start' and elem.tag == 'way':
            current_id = elem.attrib['id']
            if current_id in changes.keys():
                elem = changes[current_id]
        if event == 'end':
            yield elem
            root.clear()
                
with open(NEW_FILE, 'w') as output:
    output.write('<?xml version="1.0" encoding="UTF-8"?>\n')
    output.write('<osm>\n  ')
    for i, element in enumerate(merge_changes(ORIG_FILE, changes_dict)):
        output.write(ET.tostring(element, encoding='utf-8'))
    output.write('</osm>')

In [79]:
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

find_something(NEW_FILE, 'Eldorado')

Check ID online at: http://www.openstreetmap.org/way/226135717

<way changeset="16595288" id="226135717" timestamp="2013-06-17T20:05:18Z" uid="3392" user="SimMoonXP" version="1">
		<nd ref="137538171" />
		<nd ref="276935081" />
		<nd ref="2121261466" />
		<tag k="name" v="Eldorado Lane" />
		<tag k="lanes" v="2" />
		<tag k="highway" v="residential" />
		<tag k="maxspeed" v="25 mph" />
		<tag k="tiger:cfcc" v="A41" />
		<tag k="tiger:county" v="Clark, NV" />
		<tag k="tiger:zip_left" v="89123" />
		<tag k="tiger:name_base" v="Eldorado" />
		<tag k="tiger:name_type" v="Ln" />
		<tag k="tiger:zip_right" v="89123" />
		<tag k="tiger:name_direction_prefix" v="E" />
	</way>
	


---

## Porting to MongoDB

Now, after having performed some cleaning action, I'll edit the data so I'll be able to transfer it into MongoDB.

In [82]:
# Code taken from Lesson 6 and adapted to my situation
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json

problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
CREATED = [ "version", "changeset", "timestamp", "user", "uid"]

def shape_element(element):
    node = {}
    # you should process only 2 types of top level tags: "node" and "way"
    if element.tag == "node" or element.tag == "way":
        node['created'] = {}
        node['visible'] = 'true'
        node['type'] = element.tag
        for key, value in element.attrib.iteritems():
            if key in CREATED:
                node['created'][key] = value
            elif key != 'lat' and key != 'lon':
                node[key] = value
        try:
            node['pos'] = [float(element.attrib['lat']), float(element.attrib['lon'])]
        except:
            pass
        if element.tag == 'way':
            node['node_refs'] = {}
            nd_list = []
            for nd in element.iter('nd'):
                nd_list.append(nd.attrib['ref'])
            node['node_refs'] = nd_list

        # creating the additional dicts
        for child in element:
            if child.tag == 'tag':
                attrib_key = child.attrib['k']
                attrib_value = child.attrib['v']
                if re.search(r'(\w+:){2}', attrib_key):
                    continue
                if re.search(r':', attrib_key):
                    separate_by_colon_re = re.compile(r'([\w]+[^:\n])')
                    key_parts_list = re.findall(separate_by_colon_re, attrib_key)
                    main_key = key_parts_list.pop(0)
                    # removing the main key
                    if len(key_parts_list) == 1:
                        secondary_key = key_parts_list.pop(0)
                        if main_key == 'addr':
                            if 'address' in node:
                                node['address'][secondary_key] = attrib_value
                            else:
                                node['address'] = {}
                                node['address'][secondary_key] = attrib_value
                        else:
                            if main_key in node and type(node[main_key]) == dict:
                                node[main_key][secondary_key] = attrib_value
                            ### NOTE: Some keys I create with regex as keys for dict might already exist as
                            ### keys one level up. Therefore I added this to not lose the information from there
                            else:
                                main_key = main_key+'dict'
                                node[main_key] = {}
                                node[main_key][secondary_key] = attrib_value
                            if main_key not in node:
                                node[main_key] = {}
                                node[main_key][secondary_key] = attrib_value
                else:
                    node[attrib_key] = attrib_value
        return node
    else:
        return None
        
def process_map(file_in, pretty = False):
    file_out = "{0}.json".format(file_in)
    data = []
    with codecs.open(file_out, "w") as fo:
            for _, element in ET.iterparse(file_in):
                    el = shape_element(element)
                    if el:
                            data.append(el)
                            if pretty:
                                    fo.write(json.dumps(el, indent=2)+"\n")
                            else:
                                    fo.write(json.dumps(el) + "\n")
    return data

In [83]:
json_struct = process_map('las-vegas_nevada.osm')

In [84]:
# take a look at the data
pprint.pprint(json_struct[0])

{'created': {'changeset': '21953362',
             'timestamp': '2014-04-26T12:32:58Z',
             'uid': '85673',
             'user': 'Bored',
             'version': '8'},
 'id': '31551114',
 'is_in': 'Nevada',
 'is_indict': {'continent': 'North America'},
 'name': 'Las Vegas',
 'namedict': {'zh': u'\u62c9\u65af\u7ef4\u52a0\u65af'},
 'place': 'city',
 'population': '567641',
 'pos': [36.1662859, -115.149225],
 'type': 'node',
 'visible': 'true'}


In [85]:
# doing a random query
count = 0
for i in json_struct:
    if i['id'] == '137225011':
        pprint.pprint(json_struct[count])
    count += 1

{'created': {'changeset': '15375615',
             'timestamp': '2013-03-15T17:14:07Z',
             'uid': '12434',
             'user': 'nm7s9',
             'version': '3'},
 'highway': 'turning_circle',
 'id': '137225011',
 'pos': [36.231476, -115.17462],
 'type': 'node',
 'visible': 'true'}


After finally having the data saved in an exported .json file, I was ready to import it to MongoDB. For this I used the mongoimport command in the terminal, after installing MongoDB on my computer.

---

## Statistical Overview using MongoDB

Here are some glimpses into the commands I ran on my data, using pymongo, but also the mongodb shell.

As references for these look-ups I used the respective docs for **pymongo** and the **mongodb shell** found here: https://docs.mongodb.org/getting-started/python/

```
> db.lasvegas.stats({ dbStats: 1, scale: 1 })
{
	"ns" : "udacity.lasvegas",
	"count" : 916706,
	"size" : 234245793,
	"avgObjSize" : 255, ...
```

This is a (truncated) result of the stats that the `db.collection.stats()` command run in the mongodb shell returns.
The size of the data is presented in **bytes** (to change this, the `scale` parameter can be set e.g. to `2048` to return the result in MB).

Here are the stats regarding the amount of posts per user for the top users:

```
> db.lasvegas.aggregate([{"$group":{ "_id":"$created.user", "count":{"$sum":1}}},{"$sort":{"count":-1}},{"$limit":10}]).pretty()
{ "_id" : "alimamo", "count" : 253804 }
{ "_id" : "woodpeck_fixbot", "count" : 75622 }
{ "_id" : "alecdhuse", "count" : 66729 }
{ "_id" : "abellao", "count" : 49629 }
{ "_id" : "gMitchellD", "count" : 47377 }
{ "_id" : "robgeb", "count" : 43289 }
{ "_id" : "nmixter", "count" : 40250 }
{ "_id" : "MojaveNC", "count" : 30173 }
{ "_id" : "nm7s9", "count" : 26712 }
{ "_id" : "balrog-kun", "count" : 15051 }
```

I wanted to get to know how to look for specific values. The following query returns the values of all _highway_ fields in all _documents_ of the Las Vegas data:

(I chose this query, because the random query I ran further up returned me a wonderfully fitting `turning_circle` in "Godbey Court" :)

```
> db.lasvegas.distinct("highway", {highway : {$exists : true}})
[
	"motorway_junction",
	"turning_circle",
	"traffic_signals",
	"crossing",
	"passing_place",
	"mini_roundabout",
	"stop",
	"turning_loop",
	"overhead_sign",
	"trailhead",
	"bus_stop",
	"street_lamp",
	"give_way",
	"intersection",
	"elevator",
	"motorway",
	"residential",
	"service",
	"secondary",
	"track",
	"tertiary",
	"motorway_link",
	"footway",
	"road",
	"unclassified",
	"path",
	"secondary_link",
	"trunk_link",
	"trunk",
	"proposed",
	"pedestrian",
	"living_street",
	"tertiary_link",
	"primary",
	"steps",
	"cycleway",
	"raceway",
	"construction",
	"bridleway",
	"escalator",
	"primary_link"
]
```

Sticking with `turning_circle`s, I've used the cursor-query results to see how many of similar nodes referencing a `turning_circle` exist in the dataset:

```
> db.lasvegas.find({'highway': 'turning_circle'}).count()
6800
```

That's a lot :)

To take a short look back into the aspect of the dataset that I tried to clean a little bit during my exploration, I've constructed this query using a regex that matches any value of the `name` fields that ends with a "." (which might indicate an abbreviated street name).

This is the result:

```
> db.lasvegas.find({'name': { $regex: /( \w+\.)$/, $options: '<options>' } }).count()
13
```

It might be interesting to take a look at these 13 documents and eventually write a function to clean them. 13 documents does actually sound manageable :)
A short peek (by removing the `.count()` and running it again) already shows that most of them denote a fast-food chain, and some might be interesting for further, more focused street type cleaning.

---

## Conclusion

I've taken the chance of this DAND project to explore Las Vegas in a for me very interesting way. I surely took quite some detours regarding the aim of this project, however this offered me on one hand some beautiful discoveries, and on the other hand a lot of learning to handle XML (and especially OSM) data, and gave me some insight into how a process of auditing and cleaning a database might look like.

### Overview of the Data

Some initial statistics can be found at the top of this document, that were performed using python analysis of the file I used and the ElementTree that was built from it.

Later there is some more statistical analysis performed with the mongodb shell at the end of the document, before the Conclusion.

### Problems encountered in my map

Throughout the description of my process and the steps that I took, I keep explaining the problems that I encountered. They were manyfold! Some are caused by the users who were entering the data, some others by me, my initial lack of knowledge, and some inefficient approaches that I took.

- street names were not saved in `addr:street` but in the `name` attribute, that also contains names of other places
- 'way' tags can also be structures called 'closed ways' that represent e.g. buildings. often the suggested flags to indicate this are not present in the tags
- trying to exclude speciality tags on a case-by-case basis is very ineffective, especially because it does not account for a changing dataset
- ...

I've learned not to go down paths of cleaning individual examples, and at the same time that there will be many of them to find if I'm dealing with human-inputted data.

### Other ideas about the dataset

A lot of my time working on my dataset I spent on learning about and interacting with the XML structure's Element Tree. There's more to do there:

- adding flagging tags with appropriate attributes (e.g. `area=yes`) to those 'way' tags that are **closed ways**
- adapt the `audit_street_type()` function to return a more useful street type dictionary that excludes speciality cases
- scrape the TIGER data online to extract a valid mapping of street type abbreviations to their full-length versions, then use it for a mapping

Then it could be very interesting to explore the data more with MongoDB. There are many different ways of creating cursor objects (with different operators and pipelines), that can allow a relatively easy and efficient way to analyze the dataset. Probably easier than it was trying to work it with the ET module. As a short immediate idea, I could adapt the regex pattern or add logical operators to exclude the fast-food chain from the results, take a closer look at the documents returned in the cursor object, and eventually consider cleaning it there.

---

However, I believe that I've worked enough on this project, so I will not dive into these possible further steps now.
I hope that you'll have a bit of fun or an interesting time looking into my project, and am looking forward to your feedback! :)