# Cleaning the streets of Las Vegas
A digital walk to get to know my surroundings through the computer, using Data Wrangling.

## Getting the data
I chose to investigate the **MapZen metro extract** of **Las Vegas, Nevada** found here: https://mapzen.com/data/metro-extracts
(probably approximately containing this: https://www.openstreetmap.org/export#map=12/36.1750/-115.1372)

My flight to the USA had its destination here. I felt like taking a (for me) new approach to discovering a place that I am staying in. So I've decided to use the dataset of LV, to programmatically investigate and clean the OSM data available.

Getting to know which streets are around the place that I am staying. Learning about the size, the sights, and what people find important enough to map in this city.

I'll be spending my days here, however mostly inside, take digital walks and discovery tours by diving into the OSM data, and on my way maybe also do something good for this city, by cleaning its streets in such a way, where I can maybe actually contribute a little. :)

## Inspecting
First I'll take a look at the data I will be working with.

In [29]:
import os
#las_vegas_osm = 'las-vegas_nevada.osm'
# for testing and developing purposes, here's the truncated version:
las_vegas_osm = 'LV_truncated.osm'
file_size = os.path.getsize(las_vegas_osm)
print 'File Size in Bytes:', file_size
print 'File Size in MB:   ', file_size / (2**20)

File Size in Bytes: 18702424
File Size in MB:    17


In [25]:
import xml.etree.cElementTree as ET
import pprint

def count_tags(filename):
    '''Creates a dictionary with the tags present in the dataset, alongside a count for each'''
    tag_dict = {}
    for event, elem in ET.iterparse(filename):
        if elem.tag not in tag_dict:
            tag_dict[elem.tag] = 1
        elif elem.tag in tag_dict:
            tag_dict[elem.tag] += 1
    return tag_dict


#las_vegas_osm_dict = count_tags('las-vegas_nevada.osm')
las_vegas_osm_dict = count_tags(las_vegas_osm)

Which tags are present in the dataset, and how many of them?

In [26]:
import pandas as pd

las_vegas_osm = pd.Series(las_vegas_osm_dict, name='tags and their amounts')
las_vegas_osm

member         281
nd          100995
node         82011
osm              1
relation        31
tag          54515
way           9187
Name: tags and their amounts, dtype: int64

In [30]:
way_keys = {}
for event, elem in ET.iterparse(las_vegas_osm, events=('start',)):
    if elem.tag == 'way':
        for tag in elem.iter('tag'):
            if tag.attrib['k'] not in way_keys:
                way_keys[tag.attrib['k']] = 1
            else:
                way_keys[tag.attrib['k']] += 1

In [50]:
# making a general function for this
def counting_attributes(way_or_node):
    way_or_node_keys = {}
    for event, elem in ET.iterparse(las_vegas_osm, events=('start',)):
        if elem.tag == way_or_node:
            for tag in elem.iter('tag'):
                if tag.attrib['k'] not in way_or_node_keys:
                    way_or_node_keys[tag.attrib['k']] = 1
                else:
                    way_or_node_keys[tag.attrib['k']] += 1
    return way_or_node_keys

In [54]:
all_way_keys = pd.Series(counting_attributes('way'), name='types of tags on ways')

In [55]:
# displaying the more common way key attributes in alphabetical order
most_used_way_keys = all_way_keys[all_way_keys.values > 500]
most_used_way_keys

building            583
color               557
footway             772
highway            6793
name               4433
natural             704
oneway              613
review             1463
source             2370
tiger:cfcc         2763
tiger:county       2772
tiger:name_base    2664
tiger:name_type    2518
tiger:reviewed     2571
tiger:separated    1855
tiger:source       1957
tiger:tlid         1972
tiger:zip_left     2313
tiger:zip_right    2262
Name: types of tags on ways, dtype: int64

there's a lot of `tiger:` data. I did not know what this is and went to check it up on the OSM wiki:
http://wiki.openstreetmap.org/wiki/TIGER

So let's check which TIGER data I'm having in my map section, and how much of it:

In [76]:
def number_of_specific_attribs(way_or_node, regex):
    '''Returns the attributes specified through the input and how often they occur
    
    Takes as input a primary XML tag that holds tags in this dataset ("way" or "node")
    and a regular expression to match the desired attributes of these tags
    Returns a pandas Series object mapping the attributes to the amount of their occurence.
    '''
    import re
    import pandas as pd
    regex_keys = {}
    for event, elem in ET.iterparse(las_vegas_osm, events=('start',)):
        if elem.tag == way_or_node:
            for tag in elem.iter('tag'):
                if re.search(regex, tag.attrib['k']):
                    if tag.attrib['k'] not in regex_keys:
                        regex_keys[tag.attrib['k']] = 1
                    else:
                        regex_keys[tag.attrib['k']] += 1
    regex_series = pd.Series(regex_keys, name='tag attributes for "%s" matching "%s"' %(way_or_node, regex.pattern))
    return regex_series

In [77]:
import re
tiger_attribs = re.compile(r'^tiger:[a-z_]*$')
all_tiger = number_of_specific_attribs('way', tiger_attribs)
all_tiger

tiger:cfcc                     2763
tiger:county                   2772
tiger:mtfcc                      56
tiger:name_base                2664
tiger:name_direction_prefix     358
tiger:name_direction_suffix       2
tiger:name_full                  54
tiger:name_type                2518
tiger:reviewed                 2571
tiger:separated                1855
tiger:source                   1957
tiger:tlid                     1972
tiger:upload_uuid               254
tiger:zip_left                 2313
tiger:zip_right                2262
Name: tag attributes for "way" matching "^tiger:[a-z_]*$", dtype: int64

In [78]:
# I've written this function to better inspect what type of data the different attributes contain
def get_attrib_values(way_or_node, attribute):
    '''Collects all the "v" values for the given "k" attribute and counts their occurences
    
    Takes as input a primary XML tag that holds tags in this dataset ("way" or "node")
    and a string pertaining to an existing attribute 
    Returns a pandas Series object mapping the "v" values of the attribute to the amount of their occurence.
    '''
    attribute_values = {}
    for event, elem in ET.iterparse(las_vegas_osm, events=('start',)):
        if elem.tag == way_or_node:
            for tag in elem.iter('tag'):
                if tag.attrib['k'] == attribute:
                    if tag.attrib['v'] not in attribute_values:
                        attribute_values[tag.attrib['v']] = 1
                    else:
                        attribute_values[tag.attrib['v']] += 1
    attribute_values_series = pd.Series(attribute_values, name='amounts of values for "%s"' %(attribute))
    return attribute_values_series

In [80]:
tiger_values = get_attrib_values('way', 'tiger:name_base')
# look at a few of the values
tiger_values_top = tiger_values[tiger_values.values > 3]
tiger_values_top

Arroyo Grande                4
Buffalo                      6
Decatur                      5
Desert Inn                   4
Durango                      4
Fort Apache                  5
Heritage                     4
Jones                        7
Martin L King                4
Park                         4
Pecos                        8
Sahara                       7
Stewart                      4
Tenaya                       5
Torrey Pines                 4
Town Center                  5
Union Pacific Railroad      13
United States Highway 95     9
Name: amounts of values for "tiger:name_base", dtype: int64

This does look a bit like street names, or place names (there's also a lake somewhere among them). But none of them has a street ending information associated.

But: through reading the information on the OSM Wiki about TIGER data, I know that its way of recording had the street name data organized like this: `"#{fedirp} #{fename} #{fetype} #{fedirs}".strip`. When that data got imported to OSM, the aim was to split the road information into more separate attributes.

Therefore there exist attributes for `name_direction_prefix_1`, `name_base_1`, `name_type_1` and `name_direction_suffix_1`, that together can form e.g. a street name.

I also went to check the length of the set:

In [81]:
print len(tiger_values)
tiger_keys['tiger:name_base'] > len(tiger_values)

2425


True

Since this number is lower than the one I found counting all occurances, this means that some were appearing double. I might go and investigate into this further. These doubles might be legitimate, but they could also be redundant.

So now I went to look at the actual street name endings, to see what is there.

In [82]:
nametype_list = get_attrib_values('way', 'tiger:name_type')

In [90]:
nametype_list.order(ascending=False)

Ave               491
Dr                468
St                431
Ct                376
Ln                190
Way               165
Rd                129
Cir               120
Blvd               52
Pl                 51
Pky                18
Ter                13
Trl                 3
Dr; Dr; Dr; Rd      3
Ctr                 2
Xing                1
Cv                  1
Way; Rd; Way        1
Rd; Blvd            1
St:Trl              1
Aly                 1
Name: amounts of values for "tiger:name_type", dtype: int64

Most of these abbreviations seem to be legitimate (according to the TIGER Appendix D (2000) that I found here: http://cugir.mannlib.cornell.edu/metadata/cens2000/TIGER2000.pdf).
However, there are some strange values, such as `Dr; Dr; Dr; Rd`, that might have come through a difference in the representation of ways between TIGER and OSM.

My assumption is, that these represent more sections of a way element, since all the individual parts are valid in themselves. Therefore the above mentioned example could represent a Drive leading to a Drive leading to a Drive leading to a Road.

Similarly, the again different representation `St:Trl` could stand for a Street leading to a Trail.

I don't really know how to deal with these issues, because I am uncertain which of the information to discard and which to keep.

---

Another attribute that looked interesting to me was `tiger:name_full`, because this isn't mentioned in the OSM wiki and suggests that it would hold all the values that were in the TIGER data in one place (that is, just like other "normal" OSM attributes to `addr:street`.

In [91]:
namefull_list = get_attrib_values('way', 'tiger:name_full')
namefull_list.head()

Autumn King Ave                    1
Bay Course Ct                      1
Bethel Mill St:S Bethel Mill St    1
Crooked Putter Dr                  1
E Blue Rosalie Pl                  2
Name: amounts of values for "tiger:name_full", dtype: int64

And indeed: they are!

In [92]:
tiger_keys['tiger:name_full']

54

A fast check on how many there are makes it clear to me that it is not a copy of the split-up data. The entries might still be copies of **some** entries that were split up, or they might be unique entries.

To determine this, I'll need to investigate further.

Another thing that I can see here that pertains to the issue with the `tiger:name_type`s above, is the entry `Bethel Mill St:S Bethel Mill St `. I went to check for this street using google maps https://goo.gl/maps/8onJzAYjZoK2 and found it to be leading *S*. Howevere, there seems to be no _S Bethel Mill St_ recorded.

This makes me suspect that the `:` notation might be used to indicate **alternative namings** for certain ways.

Haha, still don't know what to do with it. :)

---

Okay. This is too complicated for me. I'll cut it down to cleaning some other part of the dataset.

I like nature. So here we go.

## Cleaning nature

In [96]:
leisure = re.compile(r'leisure')
number_of_specific_attribs('node', leisure)

leisure    18
Name: tag attributes for "node" matching "leisure", dtype: int64

In [97]:
get_attrib_values('node', leisure.pattern)

park             12
picnic_table      1
playground        1
sports_centre     1
swimming_pool     3
Name: amounts of values for "leisure", dtype: int64

In [98]:
number_of_specific_attribs('way', leisure)

leisure    162
Name: tag attributes for "way" matching "leisure", dtype: int64

In [99]:
get_attrib_values('way', leisure.pattern)

common                1
garden                2
golf_course           1
hockey                1
nature_reserve        1
park                 50
pitch                71
playground            7
recreation_ground     1
stadium               3
swimming_pool        23
track                 1
Name: amounts of values for "leisure", dtype: int64