# Cleaning Las Vegas
A digital walk to get to know my surroundings through the computer, using Data Wrangling.

## Getting the data
I chose to investigate the **MapZen metro extract** of **Las Vegas, Nevada** found here: https://mapzen.com/data/metro-extracts
(probably approximately containing this: https://www.openstreetmap.org/export#map=12/36.1750/-115.1372)

My flight to the USA had its destination here. I felt like taking a (for me) new approach to discovering a place that I am staying in. So I've decided to use the dataset of LV, to programmatically investigate and clean the OSM data available.

Getting to know which streets are around the place that I am staying. Learning about the size, the sights, and what people find important enough to map in this city.

I'll be spending my days here, however mostly inside, take digital walks and discovery tours by diving into the OSM data, and on my way maybe also do something good for this city, by cleaning some of it in such a way, where I can maybe actually contribute a little. :)

## Inspecting
First I'll take a look at the data I will be working with.

In [29]:
import os
#las_vegas_osm = 'las-vegas_nevada.osm'
# for testing and developing purposes, here's the truncated version:
las_vegas_osm = 'LV_truncated.osm'
file_size = os.path.getsize(las_vegas_osm)
print 'File Size in Bytes:', file_size
print 'File Size in MB:   ', file_size / (2**20)

File Size in Bytes: 18702424
File Size in MB:    17


In [25]:
import xml.etree.cElementTree as ET
import pprint

def count_tags(filename):
    '''Creates a dictionary with the tags present in the dataset, alongside a count for each'''
    tag_dict = {}
    for event, elem in ET.iterparse(filename):
        if elem.tag not in tag_dict:
            tag_dict[elem.tag] = 1
        elif elem.tag in tag_dict:
            tag_dict[elem.tag] += 1
    return tag_dict


#las_vegas_osm_dict = count_tags('las-vegas_nevada.osm')
las_vegas_osm_dict = count_tags(las_vegas_osm)

Which tags are present in the dataset, and how many of them?

In [26]:
import pandas as pd

las_vegas_osm = pd.Series(las_vegas_osm_dict, name='tags and their amounts')
las_vegas_osm

member         281
nd          100995
node         82011
osm              1
relation        31
tag          54515
way           9187
Name: tags and their amounts, dtype: int64

In [30]:
way_keys = {}
for event, elem in ET.iterparse(las_vegas_osm, events=('start',)):
    if elem.tag == 'way':
        for tag in elem.iter('tag'):
            if tag.attrib['k'] not in way_keys:
                way_keys[tag.attrib['k']] = 1
            else:
                way_keys[tag.attrib['k']] += 1

In [50]:
# making a general function for this
def counting_attributes(way_or_node):
    way_or_node_keys = {}
    for event, elem in ET.iterparse(las_vegas_osm, events=('start',)):
        if elem.tag == way_or_node:
            for tag in elem.iter('tag'):
                if tag.attrib['k'] not in way_or_node_keys:
                    way_or_node_keys[tag.attrib['k']] = 1
                else:
                    way_or_node_keys[tag.attrib['k']] += 1
    return way_or_node_keys

In [54]:
all_way_keys = pd.Series(counting_attributes('way'), name='types of tags on ways')

In [55]:
# displaying the more common way key attributes in alphabetical order
most_used_way_keys = all_way_keys[all_way_keys.values > 500]
most_used_way_keys

building            583
color               557
footway             772
highway            6793
name               4433
natural             704
oneway              613
review             1463
source             2370
tiger:cfcc         2763
tiger:county       2772
tiger:name_base    2664
tiger:name_type    2518
tiger:reviewed     2571
tiger:separated    1855
tiger:source       1957
tiger:tlid         1972
tiger:zip_left     2313
tiger:zip_right    2262
Name: types of tags on ways, dtype: int64

there's a lot of `tiger:` data. I did not know what this is and went to check it up on the OSM wiki:
http://wiki.openstreetmap.org/wiki/TIGER

So let's check which TIGER data I'm having in my map section, and how much of it:

In [76]:
def number_of_specific_attribs(way_or_node, regex):
    '''Returns the attributes specified through the input and how often they occur
    
    Takes as input a primary XML tag that holds tags in this dataset ("way" or "node")
    and a regular expression to match the desired attributes of these tags
    Returns a pandas Series object mapping the attributes to the amount of their occurence.
    '''
    import re
    import pandas as pd
    regex_keys = {}
    for event, elem in ET.iterparse(las_vegas_osm, events=('start',)):
        if elem.tag == way_or_node:
            for tag in elem.iter('tag'):
                if re.search(regex, tag.attrib['k']):
                    if tag.attrib['k'] not in regex_keys:
                        regex_keys[tag.attrib['k']] = 1
                    else:
                        regex_keys[tag.attrib['k']] += 1
    regex_series = pd.Series(regex_keys, name='tag attributes for "%s" matching "%s"' %(way_or_node, regex.pattern))
    return regex_series

In [77]:
import re
tiger_attribs = re.compile(r'^tiger:[a-z_]*$')
all_tiger = number_of_specific_attribs('way', tiger_attribs)
all_tiger

tiger:cfcc                     2763
tiger:county                   2772
tiger:mtfcc                      56
tiger:name_base                2664
tiger:name_direction_prefix     358
tiger:name_direction_suffix       2
tiger:name_full                  54
tiger:name_type                2518
tiger:reviewed                 2571
tiger:separated                1855
tiger:source                   1957
tiger:tlid                     1972
tiger:upload_uuid               254
tiger:zip_left                 2313
tiger:zip_right                2262
Name: tag attributes for "way" matching "^tiger:[a-z_]*$", dtype: int64

In [78]:
# I've written this function to better inspect what type of data the different attributes contain
def get_attrib_values(way_or_node, attribute):
    '''Collects all the "v" values for the given "k" attribute and counts their occurences
    
    Takes as input a primary XML tag that holds tags in this dataset ("way" or "node")
    and a string pertaining to an existing attribute 
    Returns a pandas Series object mapping the "v" values of the attribute to the amount of their occurence.
    '''
    attribute_values = {}
    for event, elem in ET.iterparse(las_vegas_osm, events=('start',)):
        if elem.tag == way_or_node:
            for tag in elem.iter('tag'):
                if tag.attrib['k'] == attribute:
                    if tag.attrib['v'] not in attribute_values:
                        attribute_values[tag.attrib['v']] = 1
                    else:
                        attribute_values[tag.attrib['v']] += 1
    attribute_values_series = pd.Series(attribute_values, name='amounts of values for "%s"' %(attribute))
    return attribute_values_series

In [80]:
tiger_values = get_attrib_values('way', 'tiger:name_base')
# look at a few of the values
tiger_values_top = tiger_values[tiger_values.values > 3]
tiger_values_top

Arroyo Grande                4
Buffalo                      6
Decatur                      5
Desert Inn                   4
Durango                      4
Fort Apache                  5
Heritage                     4
Jones                        7
Martin L King                4
Park                         4
Pecos                        8
Sahara                       7
Stewart                      4
Tenaya                       5
Torrey Pines                 4
Town Center                  5
Union Pacific Railroad      13
United States Highway 95     9
Name: amounts of values for "tiger:name_base", dtype: int64

This does look a bit like street names, or place names (there's also a lake somewhere among them). But none of them has a street ending information associated.

But: through reading the information on the OSM Wiki about TIGER data, I know that its way of recording had the street name data organized like this: `"#{fedirp} #{fename} #{fetype} #{fedirs}".strip`. When that data got imported to OSM, the aim was to split the road information into more separate attributes.

Therefore there exist attributes for `name_direction_prefix`, `name_base`, `name_type` and `name_direction_suffix`, that together can form e.g. a street name.

I also went to check the length of the set:

In [81]:
print len(tiger_values)
tiger_keys['tiger:name_base'] > len(tiger_values)

2425


True

Since this number is lower than the one I found counting all occurances, this means that some were appearing double. I might go and investigate into this further. These doubles might be legitimate, but they could also be redundant.

So now I went to look at the actual street name endings, to see what is there.

In [82]:
nametype_list = get_attrib_values('way', 'tiger:name_type')

In [90]:
nametype_list.order(ascending=False)

Ave               491
Dr                468
St                431
Ct                376
Ln                190
Way               165
Rd                129
Cir               120
Blvd               52
Pl                 51
Pky                18
Ter                13
Trl                 3
Dr; Dr; Dr; Rd      3
Ctr                 2
Xing                1
Cv                  1
Way; Rd; Way        1
Rd; Blvd            1
St:Trl              1
Aly                 1
Name: amounts of values for "tiger:name_type", dtype: int64

Most of these abbreviations seem to be legitimate (according to the TIGER Appendix D (2000) that I found here: http://cugir.mannlib.cornell.edu/metadata/cens2000/TIGER2000.pdf).
However, there are some strange values, such as `Dr; Dr; Dr; Rd`, that might have come through a difference in the representation of ways between TIGER and OSM.

My assumption is, that these represent more sections of a way element, since all the individual parts are valid in themselves. Therefore the above mentioned example could represent a Drive leading to a Drive leading to a Drive leading to a Road.

Similarly, the again different representation `St:Trl` could stand for a Street leading to a Trail.

I don't really know how to deal with these issues, because I am uncertain which of the information to discard and which to keep.

---

Another attribute that looked interesting to me was `tiger:name_full`, because this isn't mentioned in the OSM wiki and suggests that it would hold all the values that were in the TIGER data in one place (that is, just like other "normal" OSM attributes to `addr:street`.

In [91]:
namefull_list = get_attrib_values('way', 'tiger:name_full')
namefull_list.head()

Autumn King Ave                    1
Bay Course Ct                      1
Bethel Mill St:S Bethel Mill St    1
Crooked Putter Dr                  1
E Blue Rosalie Pl                  2
Name: amounts of values for "tiger:name_full", dtype: int64

And indeed: they are!

In [92]:
tiger_keys['tiger:name_full']

54

A fast check on how many there are makes it clear to me that it is not a copy of the split-up data. The entries might still be copies of **some** entries that were split up, or they might be unique entries.

To determine this, I'll need to investigate further.

Another thing that I can see here that pertains to the issue with the `tiger:name_type`s above, is the entry `Bethel Mill St:S Bethel Mill St `. I went to check for this street using google maps https://goo.gl/maps/8onJzAYjZoK2 and found it to be leading *S*. Howevere, there seems to be no _S Bethel Mill St_ recorded.

This makes me suspect that the `:` notation might be used to indicate **alternative namings** for certain ways.

Haha, still don't know what to do with it. :)

---

Okay. This is too complicated for me. I'll cut it down to cleaning some other part of the dataset.

I like nature. So here we go.

## Cleaning nature

In [96]:
leisure = re.compile(r'leisure')
number_of_specific_attribs('node', leisure)

leisure    18
Name: tag attributes for "node" matching "leisure", dtype: int64

In [97]:
get_attrib_values('node', leisure.pattern)

park             12
picnic_table      1
playground        1
sports_centre     1
swimming_pool     3
Name: amounts of values for "leisure", dtype: int64

In [98]:
number_of_specific_attribs('way', leisure)

leisure    162
Name: tag attributes for "way" matching "leisure", dtype: int64

In [99]:
get_attrib_values('way', leisure.pattern)

common                1
garden                2
golf_course           1
hockey                1
nature_reserve        1
park                 50
pitch                71
playground            7
recreation_ground     1
stadium               3
swimming_pool        23
track                 1
Name: amounts of values for "leisure", dtype: int64

---

Ah, made mistake. The attributes I was looking for are called "natural", not "nature". Try again.

In [100]:
natural = re.compile(r'natural')
number_of_specific_attribs('node', natural)

natural    95
Name: tag attributes for "node" matching "natural", dtype: int64

In [101]:
get_attrib_values('node', natural.pattern)

bay       11
beach      2
cliff      1
peak       2
spring     6
tree      73
Name: amounts of values for "natural", dtype: int64

In [102]:
number_of_specific_attribs('way', natural)

natural    704
Name: tag attributes for "way" matching "natural", dtype: int64

In [103]:
get_attrib_values('way', natural.pattern)

beach         1
cliff       558
desert       21
heath         9
mud           1
sand         62
scrub         9
tree_row      1
water        38
wetland       2
wood          2
Name: amounts of values for "natural", dtype: int64

... :')
This all looks good, nothing to clean here...

In [115]:
def get_coord_of_values(value):
    '''Computes the geographical location of the specific thing described with the entered value'''
    coordinates = []
    for event, elem in ET.iterparse(las_vegas_osm, events=('start',)):
        if elem.tag == 'node':
            for tag in elem.iter('tag'):
                if tag.attrib['v'] == value:
                    coordinates.append([elem.attrib['lat'], elem.attrib['lon']])
    return coordinates

In [116]:
get_coord_of_values('picnic_table')

[['36.1158199', '-115.0314977']]

In [117]:
get_coord_of_values('beach')

[['36.0385906', '-114.791374'], ['36.4302555', '-114.3635864']]

I'm really wondering what are the **bay**s around Las Vegas, so I go get their coordinates to take a look.

In [118]:
get_coord_of_values('bay')

[['36.1105386', '-114.4069172'],
 ['36.0052641', '-114.2452436'],
 ['36.1091485', '-114.5171988'],
 ['36.1274801', '-114.6202586'],
 ['36.4469221', '-114.3344185'],
 ['36.2991464', '-114.4124759'],
 ['36.1647032', '-114.499977'],
 ['35.8244294', '-114.7013665'],
 ['36.0919224', '-114.8174876'],
 ['36.1235891', '-114.7824871'],
 ['36.2760918', '-114.373863']]

Okay, these are all bays in the lake that came to be through the **Colorada River** being tapped for electricity by the **Hoover Dam**. Some are actually already in Arizona, since the State border runs at the original way of the Colorada River.

---

Okay. Figured that I'll find the roads and streets all under the attribute 'highway'. So I'll take another look at it.

(Currently sitting next to a tiny stream next to the huge Hard Rock Hotel Complex...), sunshine on my back :)

In [120]:
number_of_specific_attribs('way', re.compile('highway'))

highway    6793
Name: tag attributes for "way" matching "highway", dtype: int64

In [121]:
all_paths = get_attrib_values('way', 'highway')

In [122]:
all_paths.sort(ascending=False)

In [123]:
all_paths

residential       3975
service           1014
footway            894
path               165
tertiary           157
track              151
secondary          144
motorway_link       96
motorway            56
unclassified        33
cycleway            26
secondary_link      15
construction        14
trunk               11
pedestrian           8
steps                7
proposed             7
bridleway            5
tertiary_link        5
road                 3
raceway              3
trunk_link           2
primary              1
primary_link         1
Name: amounts of values for "highway", dtype: int64

In [124]:
number_of_specific_attribs('node', re.compile('highway'))

highway    1499
Name: tag attributes for "node" matching "highway", dtype: int64

In [125]:
all_node_paths = get_attrib_values('node', 'highway')

In [126]:
all_node_paths.sort(ascending=False)
all_node_paths

crossing             705
turning_circle       692
traffic_signals       59
motorway_junction     22
street_lamp            8
stop                   5
bus_stop               5
passing_place          1
overhead_sign          1
intersection           1
Name: amounts of values for "highway", dtype: int64

I'm kinda confused as to where to find the street names (if there are others than the TIGER coded ones).

In [128]:
get_attrib_values('node', 'attrib:street_name')

Series([], name: amounts of values for "attrib:street_name", dtype: float64)

In [131]:
street = re.compile(r'[a-z0-9_:]*street[a-z0-9_:]*')
number_of_specific_attribs('way', street)

addr:street    33
Name: tag attributes for "way" matching "[a-z0-9_:]*street[a-z0-9_:]*", dtype: int64

In [135]:
get_attrib_values('way', 'addr:street')

Birtcher Drive                1
Boulder Highway               1
Conestoga Way                 1
East Horizon Drive            1
East Horizon Ridge Parkway    1
El Camino Rd                  1
Fremont Street                2
Humboldt North Drive          1
Las Vegas Blvd. South         1
Las Vegas Boulevard South     2
Losee Rd                      1
Main Street                   1
Nevada Highway                1
South 4th Street              1
South 7th Street              1
South Boulder Highway         4
South Durango Drive           1
South Las Vegas Boulevard     1
South Rainbow Boulevard       1
Syracuse Drive                1
West Charleston Boulevard     1
West Flamingo Road            1
West Hacienda Avenue          1
West Horizon Ridge Parkway    1
West Lake Mead Blvd.          1
West Smoke Ranch Road         1
Wild Chive Avenue             2
Name: amounts of values for "addr:street", dtype: int64

In [136]:
number_of_specific_attribs('node', street)

addr:street    33
Name: tag attributes for "node" matching "[a-z0-9_:]*street[a-z0-9_:]*", dtype: int64

In [137]:
get_attrib_values('node', 'addr:street')

A Sahara Avenue               1
Arville Street                1
Boulder Highway               2
Citadel Circle                1
Dean Martin Drive             1
East Desert Inn Road          1
East Flamingo Road            2
East Sahara Avenue            1
East Tropicana Avenue         1
Executive Airport Drive       1
Fremont Street                1
Industrial Road               1
Nevada Way                    1
North Green Valley Parkway    1
Paradise Road                 1
Polaris Avenue                1
S Las Vegas Blvd Suite 390    1
S Paradise Road               1
S. Eastern Ave                1
South Decatur Boulevard       1
South Durango Drive           1
South Jones Boulevard         1
South Las Vegas Boulevard     1
South Nellis Boulevard        1
Spiced Wine Avenue            1
Via Bel Canto                 1
West Charleston Boulevard     1
West Cheyenne Avenue          1
West Sahara Avenue            1
West Tropicana Avenue         1
Wetlands Park Lane            1
Name: am

Whoa! So strange! There are both times 33 nodes and 33 ways that have the `addr:street`. At first I assumed that there is a mistake in my code, but the results show different street names!

Weird. So what should I do with this?

Well, one thing that is obvious (and that I came to check), is that most street names seem to be **only** coded within the TIGER data format.

Thinking to do something good for the dataset, I'm considering to scoop up all this data and insert it unified into the XML under the suggested attribute `addr:street`.
I guess I could do this both for the nodes and for the ways. The TIGER data I would leave untouched otherwise, because maybe there's actually a use in having it (the page mentions something about GPS tracking something).

Yep. This could do. I could enter all the street name information gathered from the TIGER data back into the nodes and ways, by simply adding a tag with the corresponding k and v values.

Now just the question how to do it :)

In [141]:
all_tiger.keys()

Index([u'tiger:cfcc', u'tiger:county', u'tiger:mtfcc', u'tiger:name_base', u'tiger:name_direction_prefix', u'tiger:name_direction_suffix', u'tiger:name_full', u'tiger:name_type', u'tiger:reviewed', u'tiger:separated', u'tiger:source', u'tiger:tlid', u'tiger:upload_uuid', u'tiger:zip_left', u'tiger:zip_right'], dtype='object')

In [149]:
all_street_tiger = list(all_tiger.keys()[3:8])
all_street_tiger

['tiger:name_base',
 'tiger:name_direction_prefix',
 'tiger:name_direction_suffix',
 'tiger:name_full',
 'tiger:name_type']

In [156]:
# locate the values for all existing TIGER tags
def get_tiger_values(way_or_node):
    '''Collects all the "v" values for the TIGER street attributes in a list of dicts'''
    tiger_values = []
    for event, elem in ET.iterparse(las_vegas_osm, events=('start',)):
        if elem.tag == way_or_node:
            street_name_dict = {}
            for tag in elem.iter('tag'):
                if tag.attrib['k'] in all_street_tiger:
                    street_name_dict[tag.attrib['k']] = tag.attrib['v']
            if street_name_dict != {}:
                tiger_values.append(street_name_dict)
    return tiger_values

In [184]:
street_name_list = get_tiger_values('way')

In [185]:
# put them into the right order and concatenate as a string
def make_string_list(street_name_list):
    '''Concatenates the TIGER street name information into a string.
    
    Takes as input a list of dictionaries containing TIGER street attributes and corresponding values
    Concatenates each dictionary into a properly ordered string and adds it to a list
    Returns a list of street names as strings.
    '''
    all_name_list = []
    for name_dict in street_name_list:
        street_name = ""
        if 'tiger:name_direction_prefix' in name_dict.keys():
            street_name += name_dict['tiger:name_direction_prefix'] +' '
        if 'tiger:name_base' in name_dict.keys():
            street_name += name_dict['tiger:name_base'] +' '
        if 'tiger:name_type' in name_dict.keys():
            street_name += name_dict['tiger:name_type'] +' '
        if 'tiger:name_direction_suffix' in name_dict.keys():
            street_name += ' ' + name_dict['tiger:name_direction_suffix']
        all_name_list.append(street_name.rstrip())
    return all_name_list

In [187]:
all_name_list = make_string_list(street_name_list)
all_name_list[0:5]

['Rutgers Dr',
 'Fairfield Ave',
 'Pebble Grey Ln',
 'E Airport Sht Term Park Rd',
 'Penny Ln']

In [177]:
len(all_name_list)

2664

In [188]:
# create this structure: <tag k="addr:street" v="concatenated_TIGER_street_name" />
def create_OSM_street_name_tag(street_name):
    '''Creates a string representing a valid OSM tag containing street name information'''
    return '<tag k="addr:street" v="%s" />' %(street_name)

In [189]:
# checking out the results of this...
import random
for i in range(5):
    print create_OSM_street_name_tag(random.choice(all_name_list))

<tag k="addr:street" v="Strutz Ave" />
<tag k="addr:street" v="W Owens Ave" />
<tag k="addr:street" v="Crooked Creek Ave" />
<tag k="addr:street" v="S; N Lamb Blvd" />
<tag k="addr:street" v="Pink Cliff Ct" />


Some problems happen in this! E.g.: `<tag k="addr:street" v="S; N Lamb Blvd" />`.

Seems that some TIGER data has more than one value for some of the attributes listed, resulting in stuff like `S; N`.
Maybe I'm not gonna deal with this... :)

But I guess it should be noted!

I could make a check in the end that adds a flag `todo` or `fixme` offer themselves, with a little explanation of the issue.

In [190]:
# create a new tag in the same parent tag (<node> or <way>) and add it with the previously computed structure


