# Cleaning the streets of Las Vegas
A digital walk to get to know my surroundings through the computer, using Data Wrangling.

## Getting the data
I chose to investigate the **MapZen metro extract** of **Las Vegas, Nevada** found here: https://mapzen.com/data/metro-extracts
(probably approximately containing this: https://www.openstreetmap.org/export#map=12/36.1750/-115.1372)

My flight to the USA had its destination here. I felt like taking a (for me) new approach to discovering a place that I am staying in. So I've decided to use the dataset of LV, to programmatically investigate and clean the OSM data available.

Getting to know which streets are around the place that I am staying. Learning about the size, the sights, and what people find important enough to map in this city.

I'll be spending my days here, however mostly inside, take digital walks and discovery tours by diving into the OSM data, and on my way maybe also do something good for this city, by cleaning its streets in such a way where I can maybe actually contribute a little. :)

## Inspecting
First I'll take a look at the data I will be working with.

In [29]:
import os
#las_vegas_osm = 'las-vegas_nevada.osm'
# for testing and developing purposes, here's the truncated version:
las_vegas_osm = 'LV_truncated.osm'
file_size = os.path.getsize(las_vegas_osm)
print 'File Size in Bytes:', file_size
print 'File Size in MB:   ', file_size / (2**20)

File Size in Bytes: 18702424
File Size in MB:    17


In [25]:
import xml.etree.cElementTree as ET
import pprint

def count_tags(filename):
    '''Creates a dictionary with the tags present in the dataset, alongside a count for each'''
    tag_dict = {}
    for event, elem in ET.iterparse(filename):
        if elem.tag not in tag_dict:
            tag_dict[elem.tag] = 1
        elif elem.tag in tag_dict:
            tag_dict[elem.tag] += 1
    return tag_dict


#las_vegas_osm_dict = count_tags('las-vegas_nevada.osm')
las_vegas_osm_dict = count_tags(las_vegas_osm)

Which tags are present in the dataset, and how many of them?

In [26]:
import pandas as pd

las_vegas_osm = pd.Series(las_vegas_osm_dict, name='tags and their amounts')
las_vegas_osm

member         281
nd          100995
node         82011
osm              1
relation        31
tag          54515
way           9187
Name: tags and their amounts, dtype: int64

In [30]:
way_keys = {}
for event, elem in ET.iterparse(las_vegas_osm, events=('start',)):
    if elem.tag == 'way':
        for tag in elem.iter('tag'):
            if tag.attrib['k'] not in way_keys:
                way_keys[tag.attrib['k']] = 1
            else:
                way_keys[tag.attrib['k']] += 1

In [31]:
all_way_keys = pd.Series(way_keys, name='types of tags on ways')
for key, value in sorted(way_keys.items()):
    if value > 500:
        print key, ':', value

building : 583
color : 557
footway : 772
highway : 6793
name : 4433
natural : 704
oneway : 613
review : 1463
source : 2370
tiger:cfcc : 2763
tiger:county : 2772
tiger:name_base : 2664
tiger:name_type : 2518
tiger:reviewed : 2571
tiger:separated : 1855
tiger:source : 1957
tiger:tlid : 1972
tiger:zip_left : 2313
tiger:zip_right : 2262


there's a lot of `tiger:` data. I did not know what this is and went to check it up on the OSM wiki:
http://wiki.openstreetmap.org/wiki/TIGER

So let's check which TIGER data I'm having in my map section, and how much of it:

In [39]:
import re

tiger_keys = {}
for event, elem in ET.iterparse(las_vegas_osm, events=('start',)):
    if elem.tag == 'way':
        for tag in elem.iter('tag'):
            if re.search(r'^tiger:[a-z_]*$', tag.attrib['k']):
                if tag.attrib['k'] not in tiger_keys:
                    tiger_keys[tag.attrib['k']] = 1
                else:
                    tiger_keys[tag.attrib['k']] += 1
tiger_keys

{'tiger:cfcc': 2763,
 'tiger:county': 2772,
 'tiger:mtfcc': 56,
 'tiger:name_base': 2664,
 'tiger:name_direction_prefix': 358,
 'tiger:name_direction_suffix': 2,
 'tiger:name_full': 54,
 'tiger:name_type': 2518,
 'tiger:reviewed': 2571,
 'tiger:separated': 1855,
 'tiger:source': 1957,
 'tiger:tlid': 1972,
 'tiger:upload_uuid': 254,
 'tiger:zip_left': 2313,
 'tiger:zip_right': 2262}

In [44]:
# I've written this function to better inspect what type of data the different attributes contain
def get_attrib_values(attribute):
    '''Collects all the "v" values for the given "k" attribute in a list'''
    attribute_values = set()
    for event, elem in ET.iterparse(las_vegas_osm, events=('start',)):
        if elem.tag == 'way':
            for tag in elem.iter('tag'):
                if tag.attrib['k'] == attribute:
                    attribute_values.add(tag.attrib['v'])
    return attribute_values

name_list = get_attrib_values('tiger:name_base')
# look at a few of the values
list(name_list)[:5]

['Villa Granada', 'Cross Creek', 'Rutland', 'Hibbetts', 'Grand Anacapri']

This does look a bit like street names, or place names (there's also a lake somewhere among them). But none of them has a street ending information associated.

But: through reading the information on the OSM Wiki about TIGER data, I know that its way of recording had the street name data organized like this: `"#{fedirp} #{fename} #{fetype} #{fedirs}".strip`. When that data got imported to OSM, the aim was to split the road information into more separate attributes.

Therefore there exist attributes for `name_direction_prefix_1`, `name_base_1`, `name_type_1` and `name_direction_suffix_1`, that together can form e.g. a street name.

In [35]:
len(name_list)

2425

In [37]:
nametype_list = get_attrib_values('tiger:name_type')

In [38]:
nametype_list

{'Aly',
 'Ave',
 'Blvd',
 'Cir',
 'Ct',
 'Ctr',
 'Cv',
 'Dr',
 'Dr; Dr; Dr; Rd',
 'Ln',
 'Pky',
 'Pl',
 'Rd',
 'Rd; Blvd',
 'St',
 'St:Trl',
 'Ter',
 'Trl',
 'Way',
 'Way; Rd; Way',
 'Xing'}

In [40]:
namefull_list = get_attrib_values('tiger:name_full')
namefull_list

{'Autumn King Ave',
 'Bay Course Ct',
 'Bethel Mill St:S Bethel Mill St',
 'Crooked Putter Dr',
 'E Blue Rosalie Pl',
 'E Cantabria Heights Ave',
 'E Cortez Bank Way',
 'E Erie Ave',
 'E Grand Cerritos Ave',
 'E Jasmine Grove Way',
 'E Levi St',
 'E Liberty Heights Ave',
 'E Oak Village Ave',
 'E Quaint Acres Ave',
 'E Sheerwater Ave',
 'E Siddall Ave',
 'E Socorro Song Ln',
 'E Tillman Falls Ave',
 'E Via Greca Ave',
 'Edwardian St',
 'Even Par Dr',
 'Glen Iris St',
 'Green Falls Ave',
 'Haplin Ave',
 'Jeffreys St',
 'Laying Up Ct',
 'Osterville St',
 'Real Long Way',
 'Rusty Springs Ct',
 'S Adams Chase St',
 'S African Sunset St',
 'S Amigo St',
 'S Arcadia Sunrise Dr',
 'S Cassleman Ct',
 'S Cherry Brook St',
 'S Corte Sierra St',
 'S Gwynns Falls St',
 'S Ledroit St',
 'S Montana Mountain St',
 'S Orchard St',
 'S Phesant Brook St',
 'S Sunshine Village Pl',
 'S Tawny Buck Ct',
 'S Timber Stand St',
 'S Tranquil Breeze St',
 'S Via Scula',
 'S Viterbo Ave',
 'Spansih Sky Ave',
 'S

In [None]:
def is_street_name():
    return (elem.attrib['k'] == )

def audit_tag(tag_name):
    '''
    
    '''
    for event, elem in ET.iterparse(las_vegas_osm, events=('start',)):
        if elem.tag == tag_name:
            for tag in elem.iter('tag'):
                if is_street_name(tag):
                    audit_street_types(street_types, tag.attrib['v'])
    