# OpenStreetMap Project 

OpenStreetMap (OSM)

Include:
- Snippets of code  
- Problematic tags  
- Visualizations if appropriate

Describe the different code files.

---

## Section 1: Problems encountered in the map

### 1a: Cursory audit of 'tag' elements

To develop a rough sense of the data contained in the 'tag' elements of nodes, ways, and relations in the OSM file, a script was run to summarize the contents of the 'tag' elements. In total, the OSM file contained 154,102 tag elements. In these elements, 446 _unique_ keys were present. After performing a cursory scan of the unique keys by eye, at least three large clusters were evident: (i) tags with 'tiger' data, (ii) tags with 'gnis' data, and (iii) tags labeled with 'fixme'. A fourth category was investigated, namely those tags that had keys with problematic characters as defined in one of the Udacity lessons (_Case study: OpenStreetMap data [SQL] / Quiz: Tag types_). All other keys were classified as 'other'. The tag elements were segmented into the five categories and counted programmatically, yielding the following results:

|Category |Number of keys          |Fraction of total|
|:--------|:----------------------:|----------------:|
|tiger    |61901                   |0.402            |
|gnis     |1873                    |0.012            |
|fixme    |36                      |0.000            |
|probem   |1                       |0.000            |
|other    |90291                   |0.586            |
|**TOTAL**|154102                  |1.000            |

Forty percent of the tag data is from the United States Census Bureau's Topologically Integrated Geographic Encoding and Referencing (TIGER) system. According to OSM's wiki regarding [TIGER fixup] [1], a number of issues may be encountered with TIGER data. Since the TIGER database was created for the purpose of guiding census surveys, many of the issues deal with the accuracy of nodes representing roads and boundaries. Also, since the data was uploaded in 2007/2008, some of the data is antiquated. 

The next largest cluster of data, composing about 1% of the data, is from the United States Geographical Survey's Geographic Names Information System (GNIS). According to OSM's wiki regarding [USGS GNIS] [2], this data was also bulk imported like TIGER data, and hence contains a number of errors. Many of those errors relate to features that no longer exist. 

While issues with geographic location accuracy and outdated-ness are beyond the scope of this project, three problems were identified that could be addressed programmatically. They were: (1) keys with problematic characters, (2) overabbreviation of street names, and (3) incorrect zip codes.

### 1b: Keys with problematic characters

To identify keys that contain problematic characters (characters other than alphanumeric and underscore), a regex was run against a dictionary of all the keys aggregated from the OSM file. Only one key was identified with problemmatic characters - 'Hours of Operation' - which contains spaces. During the upload of the OSM data to .csv files, the spaces were replaced with underscores with a call to the following function:

~~~~ python
def fix_prob_chars(key):
    '''Eliminate problematic characters from keys'''
    
    if ' ' in key:
        new_key = list(key)
        for i, char in enumerate(new_key):
            if char == ' ':
                new_key[i] = '_'
    new_key = ''.join(new_key)
    return new_key
~~~~

### 1c: Overabbrevation of street names

A larger issue was found with abbreviations in street names. First, a script was run to compile all the tags with address-related fields. From that compilation, keys named 'addr:street' were identified as the most relevant. A second script was run to capture the last word at the end of street name strings, similar to the approach in the Udacity lesson _Case study: OpenStreetMap data [SQL] / Auditing Street Names_. After going through the collection of possible abbreviations manually, a mapping dictionary was developed to correlate abbrevations with their full form. During the upload of the OSM data to .csv files, the street names strings were interrogated for abbreviations, and the abbreviations were expanded:

~~~~ python
def fix_street_abbrevs(street):
    '''Expand abbreviations in street names'''
    
    mapping = {
        'ave': 'Avenue',
        'Ave': 'Avenue',
        # ...
        # See code for complete mapping dict
    }
    
    elements = street.split()
    for i in range(len(elements)):
        if elements[i] in mapping:
            elements[i] = mapping[elements[i]]
    updated_street = ' '.join(elements)
    return updated_street
    ~~~~

### 1d: Incorrect zip codes

Finally, the OSM file was audited for correct zip codes. A script was run to compile the values of tags with the key 'addr:postcode'. Only two instances were problematic - one with the value '1--', and a second with the value 'West Main Street'. During the conversion to .csv files, both of these zip codes were converted to 'fixme':

~~~~ python
def fix_zipcode(zipcode):
    '''Check the zipcode for the proper format'''
    
    zipformat = re.compile(r"(^[0-9]{5})(-[0-9]{4})?")
    if zipformat.match(zipcode):
        return zipcode
    else:
        return 'fixme'
~~~~

---

## Section 2: Overview of the data





## Section 3: Other ideas about the dataset

Notes:
 - Aggregate keys once for efficiency (done)
 
 
 [1]: http://wiki.openstreetmap.org/wiki/TIGER_fixup "http://wiki.openstreetmap.org/wiki/TIGER_fixup"
 [2]: http://wiki.openstreetmap.org/wiki/USGS_GNIS "http://wiki.openstreetmap.org/wiki/USGS_GNIS"

In [5]:
key = 'Hours of Operation'
if ' ' in key:
    new_key = list(key)
    for i, char in enumerate(new_key):
        if char == ' ':
                new_key[i] = '_'
    new_key = ''.join(new_key)
    print new_key

Hours_of_Operation
