# OpenStreetMap Data Wrangling with MongoDB

## Overview
### Project
description
### Data
what and where

## Data Audit

### Issues
problems found

**Street names**

St. as Street vs St. as Saint

'Queen St E'

### Analysis

**Identify data tags and total count**

- member: 277
- nd: 7330
- node: 6575
- osm: 1
- relation: 13
- tag: 8691
- way: 1239

In [6]:
import xml.etree.cElementTree as ET
import pprint
import re

In [11]:
def count_tags(filename):
    tags = {}
    for _, elem in ET.iterparse(filename):
        tag = elem.tag
        if tag not in tags.keys():
            tags[tag] = 1
        else:
            tags[tag] += 1
    return tags


def test():
    tags = count_tags('src/sample.osm')
    pprint.pprint(tags)

test()

{'member': 277,
 'nd': 7330,
 'node': 6575,
 'osm': 1,
 'relation': 13,
 'tag': 8691,
 'way': 1239}


**Identifying unique users**

Number of unique contributors: 183

In [12]:
def get_user(element):
    return


def process_map(filename):
    users = set()
    for _, element in ET.iterparse(filename):
        try:
            users.add(element.attrib['uid'])
        except KeyError:
            # Some elements don't have a uid attribute, do nothing
            continue
    return users


def test():
    users = process_map('src/sample.osm')
    print 'Number of unique contributors:', len(users)
    
test()

Number of unique contributors: 183


**Validating data tag 'k' attribute**

{'lower': 5112, 'lower_colon': 3483, 'other': 96, 'problemchars': 0}

- "lower", for tags that contain only lowercase letters and are valid,
- "lower_colon", for otherwise valid tags with a colon in their names,
- "problemchars", for tags with problematic characters, and
- "other", for other tags that do not fall into the other three categories.

In [13]:
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')


def key_type(element, keys):
    if element.tag == "tag":
        # search returns matchObject which is always true or None when 'false'
        if lower.search(element.attrib['k']):
            keys["lower"] += 1
        elif lower_colon.search(element.attrib['k']):
            keys["lower_colon"] += 1
        elif problemchars.search(element.attrib['k']):
            keys["problemchars"] += 1
        else:
            keys["other"] += 1

    return keys


def process_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)

    return keys


def test():
    keys = process_map('src/sample.osm')
    pprint.pprint(keys)

test()

{'lower': 5112, 'lower_colon': 3483, 'other': 96, 'problemchars': 0}


## Conclusion
### Additional ideas
stuff
### Final thoughts
stuff

## References
- http://eli.thegreenplace.net/2012/03/15/processing-xml-in-python-with-elementtree
- https://docs.python.org/2/library/re.html
- http://wiki.openstreetmap.org/wiki/OSM_XML