# OpenStreetMap Data Wrangling with MongoDB

## Project Overview
To choose any area of the world in https://www.openstreetmap.org and use data munging techniques, such as assessing the quality of the data for validity, accuracy, completeness, consistency and uniformity, to clean the OpenStreetMap data for that part of the world. Finally, use MongoDB as the data schema to complete your project by storing, querying and aggregating the data.

### OSM Dataset
The data area selected for this project is of the "old" Toronto area. This area was choosen primiarly because I currently live in the city of Toronto but specifically within the "old" Toronto area and not the "new" or "Greater Toronto Area" aka the GTA.

Data was directly exported from OpenStreetMap (link provided below)
- https://www.openstreetmap.org/relation/2989349#map=12/43.6789/-79.3851

## Data Audit

### Issues
After initially review/investigation of a sample set of the data I noticed three main problems. Each data problem will be discussed below.

- abbreviated street names (St. as Street)
- Duplicated abbreviation for different words (St. for Street vs St. for Saint)
- Multiple abbreviations (Queen St E instead of Queen Street East)
- Incorrect and inconsistent postal codes (format)

**Street names**

TODO: add some info here

- Robertoway should have been "Roberto Way"
- StreetE should have been "Street East"
- AvenueE should have been "Avenue East"

And the expected entries such as St vs St. or Ave vs Ave. were also found

**Postal codes**

Talk about how you cleaned things up - data/audit.py file

Toronto, similar to all other cities in Canada have a strongly formatted/structured postal (post) code. And unlike other sections of an address, postal codes are still very quite important. In order to make sure all postal codes followed the correct format this field was validated.

Validation of postal codes was performed with the use of a regular expession. The general format is [A-Z][0-9][A-Z] space [0-9][A-Z][0-9] - however, lowercase letters typically don't matter, a space or no space between the first and last 3 characters is normal and some letters are excluded because of their resemblance to numbers. The full RegEx is below:
> ^[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ][\s]?[0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]

Upon auditing the entire dataset, only one postal code entry was found to be incorrectly formatted. To validate this, using a sample dataset I purposely modified a number of postal codes to make sure the algorythm was working as expected.

Postal Code: **M6K, M36 0H7 and M5E** was found and is incorrectly formatted 


TODO: could try to look up and correct incorrect item above

### Analysis

#### Data Overview
- Old_Toronto.osm: 176MB
- Old_Toronto.osm.json: 185.6MB


> Imported 781332 documents

> db.stats( )
-	"db" : "udacity",
-	"collections" : 1,
-	"objects" : 781332,
-	"avgObjSize" : 254.41478142454167,
-	"dataSize" : 198782410,
-	"storageSize" : 62021632,
-	"numExtents" : 0,
-	"indexes" : 1,
-	"indexSize" : 6918144,
-	"ok" : 1


**Identifying tags and total counts**

- "lower", for tags that contain only lowercase letters and are valid,
- "lower_colon", for otherwise valid tags with a colon in their names,
- "problemchars", for tags with problematic characters, and
- "other", for other tags that do not fall into the other three categories.

> {'lower': 510975, 'lower_colon': 355594, 'other': 10557, 'problemchars': 2}


**Number of documents**
                                                
> db.getCollection('open_street_map').find().count()

> 781332
                                                
**Number of nodes**
                                                
> db.getCollection('open_street_map').find({"type":"node"}).count()

> 657172
                                                
**Number of ways**
                                                
> db.getCollection('open_street_map').find({"type":"way"}).count()

> 123868
                                                
**Number of unique users**
                                                
> db.getCollection('open_street_map').distinct("created.user").length

> 799
                                                
**Top 1 contributing user**
                                                
> db.char.aggregate...

**Number of users appearing only once (having 1 post)**
                                                
> db.char.aggregate...

### Code details

this is code for sample testing only

In [7]:
import xml.etree.cElementTree as ET
import pprint
import re

In [1]:
def count_tags(filename):
    tags = {}
    for _, elem in ET.iterparse(filename):
        tag = elem.tag
        if tag not in tags.keys():
            tags[tag] = 1
        else:
            tags[tag] += 1
    return tags


def test():
    tags = count_tags('src/sample.osm')
    pprint.pprint(tags)

# test()

**Identifying unique users**

In [2]:
def get_user(element):
    return


def process_map(filename):
    users = set()
    for _, element in ET.iterparse(filename):
        try:
            users.add(element.attrib['uid'])
        except KeyError:
            continue
    return users


def test():
    users = process_map('src/sample.osm')
    print 'Number of unique contributors:', len(users)
    
# test()

**Validating data tag 'k' attribute**

In [8]:
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')


def key_type(element, keys):
    if element.tag == "tag":
        # search returns matchObject which is always true or None when 'false'
        if lower.search(element.attrib['k']):
            keys["lower"] += 1
        elif lower_colon.search(element.attrib['k']):
            keys["lower_colon"] += 1
        elif problemchars.search(element.attrib['k']):
            keys["problemchars"] += 1
        else:
            keys["other"] += 1

    return keys


def process_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)

    return keys


def test():
    keys = process_map('src/sample.osm')
    pprint.pprint(keys)

# test()

**Postal code**

In [9]:
postal_codes = re.compile(r'^[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ][\s]?[0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]')

def audit_postal_code(postal_code):
    postal_code = postal_code.upper()
    if postal_codes.match(postal_code):
        return postal_code

    bad_postal_codes.append(postal_code)
    return postal_code


def is_postal_code(address_key):
    return address_key == 'addr:postcode'

**Street name validation**

explain what this was used for

In [10]:
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road",
            "Trail", "Parkway", "Commons", "Crescent", "West", "South", "East", "North", "Vista",
            "Gardens", "Circle", "Gate", "Heights", "Park", "Way", "Mews", "Keep", "Westway", "Glenway",
            "Queensway", "Wood", "Path", "Terrace", "Appleway"]

street_mapping = {"Ave ": "Avenue",
                   "St. ": "Street",
                   "Rd.": "Road",
                   "StreetE": "Street East",
                   "AvenueE": "Avenue East",
                   "W. ": "West",
                   "E. ": "East",
                   "StreetW": "Street West",
                   "StreetW.": "Street West",
                   "StreetE.": "Street East",
                   "Robertoway": "Roberto Way"
                   }

**MongoDB import**

JSON import code/commands
> mongoimport --db udacity --collection open_street_map --drop --file old_toronto_canada.osm.json

## Conclusion
### Additional ideas
Other useful informative queries

### Final thoughts
final thoughts

## References
- http://eli.thegreenplace.net/2012/03/15/processing-xml-in-python-with-elementtree
- https://docs.python.org/2/library/re.html
- http://wiki.openstreetmap.org/wiki/OSM_XML
- https://www.openstreetmap.org/relation/2989349#map=12/43.6789/-79.3851