# OpenStreetMap Data Case Study

## Problems Encountered in the Map
Discuss the five main problems with the data in the following order:

- Over­abbreviated street names (“S Tryon St Ste 105”)

- Second level “k” tags with the value "type"(which overwrites the element’s previously processed node[“type”]field).

- Street names in second ­level “k” tags pulled from Tiger GPS data and divided into segments, in the following format:

- Unstructure Unique ID (1, 42653, 2321, 5030230)


###  Map Area - Dataset

In this project, I choose San Jose which is a large city surrounded by rolling hills in Silicon Valley, a major technology hub in California's Bay Area. I want to learn more about the place to see what database querying reveals. This location is one of my dreams working area as it's all over the world-class Tech corporations around there. 

San Jose, United States (OSM XML: 364.6 MB)
- https://mapzen.com/data/metro-extracts/metro/san-jose_california/ 


In [1]:
# -*- coding: utf-8 -*-

import pprint
import xml.etree.ElementTree as ET
from collections import defaultdict
import re
import os

DATASET = "san-jose_california.osm" # osm filename
PATH = "./" # directory contain the osm file
OSMFILE = PATH + DATASET
print('Dataset folder:', OSMFILE)

Dataset folder: ./san-jose_california.osm


### Iterative Parsing the OSM file.

In [2]:
# mapparser.py
# iterative parsing
from mapparser import count_tags, count_tags_total

tags = count_tags(OSMFILE)
print('Numbers of tag: ', len(tags))
print('Numbers of tag elements: ', count_tags_total(tags))
pprint.pprint(tags)

Numbers of tag:  8
Numbers of tag elements:  4599618
{'bounds': 1,
 'member': 18333,
 'nd': 1965111,
 'node': 1679378,
 'osm': 1,
 'relation': 1759,
 'tag': 705634,
 'way': 229401}


### Categorize the tag keys.
Categorize the tag keys in the followings:
- "lower", for tags that contain only lowercase letters and are valid,
- "lower_colon", for otherwise valid tags with a colon in their names,
- "problemchars", for tags with problematic characters, and
- "other", for other tags that do not fall into the other three categories.

In [3]:
# tags.py
from tags import key_type
def process_map_tags(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)
    return keys
keys = process_map_tags(OSMFILE)
pprint.pprint(keys)

{'lower': 459030, 'lower_colon': 224633, 'other': 21969, 'problemchars': 2}


### Number of Unique Users
As you can see, each of the user has their own unique ID. However, the ID is unstructured likes 1, 1005885, 1030, 100744. I structured all the unique user id.

In [17]:
# users.py
from users import unique_user_id, max_length_user_id, structure_user_id

def test():
    users = unique_user_id(OSMFILE)
    # structured = structure_user_id(users)
    # pprint.pprint(structured)
    max_length = max_length_user_id(users)
    print('Number of users: ', len(users))
    print('User ID maximum length', max_length)

    print_limit = 10
    for user_id in users:
        if len(user_id) < max_length:
            structured_id = user_id
            while len(structured_id) < max_length:
                structured_id = str('0' + structured_id)

            if print_limit > 0:
                print_limit -= 1
                print(user_id, "=>", structured_id)
            else:
                break

if __name__ == '__main__':
    test()

Number of users:  1359
User ID maximum length 7
21694 => 0021694
396203 => 0396203
573196 => 0573196
446933 => 0446933
253748 => 0253748
158628 => 0158628
177477 => 0177477
183795 => 0183795
500006 => 0500006
161873 => 0161873


### Over-abbreviated Street Names
Some basic query is over-abbreviated. I updated all the problematic address strings in the followings:

- Seaboard Ave => Seaboard Avenue
- Cherry Ave => Cherry Avenue

In [18]:
#audit.py
from audit import audit, update_name, street_type_re, mapping

def test():
    st_types = audit(OSMFILE)
    # pprint.pprint(dict(st_types)) #print out dictonary of potentially incorrect street types
    print_limit = 10
    for st_type, ways in st_types.items(): # .iteritems() for python2
        for name in ways:
            if street_type_re.search(name).group() in mapping:
                better_name = update_name(name, mapping)
                if print_limit > 0:
                    print_limit -= 1
                    print (name, "=>", better_name)
                else:
                    break
        
if __name__ == '__main__':
    test()

Seaboard Ave => Seaboard Avenue
Cherry Ave => Cherry Avenue
Greenbriar Ave => Greenbriar Avenue
Blake Ave => Blake Avenue
Walsh Ave => Walsh Avenue
E Duane Ave => E Duane Avenue
Meridian Ave => Meridian Avenue
The Alameda Ave => The Alameda Avenue
Hollenbeck Ave => Hollenbeck Avenue
Foxworthy Ave => Foxworthy Avenue


### Insert data into Mongodb

In [10]:
# data.py
from data import process_map

data = process_map(OSMFILE, True)

In [11]:
data[0]

{'created': {'changeset': '11686320',
  'timestamp': '2012-05-24T03:24:59Z',
  'uid': '14293',
  'user': 'KindredCoda',
  'version': '10'},
 'id': '25457954',
 'pos': [37.1582245, -121.6574737],
 'type': 'node',
 'visible': None}

# Data Overview

In [12]:
from pymongo import MongoClient
client = MongoClient('localhost:27017')
db = client.SanJose
collection = db.SanJoseMAP
#collection.insert(data)

In [13]:
collection

Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'SanJose'), 'SanJoseMAP')

In [14]:
print('Size of the original xml file: ',os.path.getsize(OSMFILE)/(1024*1024.0), 'MB')
print('Size of the processed json file: ',os.path.getsize(os.path.join(PATH, "san-jose_california.osm.json"))/(1024*1024.0), 'MB')
print('Number of documents: ' + str(collection.find().count()))
print('Number of nodes: ' + str(collection.find({"type":"node"}).count()))
print('Number of ways: ' + str(collection.find({"type":"way"}).count()))
print('Number of relations: ' + str(collection.find({"type":"relation"}).count()))
print('Number of unique users: ' + str(len(collection.distinct("created.user"))))
print('Number of pizza places: ' + str(collection.find({"cuisine":"pizza"}).count()))

Size of the original xml file:  348.08773612976074 MB
Size of the processed json file:  510.8454399108887 MB
Number of documents: 19761132
Number of nodes: 17470242
Number of ways: 2290674
Number of relations: 0
Number of unique users: 1356
Number of pizza places: 636


## Additional Ideas 

In [36]:
# Top 10 users with most contributions
pipeline = [{"$group":{"_id": "$created.user", "count": {"$sum": 1}}},
            {"$sort": {"count": -1}},
            {"$limit": 10}]
result = collection.aggregate(pipeline)
for r in range(10):
    print (result.next())

{'_id': 'nmixter', 'count': 2980568}
{'_id': 'andygol', 'count': 2961664}
{'_id': 'mk408', 'count': 1615791}
{'_id': 'Bike Mapper', 'count': 969105}
{'_id': 'samely', 'count': 813227}
{'_id': 'RichRico', 'count': 768741}
{'_id': 'dannykath', 'count': 752101}
{'_id': 'MustangBuyer', 'count': 646129}
{'_id': 'karitotp', 'count': 645535}
{'_id': 'Minh Nguyen', 'count': 517383}


In [16]:
# Number of users appearing only once (having 1 post)
pipeline = [{"$group":{"_id":"$created.user", "count":{"$sum":1}}},
                      {"$group":{"_id":"$count", "num_users":{"$sum":1}}},
                      {"$sort":{"_id":1}}, {"$limit":1}]

result = collection.aggregate(pipeline)
for r in range(1):
    print (result.next())

{'_id': 1, 'num_users': 1}


## Conclusion
I believe it has been cleaned for the purposes of this exercise. However, some area of the San Jose data is obviously far from being complete. There's still some data haven't clean likes Inconsistent postal codes (“NC28226”, “28226­0783”, “28226”) and “Incorrect” postal codes (Charlotte area zip codes all begin with “282” however a large portion of all documented zip codes were outside this region.)