In [1]:
from IPython.display import display_html
display_html("""<button onclick="$('.input, .prompt, .output_stderr, .output_error').toggle();">Toggle Code</button>""", raw=True)

# OpenStreetMap
***Data Wrangling with mongoDB by NK Zhehua Zou***
  
Map Area: San Jose, CA, United States  
https://mapzen.com/data/metro-extracts/metro/san-jose_california/  
  
***Table of Contents***
1. Data Audit
2. Problems Encountered in the Map  
Abbreviated Street Names  
Postal Codes  
3. Data Overview  
4. Additional Ideas  
Contributor statistics and gamification suggestion  
Additional data exploration using MongoDB  
5. Conclusion

# 1. Data Audit

In [2]:
# Load packages and libraries
import sys
sys.path.append("script/")
import xml.etree.cElementTree as ET
import re

### cleaning ###
from collections import defaultdict
import string

### osm to json ###
from pymongo import MongoClient
import os
import codecs
import json

In [3]:
# Load data
# This data just a sample for code testing
# Please read html file if you want to reviewed entire analysis.
data = 'data/sanjose.osm'

### Tags
Parse through the San Jose dataset with ElementTree and count the number of unique element types to get an overall understanding of the data by using count_tags function.

In [4]:
# Parse through the data with ElementTree
def count_tags(data):
    tags={}
    for event, elem in ET.iterparse(data):
        if elem.tag in tags:
            tags[elem.tag]+=1
        else:
            tags[elem.tag]=1
    return tags

count_tags(data)

{'bounds': 1,
 'member': 14382,
 'nd': 1508760,
 'node': 1291540,
 'osm': 1,
 'relation': 1363,
 'tag': 693140,
 'way': 171911}

### Keys Type
*** For the follinwg function: key_type & process_key. We check the "k" value for each. ***  
"lower", for tags that contain only lowercase letters and are valid.  
"lower_colon", for otherwise valid tags with a colon in their names.  
"problemchars", for tags with problematic characters.

In [5]:
# Count of each of three tag categories in a dictionary with re
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
# This regex represents invalid MongoDB characters for keys.
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

def key_type(element, keys):
    if element.tag == 'tag':
        if re.match(lower,element.get('k'))!=None:
            keys['lower']+=1
        elif re.match(lower_colon,element.get('k'))!=None:
            keys['lower_colon']+=1
        elif re.match(problemchars,element.get('k'))!=None:
            keys['problemchars']+=1
        else:
            keys['other']+=1
    return keys

def process_key(data):
    keys = {'lower': 0, 'lower_colon': 0, 'problemchars': 0, 'other': 0}
    for _, element in ET.iterparse(data):
        keys = key_type(element, keys)
    return keys

process_key(data)

{'lower': 378290, 'lower_colon': 291114, 'other': 23736, 'problemchars': 0}

### Users

In [6]:
# get users info with ElementTree
def process_people(data):
    users = set()
    for _, element in ET.iterparse(data):
        for e in element:
            if 'uid' in e.attrib:
                users.add(e.attrib['uid'])
    return users

number_contributors = len(process_people(data))

print str(number_contributors) + ' peoples invovlved in the map editing.'

1265 peoples invovlved in the map editing.


# 2. Problems Encountered in the Map
After initially downloading a small sample size of the San Jose area and running it, I noticed two main problems with the data, which I will discuss in the following order:  
1) Abbreviated street names ('Branham Ln')  
2) Inconsistent postal codes ('CA950543', '95014-1899')  
3) We use two scripts (street.py & zipcode.py) in this section.

### Abbreviated Street Names
Once the data was imported to MongoDB, some basic querying revealed street name abbreviations. I updated all substrings in problematic address strings, such that 'Branham Ln' becomes 'Branham Lane'.

1) The main problem we encountered in this dataset come from the street name abbreviation inconsistency. We build the regex matching the last element in the string, where usually the street type is based. Then we come up with a list of mapping that need not to be cleaned.  
2) audit_street_type function search the input string for the regex. If there is a match and it is not within the 'expected' list, add the match as a key and add the string to the set.  
3) is_street_name function looks at the attribute k if k='addre:street'.  
4) audit functio will return the list that match previous two functions.  
5) After that, we would do a pretty print the output of the audit. With the list of all the abbreviated street types we can understand and fill-up our 'mapping' dictionary as a preparatio to convert these street name into proper form.  
6) update_name is the last step of the process, which take the old name and update them with a better name.

### Zip Codes
Postal code strings posed a different sort of problem, forcing a decision to strip all leading and trailing characters before and after the main 5-digit zip code. This effectually dropped all leading state characters (as in 'CA950543') and 4-digit zip code extensions following a hyphen ('95014-1899'). This 5-digit constriction benefits MongoDB aggregation calls on postal codes.  
1) Although most of the zip code is correct, there're still a lot of zip code with incorrect 5 digit formats. We will process it like update street name.  
2 )The output of the clean zip code are the format of 5 digits or string 'None'.

# 3. Data Overview
This section contains basic statistics about the dataset and the MongoDB queries used to gather them.  
We from street import is_street_name, update_street, mapping_street, mapping_abbrev to cleaning street name  
We from zipcode import is_zipcode, update_zipcode to cleaning zipcode  
we use shape_elemnt function to wrangle data and parse it.  
We use process_map to write json and output to mongoDB. 
  
### Preparing for MongoDB by converting XML to JSON
In order to transform the data from XML to JSON, we need to follow these rules:  
1) Process only 2 types of top level tags: "node" and "way"  
2) All attributes of "node" and "way" should be turned into regular key/value pairs, except:   attributes in the CREATED array should be added under a key "created", attributes for latitude and longitude should be added to a "pos" array, for use in geospacial indexing. Make sure the values inside "pos" array are floats and not strings.  
3) If second level tag "k" value contains problematic characters, it should be ignored  
4) If second level tag "k" value starts with "addr:", it should be added to a dictionary "address"  
5) If second level tag "k" value does not start with "addr:", but contains ":", you can process it same as any other tag.  
6) If there is a second ":" that separates the type/direction of a street, the tag should be ignored  
After all the cleaning and data transformation are done, we would use last function process_map and convert the file from XML into JSON format

### File sizes

In [7]:
client = MongoClient()
db=client.project

In [8]:
print 'The original OSM file is ' + str(os.path.getsize(data)/1.0e6) + ' MB'

The original OSM file is 286.056458 MB


In [9]:
print 'The JSON file is ' + str(os.path.getsize(data + '.json')/1.0e6) + ' MB'

The JSON file is 327.883245 MB


In [10]:
# Number of documents, we defined it for next section.
number_document = db.doc.find().count()
print 'The number of documents is ' + str(number_document)

The number of documents is 1463451


In [11]:
# Number of nodes
print 'The number of node is ' + str(db.doc.find({'type':'node'}).count())

The number of node is 1291532


In [12]:
# Number of ways
print 'The number of way is ' + str(db.doc.find({'type':'way'}).count())

The number of way is 171880


In [13]:
# Number of unique users, we defined it for next section.
number_unique_users = len(db.doc.distinct('created.user'))
print 'The number of unique users is ' + str(number_unique_users)

The number of unique users is 1257


In [14]:
# Top 1 contributing user
cursor = db.doc.aggregate([{'$group':{'_id':'$created.user', 'count':{'$sum':1}}}, {'$sort':{'count':-1}}, {'$limit':1}])
for res in cursor:
    user1=res['_id']
    user1_count=res['count']
print 'The first contributor is ' + user1 + ' with '+ str(user1_count) + ' contributions.'

The first contributor is nmixter with 288570 contributions.


In [15]:
# Number of users appearing only once (having 1 post), we defined it for next section.
user_once=db.doc.aggregate([{'$group':{'_id':'$created.user', 'count':{'$sum':1}}}, 
                       {'$sort':{'count':1}},
                       {'$match':{'count':1}},
                       {'$group':{'_id':'null','total':{'$sum':'$count'}}}
                        ])
for res in user_once:
    number_user_once=res['total']

print 'There is ' + str(number_user_once) + ' users appearing only once.'

There is 272 users appearing only once.


# 4. Additional Ideas
### Contributor statistics and suggestion
According to these results below, we found unbelievable truth.  
1) Best contributor gave 19% documents, almost 1/5 of total contributions.  
2) Four contributors also over 40% total contributions, it means top 2, top 3 and top 4 contributors are far behind top 1 contributors.  
3) Just 100 contributors already gave 95% of total documents, it means rest of people almost have not any contributors in here even if still have 21% contributors gave one post.  
4) Every contributor shall gave 1164 documents by average contribution, but most of people can't close to this number.  
5) What incentives should we increase? Perhaps we can refer to the experience of waze, which is a great application for navigation app. We can be divided different levels according to contribution, each level users will enjoy different privileges, badges and rewards.

In [16]:
def topn_contrib(n, user=False):
    if user==True:
        topuser=db.doc.aggregate([{'$group':{'_id':'$created.user', 'count':{'$sum':1}}}, 
                                 {'$sort':{'count':-1}}, {'$limit':n}
                                 ])
        top_n_users=[]
        for res in topuser:
            top_n_users.append(res['_id'])

    top_n_contrib=db.doc.aggregate([{'$group':{'_id':'$created.user', 'count':{'$sum':1}}}, 
                         {'$sort':{'count':-1}}, {'$limit':n},
                         {'$group':{'_id':'$created.user','total':{'$sum':'$count'}}}
                        ])

    for res in top_n_contrib:
        top_n_contrib_count=res['total']

    percent_contrib_topn=(top_n_contrib_count*100)/number_document
    
    if user==True:
        return top_n_users,percent_contrib_topn
    else:
        return percent_contrib_topn

In [17]:
top1,top1_percent_contrib=topn_contrib(1,user=True)
print 'Top1 Contributor is ' + str(top1) + ', contribution percentage is ' + str(top1_percent_contrib) + '%.'

Top1 Contributor is [u'nmixter'], contribution percentage is 19%.


In [18]:
top4,top4_percent_contrib=topn_contrib(4,user=True)
print 'These contributors: ' + str(top4) + ' have ' + str(top4_percent_contrib) + '% contribution rate in this area.'

These contributors: [u'nmixter', u'mk408', u'Bike Mapper', u'samely'] have 41% contribution rate in this area.


In [19]:
top100,top100_percent_contrib=topn_contrib(100, user=True)
print 'Contribution percentage from top 100 users is ' + str(top100_percent_contrib) + '%.'

Contribution percentage from top 100 users is 95%.


In [20]:
percent_user_1post=(number_user_once*100)/number_unique_users
print str(percent_user_1post) + '% of users contribute with one post.'

21% of users contribute with one post.


In [21]:
average = number_document/number_unique_users
print 'Average number of documents per contributor is ' + str(average)

Average number of documents per contributor is 1164


### Additional data exploration using MongoDB queries
1) 1463451 people living in this area.  
2) We found most amenities are Parking and restaurant, it make sence for a Metropolitan area.  
3) I am not suprise to many city bus stations in this Metropolitan area.  
4) Shell, 76, Valeroand Chevron have most gas stations in this area, no much suprised for this result, They are every where in Bay Area.  
5) Pizza My Heart is the most popular restaurant in this area, they have 9 restaurants in here. I have been there before, their pizza really taste good, but I still have a bit suprise to this result, I though Thaifood is most popular food in San Jose.  

In [22]:
population = db.doc.aggregate([{'$group':{'_id':'population', 'count':{'$sum':1}}},
                    {'$sort':{'count':-1}}, {'$limit':10}])

print list(population)

[{u'count': 1463451, u'_id': u'population'}]


In [25]:
# Let's check the number of amenity first
amenity = db.doc.aggregate([{'$match':{'amenity':{'$exists':1}}},
                               {'$group':{'_id':'$amenity', 'count':{'$sum':1}}},
                               {'$sort':{'count':-1}}, {'$limit':10}])
for doc in amenity:
    print doc

{u'count': 1835, u'_id': u'parking'}
{u'count': 937, u'_id': u'restaurant'}
{u'count': 532, u'_id': u'school'}
{u'count': 477, u'_id': u'fast_food'}
{u'count': 343, u'_id': u'place_of_worship'}
{u'count': 238, u'_id': u'cafe'}
{u'count': 233, u'_id': u'fuel'}
{u'count': 201, u'_id': u'bench'}
{u'count': 183, u'_id': u'toilets'}
{u'count': 182, u'_id': u'bicycle_parking'}


In [26]:
bus_station = db.doc.aggregate([{'$match':{'amenity':{'$exists':1}, 'amenity':'bus_station'}},
                               {'$group':{'_id':'$name', 'count':{'$sum':1}}},
                               {'$sort':{'count':-1}}, {'$limit':10}])
for doc in bus_station:
    print doc

{u'count': 2, u'_id': None}
{u'count': 1, u'_id': u'San Jose Diridon Transit Center'}
{u'count': 1, u'_id': u'Valley Fair'}
{u'count': 1, u'_id': u'VTA Route 22'}
{u'count': 1, u'_id': u'VTA Route 55 Stop#62327'}
{u'count': 1, u'_id': u'Santa Clara Transit Center'}
{u'count': 1, u'_id': u'VTA Route 55 Stop#62391'}


In [27]:
gas_station = db.doc.aggregate([{'$match':{'amenity':{'$exists':1}, 'amenity':'fuel'}},
                    {'$group':{'_id':'$name', 'count':{'$sum':1}}},
                    {'$sort':{'count':-1}}, {'$limit':10}])

for doc in gas_station:
    print doc

{u'count': 71, u'_id': None}
{u'count': 25, u'_id': u'Shell'}
{u'count': 23, u'_id': u'76'}
{u'count': 22, u'_id': u'Valero'}
{u'count': 20, u'_id': u'Chevron'}
{u'count': 14, u'_id': u'Arco'}
{u'count': 5, u'_id': u'Rotten Robbie'}
{u'count': 2, u'_id': u'Spartan'}
{u'count': 2, u'_id': u'Beacon'}
{u'count': 2, u'_id': u'Cal Gas'}


In [28]:
restaurant = db.doc.aggregate([{'$match':{'amenity':{'$exists':1}, 'amenity':'restaurant'}}, 
                    {'$group':{'_id':'$name', 'count':{'$sum':1}}},
                    {'$sort':{'count':-1}}, {'$limit':10}])

for doc in restaurant:
    print doc

{u'count': 20, u'_id': None}
{u'count': 9, u'_id': u'Pizza My Heart'}
{u'count': 7, u'_id': u"Denny's"}
{u'count': 6, u'_id': u'Panera Bread'}
{u'count': 6, u'_id': u'Round Table Pizza'}
{u'count': 6, u'_id': u'Round Table'}
{u'count': 5, u'_id': u'Subway'}
{u'count': 5, u'_id': u'IHOP'}
{u'count': 4, u'_id': u'Outback Steakhouse'}
{u'count': 4, u'_id': u'Pizza Hut'}


# 5. Conclusion
1) The map about the city of San Jose is relatively clean so I could retrieve some interesting content. But still the data is not entirely clean.  
2) The data contains some mistakes or different references for the same feature. So I had to clean the data programmatically for the street and the postal codes.  
3) When we audit the data, it was very clear that although there are minor error caused by human input, the dataset is fairly well-cleaned. Considering there're hundreds of contributors for this map, there is a great numbers of human errors in this project. I'd recommend a srtuctured input form so everyone can input the same data format to reduce this error.  
4) We can incentivize users by gamify the contribution process, then we can create a recommendation engine to leverage these data (eg. restaurant recommendation, building, etc).  
5) OpenStreetMaps is an open source project, there're still a lot of areas left unexplored as people tend to focus on a certain key areas and left other part outdated. This is most difference between OpenStreetMaps and GoogleMap, they allow every one to create or modify data even it will miss many datas.