# OpenStreetMap Data Case Study

## Problems Encountered in the Map
Discuss the five main problems with the data in the following order:

- Over­abbreviated street names (“S Tryon St Ste 105”)

- Inconsistent postal codes (“NC28226”, “28226­0783”, “28226”)

- “Incorrect” postal codes (Charlotte area zip codes all begin with “282” however a large portion of all documented zip codes were outside this region.)

- Second­ level “k” tags with the value "type"(which overwrites the element’s previously processed node[“type”]field).

- Street names in second ­level “k” tags pulled from Tiger GPS data and divided into segments, in the following format:

###  Map Area - Dataset

In this project, I choose San Jose which is a large city surrounded by rolling hills in Silicon Valley, a major technology hub in California's Bay Area. I want to learn more about the place to see what database querying reveals. This location is one of my dreams working area as it's all over the world-class Tech corporations around there. 

San Jose, United States (OSM XML: 364.6 MB)
- https://mapzen.com/data/metro-extracts/metro/san-jose_california/ 


In [1]:
# -*- coding: utf-8 -*-

import pprint
import xml.etree.ElementTree as ET
from collections import defaultdict
import re
import os

DATASET = "san-jose_california.osm" # osm filename
PATH = "./" # directory contain the osm file
OSMFILE = PATH + DATASET
print('Dataset folder:', OSMFILE)

Dataset folder: ./san-jose_california.osm


### Iterative Parsing the OSM file.

In [2]:
#mapparser.py
# iterative parsing
from mapparser import count_tags, count_tags_total

tags = count_tags(OSMFILE)
print('Numbers of tag: ', len(tags))
print('Numbers of tag elements: ', count_tags_total(tags))
pprint.pprint(tags)

Numbers of tag:  8
Numbers of tag elements:  4593100
{'bounds': 1,
 'member': 18278,
 'nd': 1963083,
 'node': 1677768,
 'osm': 1,
 'relation': 1756,
 'tag': 703164,
 'way': 229049}


### Categorize the tag keys.
Categorize the tag keys in the followings:
- "lower", for tags that contain only lowercase letters and are valid,
- "lower_colon", for otherwise valid tags with a colon in their names,
- "problemchars", for tags with problematic characters, and
- "other", for other tags that do not fall into the other three categories.

In [3]:
#tags.py
from tags import key_type
def process_map_tags(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)
    return keys
keys = process_map_tags(OSMFILE)
pprint.pprint(keys)

{'lower': 457197, 'lower_colon': 224021, 'other': 21945, 'problemchars': 1}


### Number of Unique Users

In [4]:
#users.py
from users import unique_user_ID

users = unique_user_ID(OSMFILE)
print('Number of users: ',len(users))
pprint.pprint(users)

Number of users:  1358
{'1',
 '1001936',
 '1005885',
 '1007178',
 '1007194',
 '100744',
 '1007528',
 '100901',
 '101857',
 '102862',
 '1030',
 '103253',
 '1035693',
 '103574',
 '103769',
 '104583',
 '1051550',
 '105839',
 '1058397',
 '1058666',
 '106914',
 '1069163',
 '10786',
 '108265',
 '1087647',
 '108775',
 '1090211',
 '110046',
 '110263',
 '1105193',
 '110639',
 '110723',
 '1107533',
 '1110270',
 '1113906',
 '111511',
 '1125516',
 '1131634',
 '113396',
 '1135274',
 '113696',
 '113972',
 '1149057',
 '115894',
 '116029',
 '1163754',
 '1174563',
 '11776',
 '117975',
 '118021',
 '1183113',
 '1188270',
 '119204',
 '1195290',
 '1198669',
 '119881',
 '11991',
 '1203357',
 '120468',
 '1205991',
 '1207173',
 '1209932',
 '1211556',
 '1211640',
 '121241',
 '1212745',
 '1213904',
 '1214881',
 '121721',
 '1219875',
 '1219941',
 '1222064',
 '1222341',
 '1224430',
 '12303',
 '1233337',
 '1234011',
 '123459',
 '123633',
 '1238906',
 '1239933',
 '123995',
 '1240849',
 '1241936',
 '12434',
 '12448'

### Auditing Street Type

In [5]:
#audit.py
from audit import audit, update_name, street_type_re, mapping

def test():
    st_types = audit(OSMFILE)
    pprint.pprint(dict(st_types)) #print out dictonary of potentially incorrect street types
    for st_type, ways in st_types.items(): # .iteritems() for python2
        for name in ways:
            if street_type_re.search(name).group() in mapping:
                better_name = update_name(name, mapping)
                print (name, "=>", better_name)

if __name__ == '__main__':
    test()

{'0.1': {'Ala 680 PM 0.1'},
 '1': {'Prospect Rd #1', 'Stewart Drive Suite #1'},
 '109A': {'Kato Road #109A'},
 '114': {'West Evelyn Avenue Suite #114'},
 '201': {'Great America Pkwy Ste 201'},
 '4A': {'Saratoga Avenue Bldg 4A'},
 '6': {'Martin Avenue #6', 'Pruneridge Ave #6'},
 '7.1': {'Hwy 17 PM 7.1'},
 '81': {'Concourse Dr #81'},
 'Alameda': {'The Alameda'},
 'Alley': {'Fountain Alley'},
 'Ave': {'1425 E Dunne Ave',
         'Blake Ave',
         'Cabrillo Ave',
         'Cherry Ave',
         'E Duane Ave',
         'Foxworthy Ave',
         'Greenbriar Ave',
         'Hillsdale Ave',
         'Hollenbeck Ave',
         'Meridian Ave',
         'N Blaney Ave',
         'Saratoga Ave',
         'Seaboard Ave',
         'The Alameda Ave',
         'W Washington Ave',
         'Walsh Ave',
         'Westfield Ave'},
 'Barcelona': {'Calle de Barcelona'},
 'Bascom': {'S. Bascom'},
 'Bellomy': {'Bellomy'},
 'Blvd': {'Los Gatos Blvd',
          'McCarthy Blvd',
          'Mission College B

### Insert data into Mongodb

In [6]:
# data.py
from data import process_map

data = process_map(OSMFILE, True)

In [7]:
data[0]

{'created': {'changeset': '11686320',
  'timestamp': '2012-05-24T03:24:59Z',
  'uid': '14293',
  'user': 'KindredCoda',
  'version': '10'},
 'id': '25457954',
 'pos': [37.1582245, -121.6574737],
 'type': 'node',
 'visible': None}

# Data Overview

In [9]:
from pymongo import MongoClient
client = MongoClient('localhost:27017')
db = client.SanJose
collection = db.SanJoseMAP
collection.insert(data)

  """


[ObjectId('59745cca20318e0bbd91e625'),
 ObjectId('59745cca20318e0bbd91e626'),
 ObjectId('59745cca20318e0bbd91e627'),
 ObjectId('59745cca20318e0bbd91e628'),
 ObjectId('59745cca20318e0bbd91e629'),
 ObjectId('59745cca20318e0bbd91e62a'),
 ObjectId('59745cca20318e0bbd91e62b'),
 ObjectId('59745cca20318e0bbd91e62c'),
 ObjectId('59745cca20318e0bbd91e62d'),
 ObjectId('59745cca20318e0bbd91e62e'),
 ObjectId('59745cca20318e0bbd91e62f'),
 ObjectId('59745cca20318e0bbd91e630'),
 ObjectId('59745cca20318e0bbd91e631'),
 ObjectId('59745cca20318e0bbd91e632'),
 ObjectId('59745cca20318e0bbd91e633'),
 ObjectId('59745cca20318e0bbd91e634'),
 ObjectId('59745cca20318e0bbd91e635'),
 ObjectId('59745cca20318e0bbd91e636'),
 ObjectId('59745cca20318e0bbd91e637'),
 ObjectId('59745cca20318e0bbd91e638'),
 ObjectId('59745cca20318e0bbd91e639'),
 ObjectId('59745cca20318e0bbd91e63a'),
 ObjectId('59745cca20318e0bbd91e63b'),
 ObjectId('59745cca20318e0bbd91e63c'),
 ObjectId('59745cca20318e0bbd91e63d'),
 ObjectId('59745cca20318e

In [10]:
collection

Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'SanJose'), 'SanJoseMAP')

In [11]:
print('Size of the original xml file: ',os.path.getsize(OSMFILE)/(1024*1024.0), 'MB')
print('Size of the processed json file: ',os.path.getsize(os.path.join(PATH, "san-jose_california.osm.json"))/(1024*1024.0), 'MB')
print('Number of documents: ' + str(collection.find().count()))
print('Number of nodes: ' + str(collection.find({"type":"node"}).count()))
print('Number of ways: ' + str(collection.find({"type":"way"}).count()))
print('Number of relations: ' + str(collection.find({"type":"relation"}).count()))
print('Number of unique users: ' + str(len(collection.distinct("created.user"))))
print('Number of pizza places: ' + str(collection.find({"cuisine":"pizza"}).count()))

Size of the original xml file:  347.6639804840088 MB
Size of the processed json file:  513.6455898284912 MB
Number of documents: 17852353
Number of nodes: 15790864
Number of ways: 2061273
Number of relations: 0
Number of unique users: 1351
Number of pizza places: 636


In [12]:
# Top 10 users with most contributions
pipeline = [{"$group":{"_id": "$created.user", "count": {"$sum": 1}}},
            {"$sort": {"count": -1}},
            {"$limit": 10}]
result = collection.aggregate(pipeline)
for r in range(10):
    print (result.next())

{'_id': 'nmixter', 'count': 2695371}
{'_id': 'andygol', 'count': 2666013}
{'_id': 'mk408', 'count': 1468349}
{'_id': 'Bike Mapper', 'count': 878665}
{'_id': 'samely', 'count': 732090}
{'_id': 'RichRico', 'count': 692573}
{'_id': 'dannykath', 'count': 677561}
{'_id': 'karitotp', 'count': 581877}
{'_id': 'MustangBuyer', 'count': 581084}
{'_id': 'Minh Nguyen', 'count': 465092}


In [16]:
# Number of users appearing only once (having 1 post)
pipeline = [{"$group":{"_id":"$created.user", "count":{"$sum":1}}},
                      {"$group":{"_id":"$count", "num_users":{"$sum":1}}},
                      {"$sort":{"_id":1}}, {"$limit":1}]

result = collection.aggregate(pipeline)
for r in range(1):
    print (result.next())

{'_id': 9, 'num_users': 169}


## Conclusion
I believe it has been cleaned for the purposes of this exercise. However, some area of the San Jose data is obviously far from being complete. 