In [2]:
from pymongo import MongoClient
from pymongo import cursor
import xml.etree.cElementTree as ET
from collections import defaultdict
import re
import os
import datetime
import pprint
import csv
import json
import io
import re
import codecs
import pymongo

client = MongoClient('localhost:27017')
db = client.osmstreetmap

### Area Dataset - Boston, Massachusetts
- boston_massachusetts.osm ... 211 MB
- boston_massachusetts.json ... 245 MB

#### Boston, MA
Boston is the capital city and the most populous municipality of the Commonwealth of Massachusetts, USA. With population over 670,000 in 2016, it is the largest city in the New England region of the Northeastern United States. Founded in 1630 by Puritan settlers from England, it is considered as one of the oldest cities in the country. With universities like MIT, Harvard University, Berklee College of Music, University of Massachusetts etc., Boston is considered as an international center of higher education and also a world leader in innovation and entrepreneurship with nearly 2,000 start-ups.

#### Why Boston?
For a long time, I have wished to go to Boston, and I'm taking my first step towards knowing the place, by choosing this particular dataset to learn more about it before hand.

#### Note
The dataset can be found via this link: https://mapzen.com/data/metro-extracts/metro/boston_massachusetts/. The original dataset is over 400 MB and unfortunately my computer wasn't able to process the file of this magnitude, so I had to trim down the dataset to about 200 MB for it to function properly.

### Skimming through it, I encountered few problems like:
- How to better understand and clean up the dataset of such large size.
- Several inconsistencies in zip codes, like the ones with more than five digits "02110-1301" or zip codes out of Boston area.
- Abbreviation of some street types, like "Ave" for Avenue, "St" for Street etc.
- The "address" section for some documents might differ from others, for instance, one way of representing the address in the documents is:
        <tag k="address" v="40 Thorndike St, Cambridge, MA, 02141" />
     having the entire address in one line. But some documents have separate entities describing an address like:
        <tag k="addr:city" v="Hyde Park" />
        <tag k="addr:street" v="Metropolitan Avenue" />
        <tag k="addr:postcode" v="02136" />
        <tag k="addr:housenumber" v="655" />
    Having an address described the way above, makes it easy to go after cities, street or postcode separately. But with one-liner address, it might be difficult to get an accurate count, say most mentioned street or cities. So, I'm going to identify the zip code from the address part and save it separately like so, "address": {"location": "40 Thorndike Street, Cambridge, MA 02141", "postcode": "02141"}, to easily analyze all the zip codes in one go.


### After the intial observation, my next steps would be to:
- To count the tags, like the number of nodes, ways etc.
- Identify any problem characters
- Clean up the dataset, convert it to JSON and upload to MongoDB
- Calculate the total number of documents
- Number of unique contributors
- Users with highest contributions
- Users with 10 or less contributions
- Drilling down on the zip codes, we'll examine if the zip codes are valid and whether they all belong to the Boston area
- Most mentioned cities
- List of amenities and their count

### Ideas for improvement:
- For addresses, there should be one standard way to store the data, like the following:
        <tag k="type" v="multipolygon" />
		<tag k="building" v="commercial" />
		<tag k="addr:city" v="Boston" />
		<tag k="addr:state" v="MA" />
		<tag k="addr:street" v="Summer Street" />
		<tag k="building:height" v="34.1376" />
		<tag k="building:levels" v="10" />
		<tag k="addr:housenumber" v="280" />
    This way of representing the address is quite convenient, not only it details out everything clearly but also it would be easy to examine each of the entities separately.
- It would be a good addition to have the date or year of establishment for buildings, universities or libraries or places of interest or historical value.
- Also, adding the capacity attribute to restaurants, buildings or necessary public places.

Although it might be difficult for some to gather such minutiae details, and ofcourse to edit the already existing data, but these improvements might help people looking for specific data at times.

### Conclusion:
The dataset is quite well maintained for most part and it was fairly interesting & challenging to work with data of such large scale and examining the outcomes. There's always a room to improve the quality of content for hightened accessbility even by a novice.

Before this lesson, I had little to no idea what OpenStreetMap was but these lessons along with this project have got me several steps closer to understanding the fundamentals of this phenomenal opensource collaborative mapping service. OpenStreetMap today has over 2 million registered users (Res: OpenStreetMap Wiki) and with more awareness, people all over the globe can have the knowledge to successfully build the map of the world.

### Resources used:

- https://en.wikipedia.org/wiki/OpenStreetMap
- https://api.mongodb.com/python/2.0/tutorial.html
- https://docs.mongodb.com/manual/reference/method/
- https://en.wikipedia.org/wiki/Boston

In [2]:
OSM_FILE = "boston_massachusetts.osm"  # Replace this with your osm file
SAMPLE_FILE = "boston_massachusetts_sample.osm"

k = 100 # Parameter: take every k-th top level element

def get_element(osm_file, tags=('node', 'way', 'relation')):
    """Yield element if it is the right type of tag

    Reference:
    http://stackoverflow.com/questions/3095434/inserting-newlines-in-xml-file-generated-via-xml-etree-elementtree-in-python
    """
    context = iter(ET.iterparse(osm_file, events=('start', 'end')))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()


with open(SAMPLE_FILE, 'wb') as output:
    output.write('<?xml version="1.0" encoding="UTF-8"?>\n')
    output.write('<osm>\n  ')

    # Write every kth top level element
    for i, element in enumerate(get_element(OSM_FILE)):
        if i % k == 0:
            output.write(ET.tostring(element, encoding='utf-8'))

    output.write('</osm>')

In [3]:
# To count tags - Nodes, Ways, Relations
def count_tags(filename):
    tag_dictionary = {}
    attrib_value_dictionary = {}
    for event, element in ET.iterparse(filename, events = ("start",)):
        key = element.tag
        if key in tag_dictionary:
            tag_dictionary[key] += 1
        else:
            tag_dictionary[key] = 1
    
    return tag_dictionary

tags = count_tags("boston_massachusetts.osm")
print "Nodes =", tags['node']
print "Ways =", tags['way']
print "Relations =", tags['relation']

Nodes = 976373
Ways = 155519
Relations = 663


In [18]:
# Identifying problem characters

lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

def key_type(element, keys):
    if element.tag == "tag":
        k_value = element.attrib['k']
        if lower.search(k_value):
            keys["lower"] += 1
        elif lower_colon.search(k_value):
            keys["lower_colon"] += 1
        elif problemchars.search(k_value):
            keys["problemchars"] += 1
        else:
            keys["other"] += 1
    return keys

def process_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)
    return keys

def test():
    keys = process_map('boston_massachusetts.osm')
    for key, value in keys.iteritems():
        print key, "--", value

test()

problemchars -- 3
lower -- 398383
other -- 19787
lower_colon -- 37812


In [4]:
# Data cleanup and convert it to JSON
CREATED = ["version", "changeset", "timestamp", "user", "uid"]
mapping = {"St": "Street",
            "St.": "Street",
            "Ave" : "Avenue",
            "Ave." : "Avenue",
            "Rd." : "Road",
            "Rd" : "Road",
            "Blvd" : "Boulevard",
            "Dr" : "Drive",
            "pkwy" : "Parkway",
            "Pkwy" : "Parkway",
            "Trl" : "Trail",
            "Ln" : "Lane",
            "ct" : "Court",
            "Ct" : "Court"}

def street_cleanup(street):
    name_split = street.split(' ')
    for key in mapping:
        for word in name_split:
            if key == word:
                street = re.sub(r'\b%s[.,\b]?' %key, mapping[key], street)
    return street

def zip_code_setup(address):
    zip_code_1 = re.compile("MA, ")
    zip_code_2 = re.compile("MA ")
    zip_pos = 10000
    if zip_code_1.search(address):
        zip_pos = zip_code_1.search(address).start()+4
    if zip_code_2.search(address):
        zip_pos = zip_code_2.search(address).start()+3
    zip_code = address[zip_pos:].strip(' ')
    return zip_code

def capacity_conversion(value):
    special_pattern = re.compile("~")
    if special_pattern.match(value):
        return int(value[1:])
    return int(value)

def create_dictionary(element):
    node = {}
    if element.tag == "node" or element.tag == "way" or element.tag == "relation":
        node['node_id'] = element.attrib['id']
        node['node_type'] = element.tag
        node['created'] = {}
        
        # The "created" attribute will store info on the 'id', 'date-time stamp', 'version' etc.
        for attributes in CREATED:
            if element.attrib[attributes]:
                node['created'][attributes] = element.attrib[attributes]
        
        # We check if the current tag contains "latitude" or "longitude" values, if yes then we create
        # their respective attributes
        if 'lat' in element.attrib or 'lon' in element.attrib:
            node['pos'] = []
            latitude = element.attrib['lat']
            longitude = element.attrib['lon']
            node['pos'].append(latitude)
            node['pos'].append(longitude)
        
        # The "address" attribute
        node['address'] = {}
        for tag in element.iter("tag"):
            
            # We'll use the following, "addr_pattern" & "colon_pattern", regexes to filter out the "tags" that might
            # contain any address related data
            
            # Addresses that match the "addr:" prefix, for e.g., <tag k="addr:postcode" v="02126" />
            addr_pattern = re.compile("addr:")
            
            # It will check if the address type begins with a colon, like <tag k=":postcode" v="02126" /> and
            # also to check for any additional colons in the address type
            colon_pattern = re.compile(":")
            
            # To store the k-attribute
            k_attrib = tag.attrib['k']
            
            # Stores the value for the k-attribute
            value = tag.attrib['v']
            
            # This condition will perform any kind of clean up related to street name abbreviation based on the
            # "mapping" dictionary provided, like converting "St." to "Street" or "Ave" to "Avenue"
            if k_attrib == "addr:street":
                value = street_cleanup(value)
                            
            # If the regexes don't match with the attributes
            if addr_pattern.match(k_attrib) == None and colon_pattern.match(k_attrib) == None:
                
                # The following condition is to check for the attributes stored as "address" instead of "addr:",
                # like <tag k="address" v="888 Broadway, Everett MA 02149-3199" />, here we save the postcode as
                # a separate entity
                if k_attrib == "address":
                    node['address']['location'] = value
                    zip_code = zip_code_setup(value)
                    if len(zip_code) >= 5:
                        node['address']['postcode'] = zip_code

                else:
                    # To convert the value for "capacity" attribute from String type to Integer
                    if k_attrib == "capacity":
                        node[k_attrib] = capacity_conversion(value)
                    else:
                        node[k_attrib] = value
            
            # If the regexes do match, then the matched prefixes will be removed saving just the address type with their
            # corresponding values
            else:
                if addr_pattern.match(k_attrib):
                    addr_type = k_attrib[5:]
                elif colon_pattern.match(k_attrib):
                    addr_type = k_attrib[1:]
                if not colon_pattern.search(addr_type):
                    node['address'][addr_type] = value
        
        # If no addresses are found for a particular tag, then it will be dropped from the dictionary
        if node['address'] == {}:
            node.pop('address', None)
        
        # If the tag is "way", then all of it's node references, like <nd ref="240167956" />, will be saved as a list
        if element.tag == "way":
            node['node_refs'] = []
            for tag in element.iter("nd"):
                node_refs = tag.attrib['ref']
                node['node_refs'].append(node_refs)
        
        # Same with "relation", all it's members, like <member ref="244444287" role="across" type="way" /> will be saved
        # as a dictionary
        if element.tag == "relation":
            node['member_values'] = {}
            for tag in element.iter("member"):
                node['member_values']['ref'] = tag.attrib['ref']
                node['member_values']['role'] = tag.attrib['role']
                node['member_values']['type'] = tag.attrib['type']
        return node
    else:
        return None

def process_map(file_in):
    file_out = "boston_massachusetts.json".format(file_in)
    with codecs.open(file_out, "w") as fo:
        for _, element in ET.iterparse(file_in):
            el = create_dictionary(element)
            if el:
                fo.write(json.dumps(el) + "\n")

process_map('boston_massachusetts.osm')

In [6]:
# Total no. of documents
total = db.boston_massachusetts.find().count()
print "Total no. of documents:", total

Total no. of documents: 1132555


In [7]:
# No. of unique contributors
unique_contributors = len(db.boston_massachusetts.distinct("created.uid"))
print "No. of unique contributors:", unique_contributors

No. of unique contributors: 1197


In [8]:
# Users with highest contributions & percentage
max_contributors = db.boston_massachusetts.aggregate([{"$group" : {"_id" : "$created.user",
                                                               "contributions" : {"$sum" : 1}}},
                                                      {"$sort" : {"contributions" : -1}},
                                                      {"$limit" : 10}])

print "DOCS SUBMITTED, (contribution %ge) -->  USERS"
print "----------------------------------      -----"
top_ten_docs = 0
for contributor in max_contributors:
    top_ten_docs += contributor['contributions']
    print contributor['contributions'], ", (%.2f)" % float(contributor['contributions'] * 100.0/total), '                  -->   ',contributor['_id']

print
print "TOTAL DOCS SUBMITTED BY THE TOP 10 USERS, (contribution %ge)"
print "---------------------------------------------------------------"
print top_ten_docs, ", (%.2f)" % float(top_ten_docs * 100.0/total)

DOCS SUBMITTED, (contribution %ge) -->  USERS
----------------------------------      -----
597677 , (52.77)                   -->    crschmidt
214329 , (18.92)                   -->    jremillard-massgis
55587 , (4.91)                   -->    wambag
45447 , (4.01)                   -->    OceanVortex
33690 , (2.97)                   -->    morganwahl
32991 , (2.91)                   -->    ryebread
29291 , (2.59)                   -->    MassGIS Import
16220 , (1.43)                   -->    ingalls_imports
14125 , (1.25)                   -->    Ahlzen
7363 , (0.65)                   -->    mapper999

TOTAL DOCS SUBMITTED BY THE TOP 10 USERS, (contribution %ge)
---------------------------------------------------------------
1046720 , (92.42)


In [9]:
# No. of users with 10 or less contributions
min_contributors = db.boston_massachusetts.aggregate([{"$group" : {"_id" : "$created.user",
                                                                   "contributions" : {"$sum" : 1}}},
                                                      {"$group" : {"_id" : "$contributions",
                                                                   "users_count" : {"$sum" : 1}}},
                                                      {"$sort" : {"_id" : 1}},
                                                      {"$limit" : 10}])
print "DOCUMENTS --> NO. OF USERS, (%ge of users)"
print "---------     ----------------------------"
bottom_ten_users = 0
for contributor in min_contributors:
    bottom_ten_users += contributor['users_count']
    print contributor['_id'], '        -->        ', contributor['users_count'], ", (%.2f)" % float(contributor['users_count'] * 100.0/unique_contributors)

print
print "TOTAL, (%ge of users with 10 docs or less)"
print "------------------------------------------"
print bottom_ten_users, ", (%.2f)" % float(bottom_ten_users * 100.0/unique_contributors)

DOCUMENTS --> NO. OF USERS, (%ge of users)
---------     ----------------------------
1         -->         338 , (28.24)
2         -->         128 , (10.69)
3         -->         88 , (7.35)
4         -->         60 , (5.01)
5         -->         50 , (4.18)
6         -->         36 , (3.01)
7         -->         24 , (2.01)
8         -->         23 , (1.92)
9         -->         20 , (1.67)
10         -->         15 , (1.25)

TOTAL, (%ge of users with 10 docs or less)
------------------------------------------
782 , (65.33)


In [10]:
# Zip codes

# All the zip codes in the Boston area
boston_zip_codes = ['01841', '02101', '02108', '02109', '02110', '02111', '02112', '02113', '02114', '02115', '02116', '02117',
                    '02118', '02119', '02120', '02121', '02122', '02123', '02124', '02125', '02126', '02127', '02128', '02129',
                    '02130', '02131', '02132', '02133', '02134', '02135', '02136', '02137', '02141', '02149', '02150', '02151',
                    '02152', '02163', '02171', '02196', '02199', '02201', '02203', '02204', '02205', '02206', '02210', '02211',
                    '02212', '02215', '02217', '02222', '02228', '02241', '02266', '02283', '02284', '02293', '02297', '02298',
                    '02445', '02467']

# Documents with "postcode" in the "address" section
docs_with_postcode = db.boston_massachusetts.find({'address.postcode' : {"$exists" : True}}).count()
print "Documents with \"postcode\" in the \"address\" section:", docs_with_postcode

# Documents with "postcode" in Boston area
docs_with_postcode_in_boston = db.boston_massachusetts.find({'address' : {"$exists" : True},
                                                             'address.postcode' : {"$in" : boston_zip_codes}}).count()
print "Documents with \"postcode\" in the Boston area:", docs_with_postcode_in_boston

# Documents with "postcode" not in the Boston area
docs_with_postcode_outside = db.boston_massachusetts.find({'address' : {"$exists" : True},
                                                           'address.postcode' : {"$exists" : True,
                                                                                 "$nin" : boston_zip_codes}}).count()
print "Documents with \"postcode\" outside the Boston area:", docs_with_postcode_outside

docs_with_postcode_outside_details = db.boston_massachusetts.find({'address' : {"$exists" : True},
                                                                   'address.postcode' : {"$exists" : True,
                                                                                         "$nin" : boston_zip_codes}})

# On further expansion of the "docs_with_postcode_outside" the Boston area, we can see some that of the postcodes have been
# formatted differently, for e.g. "02110-1301", and thus needs to be edited to get an accurate count.
set_code = set()
for index, area in zip(range(200), docs_with_postcode_outside_details):
    address = area['address']
    if 'postcode' in address:
        set_code.add(address['postcode'])

# You can uncomment the below print statement to view a sample of such zip codes
# pprint.pprint(set_code)

# Edit the zip codes to 5-digit values by removing any hypens or any preceding characters like "MA", to get a count on unique zip codes from the documents
set_postcodes = set()
for area in db.boston_massachusetts.find({"address.postcode" : {"$exists" : True}}):
    postcode = area['address']['postcode']
    type_1 = re.compile("-")
    type_2 = re.compile("MA ")
    if type_1.search(postcode):
        pos = type_1.search(postcode).start()
        postcode = postcode[:pos]
    if type_2.search(postcode):
        postcode = postcode[3:]
    set_postcodes.add(postcode)

print
print "Unique postcodes in the documents:", len(set_postcodes)
print "All of the Boston area postcodes:", len(boston_zip_codes)
print "Postcodes from the documents belonging to the Boston area:", len(set_postcodes.intersection(boston_zip_codes))
print "Postcodes not belonging to the Boston area:", len((set_postcodes) - set(boston_zip_codes))
print

# Most mentioned postal codes in the documents
most_mentioned_postal_codes = db.boston_massachusetts.aggregate([{"$match" : {"address.postcode" : {"$exists" : True}}},
                                                                 {"$group" : {"_id" : "$address.postcode",
                                                                              "count" : {"$sum" : 1}}},
                                                                 {"$sort" : {"count" : -1}},
                                                                 {"$limit" : 10}])

print "10 most mentioned zip codes:"
print "----------------------------"
for data in most_mentioned_postal_codes:
    print data['_id'], " --> ", data['count']

# The most mentioned postal code - 02139, doesn't come under Boston but is actually in Cambridge, MA.

Documents with "postcode" in the "address" section: 1593
Documents with "postcode" in the Boston area: 833
Documents with "postcode" outside the Boston area: 760

Unique postcodes in the documents: 75
All of the Boston area postcodes: 62
Postcodes from the documents belonging to the Boston area: 40
Postcodes not belonging to the Boston area: 35

10 most mentioned zip codes:
----------------------------
02139  -->  218
02135  -->  161
02130  -->  105
02144  -->  63
02474  -->  63
02114  -->  49
02215  -->  46
02116  -->  43
02143  -->  40
02138  -->  39


In [12]:
# Most mentioned cities
most_mentioned_city = db.boston_massachusetts.aggregate([{"$match" : {"address.city" : {"$exists" : True}}},
                                                         {"$group" : {"_id" : "$address.city",
                                                                      "count" : {"$sum" : 1}}},
                                                         {"$sort" : {"count" : -1}},
                                                         {"$limit" : 10}])

print "10 Most mentioned cities"
print "------------------------"
for data in most_mentioned_city:
    print data['_id'], " --> ", data['count']

# Mostly true to the dataset, "Boston" has the highest mentions in the documents but as we can see we have names of surrounding
# cities as well like "Cambridge", "Malden" etc. which, explains why we have about 35 of zip codes outside of Boston area.

10 Most mentioned cities
------------------------
Boston  -->  572
Cambridge  -->  287
Malden  -->  191
Arlington  -->  136
Somerville  -->  134
Jamaica Plain  -->  50
Chelsea  -->  37
Quincy  -->  28
Medford  -->  25
Brookline  -->  22


In [14]:
# List of amenities and their count
amenities = db.boston_massachusetts.aggregate([{"$match" : {"amenity" : {"$exists" : True}}},
                                               {"$group" : {"_id" : "$amenity",
                                                            "count" : {"$sum" : 1}}},
                                               {"$sort" : {"count" : -1}},
                                               {"$limit" : 10}])

print
print "LIST OF AMENITIES AND COUNT:"
print "----------------------------"
for data in amenities:
    print data['_id'], " --> ", data['count']
print

# Places of worship
places_of_worship = db.boston_massachusetts.aggregate([{"$match" : {"amenity" : {"$exists" : True}, "amenity" : "place_of_worship", "religion" : {"$exists" : True}}},
                                                       {"$group" : {"_id" : "$religion",
                                                                    "count" : {"$sum" : 1}}},
                                                       {"$sort" : {"count" : -1}}])

print "TOP 5 PLACES OF WORSHIP:"
print "------------------------"
for data in places_of_worship:
    print data['_id'], " --> ", data['count']
print

# Restaurants & cuisine
restaurants = db.boston_massachusetts.aggregate([{"$match" : {"amenity" : {"$exists" : True}, "amenity" : "restaurant", "cuisine" : {"$exists" : True}}},
                                                 {"$group" : {"_id" : "$cuisine",
                                                              "count" : {"$sum" : 1}}},
                                                 {"$sort" : {"count" : -1}},
                                                 {"$limit" : 5}])

print "TOP 5 CUISINES"
print "--------------"
for data in restaurants:
    print data['_id'], " --> ", data['count']

# Top 5 restaurants with highest capacity
top_capacity_restaurants = db.boston_massachusetts.find({"amenity" : "restaurant", "capacity" : {"$exists" : True}}).sort("capacity", -1).limit(5)

print
print "TOP 5 RESTAURANTS WITH HIGHEST CAPACITY"
print "---------------------------------------"
for data in top_capacity_restaurants:
    print data['name'], "-->", data['capacity']

print
res_with_no_capacity_info = db.boston_massachusetts.find({"amenity" : {"$exists" : True}, "amenity" : "restaurant", "capacity" : {"$exists" : False}}).count()
print "Restaurants with no capacity info:", res_with_no_capacity_info
    
# As you can see, there are still over 300 restaurants with no capacity data. Having the this info for all the restaurants would have given us a better idea

LIST OF AMENITIES AND COUNT:
----------------------------
parking  -->  742
bench  -->  527
restaurant  -->  376
school  -->  339
parking_space  -->  224
place_of_worship  -->  212
library  -->  157
bicycle_parking  -->  155
cafe  -->  151
fast_food  -->  127

PLACES OF WORSHIP:
------------------
christian  -->  183
jewish  -->  9
unitarian_universalist  -->  4
muslim  -->  2
buddhist  -->  1

CUISINES
--------
pizza  -->  32
mexican  -->  20
chinese  -->  17
indian  -->  15
italian  -->  15

RESTAURANTS WITH HIGHEST CAPACITY
---------------------------------
The Haven --> 100
Chicago Pizza --> 40
Yume Wo Katare --> 20
Taco Party --> 20
All Star Pizza Bar --> 20

Restaurants with no capacity info: 367
