# OpenStreetMap Sample Project
# Data Wrangling with MongoDB

In this project I have analysed the OpenStreet data of my own town, Milan in Italy using the link in the section "Sources/Materials".

##Problems encountered in the map

The map comes from a big area in Italy, where naming conventions are different than in US. 
I firstly analysed street names, without finding anything relevant. In Italy the street name is mentioned as first and usually not abbreviated, so it is easier to have standardized data.

I have then analysed postcode: usually all codes should have exactly 5 digits. In the map data I found few codes that were having 4 digits - most probably due to human error.
Here below the regex usued to find wrong codes:

In [1]:
import re

postcode_type_re = re.compile(r'^\d{1,4}$', re.IGNORECASE)


def audit_postcode_type(wrong_postcode, postcode):
    r = postcode_type_re.search(postcode)
    if r:
        wrong_postcode = r.group()
        wrong_postcodes[wrong_postcode] += 1
        

This was corrected manually building a short dictionary to map wrong codes with good ones. This was an easy task because considering this map focuses only on Milan, it was not difficult to figure out which code was intented to use. 

In [2]:
postcodes_tofix = {
    "2090":"20090",
    "2121":"20121",
    "2043":"20143",
    "2014":"20124",
    "2009":"20092",
    "2003":"20030"
}

In the code I also found that several ways were used to describe the same thing. In particular, some users used the "telephone" to add the phone of a particular place, some others "contact:phone". The same is applicable to website, url, fax, mail and phone. All those tags were modified to have "contact:x" everywhere. During the import of the data, a new dictionary "contact" was created to load all those information together. 
Below some of the code used for the analysis and to fix the issue:


In [3]:
def is_mail_contact(elem):
    return (elem.tag == "tag") and (elem.attrib['k'] == "mail")

def is_fax_contact(elem):
    return (elem.tag == "tag") and (elem.attrib['k'] == "fax")

def is_phone_contact(elem):
    return (elem.tag == "tag") and (elem.attrib['k'] == "phone")

def is_website_contact(elem):
    return (elem.tag == "tag") and (elem.attrib['k'] == "website")

def is_url_contact(elem):
    return (elem.tag == "tag") and (elem.attrib['k'] == "url")

def audit_contact_type(contact_types, name):
    contact_types[name] += 1
    
def correct_contact_type(contact_type):
    return "contact:"+ contact_type


Another problem found in the map, that was not fixed, is about the classification used to label some places. Analysing the leisure and amenity places, I found some that should be part of one category instead of the other one.
In particular, all places labeled as gym, lotto, nightclub, picnic_table in amenities, should be put in leisure.

This was not performed because the more the data are analysed, more of those inconsistency can be found. 

##Data overview

File sizes:
- Milan.osm  780MB
- Milan.json 1.14GB

Number of documents
coll.find().count()
4016658

Number of nodes
coll.find({"type": "node"}).count()
3484193

Number of ways
coll.find({"type": "way"}).count()
532392

Number of unique users
len(coll.distinct("created.user"))
2410

Most contributing user
coll.aggregate([{"$group":{"_id":"$created.user", "count":{"$sum":1}}}, {"$sort":{"count":-1}}, {"$limit":1}])
[{u'_id': u'Alecs01', u'count': 594237}]

#Find amenities 
print "Count of amenities in the data"
aggr = coll.aggregate( [{"$group":{"_id": "$amenity", "count":{"$sum":1}}}, {"$sort":{"count":-1}}])
pprint(list(aggr))

#Find all restaurant in Milan 
print "Count of different cuisines in resto in Milan city"
aggr = coll.aggregate( [{"$match":{"address.city":"Milano", 
                                  "amenity":"restaurant"}},
                       {"$group":{"_id": "$cuisine", "count":{"$sum":1}}}, {"$sort":{"count":-1}}])
pprint(list(aggr))


##Future ideas

Considering that OpenStreet is an open source project, and contributions are optional, I think that it would still be nice to add a new way of loading information. 
For the moment, users in average have added 1666 information, which is not a lot compared to the most contributing user. 

Why not allowing users to add information in a easier way? Instead of having to add data via a laptop, it would be nice to being able to use a smartphone to automatically add information. How? If you are having a stroll in your city, you could simply use the app of OpenStreet and take a pic of the element you want to add:
- if you want to add a new street, simply take a pic of the street name
- if you want to add a shop, take a pic of the shop sign.
A software would then automatically recognize the items in the picture, giving the user the ability to choose from a list of selected categorizes which one is the most appropriate and ask for more details if needed (like contact info, opening hours, ...).
The fact of having an app to populate the map would simplify the process of adding information, making it more fun and also more accurate.
What are the problem of this approach?
First of all, having a software that can work well recognizing items from a picture is not so straightforward to get. As well as changing the way people add information might be counter-productive: if users are not used to having the OpenStreet app, they might continue to add information in the usual way. Another point is that OpenStreet, which is an opensource, would need to create and maintain their application.
What would be the pros?
Nowadays everyone has a smartphone and people are used to using app in their day to day life. Giving the users the opportunity to simply add a picture, instead of having to write a street name, it also prevents from typos and other human errors. At the same time, contributors can enjoy adding more information, increasing the average count of contributors information. 