# Project 3 - Wrangle OpenStreetMap Data

## Introduction

OpenStreetMap.org is an open source online map service. It uses tags to represent physical features on the ground. Its structure consists of nodes, ways and relations. 

The nodes represent single points in the map and are defined by their latitude, longitude and id. The ways are ordered lists of nodes (2 to 2,000). The relations define logical or geographical relationships between other elements. They are ordered lists of one or more nodes, ways and/ or relations.

The nodes, ways and relations usually have at least one tag to define their purpose.

**Example of a node**:

< node id="25496583" lat="51.5173639" lon="-0.140043" version="1"
changeset="203496" user="80n" uid="1238" visible="true" 
timestamp="2007-01-28T11:40:26Z" >
    
    < tag k="highway" v="traffic_signals"/ >
< /node >

**Example of a way**:

  < way id="5090250" visible="true" timestamp="2009-01-19T19:07:25Z" version="8" changeset="816806" user="Blumpsy" uid="64226" >
  
    < nd ref="822403"/>
    < nd ref="21533912"/>
    < nd ref="821601"/>
    < nd ref="21533910"/>
    < nd ref="135791608"/>
    < nd ref="333725784"/>
    < nd ref="333725781"/>
    < nd ref="333725774"/>
    < nd ref="333725776"/>
    < nd ref="823771"/>
    < tag k="highway" v="residential"/>
    < tag k="name" v="Clipstone Street"/>
    < tag k="oneway" v="yes"/>
  < /way >
  
**Example of a relation**:

< relation id="1" >

    < tag k="type" v="boundary" />
    < tag k="boundary" v="administrative" />
    < tag k="land_area" v="administrative" />
    < tag k="admin_level" v="2" />
    < tag k="name" v="light green country" />
    < member type="way" id="AB" role="outer" />
    < member type="way" id="AC" role="inner" />

< /relation >


< relation id="2" >

    < tag k="type" v="boundary" />
    < tag k="boundary" v="administrative" />
    < tag k="land_area" v="administrative" />
    < tag k="admin_level" v="2" />
    < tag k="name" v="dark green country" />
    < member type="way" id="AB" role="outer" />
    < member type="way" id="AC" role="outer" />
< /relation>

## Methods

In this project I explored the OpenStreetMap data for the city of Athens, Greece. I followed the methodology and code as presented in the respective case study for the MongoDB of the "Wrangle OpenStreetMap Data" course.


I used the ElementTree library of python to parse the xml file which contained the raw data. Since the data sometimes were presented in greek characters, I converted them to latin characters when necessary.


Finally, I passed the data to a MongoDB database, to which I performed certain queries to uncover unique contributors, cities amenities etc. included in the data.

## Results

#### Import python libraries

In [1]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import xml.etree.ElementTree as ET
from copy import copy as cp
from collections import defaultdict
from pymongo import MongoClient  
import os
import sys
import re
import codecs
import json
import string
import pprint

#### Data parsing

Here I parse the OSM file of the city of Athens. Data Refered to the whole area of athens and obtained directly from: https://mapzen.com/data/metro-extracts/

In [2]:
OSM_FILE = "athens_greece.osm"
SAMPLE_FILE = "athens_sample.osm"

k = 1 # Parameter: take every k-th top level element

def get_element(osm_file, tags=('node', 'way', 'relation')):
    #Yield element if it is the right type of tag    
    
    context = iter(ET.iterparse(osm_file, events=('start', 'end')))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()

with open(SAMPLE_FILE, 'wb') as output:
    output.write('<?xml version="1.0" encoding="UTF-8"?>\n')
    output.write('<osm>\n  ')

    # Write every kth top level element
    for i, element in enumerate(get_element(OSM_FILE)):
        if i % k == 0:
            output.write(ET.tostring(element, encoding='utf-8'))

    output.write('</osm>')

#### Counting tag occurences in the xml file

Here I counted all the tag occurences of the xml file. We see that there are $2,198,514$ nodes, $266,412$ ways and $7,522$ relations in this dataset.

In [4]:
def count_tags(filename):
        tree = ET.parse(filename)
        tags = dict()
        for childElem in tree.iter():
            child = childElem.tag
            if child not in tags:
                tags[child] = 1
            else:
                tags[child] += 1
        return tags

count_tags(SAMPLE_FILE)

{'member': 72506,
 'nd': 2666195,
 'node': 2198514,
 'osm': 1,
 'relation': 7522,
 'tag': 734021,
 'way': 266412}

#### Auditing the k values of all tags

In this section I audited the k values of the tags. Most of them seemed to be in an expeted format, only $7$ of them had some sort of special characters and $3,382$ were in an "other" format (probably containing greek characters).

In [6]:
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

def key_type(element, keys):
    if element.tag == "tag":
        k = element.attrib["k"]
        if lower.search(k):
            keys["lower"] += 1
        elif lower_colon.search(k):
            keys["lower_colon"] += 1
        elif problemchars.search(k):
            keys["problemchars"] += 1
        else:
            keys["other"] += 1
        pass
        
    return keys

def process_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)
    
    return keys

process_map(SAMPLE_FILE)

{'lower': 614528, 'lower_colon': 116104, 'other': 3382, 'problemchars': 7}

#### Identifying the unique contributors

Here I identified the occurences of unique contributors to the dataset. $1,362$ unique users contributed to the dataset.

In [8]:
def get_user(element):
    return

def process_map(filename):
    users = set()
    for _, element in ET.iterparse(filename):
        if element.tag == "node":
            users.add(element.attrib["uid"])
    #pprint.pprint(users)
    return users, len(users)
    
users, users_total = process_map(SAMPLE_FILE)
users_total

1362

#### Improving street names

The names of greek streets very often start with the street type. Thus I split the code to check for street type also in the beginning of the street name. Furthermore I added some common greek street names as "Odos" and "Leoforos" (which translate to "Street" and "Avenue", respectively). When encountering these terms I also put in parenthesis their translation. Also, I converted all street names which were in greek characters to their equivalent latin characters.

Some of the street names are given botth in greek and english characters, for example Τζων Κένεντυ (John Kennedy). These cases were treated as all the rest (modified to "Tzon Kenentu (John Kennedy)"). Note that the greek character "υ" when alone is pronounced as the letter "e", but when after 'ο' and 'α' can be pronounced in different ways ("ou", "av" and "af" are some of them). 

Certain greek letters are equivalent to double characters in english, such as 'θ' which is pronounced as 'th'. However, in typical greek to "greeklish" converters it is represented as an "8". This is understandable to Greeks, but probably not to non-greeks. I modified a bit the code that I used to convert greek to latin characters in order to make it easier to pronounce from non-greeks.

Some street names are obviously street numbers. These were marked as "Invalid Street Name" when encounntered.

A few of the street names also have their first character as lowercase. I converted all first characters to capitals for nicer presentation.

Moreover, after looking multiple times at the dictionary file produced from the "Preparing for database" section I found other fields which have unicode characters. In some of the fields there are russian or french characters. For this project I chose to change only the greek characters to latin and to not account for the rest. The methods 'convert()' and 'get_conversion_pool()' that are defined in this section were used to make the conversion.

The other fields in which I identified greek characters are:

    1. 'name':'el'
    2. 'wikipedia':'el'
    3. 'created':'user'
    4. 'alt_name':
    5. 'name':
    6. 'addr':'city'

Finally, some of the street names are duplicates. For example "Leoforos Pasidonos Avenue" is the same as "Poseidonos Avenue" and "Liosion (Liosion)" is the same as "Liossion". I chose not to account for duplicates in this project.

In [53]:
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

# Expected types
expected = ["Odos", "Leoforos", "Street", "Avenue", "Boulevard", "Place", "Square", "Road", 
            "Parkway", "Way"]

# Conversion of abbreviations
mapping = { "St": "Street",
            "St.": "Street",
            "Str.": "Street",
            "Ave": "Avenue",
            "Ave.": "Avenue",
            "Rd.": "Road",
            "Leof.": "Leoforos (Avenue)",
            "Leof": "Leoforos (Avenue)",
            "Leoforos": "Leoforos (Avenue)",
            "L.": "Leoforos (Avenue)",
            "Λ.": "Leoforos (Avenue)",
            "Λ": "Leoforos (Avenue)",
            "Odos": "Odos (Street)"
            }

# The next two methods are responsible for transforming greek characters
# that exist in words into latin characters.

def get_conversion_pool():
    poolGR = u"αβγδεζηικλμνοπρστυφωΑΒΓΔΕΖΗΙΚΛΜΝΟΠΡΣΤΥΦΩςάέήίϊΐόύϋΰώΆΈΉΊΪΌΎΫΏ"
    poolGL =  "avgdeziiklmnoprstufoAVGDEZIIKLMNOPRSTYFOsaeiiiiouuuoAEIIIOYYO"
    special_chars = [[u'θ','th'],[u'ξ','ks'],[u'ψ','ps'],[u'χ','ch'],[u'Θ','Th'],[u'Ξ','Ks'],[u'Ψ','Ps'],[u'Χ','Ch']]
    return dict(zip(poolGR, poolGL) + special_chars)

                   
def convert(datasource):
    pool = get_conversion_pool()
    output_line = []
    for character in datasource:
        if pool.has_key(character):
            output_line.append(pool[character])
        else:
            output_line.append(character)
    return "".join(output_line)

# This method checks whether a street name is expectted or not
def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)

def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")

# This method identifies street tags inside nodes or ways
def audit(osmfile):
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])
    osm_file.close()
    return street_types

# This method is responsible for converting the street names to their
# expected formats, as seen in the "mapping" list
def update_name(name, mapping):
    fixed_name = None
    if isinstance(name, unicode):
        name = convert(name)
    
    name_start = cp(name).split(" ")[0]
    len_start = len(name_start)
    name_end = cp(name).split(" ")[-1]
    len_end = len(name_end)
    
    #clean_name_end = re.sub('[^A-Za-z0-9]+', '', name_end)
    #clean_name_start = re.sub('[^A-Za-z0-9]+', '', name_start)
    
    if (name_start in mapping) and (name_end in mapping):
        end = len(name) - len_end
        fixed_name = mapping[name_start] +\
                     name[len_start:end] +\
                     mapping[name_end]
    elif (name_start in mapping):
        fixed_name = mapping[name_start] + name[len_start:]
    elif name_end in mapping:
        end = len(name) - len_end
        fixed_name = name[0:end] + mapping[name_end]
    elif (name[0:2] in mapping) and (name[1] == '.'):
        fixed_name = mapping[name[0:2]] + ' ' + name[2:]
    elif (sum(c.isdigit() for c in name) >= len(name)/2):
        fixed_name = "Invalid Street Name"
    else:
        fixed_name = name
    return fixed_name.title()

# This method calls all the street names and converts them. It is used
# here for presentation purposes.
def correct_street_name():
    st_types = audit(SAMPLE_FILE)
    better_name = {}
    for st_type, ways in st_types.iteritems():
        for name in ways:
            better_name[name] = update_name(name, mapping)
            print name, "=>", better_name[name]
    return 

correct_street_name()

Stratigou Dimitriou Petriti => Stratigou Dimitriou Petriti
44 => Invalid Street Name
Voltairou => Voltairou
Agias Paraskeyis 2-3 => Agias Paraskeyis 2-3
N. Kazantzaki => N. Kazantzaki
Ελευθερίου Βενιζέλου (Eleftheriou Venizelou) => Eleutheriou Venizelou (Eleftheriou Venizelou)
Τζων Κένεντυ (John Kennedy) => Tzon Kenentu (John Kennedy)
Liossion => Liossion
Αχαρνών 330 => Acharnon 330
miaovyn => Miaovyn
Erithrou Stavrou => Erithrou Stavrou
Παλαιά Εθνική Οδός 89 => Palaia Ethniki Odos 89
Evripidou Str. => Evripidou Street
Filadelfeos Str. => Filadelfeos Street
Λ. Βεΐκου 83 => Leoforos (Avenue) Veikou 83
Evaggelistrias => Evaggelistrias
Alexandrou Panagouli => Alexandrou Panagouli
Kaisareias & Kountouriotou => Kaisareias & Kountouriotou
Τροίας 34 => Troias 34
Michalakopoulou => Michalakopoulou
Leof. Acheon => Leoforos (Avenue) Acheon
Leoforos Dimarchou Angelou Metaxa => Leoforos (Avenue) Dimarchou Angelou Metaxa
Dimarchou Angelou Metaxa => Dimarchou Angelou Metaxa
Νομικού Μιχαήλ 26 => Nomi

#### Preparing fot database

Here I transformed the xml file to a python dictionary in order to pass it to the database. 

In [54]:
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
double_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

CREATED = [ "version", "changeset", "timestamp", "user", "uid"]


# This method converts the data from the xml format to python dictionaries
# accorrding to the following desired format:

'''
{
"id": "2406124091",
"type: "node",
"visible":"true",
"created": {
          "version":"2",
          "changeset":"17206049",
          "timestamp":"2013-08-03T16:43:42Z",
          "user":"linuxUser16",
          "uid":"1219059"
        },
"pos": [41.9757030, -87.6921867],
"address": {
          "housenumber": "5157",
          "postcode": "60625",
          "street": "North Lincoln Ave"
        },
"amenity": "restaurant",
"cuisine": "mexican",
"name": "La Cabana De Don Luis",
"phone": "1 (773)-271-5176"
}
'''

def shape_element(element):
    node = {}
    
    if element.tag == "node" or element.tag == "way":
        node["type"] = element.tag
        for attr_name, attr_value in element.attrib.items():
            if attr_name == "lat":
                if "pos" not in node:
                    node["pos"] = []
                node["pos"].insert(0, float(attr_value))
            elif attr_name == "lon":
                if "pos" not in node:
                    node["pos"] = []
                node["pos"].insert(-1, float(attr_value))
            elif attr_name  in ["id", "visible", "type"]:
                node[attr_name] = convert(attr_value)
            else:
                if "created" not in node:
                    node["created"] = {}
                node["created"][attr_name] = convert(attr_value)
        for tag in element.iter("tag"):
            if problemchars.search(tag.attrib["k"]) or \
               double_colon.search(tag.attrib["k"]):
                pass
            else:
                if lower_colon.search(tag.attrib["k"]) and \
                    tag.attrib["k"].startswith("addr:"):
                    if "address" not in node:
                        node["address"] = {}
                    address_element = tag.attrib["k"][tag.attrib["k"].index(":") + 1:]
                    if address_element == "street":
                        name = update_name(tag.attrib["v"], mapping)
                        node["address"]["street"] = name
                    elif address_element == "city":
                        name = convert(tag.attrib["v"])
                        node["address"][address_element] = name.title()
                    else:
                        node["address"][address_element] = tag.attrib["v"].title()
                elif lower_colon.search(tag.attrib["k"]):
                    pre_colon = tag.attrib["k"][:tag.attrib["k"].index(":") + 1]
                    post_colon = tag.attrib["k"][tag.attrib["k"].index(":") + 1:]
                    #print pre_colon
                    if pre_colon not in node:
                        node[pre_colon[:-1]] = {}
                    node[pre_colon[:-1]][post_colon] = convert(tag.attrib["v"]).title()
                else:
                    node[tag.attrib["k"]] = convert(tag.attrib["v"]).title()
        for tag in element.iter("nd"):
            if "node_refs" not in node:
                node["node_refs"] = []
            node["node_refs"].append(convert(tag.attrib["ref"]))
        return node
    else:
        return None

# This method converts the python dictionaries to JSON and passes it
# to a data variable, which later can be imported to the database.
def process_map(file_in, pretty = False):
    # You do not need to change this file
    file_out = "{0}.json".format(file_in)
    data = []
    with codecs.open(file_out, "w") as fo:
        for _, element in ET.iterparse(file_in):
            el = shape_element(element)
            if el:
                data.append(el)
                if pretty:
                    fo.write(json.dumps(el, indent=2)+"\n")
                else:
                    fo.write(json.dumps(el) + "\n")
    return data

data = process_map(SAMPLE_FILE, False)

In [55]:
# I print some of the data just for reference
data[0:100]

[{'created': {'changeset': '10662021',
   'timestamp': '2012-02-12T11:51:59Z',
   'uid': '497621',
   'user': 'NikosSkouteris',
   'version': '5'},
  'id': '78695',
  'pos': [37.5964742, 23.0709818],
  'type': 'node'},
 {'created': {'changeset': '10662021',
   'timestamp': '2012-02-12T11:51:59Z',
   'uid': '497621',
   'user': 'NikosSkouteris',
   'version': '4'},
  'id': '78696',
  'pos': [37.5961011, 23.0711918],
  'type': 'node'},
 {'created': {'changeset': '10662021',
   'timestamp': '2012-02-12T11:51:59Z',
   'uid': '497621',
   'user': 'NikosSkouteris',
   'version': '4'},
  'id': '78697',
  'pos': [37.5958087, 23.071361],
  'type': 'node'},
 {'created': {'changeset': '10662021',
   'timestamp': '2012-02-12T11:51:59Z',
   'uid': '497621',
   'user': 'NikosSkouteris',
   'version': '4'},
  'id': '78698',
  'pos': [37.5956932, 23.0716514],
  'type': 'node'},
 {'created': {'changeset': '2335200',
   'timestamp': '2009-09-01T10:25:48Z',
   'uid': '136321',
   'user': 'Teddy73',
   'v

##### Import to MongoDB

Here I created the database and I inserted the clean data in it.

In [56]:
def insert_data(infile, db):
    db.athens_map.insert_many(infile)

client = MongoClient("mongodb://localhost:27017")
db = client.athens_map

insert_data(data, db)

##### Example of a datapoint

In [57]:
db.athens_map.find_one()

{u'_id': ObjectId('583ee56a8a99fb16002d0bef'),
 u'created': {u'changeset': u'10662021',
  u'timestamp': u'2012-02-12T11:51:59Z',
  u'uid': u'497621',
  u'user': u'NikosSkouteris',
  u'version': u'5'},
 u'id': u'78695',
 u'pos': [37.5964742, 23.0709818],
 u'type': u'node'}

##### Counting the data points in the database

In [58]:
db.athens_map.find().count()

2464926

##### Contributors

Here we can see the number of contributions by a unique user. Previously we saw that the total number of unique contributors is $1,362$. What is very interesting is that the top contributing user has contributed more than the $10$ next top contributors. With a simple look at the results we see that there is a strong imbalance in the contribution effort. 

In [77]:
city = db.athens_map.aggregate([{'$group': {"_id": '$created.user',
                                            "count": {"$sum": 1}}},
                                {'$sort': {"count": -1}}, 
                                {"$limit":50}])

pprint.pprint([{c['_id'], c['count']} for c in city])

[set([1161365, u'makmar']),
 set([372124, u'greecemapper']),
 set([134356, u'mtraveller']),
 set([111571, u'Amaroussi']),
 set([77301, u'NikosSkouteris']),
 set([63682, u'Chris Makridis']),
 set([61411, u'aitolos']),
 set([54137, u'Kanenas']),
 set([30202, u'athinaios']),
 set([29896, u'AiNikolas']),
 set([21382, u'GeorgeKM']),
 set([14858, u'Ori952']),
 set([14319, u'Nautic']),
 set([13890, u'Alex111X']),
 set([13418, u'ika-chan!']),
 set([12561, u'nikolakis']),
 set([12333, u'mappas']),
 set([11227, u'xenofondas']),
 set([11041, u'SophoM']),
 set([8962, u'asyr']),
 set([8555, u'gmar55']),
 set([8438, u'JayCBR']),
 set([8375, u'concartman']),
 set([8061, u'hugelsepp']),
 set([7804, u'gata_osm']),
 set([7639, u'0000FF berries']),
 set([7553, u'nikpet67']),
 set([7312, u'Nikos1961']),
 set([6791, u'Peter D']),
 set([6445, u'Spyros Ligouras']),
 set([6419, u'YalCat']),
 set([4277, u'armitatz']),
 set([4203, u'Antre']),
 set([3771, u'drimiskos']),
 set([3733, u'Willem1']),
 set([3681, u'G

##### The city field

Here I groupped all occurences of a same city field that are found inside the data and sorted it in descending order. 

The first issue I encountered is that there are $2,449,144$ data points without the city tag. This is more than $99\%$ of the data points and shows that a lot of work is necessary for the data concerning the area of Athens in the openstreetmap.org.

The rest of the results might be a bit strange for those not familiar with the city. Athens and Pireus are the two major cities in the area and the rest are suburbs and municipalities which are usually refered as parts of the closest major city. Also, some of the results are parts of the greater are called 'Attiki' or 'Attica', but it is ok to consider most of them as suburbs of Athens.

The results for 'Athina', 'Athens', 'Athen' and 'Athina (Athens)' refer to Athens. 'Athina - Alimos' refers to the area of Alimos in Athens. Almost all the rest of the results are quite normal in their majority, apart from 'Thiva' and 'Marmari', which are not at all in the area of Athens, but are inside the dataset probably due to a mistake.

Very notable is that the suburb of 'Acharnes' ($3^{rd}$ place) appears more than 'Peiraias' ($4^{th}$ place). It is an odd result because the latter is bigger and more populated than the former. The most reasonable explanation is that some of the contributors might be from 'Acharnes' and thus they worked more in documenting that area.

In [59]:
city = db.athens_map.aggregate([{'$group': {"_id": '$address.city',
                                            "count": {"$sum": 1}}},
                                {'$sort': {"count": -1}}])

pprint.pprint([{c['_id'], c['count']} for c in city])

[set([None, 2449144]),
 set([9873, u'Athina']),
 set([1783, u'Acharnes']),
 set([1628, u'Peiraias']),
 set([556, u'Kallithea']),
 set([235, u'Arguroupoli']),
 set([134, u'Marousi']),
 set([93, u'Moschato']),
 set([83, u'Galatsi']),
 set([72, u'Nea Ionia']),
 set([64, u'Aigaleo']),
 set([59, u'Chalandri']),
 set([59, u'Agios Ioannis Renti']),
 set([58, u'Alimos']),
 set([55, u'Elliniko']),
 set([54, u'Zografou']),
 set([48, u'Thiva']),
 set([46, u'Petroupoli']),
 set([46, u'Dimotiki Enotita Ellinikou']),
 set([40, u'Nea Filadelfeia']),
 set([37, u'Peuki']),
 set([35, u'Keratsini']),
 set([32, u'Glufada']),
 set([31, u'Athens']),
 set([31, u'Peristeri']),
 set([23, u'Nikaia']),
 set([23, u'Korudallos']),
 set([21, u'Ilion']),
 set([21, u'Neo Irakleio']),
 set([18, u'Nea Smurni']),
 set([18, u'Agios Dimitrios']),
 set([18, u'Vuronas']),
 set([17, u'Tauros']),
 set([17, u'Palaio Faliro']),
 set([16, u'Chaidari']),
 set([14, u'Agia Paraskeui']),
 set([14, u'Kaisariani']),
 set([13, u'Papago

##### Counting the top amenities

From the results of the amenities one can note that cafeterias are very frequent. This is actually true for Athens and it shouldn't be a surprise.

A very surprising result is the number of places of worship. In the area of Attica only greek-orthodox, and very few other christian, churches exist. There are no mosques. A search on greek websites shows that in the area of Attica exist $960$ greek-orthodox churches. The data suggests that $1,016$ places of woship exist. This means that actually this result is very accurate in the OpenStreetMap for Athens. Some of the other results to me seem to be a bit under-docummented, especially those for 'Atm', 'Bar', 'Fast_Food' and 'School'. Interestingly archeological spaces are  missing from the top 30 and this also doesn't seem normal.

In [78]:
amenity = db.athens_map.aggregate([ {"$match":{"amenity":{"$exists":1}}},
                                    {'$group': {"_id": '$amenity',
                                                "count": {"$sum": 1}}},
                                    {'$sort': {"count": -1}}, 
                                    {"$limit":30}])

pprint.pprint([{a['_id'], a['count']} for a in amenity])

[set([1650, u'Bench']),
 set([1296, u'Cafe']),
 set([1266, u'Parking']),
 set([1137, u'Restaurant']),
 set([1016, u'Place_Of_Worship']),
 set([986, u'School']),
 set([771, u'Pharmacy']),
 set([725, u'Fuel']),
 set([673, u'Fast_Food']),
 set([506, u'Bank']),
 set([349, u'Telephone']),
 set([271, u'Waste_Basket']),
 set([245, u'Bar']),
 set([165, u'Kindergarten']),
 set([152, u'Theatre']),
 set([146, u'Atm']),
 set([125, u'Toilets']),
 set([123, u'Parking_Entrance']),
 set([120, u'Recycling']),
 set([105, u'Cinema']),
 set([101, u'Public_Building']),
 set([94, u'Drinking_Water']),
 set([92, u'Post_Office']),
 set([83, u'Fountain']),
 set([79, u'Hospital']),
 set([77, u'Bicycle_Parking']),
 set([75, u'Police']),
 set([73, u'Social_Facility']),
 set([67, u'Taxi']),
 set([62, u'Grave_Yard'])]


## Discussion

In this project I analyzed the dataset of OpenStreetMap for the area of Athens. I parsed the xml file, I cleaned it to an extent, formatted it into a python dictionary and passed into a MongoDB. 

Even though I converted all the greek characters to latin, this is not enough. There are data points that include russian, french and other characters which which also need conversion to latin. 

From the queries I performed in the database seems that a lot of additional work needs to be done in documenting the "city" field in the datapoints, especially to reduce double counting of the same areas. For that purpose there should be a decision whether areas should be indexed with only greek characters (e.g. "Αθήνα"), both greek and english characters (e.g. "Αθήνα (Athina)"), both greek and english names (e.g. "Αθήνα (Athens)") or only in english (e.g. "Athens"). Also the various areas and suburbs can be either presented with only their name (e.g. "Kifisia") or both with their name and the name of the bigger city or area closer to them (e.g. "Kifisia (Athens)").

In my opinion the best way to representt such data is in the following format: 
    
    Κηφησιά - Αθήνα (Kifisia - Athens)
    
Finally, there is a strong imbalance in the contributions by the users. This means that very few people are responsible for most of the work in the dataset. In my opinion this is a major reason for having many incomplete data points and an over-representation of some areas, such as 'Acharnes'.

## References

OpenStreetMap information: https://www.openstreetmap.org/

OpenStreetMap wiki: https://wiki.openstreetmap.org/wiki/Main_Page

get_element method of the xml parser: http://stackoverflow.com/questions/3095434/inserting-newlines-in-xml-file-generated-via-xml-etree-elementtree-in-python

Python documentation: https://www.python.org/doc/

Pymongo documentation: https://api.mongodb.com/python/current/

Greek to latin converter: https://www.g-loaded.eu/2006/12/18/pygr2gl-greek-to-greeklish-converter/

Count digits in string: https://stackoverflow.com/questions/24878174/how-to-count-digits-letters-spaces-for-a-string-in-python

Churches in Attica: http://parganews.com/%CF%80%CF%8C%CF%83%CE%B5%CF%82-%CE%B5%CE%BA%CE%BA%CE%BB%CE%B7%CF%83%CE%AF%CE%B5%CF%82-%CF%85%CF%80%CE%AC%CF%81%CF%87%CE%BF%CF%85%CE%BD-%CF%83%CF%84%CE%B7%CE%BD-%CF%85%CF%80%CE%AD%CF%81%CE%BF%CF%87/