# Wrangling OpenStreetMap Data

### Map area: Jakarta, Indonesia

** Data source: https://s3.amazonaws.com/metro-extracts.mapzen.com/jakarta_indonesia.osm.bz2**

---

## Overview

After downloading the map data of Jakarta, I do some initial checking on the document. 

In [1]:
from osm_datadescriptor import OSMDataDescriptor 

osm_data = OSMDataDescriptor('jakarta_indonesia.osm')

In [2]:
# Basic element check
osm_data.count_element()

{'bounds': 1,
 'member': 2083,
 'nd': 2522657,
 'node': 1994318,
 'osm': 1,
 'relation': 420,
 'tag': 700156,
 'way': 364030}

OSM allow a very flexible tag system, which gives user a bit of freedom but causing problem with consistency.
I count the number of for all tag in the document.

In [3]:
# Check the tag key and element
osm_data.get_tag_keys()

[('1', 2),
 ('ALAMAT', 2),
 ('BABAKAN', 2),
 ('CIKEAS', 2),
 ('Company', 2),
 ('FIXME', 39),
 ('FLOODPRONE', 27),
 ('Gedung Fasilitas Remaja', 2),
 ('Gerbang Bukit Pelangi', 2),
 ('ID', 2),
 ('ISO3166-1', 2),
 ('ISO3166-1:alpha2', 2),
 ('ISO3166-1:alpha3', 2),
 ('ISO3166-1:numeric', 2),
 ('ISO3166-2', 6),
 ('Id', 882),
 ('JENIS', 2),
 ('Jalan', 2),
 ('Jenis', 4),
 ('Jenis Atap', 2),
 ('Jenis Tembok', 2),
 ('KAB_NAME', 28),
 ('KEC_NAME', 29),
 ('KEL_NAME', 29),
 ('KODE', 2),
 ('Kab.', 2),
 ('Kabupaten', 2),
 ('Kereta Api', 2),
 ('Keterangan', 2),
 ('Latitude', 2),
 ('Longitude', 2),
 ('NAMA', 2),
 ('NMR', 2),
 ('Nama', 92),
 ('Name', 2),
 ('Note', 20),
 ('OBJECTID', 3716),
 ('Propinsi', 2),
 ('Province', 2),
 ('RT', 2),
 ('RW', 53),
 ('Region', 2),
 ('Render', 2),
 ('SDN', 4),
 ('SDT', 2),
 ('SLTA_', 2),
 ('SLTA_ID', 2),
 ('SMU_', 2),
 ('SMU_ID', 2),
 ('School Building', 2),
 ('Struktur', 2),
 ('Sukaraja', 2),
 ('TINGKAT', 2),
 ('Taman', 2),
 ('Use', 4),
 ('access', 818),
 ('access:roof

Looking at the tag above, we can see a lot of inconsistency, for instance some of the tag is using uppercase and the other is using lowercase. For 'name' we have 3 different tags: 'NAMA', 'Nama', and 'Name'. For province we have 'Propinsi' and 'Province'.

With so many inconsistency, I focus on address, simply because I am a bit more familiar with this information and can  quickly verify them.

Now for address there are three related tags: 'ALAMAT, 'addr:street', 'addr:full'.
Both 'addr:street' and 'addr:full' is valid tag, so we can not merge them.
[OSM wiki](http://wiki.openstreetmap.org/wiki/Key:addr>) imply that the using 'addr:street' and other supporting field is better then using 'addr:full' but our data shows that we have more 'addr:full' then 'addr:street' (10038 vs 1441).

Okay let's import the data to MongoDB

In [4]:
from osm_dataimporter import OSMDataImporter

importer = OSMDataImporter(db_name='osm_data_import', db_collection_name='jakarta')
importer.import_data('jakarta_indonesia.osm')

---

## Problems encountered in Map

The problems with the address:

* Abbreviated street names (Jl. Masjid Almunawarah, Jln Perintis, etc)
* Abbreviated alley names (Gg. Kembang)

For the street name there are several variation, that is: 'jl.', 'jln.', 'jl', jln'.
And then some use all upper case, some all lower case and some are mix.

In [5]:
# Clean up street name
importer.cleanup_address_street()
# Clean up street alley
importer.cleanup_address_alley()

---

## Data Overview

This section contains basic statistics about the dataset and the MongoDB queries used to gather them.
                                                
File sizes:
                                                
jakarta_indonesia.osm ......... 449.2 MB


In [6]:
from pymongo import MongoClient

client = MongoClient()
db = client['osm_data_import']

In [7]:
# Number of document
db.jakarta.find().count()

2358348

In [8]:
# Number of nodes
db.jakarta.find({'type': 'node'}).count()

1994318

In [9]:
# Number of way
db.jakarta.find({'type': 'way'}).count()

363954

In [10]:
# Number of unique user
len(db.jakarta.distinct('created.user'))

1365

In [11]:
# Top 10 contributing user
list(db.jakarta.aggregate([{'$group': {'_id': '$created.user', 'count': {'$sum': 1}}}, {'$sort':{'count':-1}}, {'$limit':10}]))

[{u'_id': u'Alex Rollin', u'count': 409359},
 {u'_id': u'PutriRachiemnys', u'count': 171520},
 {u'_id': u'zahrabanu', u'count': 124793},
 {u'_id': u'Dosandriani', u'count': 114818},
 {u'_id': u'miftajnh', u'count': 114544},
 {u'_id': u'dfo', u'count': 110296},
 {u'_id': u'naomiangelia', u'count': 104560},
 {u'_id': u'Firman Hadi', u'count': 96807},
 {u'_id': u'anisa berliana', u'count': 89299},
 {u'_id': u'ceyockey', u'count': 70948}]

In [12]:
# Place of worship breakdown
list(db.jakarta.aggregate([
        {"$match":{"amenity":{"$exists":1}, "amenity":"place_of_worship"}},
        {"$group":{"_id":"$religion", "count":{"$sum":1}}},
        {"$sort":{"count":-1}}
    ]))

[{u'_id': u'muslim', u'count': 3438},
 {u'_id': u'christian', u'count': 374},
 {u'_id': u'buddhist', u'count': 68},
 {u'_id': None, u'count': 68},
 {u'_id': u'hindu', u'count': 15},
 {u'_id': u'confucian', u'count': 4}]

---

## Additional ideas

Jakarta experience flooding issue every year. There is this cycle the citizen believe, small flood every year and a big one every 5 years.

Looking at the tag list, I saw this:

    ('flood:overflow', 2619),
    ('flood:rain', 4859),
    ('flood:rob', 1049),
    ('flood:send', 3362),
    ('flood_cause:overflowing_river', 2),
    ('flood_depth', 5860),
    ('flood_duration', 5696),
    ('flood_latest', 5845),
    ('flood_prone', 21051),
    ('floodprone', 19)

Which is great, so we have flooding information.
But I imagine it will be difficult to manually add this information.

Fortunately, Indonesian loves Twitter, and they tweet about the event everytime this happens.
Some of the user turn on their geolocation. So we can probably use that to populate more flooding information into our data. Use Twitter API to fetch user flood information get the geolocation, (if needed use Google API to do geo reverse and add entry to OSM data) and update the data.

---

## Conclusion and Notes

The data we obtain from OSM is far from perfect. For the purpose of this exercise, however, I have clean up the address.

Some notes:

Indonesia has a bit complex administrative subdivision. It is divided as follow:

* Province
* Regency (Kabupaten) or City (Kota)
* District (Kecamatan)
* Village

And then there are non-administrative division like RT and RW. While RT and RW is considered a non-administrative subdivision, it is widely use (The ID card has and requires this information). And from data point of view it is actually interesting to know. A RT have a maximum 30 household.

The division here does not quite match the OSM address tag found [here](http://wiki.openstreetmap.org/wiki/Key:addr). Which allow for multiple interpretation.
And given the complexity of division, not many people can fill this information easily.
Typical user will not know the regency or district right away of the location.

I believe this is what lead user to just simply put the whole address in 'addr:full', as this is much simpler.
But as OSM warns, putting everything in 'addr:full' makes it harder to parse by software.

There are some effort by the community to try to add the division into the data, but the result is not all that good. 
For instance for Regency we have 'KAB_NAME', 'Kabupaten', 'kab.', etc. And some way node uses 'admin_level' tag and then put 'kabupaten' in the value. 

I believe the community need to come out with a convention to write the correct address down,
and a way to convince the rest of community member to follow this convention.