# Data Wrangling | Kolkata OpenStreetMap
<hr>

## 1. Introduction

<br><center>"<b><i>It’s an absolute myth that you can send an algorithm over raw data and have insights pop up.</i></b>"</center><br>

The foremost and important step of the data preparation task that deals with correcting inconsistent data is filling out missing values and smoothing out noisy data. There could be many rows in the dataset that do not have value for attributes of interest or there could be inconsistent data or duplicate records or some other random error. All these data quality issues are tackled in the foremost step of data preparation.

Data from sensors, documents, the web and conventional databases all come in different formats. Before a software algorithm can go looking for answers, the data must be cleaned up and converted into an unified form that the algorithm can understand.

Missing values are tackled in various ways depending on the requirement either by ignoring the tuple or filling in the missing value with the mean value of the attribute or using a global constant or some other techniques like decision tree or Bayesian formula. Noisy data is tackled manually or through various regression or clustering techniques.

In this project we aim to look into the data wrangling phase by cleaning the OpenStreetMap data of the city of joy, Kolkata.

## 2. Familiarization with the dataset

* We need to extract a region from [OpenStreetMaps](https://www.openstreetmap.org). It will be great if I choose my home-city, Kolkata to implement Data Wrangling. 

* We start by downloading the [OSM XML](https://s3.amazonaws.com/metro-extracts.mapzen.com/kolkata_india.osm.bz2) of Kolkata from [Metro Extracts | Mapzen](https://mapzen.com/data/metro-extracts/metro/kolkata_india/).

* However we can see that the popular extract for this city also includes part of Bangladesh, which is undesirable. So, I made a [Custom Extact](https://mapzen.com/data/metro-extracts/your-extracts/320cb2360f25) which can be accessed if you have a Mapzen Developer's Account.

* The file becomes 680.5 MB on uncompressing with a total of 8870629 lines in the XML. Before we start exploring, let us get used to the structure of the XML. [The wiki of OSM](http://wiki.openstreetmap.org/wiki/OSM_XML) suggests that they use three data primitives <b>node</b>, <b>way</b> and <b>relation</b>.

* **Node**: It consists of a single point in space defined by its **latitude**, **longitude** and **node id**. Nodes can define it's own features using **tag** and its attributes **k**(key) and **v**(value). Here's a [guide](http://wiki.openstreetmap.org/wiki/Map_Features) to all possible key-value pairs. Let's see an [example](https://www.openstreetmap.org/node/3409030014) of this kind of nodes.
```css
<node id="3409030014" visible="true" version="2" changeset="49141539" timestamp="2017-05-31T17:00:28Z" user="Srihari Thalla" uid="2815653" lat="22.5194936" lon="88.3772832">
<tag k="amenity" v="bank"/>
<tag k="brand" v="ICICI"/>
...
<tag k="website" v="https://www.icicibank.com"/>
<tag k="wikipedia" v="en:ICICI Bank"/>
</node>
``` 
Nodes can also be without any tags. This means they are points used for creating shape of a **way**. Here's an [example](https://www.openstreetmap.org/node/419272549) of a node that is used in a way.
```css
 <node id="419272549" visible="true" version="1" changeset="1498157" timestamp="2009-06-12T17:57:15Z" user="thevikas" uid="17429" lat="22.4975510" lon="88.3743585"/>
```
* **Way**: A way is an ordered list of nodes which normally also has at least one tag or is included within a Relation. A way can have between 2 and 2,000 nodes, *although it's possible that faulty ways with zero or a single node exist*. 
    * Ways can [have](https://www.openstreetmap.org/way/120004528) or [not have](https://www.openstreetmap.org/way/120004531) a name.
    * Ways maybe [open](https://www.openstreetmap.org/way/120004544) or [closed](https://www.openstreetmap.org/way/120004545). Open ways generally mean a road and the start node and end node varies. Closed ways mean a [bounded polygon](http://www.openstreetmap.org/way/357985873#map=19/22.65737/88.49060).
    

* **Relation**: It is an ordered list of one or more nodes, ways and/or relations as members which is used to define [logical or geographic](https://www.openstreetmap.org/relation/5341780) relationships between other elements. An example of a [v9 bus route]() is like,
```css
<relation id="4147368" visible="true" version="7" changeset="43126343" timestamp="2016-10-24T15:18:55Z" user="ediyes" uid="1240849">
    <member type="node" ref="3157361872" role="stop"/>
    <member type="node" ref="2211350535" role="stop"/>
    ...
    <tag k="ref" v="V9"/>
    <tag k="route" v="bus"/>
    ...
</relation>
    ```

## 3. Parsing the OSM XML

Let us use the ***map_parser.py*** to parse the XML file. It will give us a general idea of the available tags along with their count. I have re-arranged the actual output for a better understanding.

In [None]:
{'node': 3100069,    # total count of nodes with (stand-alone) and without (for ways) point features
 'way': 640011       # total number of ways
 'nd': 3767632,      # total nodes participated in creation of ways
 'relation': 1247,   # total number of relations to identify a demographic or logical relationj
 'member': 5271,     # total number of ways taking part in a relation
 'tag': 699513}      # total number of tags used for describing point features

> **Observation**: We can do a simple math to find the number of stand-alone nodes. 

>$node - nd = 3100069 - 3767632 = −667563$

>We can conclude that there are many points which are repeated (for bounded areas) or they are a part of more than one ways.

## 4. Auditing "tag" Keys

**Tag**s describe specific features of map elements (nodes, ways, or relations) or changesets. Both items are free format text fields, but often represent numeric or other structured items. Conventions are agreed on the meaning and use of tags, which are captured on OSM wiki.


### 4.1 Finding Unique "tag" Keys

Now we are interested with the **keys of tag**s of the **node**s, **way**s and **relation**s. So, let's see what are the unique tag keys that out dataset contains. 
>I'll run ***unique_tag_key_finder.py*** to find the list of unique tag keys. The dataset contains **359** unique tag keys.

### 4.2 Finding Unusual "tag" Keys

Now with all this 359 keys, we can try to find the unusual or native ones, specifically not defined in [TagInfo Wiki](http://wiki.openstreetmap.org/wiki/Taginfo/Sources). I also used [List of ISO 639-1 codes](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) to verify the sub-category of key called ***name***. 
>Let us run ***unusual_tag_key_finder.py*** to list all of them.

In [None]:
"""
------------------------------------------------------------
TAG KEYS WITHOUT/WEAK WIKI IN taginfo: (contains 48 entries)
------------------------------------------------------------
"""
unusual_tag_key = set(['City', 'PARK', 'via', 'gns:dsg', 'ref:new', 'alt_name:pl', 'ISO3166-2', 'is_capital', 'route_refs', 'foot_1', 'seamark:mooring:colour', 'IRrouterank', 'seamark:construction', 'phone_2', 'phone_3', 'phone_1', 'mini_roundabout', 'from', 'district', 'section', 'kindergarten', 'to', 'Road', 'gns:uni', 'GNS:id', 'taluk', 'Tank', 'seamark:information', 'park', 'building_1', 'abandoned:aeroway', 'name:abbr', 'currency:INR', 'ref:old', 'payment:bitcoin', 'orphanage', 'IR:zone', 'AND_a_nosr_p', 'AND_a_c', 'AND_a_nosr_r', 'place:cca', 'alt_name:eo', 'AND_a_w', 'leisure_1', 'leisure_2', 'seamark:status', 'source:tracer', 'is_in:iso_3166_2'])

### 4.3 Grouping and Fixing of Unusual "tag" keys according to RegEx Format

As we can see, the above list containing 48 keys are not having a wiki, or weakly defined. 

Let's identify some faulty tag keys based on regular expressions.
1. **Case 1:** Presence of any Upper Case Letters (wiki suggests all should be in lowercase)
2. **Case 2:** Presence of any special symbols other than ":" and "_"
3. **Case 3:** Presence of any colons, [name:en]. We need to fix them for JSON format.
4. **Good Keys**: Keys whose format is accepted by OSM wiki.

>We will use ***bad_regex_tag_key_finder.py***

#### 4.3.1 Fixing Case 1 "tag" keys

Now we are sure that Case 1 entries are not suggested according to OSM wiki. Before I can suggest changes, I need to get the **sample values of these keys** so that I can manually check them against the [Overpass API](http://overpass-turbo.eu/). 

>To populate the sample values, I will be running ***unusual_key_value_gen.py***. However I won't display the results here as it will become too clumsy.

Here's some data wrangling I want to perform from this group.

* Now ['AND_a_c': '10001023'](http://www.openstreetmap.org/way/165014015#map=11/23.0706/88.5333), ['AND_a_nosr_p': '10013997'](http://www.openstreetmap.org/node/3785672730),  ['AND_a_nosr_r': '10013997'](http://www.openstreetmap.org/way/397660190), ['AND_a_w': '15064585'](http://www.openstreetmap.org/way/28573377) all refers to **[AUTOMOTIVE NAVIGATION DATA](https://wiki.openstreetmap.org/wiki/AND_Data)** donated by Netherlands. Although it does not comply with OSM Wiki key guidelines, *I keep it untouched*.
* I will keep [currency:INR': 'yes'](http://wiki.openstreetmap.org/wiki/Tag:amenity%3Datm), [GNS:id](http://wiki.openstreetmap.org/wiki/GEOnet_Names_Server) and ISO3166-2 *untouched* as they are already proposed in OSM wiki.
* I will *lowercase* **City, PARK, Road**.
* I will *rename* **IR** to **indian_railways** in **IR:zone** and **IRrouterank**.
* I will *rename* **Tank** to **man_made** and change it's value to **water_tower** as suggested in [OSM wiki (water_tower)](https://wiki.openstreetmap.org/wiki/Tag:man_made%3Dwater_tower).
> We will create a dictionary **map_tag_key** and add the key mappings for Case 1, also in **map_tag_value** I will add the new value mappings.

#### 4.3.2 Fixing Case 2 "tag" keys

* I will *rename* **[building_1=Wifi](http://www.openstreetmap.org/way/383928062)** to **[internet_access=wlan](http://wiki.openstreetmap.org/wiki/Key:internet_access)**
* [OSM wiki](http://wiki.openstreetmap.org/wiki/Multiple_values) strongly discourages use of **numbered suffix**.So, I will *rename* the numbered suffix such as **phone_1**, **phone_2** ... to **phone:1**, **phone:2** ... as discussed in [OSM forums](https://forum.openstreetmap.org/viewtopic.php?id=32037).
> I will add these mappings in **map_tag_key** and **map_tag_value**.

#### 4.3.3 Fixing Case 3 "tag" keys

* [**payment:bitcoin**](http://wiki.openstreetmap.org/wiki/Bitcoin) is already having an OSM wiki, but *tagsinfo is not updated*. I will keep it.
* I will *rename* **name:abbr** to **alt_name**.
> I will add these mappings in **map_tag_key** and **map_tag_value**.

#### 4.3.4 Fixing good "tag" keys

* I will *rename* **park** to [**leisure=park**](http://wiki.openstreetmap.org/wiki/Tag:leisure%3Dpark). However  there's already a *renaming of PARK  park suggested in 4.3.1*. So, I'll **wrangle** this **in MongoDB** to avoid conflicts.
> I will add these mappings in **map_tag_key** and **map_tag_value**.

## 5. Auditing "tag" Values

Among many others here are some major erroneous tag values that I found. They are grouped according to their keys.

### 5.1 IR:zone
Too much **shortening** of data and using **non-conventional form** as described in OSM wiki.
* *"ER" → "eastern_railway"*.
> Fixed using function called **fix_irzone**.

### 5.2 PARK
The values are often in a **mixed case format**. Uniformity can be maintained.
* The values are converted to **change_case_Aa**.
> Fixed using function called **fix_park**.

### 5.3 access
The values are often in a **mixed case format**. Uniformity can be maintained.
* The value are converted to **change_case_lower**.
> Fixed using function called **fix_lowercase**.

### 5.4 addr:city
The values are often in a **mixed case format**. There are several **near-similar ambiguous** entries. **Too much details** for a city, ex: state also provided and there are also **spelling mistakes**.
* The values are converted to **change_case_Aa**.
* *"DUM DUM CANTT.,KOLKATA" → "Dum Dum Cantonment, Kolkata"*
* *"Kolkata, West Bengal" → "Kolkata"*
* *"New Town", "New Town, Kolkata", "Newtown, Kolkata", "Rajarhat", → "New Town, Rajarhat" *
* *"Salt Lake", "Salt Lake City, Kolkata", "Saltlake (Bidhannagar)" → "Salt Lake (Bidhannagar)"*
* *"Kolkatta" → "Kolkata"*
> Fixed using function called **fix_addr_city**.

### 5.5 addr:housename
There are **non-uniform abbreviations**, as well the values are often in a **mixed case format**.
* *"WB" → "West Bengal"*
* The values which are not part of house number are converted to **change_case_Aa**.
> Fixed using function called **fix_addr_housename**.


### 5.6 addr:housenumber
The values are often in a **mixed case format** and contains meaningless **unicode** characters and **special** characters.
* Remove Unicode Chars
* Remove Special Chars other than "/\,-&."
> Fixed using function called **fix_addr_housenumber**.

### 5.7 addr:postcode
Indian postal codes have 6 digits. Postal code for Kolkata is mainly 700xxx.Now, there are certain entries with the 6 digits **grouped** in 3s and **separated by space**. Now it is also a common mistake many people make is give wrong number of zeros. Either they give one less or one more.
* Join if separated by space
* Inflate 0 if len < 7
* Deflate 0 if len > 7
* Remove unicode
* error: '7000 026', '700 050', '700 156' '70014'
> Fixed using function called **fix_addr_postcode**.

### 5.8 'addr:state':
There are **non-uniform abbreviations**.
* *"WB" → "West Bengal"*
> Fixed using function called **fix_addr_state**.

### 5.9 addr:street
There are **ambiguous entries** for same street. **Abbreviated** usage of **Lanes (ln)** and **Road (rd)**. **Incorrect spelling**s of road. 
* *"A. J. C. Bose Road", "Acharya Jagadish Chandra Bose Road", → "AJC Bose Road"*
* Karbala Tank Ln → Karbala Tank Lane
* Barrackpore Trunk → Rd Barrackpore Trunk Road
* D.r A.k paul raod
> Fixed using function called **fix_addr_street**.

### 5.10 brand
There are **incorrect** brand names (missing spaces or indicating an amenity).
* *"IndianOil" → "Indian Oil"*
* *"State Bank of India ATM" → "State Bank of India"*
> Fixed using function called **fix_brand**.

There are **some other** minor data cleaning for some keys. I have **not documented** them here, however ***functions.py*** contains all the helper functions to wrangle the values.


## 6. Preparation for Data Analysis

Once we have finished auditing the data, we need to create a JSON file for insertion into MongoDB for data analysis. I have coded ***json_maker.py*** for that purpose. The JSON file abides the following rules.

- Only 2 types of top level tags: "node" and "way" is processed
- All attributes of "node" and "way" is turned into regular key/value pairs, except:
    - attributes in the CREATED array is added under a key "created"
    - attributes for latitude and longitude is added to a "pos" array for use in geo-spacial indexing. The values inside "pos" array are floats and not strings. 
- if the second level tag "k" value contains problematic characters, it is ignored
- if the second level tag "k" value starts with "addr:", it is added to a dictionary "address"
- if the second level tag "k" value does not start with "addr:", but contains ":",it is split it into a two-level dictionary.

The python script creates a file called custom_kolkata.osm.json in  ***./maps_uncompressed***.

Once this is done, I have used the following code to import the data in MongoDB

In [None]:
mongoimport --db OSM --collection kolkata --file custom_kolkata.osm.json

We can now use the queries of ***mongo_test_commands.txt*** in mongo shell to check if the entries in MongoDB are as expected.

## 7. Data Wrangling in MongoDB

Now we have an unfinished job. We have to wrangle ***park***  into ***leisure: park***. It's best to use mongo shell queries to perform this wrangling. The following query updates this key-pair value.

In [None]:
db.kolkata.update(
    {"park": {"$exists": 1}},
    {
        "$unset":
            {"park": ""},
        "$set":
            {"leisure": "park"}
    },
    {multi: true}
)

## 8. Data Analysis with MongoDB

### 8.1 Data Size
We can have a brief statistics of the used space in MongoDB for this collection. We'll also discuss about the other file sizes in the obeservation.
#### 8.1.1 Query

In [None]:
> db.kolkata.dataSize()

#### 8.1.2 Output

In [None]:
888513656

> **Observation**: This states the size of the collection (~888.5 MB) which is bigger than the JSON (~818.8 MB) as well as the actual OSM (~680.5 MB) and the sample (~8.1 MB).

### 8.2 Total Records
This will help us to understand how many documents we have and if document type (node/way) sums up to this number.
#### 8.2.1 Query

In [None]:
> db.kolkata.count()

#### 8.2.2 Output

In [None]:
3740080

> **Observation**: This shows the total number of documents present in the collection.

### 8.3 Unique Users
Using this analysis will give an insight on the active participation of users.
#### 8.3.1 Query

In [None]:
> db.kolkata.distinct("created.user").length

#### 8.3.2 Output

In [None]:
582

> **Observation**: This shows quite a high number of contributing users.

### 8.4 Total Nodes
Simple analysis to see how many nodes are created to represent this map.
#### 8.4.1 Query

In [None]:
> db.kolkata.find({element_type: "node"}).count()

#### 8.4.2 Output

In [None]:
3100069

> **Observation**: This shows total number of nodes.

### 8.5 Total Ways
This will show total number of ways discovered and marked in OSM for Kolkata.
#### 8.5.1 Query

In [None]:
> db.kolkata.find({element_type: "way"}).count()

#### 8.5.2 Output

In [None]:
640011

> **Observation**: Adding up total nodes and ways we get back total documents. This means there's no document without an ***element_type***. That's good.

### 8.6 Top 6 Amenities
We can have an insight into the culture of the people and their interest in Kolkata.
#### 8.6.1 Query

In [None]:
> db.kolkata.aggregate([
    {"$match": {"amenity": { "$exists": 1}}},
    {"$group": {"_id": "$amenity", "count": {"$sum": 1}}},
    {"$sort": {"count": -1}},
    {"$limit": 6}],
    {allowDiskUse: true}
)

#### 8.6.2 Output

In [None]:
{ "_id" : "school", "count" : 175 }
{ "_id" : "place_of_worship", "count" : 111 }
{ "_id" : "restaurant", "count" : 101 }
{ "_id" : "hospital", "count" : 95 }
{ "_id" : "bank", "count" : 91 }
{ "_id" : "college", "count" : 70 }

> **Observation**: That's a good trend. The state ministry's claims of attempts to improve the literacy rates can be justified with so many schools and colleges set up.

> Bengalis are no doubt [religious](https://www.quora.com/What-is-the-genetic-history-and-origin-of-Bengali-people) and [foodie](https://www.quora.com/Why-do-Bengalis-love-to-eat-so-much) and we can see that too.

### 8.7 Top 10 Users and their Percentage
This analysis can predict the confidence of the data submitted.
#### 8.7.1 Query

In [None]:
> var total = db.kolkata.find({"created.user" : {"$exists": 1}}).count();
db.kolkata.aggregate([
    {"$match": {"created.user": { "$exists": 1}}},
    {"$group": {"_id": "$created.user", "count": {"$sum": 1}}},
    {"$project": {
        "count":1,
        "percentage": {"$multiply": [100, {"$divide": ["$count", total]}]}}},
    {"$sort": {"count": -1}},
    {"$limit": 10}],
    {allowDiskUse: true}
)

#### 8.7.2 Output

In [None]:
{ "_id" : "Rondon237", "count" : 222867, "percentage" : 5.958883232444226 }
{ "_id" : "saikumar", "count" : 129408, "percentage" : 3.4600329404718617 }
{ "_id" : "samuelmj", "count" : 122381, "percentage" : 3.272149258839383 }
{ "_id" : "Apreethi", "count" : 120552, "percentage" : 3.2232465615708756 }
{ "_id" : "anthony1", "count" : 105324, "percentage" : 2.816089495411863 }
{ "_id" : "harisha", "count" : 91874, "percentage" : 2.4564715193257896 }
{ "_id" : "venugopal009", "count" : 90498, "percentage" : 2.419680862441445 }
{ "_id" : "jasvinderkaur", "count" : 87910, "percentage" : 2.3504844816153665 }
{ "_id" : "shiva05", "count" : 86632, "percentage" : 2.3163140895381917 }
{ "_id" : "ravikumar1", "count" : 82775, "percentage" : 2.213187953198862 }

> **Observation**: The data shows quite a decent distribution of contribution among the **582** users. This means the data is not solely dependent on a particular user but to a mass in general. So, the confidence of the data can be concluded to be high.

### 8.8 Top 10 Data Source and their Percentage
We can get an insight into how the data is acquired.
#### 8.8.1 Query

In [None]:
> var total = db.kolkata.find({"source" : {"$exists": 1}}).count();
db.kolkata.aggregate([
    {"$match": {"source": { "$exists": 1}}},
    {"$group": {"_id": "$source", "count": {"$sum": 1}}},
    {"$project": {
        "count":1,
        "percentage": {"$multiply": [100, {"$divide": ["$count", total]}]}}},
    {"$sort": {"count": -1}},
    {"$limit": 10}],
    {allowDiskUse: true}
)

#### 8.8.2 Output

In [None]:
{ "_id" : "PGS", "count" : 2929, "percentage" : 53.26422985997454 }
{ "_id" : "Bing", "count" : 1798, "percentage" : 32.69685397344972 }
{ "_id" : "Yahoo hires", "count" : 319, "percentage" : 5.80105473722495 }
{ "_id" : "AND", "count" : 180, "percentage" : 3.273322422258593 }
{ "_id" : "GPS", "count" : 152, "percentage" : 2.7641389343517 }
{ "_id" : "Survey", "count" : 29, "percentage" : 0.5273686124749954 }
{ "_id" : "bing", "count" : 17, "percentage" : 0.30914711765775593 }
{ "_id" : "Bing HiRes", "count" : 15, "percentage" : 0.27277686852154936 }
{ "_id" : "survey", "count" : 11, "percentage" : 0.20003637024913623 }
{ "_id" : "landsat", "count" : 8, "percentage" : 0.14548099654482632 }

> **Observation**: This shows how data is acquired. Data from GPS seems very low (~2.7%). However if data is collected from GPS when in transit, more roads can be discovered automatically.

> Example: If the system gets data above a minimum threshold (say 100) for a particular sequence of unlisted GPS co-ordinates from different devices, we may conclude that the sequence is a way which is not auto-identified from image processing algorithms (maybe due to trees above the road as satellite images see it).

### 8.9 Top 5 Cuisines
Now with this analysis we can see find out what's the trend of Bengalis when they hang out in restaurants (which seemed a pretty popular amenity in the city). This can be an useful information for people who wants to enter the business and can decide which cuisine to focus on.
#### 8.9.1 Query

In [None]:
db.kolkata.aggregate([
    {"$match": {"cuisine": { "$exists": 1}}},
    {"$group": {"_id": "$cuisine", "count": {"$sum": 1}}},
    {"$sort": {"count": -1}},
    {"$limit": 5}],
    {allowDiskUse: true}
)

#### 8.9.2 Output

In [None]:
{ "_id" : "indian", "count" : 29 }
{ "_id" : "regional", "count" : 4 }
{ "_id" : "multicuisine", "count" : 3 }
{ "_id" : "chinese", "count" : 3 }
{ "_id" : "international", "count" : 3 }

> **Observation**: We can clearly see that Indian cuisine is the most popular in Bengal followed by Chinese. However a more detailed cuisine (North Indian, South Indian, Bengali, Gujrati etc) can make the restaurants more accurate and search-friendly. I have a feeling that many restaurants don't have a cuisine defined for them.  

#### 8.9.3 Restaurants without cuisine
However the data here in OSM seems unclear. Let's investigate a bit more.
#### 8.9.3.1 Query

In [None]:
db.kolkata.find({"$and": [{"amenity": "restaurant"}, {"cuisine": {"$exists": 0}}]}).count()

#### 8.9.3.2 Output

In [None]:
60

> **Observation**: Well, it seems that I'm correct. Most of the restaurants don't have a cuisine for them. So intelligent searches such as "Find South-Indian restaurants near me" won't be of much use for OSM in Kolkata.

> This can solved with review features like **Question-Answers** for places like restaurants. If users are given points for contribution, they may take the interest to fill out these small details to showcase their contribution level.

> However an **anticipated problem** for this improvement is QAs may not be always uniform for all countries. Example: *Salad bars* or *drinks* are **not** absent in 90% Bengal restaurants. But they maybe an important feature for other countries/regions. So, repetitive QAs demanding to answer if Kolkata restaurants have *Salad bars* or *drinks* can be irritating.

### 8.10 Religion and their Percentage
We can try to draw a symmetry of religion present in Kolkata and the trend as shown by OSM.
#### 8.10.1 Query

In [None]:
> var total = db.kolkata.find({"religion" : {"$exists": 1}}).count();
db.kolkata.aggregate([
    {"$match": {"religion": { "$exists": 1}}},
    {"$group": {"_id": "$religion", "count": {"$sum": 1}}},
    {"$project": {
        "count":1,
        "percentage": {"$multiply": [100, {"$divide": ["$count", total]}]}}},
    {"$sort": {"count": -1}}],
    {allowDiskUse: true}
)

#### 8.10.2 Output

In [None]:
{ "_id" : "hindu", "count" : 38, "percentage" : 42.22222222222222 }
{ "_id" : "christian", "count" : 25, "percentage" : 27.77777777777778 }
{ "_id" : "muslim", "count" : 13, "percentage" : 14.444444444444443 }
{ "_id" : "buddhist", "count" : 8, "percentage" : 8.88888888888889 }
{ "_id" : "jain", "count" : 3, "percentage" : 3.3333333333333335 }
{ "_id" : "jewish", "count" : 2, "percentage" : 2.2222222222222223 }
{ "_id" : "sikh", "count" : 1, "percentage" : 1.1111111111111112 }

> **Observation**: This output misleading. Undoubtedly Kolkata has the maximum Hindus. However as we see Christians are ranked second is not the case. Kolkata has more Muslims and more mosques than churches. This points out that either they are not pointed out or there is some kind of biasness among the users.

>However this data clearly shows India and specifically Kolkata has diverse religions and praises its secularity.

### 8.11 Most and Least Discovered areas
We may try to find if there's any particular region in Kolkata that map-pointers have pointed out more frequently with minute details such as postcode. Also we can reverse it to find those areas which have not been explored so much or miss the details.
#### 8.11.1 Query

In [None]:
> var total = db.kolkata.find({"address.postcode" : {"$exists": 1}}).count();
> db.kolkata.aggregate([
    {"$match": {"address.postcode": { "$exists": 1}}},
    {"$group": {"_id": "$address.postcode", "count": {"$sum": 1}}},
    {"$project": {
        "count":1,
        "percentage": {"$multiply": [100, {"$divide": ["$count", total]}]}}},
    {"$sort": {"count": -1}},
    {"$limit": 5}],
    {allowDiskUse: true}
)

> db.kolkata.aggregate([
    {"$match": {"address.postcode": { "$exists": 1}}},
    {"$group": {"_id": "$address.postcode", "count": {"$sum": 1}}},
    {"$project": {
        "count":1,
        "percentage": {"$multiply": [100, {"$divide": ["$count", total]}]}}},
    {"$match": {"count": {"$eq": 1}}},
    {"$limit": 10}],
    {allowDiskUse: true}
)

#### 8.11.2 Output (most discovered)

In [None]:
{ "_id" : "700064", "count" : 911, "percentage" : 65.49245147375989 }
{ "_id" : "741235", "count" : 71, "percentage" : 5.104241552839683 }
{ "_id" : "743363", "count" : 57, "percentage" : 4.097771387491013 }
{ "_id" : "712306", "count" : 31, "percentage" : 2.2286125089863407 }
{ "_id" : "700107", "count" : 22, "percentage" : 1.5815959741193386 }

#### 8.11.3 Output (least discovered)

In [None]:
{ "_id" : "700001", "count" : 1, "percentage" : 0.07189072609633358 }
{ "_id" : "700003", "count" : 1, "percentage" : 0.07189072609633358 }
{ "_id" : "700040", "count" : 1, "percentage" : 0.07189072609633358 }
{ "_id" : "700124", "count" : 1, "percentage" : 0.07189072609633358 }
{ "_id" : "721134", "count" : 1, "percentage" : 0.07189072609633358 }
{ "_id" : "700028", "count" : 1, "percentage" : 0.07189072609633358 }
{ "_id" : "700145", "count" : 1, "percentage" : 0.07189072609633358 }
{ "_id" : "712248", "count" : 1, "percentage" : 0.07189072609633358 }
{ "_id" : "721603", "count" : 1, "percentage" : 0.07189072609633358 }
{ "_id" : "700095", "count" : 1, "percentage" : 0.07189072609633358 }

> **Observation**: I was a bit curios to know if the users preferred any particular region for map-pointing. I find the highest pincode recorded is for [BidhanNagar](https://www.google.co.in/maps/place/Salt+Lake+City,+Kolkata,+West+Bengal+700064/@22.5886106,88.4061628,15z/data=!4m5!3m4!1s0x3a0275db63029127:0x1ffc973fa16fd18!8m2!3d22.5924571!4d88.412665) (Kolkata 700064). However there's no reason for heavy pointing in this region. This place is not so important. I like to see distribution of users for marking this postcode.

#### 8.11.4 Distribution of users in address.postcode = 700064
If we see the contribution of various users in this place, then it must be really some important place I don't know of. However if there's not a varying distribution, then we can say that someone made detailed discoveries in this particular area only.
#### 8.11.4.1 Query

In [None]:
> db.kolkata.aggregate([
    {"$match": {"address.postcode": "700064"}},
    {"$group": {"_id": "$created.user", "count": {"$sum": 1}}},
    {"$project": {
        "count":1}},
    {"$sort": {"count": -1}}],
    {allowDiskUse: true}
)

#### 8.11.4.2 Output

In [None]:
{ "_id" : "Japa", "count" : 490 }
{ "_id" : "sujandeb", "count" : 406 }
{ "_id" : "Narsimulu", "count" : 12 }
{ "_id" : "Pc2", "count" : 1 }
{ "_id" : "dazz99", "count" : 1 }
{ "_id" : "Oberaffe", "count" : 1 }

> **Observation**: This clearly demonstrates that the users *Japa* and *sujandeb* are contributors who know this place very well (and possibly live there) and that's why have pointed out the details.

## 9. Conclusion

Here in this project I have audited, cleaned and analyzed OpenStreetMap data for Kolkata, West Bengal, India. What I have found during the process was many excess, deprecated tags without proper wiki. 

### 9.1 Ideas for additional improvements

Now I got curious to know how actually places are created in OSM. I can easily create **any key** without those listed in the dropdown. The screenshot makes it clear.

![title](../Screenshots/1.png)

Now you can easily find my edit in OSM (if still available) in this [link](https://www.openstreetmap.org/way/519564562). Here's a screenshot showing my update is live.
![title](../Screenshots/2.png)

So, what I think is, OSM should not allow creation of **any tag** and be reserved to some extent. Since this organization is community-driven, it is good to have a review policy (from high contributing users of that locality) to make things more accurate. Also various issues that I faced like variation in letter-cases, abbreviation of street names, states, brands can be nullified with some simple hints and validations before actual entry in the database. Anomaly in postcodes can be eliminated by hinting users of nearby postcodes (so that in cases of hit, it can be auto-filled) and before entry validating the digit-count defined in that country. Also there is no need to store redundant data like country, state, city for every node if it's latitude and longitude falls within the area defined by the respective country, state and city
).

Now a relative competitor to this **editable** feature of OSM is **Google Local Guides** program that is though not much flexible but attracts more users due to their **rewards** (level-based). This can be used as a motivating factor for more contributions to OSM.

Also I found [tagsinfo](https://taginfo.openstreetmap.org/) really helpful. But if there are specific tags that a particular country is using, (like **IR:zone** means **Indian Railways - zone**) proper wiki should be provided before the usage. This makes OSM more documented and understandable.

### 9.2 Anticipated problems in implementing the improvements

* ***The idea of getting an edit reviewed by some high-contributing user may not always work***
    
    Example: Some users who has a large number of edits does not necessarily means that he knows my neighborhood better than me. He might reject my edits to a road or cell number of a store. It is not quite feasible that each and every edit is being reviewed with justice. I badly want to give an example to demonstrate the negative effects of getting every edit reviewed in Google LG program. Although I know my locality best and have given the best possible suggestions, they are not approved. 

![title](../Screenshots/3.png)

* ***There can be collisions of edits/creation of new places if they had to pass through a review queue***

    Example: Suppose there's an opening of a new shop. The business owner might have created a node for his shop. Now, it is obvious I won't be knowing this and suppose I re-create the same place. Well the system won't reject it if there's some variation in name. The owner created with "ABC & sons Pvt Ltd." while I did with "ABC". Now? Duplication of same place will happen and again another request for removal of duplicate entry needs to be addressed.
    
    
* ***However be the proposals of new tags are administered, it still possess the risk of ether being absurd or rejected***

    Example: Even I did not understand the meaning of IR:zone until I saw the combination of its values like ER (Eastern Railway), SER (South-Eastern Railway) and so on. Now this things if administered from outside India may be rejected due to lack of clarity and it's acceptable. However "THANA_NAME" is unique for Bangladesh and equivalent to our district. Now if they are forced to use district, then it destroys the whole idea of localization.
    
    
* ***To the best of my knowledge, redundancy and validation improvements can be carried out without any problems***

    At least in Kolkata all postal-codes are solid polygonal areas and mutually exclusive. Now if they are not solid and have holes (of other postal-codes) in them then it may be hard to implement but if not mutually exclusive then, it's impossible.

    However if the data is collected from individual then it may be prone to errors. Best demarcations can be provided by the Municipality.

## 10. References

* https://www.stackoverflow.com
* https://www.datacamp.com
* https://taginfo.openstreetmap.org
* https://www.overpass-turbo.eu
* https://www.openstreetmap.org
* https://en.wikipedia.org
* https://mapzen.com/