# OpenStreetMap case study

The area that I have chosen for this particular case study is my home city of Mumbai. The data was obtained from the follwing source: https://mapzen.com/data/metro-extracts/metro/mumbai_india/

A smaller sample was used was extratced from this huge file (>400 MB) to ensure smoother and faster processing on local system, the python code was also validated against this smaller sample. This document describes the problems that the dataset pocesses and how I managed to navigate across the various problems. It also gives a detailed overview of the data and finally presents the reader with exciting findings about the Mumbai city

In [146]:
import sqlite3
import os
import pandas as pd
db = sqlite3.connect("Mumbai")
c = db.cursor()

## Problems encountered in the map

* PROBLEM 1: Difference in k values for tags of postal code:

    ```xml
      <tag k="addr:postcode" v="400005" /> 
      <tag k="postal_code" v="410205" />
    ```
* PROBLEM 2: Inconsistent format of postal codes

    ```xml
       <tag k="addr:postcode" v="400 071" />
       <tag k="addr:postcode" v="400018" />
    ```

* PROBLEM 3: Postal codes out of the range. Codes of the Mumbai region begin with 4 and are of length 6. However, some postal codes outside of this range creeped into the dataset

    ```xml
       <tag k="addr:postcode" v="500053" />
       <tag k="addr:postcode" v="40049" /> 
    ```
    
* PROBLEM 4: Cleaning of the value for k=city. The city name has been pronounced in a number of different ways, at times smaller provinces within Mumbai has been substitued for the city name or at time Mumbai has been spelt incorrectly, these have been cleaned so that only Mumbai appears as the city name. However, even after cleaning the city values we find the inclusion of data from the sourrounding parts of Mumbai City



* PROBLEM 5: Difference in sources of data

    ```sql
       SELECT DISTINCT tags.value
       FROM (SELECT key, value FROM nodes_tags UNION SELECT key, value FROM ways_tags) tags
       WHERE tags.key = "source"
       LIMIT 10; 
    ```
    
The following shows the output of 10 differnt sources of data in python. This problem is mentioned to make the readers realize the wide array of data sources and thus to be vigilant when using this data. An unscrupulous data source can lead to incorrect conclusions.

In [147]:
query1 = '''
    SELECT DISTINCT tags.value
    FROM (SELECT key, value FROM nodes_tags UNION SELECT key, value FROM ways_tags) tags
    WHERE tags.key = "source"
    LIMIT 10;
    '''
c.execute(query1)
c.fetchall()

[(u'402',),
 (u'502,504,353',),
 (u'AND',),
 (u'AND;PGS',),
 (u'AND;US NGA Pub. 112. 2009-11-12.',),
 (u'Autooptions',),
 (u'BEST',),
 (u'Bing',),
 (u'Chembur',),
 (u'GNS',)]

## The postal code problems
* The key identity of any area in a city like Mumbai is considered to be its postal code. Thus it was vital that the postal code for the various areas be cleaned first. The k value, for the child tags of nodes, for postal code was normalized to "addr:postalcode" resloving problem 1
* The postal code values themselves were standardized by removing unwanted spaces between them, i.e. a postal code value of "400 071" was transformmed to 400071 resolving problem 2
* However, for codes which were incorrectly included in the dataset no, cleaning was done. Yet, a word of caution is provided to the readers making them aware of such anomalies existing in the data

The below part of the code of the **get_element()** fucntion summarizes the cleaning process for the postal code problems

```python
   def get_element(osm_file, tags=('node', 'way', 'relation')):
    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    for event, elem in context:
        for child in elem:
            if child.tag == "tag" and (child.attrib["k"] == "postal_code" or child.attrib["k"] == "addr:postcode"):
                child.attrib["k"] = "addr:postcode"
                temp = child.attrib["v"].split()
                if len(temp)>1:
                    child.attrib["v"] = "".join(temp)
```

## Surrounding area problem
The city values were cleaned again with the **get_element()** function, the part of the code doing the required job is as follows:

```python
    def get_element(osm_file, tags=('node', 'way', 'relation')):
    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    for event, elem in context:
        for child in elem:
            if child.attrib["k"] == "city" and child.attrib["v"] in unwanted_cities_list:
                child.attrib["v"] == "Mumbai"
```
To clear the suspision of the inclusion of surrounding areas in the dataset consider the following query and the its python application to the see the output, the output clearly shows that a number of number of surrounding areas including the Thane and Navi Mumbai suburban areas are included in the dataset. (The areas of Kharghar and Swawoods are smaller provinces in Navi Mumbai).

```sql
   SELECT DISTINCT tags.value, COUNT(tags.value) as num
   FROM (SELECT key, value FROM nodes_tags UNION ALL SELECT key, value FROM ways_tags) tags
   WHERE tags.key = "city"
   GROUP BY tags.value
   ORDER BY num DESC;
```

In [150]:
query2 = '''
    SELECT tags.value, COUNT(tags.value) as num
    FROM (SELECT key, value FROM nodes_tags UNION ALL SELECT key, value FROM ways_tags) tags
    WHERE tags.key = "city"
    GROUP BY tags.value
    ORDER BY num DESC
    LIMIT 10;
'''
c.execute(query2)
c.fetchall()

[(u'Mumbai', 1454),
 (u'Navi Mumbai', 49),
 (u'Kharghar', 43),
 (u'Thane (West)', 38),
 (u'Thane', 33),
 (u'navi mumbai', 10),
 (u'kamothe, navi mumbai', 9),
 (u'Kurla West, Mumbai', 6),
 (u'Mulind (East)', 5),
 (u'Sanpada', 5)]

## Overview of the dataset

#### 1. File Sizes

In [151]:
print "mumbai_india.osm:", round(os.stat('mumbai_india.osm').st_size/float(1000000),2), "MB" #Change Sample here t mumbai.osm
print "Mumbai.db:", round(os.stat('Mumbai').st_size/float(1000000),2), "MB"
print "nodes.csv:", round(os.stat('nodes.csv').st_size/float(1000000),2), "MB"
print "nodes_tags.csv:", round(os.stat('nodes_tags.csv').st_size/float(1000000),2), "MB"
print "ways.csv:", round(os.stat('ways.csv').st_size/float(1000000),2), "MB"
print "ways_tags.csv:", round(os.stat('ways_tags.csv').st_size/float(1000000),2), "MB"
print "ways_nodes.csv:", round(os.stat('ways_nodes.csv').st_size/float(1000000),2), "MB"

mumbai_india.osm: 422.77 MB
Mumbai.db: 283.06 MB
nodes.csv: 171.22 MB
nodes_tags.csv: 1.88 MB
ways.csv: 17.44 MB
ways_tags.csv: 10.84 MB
ways_nodes.csv: 58.76 MB


#### 2. Number of unique users

In [152]:
query3 = '''
    SELECT DISTINCT COUNT(tags.user)
    FROM (SELECT user FROM nodes UNION SELECT user FROM ways) tags; 
'''
c.execute(query3)
print "number of distinct users are:", c.fetchall()[0][0]

number of distinct users are: 1411


#### 3. Number of nodes and ways

In [153]:
query4 = "SELECT COUNT(*) FROM nodes"
query5 = "SELECT COUNT(*) FROM ways"
c.execute(query4)
print "number of nodes are:", c.fetchall()[0][0]
c.execute(query5)
print "number of ways are:", c.fetchall()[0][0]

number of nodes are: 2029610
number of ways are: 281693


#### 4. Number of resturants and clinics

In [154]:
query6 = '''
        SELECT nodes_tags.value, count(nodes_tags.value) as num
        FROM nodes_tags
        WHERE nodes_tags.value = "restaurant" OR nodes_tags.value = "clinic"
        GROUP BY nodes_tags.value
'''
c.execute(query6)
for i in c.fetchall():
    print "number of", i[0], "=",i[1]

number of clinic = 35
number of restaurant = 341


## Some important statistics on users

In [155]:
query7 = '''
    SELECT tags.user, COUNT(tags.user) as num
    FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) tags
    GROUP BY tags.user
    ORDER BY num DESC; 
'''
c.execute(query7)
df=pd.DataFrame(c.fetchall())
df = df.set_index(df[0])
df = df.drop(0,1)
df[1]=df[1]/df[1].sum(axis=0)*100
df = df.rename(columns={ 1: "Percentage"})
df.head(10)

Unnamed: 0_level_0,Percentage
0,Unnamed: 1_level_1
anushap,3.22965
PlaneMad,3.222944
parambyte,3.138533
Ashok09,2.840303
premkumar,2.722317
Srikanth07,2.632887
Narsimulu,2.583132
sampath reddy,2.401286
Naresh08,2.393152
ravikumar1,2.209317


* 27.37% of the total posts were contributed by the top 10 users.
* The highest user conrtibuted 3.23% of the total posts, the second highest user contributed 3.22% of total posts.
* The distribtion of the posts does not indicate any kind of involvment of bots.

## Other ideas about the datasets

#### Top 10 appearing amenities

```sql
    SELECT value, COUNT(*) as num
    FROM nodes_tags
    WHERE key='amenity'
    GROUP BY value
    ORDER BY num DESC
    LIMIT 10;
```

In [156]:
query8 = '''
    SELECT value, COUNT(*) as num
    FROM nodes_tags
    WHERE key='amenity'
    GROUP BY value
    ORDER BY num DESC
    LIMIT 10 
'''
c.execute(query8)
c.fetchall()

[(u'restaurant', 341),
 (u'place_of_worship', 287),
 (u'bank', 269),
 (u'cafe', 133),
 (u'school', 118),
 (u'fast_food', 110),
 (u'atm', 103),
 (u'fuel', 102),
 (u'hospital', 101),
 (u'toilets', 76)]

#### Religions in Mumbai

```sql
    SELECT value, COUNT(*) as num
    FROM nodes_tags
    WHERE key='amenity'
    GROUP BY value
    ORDER BY num DESC
    LIMIT 10;
```

In [157]:
query9 = '''
    SELECT nodes_tags.value, COUNT(*) as num
    FROM nodes_tags 
    JOIN (SELECT DISTINCT(id) FROM nodes_tags WHERE value='place_of_worship') i
    ON nodes_tags.id=i.id
    WHERE nodes_tags.key='religion'
    GROUP BY nodes_tags.value
    ORDER BY num DESC;
'''
c.execute(query9)
c.fetchall()

[(u'hindu', 109),
 (u'muslim', 69),
 (u'christian', 34),
 (u'buddhist', 13),
 (u'jain', 6),
 (u'sikh', 4),
 (u'zoroastrian', 2),
 (u'jewish', 1)]

# Conclusions
Throughout this report and project at large, an effort has been made to clean the data and develop insights out of this cleaned data. Though no claims are made that the data wrangling and data cleaning being performed is perfect, an adequate attempt is made. From the analysis we find that a high percetage of the data was human edited and a huge number of users have contributed to building up this data, this could be a potential reason for such poor quality of data. The data further has also been accumulated through a lot of different sources and this has in particular further deteriortaed the data quality. Through certain checks at the data collection stage itsself the data quality can be imporved manifolds, and also through certain enhanced analytical tools and techniques the data can be analyzed more efficiently.