# Data Wrangling Project with Python and MongoDB
## 1.What is Data Wrangling?
## 2.Project Explanation
## 3.Data Exploration
## 4.Conclusion
<hr/>

## 1.What is data wrangling?
#### Definition: 
The process of manually converting or mapping data from one "raw" form into another format that allows for more convenient consumption of the data with the help of semi-automated tools.
#### Why do we need it?
After having the data, usually the data's shape is not consistant and clean. So, we need to reshape the data and clean unnecessary things. Espisally when we take data using web scrapping methods and when combining multiple sources of data together. 
#### Tools used:
To clean (wrangle) data, you can do in one of many ways. The listed tools below are most used in these days.
<ol>
    <li>Manually by hand [Not good way with huge data].</li>
    <li>Programming [Scripts to complete a spesific task, excellent with most cases].</li>
    <li>Applications [Commerical or open source applications, most of them works with just a small list of data types].</li>
</ol>


## 2.Project Explanation:
###### The project aims to help the data analyst understanding the wrangling process, from getting data cleaning it and then store it on local database. 
This project is about the Open Street Map, which is a collaborative Geographical Information System (GIS). Open Street Map allows the community adding, updating and deleting data from maps. Open Street Map (OSM) lunched by Steve Coast 12 years ago and it has more than 3 milion users. 

###### How project works?
To complate the project you need to download an OSM XML dataset for one of areas you are intrested in. Then audit the data in many ways.
<ul>
    <li>Languages</li>
    <li>Data formats</li>
    <li>Abbreviation</li>
    <li>Etc...</li>
</ul>

Also, statistical overview of the dataset must be provided for:
<ul>
    <li>Size of the file</li>
    <li>Total elements</li>
    <li>Number of unique users</li>
    <li>Number of nodes and ways</li>
    <li>Number of chosen type of nodes, like cafes, shops etc.</li>
    <li>Top 5 religons</li>
    <li>Top 5 food types</li>
    <li>Total Nando's resturants</li>
    <li>Total number of Restaurants and Cafes</li>
</ul>

Finally, you need to provide suggestions for improving, analyzing the data and includes thoughtful discussion about the benefits as well as some anticipated problems in implementing the improvement.
<hr/>

### 3. Data Exploration
I selected Cape Town, South Africa as my area to be cleaned. I will do the following steps:
<ol>
    <li>Download map's data</li>
    <li>Prepare workspace</li>
    <li>Select dataset file</li>
    <li>Check file size</li>
    <li>Calculate total tags</li>
    <li>Calculate total users</li>
    <li>Audit abbreviations</li>
    <li>Convert data JSON</li>
    <li>Storing data in MongoDB</li>
    <li>Statistical overview</li>
</ol>

#### 1.Download Map's Data
I Downloaded Cape Town from the https://mapzen.com/data/metro-extracts datasets. The dataset path is https://s3.amazonaws.com/metro-extracts.mapzen.com/cape-town_south-africa.osm.bz2
<hr/>

#### 2.Prepare Workspace
I used **Python** and **MongoDB** to clean and query data. **Python** is one of the best languages for this task. It is very fast and lightweight. In the other side, **MongoDB** is one of the fastest NoSQL databases. I used it to store my dataset after cleaning it. It is very powerful tool.

#### 3.Select Dataset File
My dataset is huge, so I have to create a sample to make the process faster. I created a Python script called **_sampler.py_** to make the sample. This allows us to check our code faster. The final result of the script it to make a **_sample.osm_** file.

In [1]:
# The main file
OSM_FILE = 'cape-town_south-africa.osm'
# The sample file
OSM_FILE = 'sample.osm'

#### 4.Check File Size
The main dataset file's size is **283.0 MB** where the sample dataset file's size is **5.7 MB**. The dataset size has to be more than **50.00 MB**.

<hr/>

#### 5.Calculate Total Tags
The dataset is in XML like format, which means it made up from many open and close tags like this < tag >Content</ tag >. Knowing the tags count will help us understanding dataset structure and details. 

The below table list all tags in the dataset with its count.

_To know more about tags you can check OpenStreetMap wiki._


|    Tag   |  Count  |
|:--------:|:-------:|
|  bounds  |    1    |
|  member  |  30724  |
|    nd    | 1554908 |
|   node   | 1350871 |
|    osm   |    1    |
| relation |   3033  |
|    tag   |  604348 |
|    way   |  212159 | 


Tags in the dataset can be one the following:

    1- lower: valid tags in lowercase
    
    2- lower_colon: valid tags with a colon in their names
    
    3- problemchars: tags with problematic characters
    
    4- other: other tags that do not fall into the other three categories
    


|      Tag     |  Count |
|:------------:|:------:|
|     lower    | 555062 |
|  lower_colon |  48113 |
| problemchars |    7   |
|     other    |  1166  |


The dataset set contains the above numbers of tags. Which means it is little bit big.

<hr/>

#### 6.Calculate Total Users
The number of users who helped in editing the map is very good indicator to know the comunity. A bigger number means more comunity, where a smaller number means usually using automated bots.

Our dataset contains **1538 users** which is a good number.

<hr/>

#### 7.Audit Abbreviations
##### 7.1 Audit Street Types:
As we said before, our dataset contains many abbreviations, here we will list all appriviations and match them with current appriviations in the dataset. 

For example, **Rd.** will be changed with **Road**. This will give us a more consistent look and feel.

In [2]:
import re
# A reguler expresion to check the last word in element name which is usally its type [street, road, etc...] 
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

# What we want to see
expected = ["Avenue", "Boulevard", "Commons", "Court", "Drive", "Lane", "Parkway", 
                         "Place", "Road", "Square", "Street", "Trail", "Circle", "Highway"]

# Mapping appriviations to expected.
mapping = {'Ave'  : 'Avenue',
           'Blvd' : 'Boulevard',
           'Dr'   : 'Drive',
           'Ln'   : 'Lane',
           'Pkwy' : 'Parkway',
           'Rd'   : 'Road',
           'Rd.'   : 'Road',
           'St'   : 'Street',
           'st'   : 'Street',
           'street' :"Street",
           'stre' :"Street",
           'stree' :"Street",
           'Ct'   : "Court",
           'Cir'  : "Circle",
           'Cr'   : "Court",
           'ave'  : 'Avenue',
           'Hwg'  : 'Highway',
           'Hwy'  : 'Highway',
           'Sq'   : "Square"}


    def audit_street(value):
        m = street_type_re.search(value)
        if m:
            street_type = m.group()
            if street_type not in expected:
                new_type = mapping[street_type]
                value = value.replace(street_type, new_type)
        return value


After auditing all items, a sample result is listed below. 
    
    Old Name                      => New Name
    Main Rd                       => Main Road
    Foam Rd                       => Foam Road
    Solan st                      => Solan Street
    Caledon St                    => Caledon Street
    De Villiers St                => De Villiers Street
    New Church street             => New Church Street
    Main Rd                       => Main Road
    Barrack St                    => Barrack Street
    Rhine Road & Eindhoven street => Rhine Road & Eindhoven Street



<hr />

##### 7.2 Audit Cities Names
As we selected Cape Town, some small cities were selected with it. So, we are going to clean how cities writen. The way cities have to be writen is Capitel way. Which means cape Town, cape town, Cape town or cape-town are not accepted. It has to be Cape Town. 

I went over the data and made a set with cities names. Where I can make a list of all used names. Then I changed the wrong names with better ones.

In [3]:
# Names in the dataset.
expected_city_names = ['Abbotsdale', 'Athlone', 'Atlantis', 'Belgravia', 'Bellville',
     'Bergvliet', 'Blouberg', 'Blue Downs', 'Bothasig', 'Brackenfell', 'Camps Bay', 'Cape Town',
     'Capetown', 'Capricorn', 'Century City', 'Claremont', 'Constantia', 'De Tijger',
     'Delft Cape Town', 'Diep River', 'Durbanville', 'Epping', 'Filippi',
     'Foreshore, Cape Town', 'Gardens', 'Glencairn Heights', 'Goodwood', 'Gordons Bay',
     'Grassy Park', 'Green Point', 'Hout Bay', 'Hout Bay Harbour', 'Hout Bay Heights Estate',
     'Kalbaskraal', 'Kapstaden', 'Kenilworth', 'Khayelitsha', 'Killarney Gardens, Cape Town',
     'Kommetjie', 'Kuilsriver', 'Kuilsrivier', 'Lansdowne', 'Loevenstein','Maitland',
     'Makahza', 'Manenberg', 'Marina Da Gama', 'Melkbosstrand', 'Mfuleni',
     'Milnerton', 'Milnerton,  Cape Town', 'Mitchells Plain', 'Mowbray', 'Mowbray, Cape Town',
     'Muizenberg', 'Nerina Lane', 'Newlands', 'Noordhoek', 'Noordhoek, Cape Town', 'Nyanga',
     'Observatory', 'Paarden Eiland' ,'Paarl' ,'Parklands' ,'Parow','Philadelphia',
     'Pinelands','Plumstead','Pniel','Pringle Bay','Richwood','Rondebosch','Rondebosch East','Rondebosh East',
     'Rosebank','Salt River','Scarborough','Sea Point','Sea Point, Cape Town',"Simon's Town",
     'Somerset West','Sonnekuil','Steenberg','Stellenbosch','Stellenbosch Farms','Strand',
     'Strandfotein','Suider Paarl','Sybrand Park','Table View','Techno Park','Technopark',
     'Test city','Vredehoek','Vrygrond','Welgelegen 2','Welgemoed','Wellington','Woodstock',
     'Woodstock, Cape Town','Wynberg','Zonnebloem']

# Wrong names with better names.
city_names_mapping = {
         'cape Town' : "Cape Town",
         'cape town' : "Cape Town",
         'cape-town' : "Cape Town",
         'Cape town' : "Cape Town",
         'muizenberg' : "Muizenberg",
         'rylands' : "Rylands"
}

    def audit_city(value):
        if value not in expected_city_names:
            return city_names_mapping[value]
        return value


After auditing all items, a sample result is listed below.

    Old Name   => New Name
    cape town  => Cape Town
    cape Town  => Cape Town
    muizenberg => Muizenberg
    cape Town  => Cape Town
    cape Town  => Cape Town
    rylands    => Rylands

#### 8.Convert Data to JSON
Our dataset is in XML format. We are going to store data in MonogoDB, but before that we have to change dataset format from XML to JSON. 

To start doing that, we have to set the structure of the object, then start changing items based on it. 

A sample JSON object is below:

    {
        'pos':[-33.9322555, 18.8587291], 
        'type': 'node', 
        'id': '18401303', 
        'highway': 'traffic_signals', 
        'created': {
            'changeset': '19306159', 
            'user': 'kaiD', 
            'version': '4', 
            'uid': '282726', 
            'timestamp': '2013-12-06T13:30:06Z'
        }
    }
    
The dataset converted to JSON and saved localy.

<hr/>

#### 9.Store Data in MongoDB
After cleaning data and store it in a clean structure, we are going to save it on local database using MongoDB. 


A long process ends with adding all items to the database.

<hr/>

#### 10.Statistical Overview
The last step is about showing statistical overview about the database.
<ul>
    <li>Size of the file</li>
    <li>Total elements</li>
    <li>Number of unique users</li>
    <li>Number of nodes and ways</li>
    <li>Number of chosen type of nodes, like cafes, shops etc.</li>
    <li>Top 5 religons</li>
    <li>Top 5 food types</li>
    <li>Total Nando's resturants</li>
    <li>Total number of Restaurants and Cafes</li>
</ul>

#### Size of the file
What do you think, which file is bigger, the old or the new one?

    The old file size is: 283.0 MB.
    The new file size is: 324.6 MB.


#### Total elements
Here is how we can count the number of elements in the database.

    # total elements in DB
    db.map.find().count()

Total DB documants:  1563030


#### Number of unique users
Now we are going to calculate the number of users who participated in the map.

    # total number of users
    unique_users = len(db.map.distinct('created.user'))
    print ("Total users who participated in the map is: {} users.".format(unique_users))
    
Total users who participated in the map is: 1529 users.


#### Number of nodes and ways
We are going to count the number of nodes and ways in the database.

    # count items where type = way
    total_ways = db.map.find({'type':'way'}).count()

    # count items where type = node
    total_nodes = db.map.find({'type':'node'}).count()

    print ("Total ways in the map is: {} ways.".format(total_ways))
    print ("Total nodes in the map is: {} nodes.".format(total_nodes))

Total ways in the map is: 212127 ways.
Total nodes in the map is: 1350860 nodes.


#### Number of chosen type of nodes, like cafes, shops etc.
We are going to see togather what are the most popular nodes in selected area.

    amenities = db.map.aggregate([{"$match":{"amenity":{"$exists":1}}}, {"$group":{"_id":"$amenity",
        "count":{"$sum":1}}}, {"$sort":{"count":-1}}, {"$limit":10}])

    pprint.pprint(list(amenities))


    [
         {u'_id': u'parking', u'count': 1719},
         {u'_id': u'restaurant', u'count': 556},
         {u'_id': u'school', u'count': 523},
         {u'_id': u'toilets', u'count': 441},
         {u'_id': u'drinking_water', u'count': 323},
         {u'_id': u'place_of_worship', u'count': 307},
         {u'_id': u'fuel', u'count': 302},
         {u'_id': u'fast_food', u'count': 252},
         {u'_id': u'waste_basket', u'count': 187},
         {u'_id': u'atm', u'count': 174}
    ]


#### Top 5 religons 
We are going to investigate the **place_of_worship** to see top religons.

    religions = db.map.aggregate([{"$match":{"amenity":{"$exists":1}, "amenity":"place_of_worship"}},
                     {"$group":{"_id":"$religion", "count":{"$sum":1}}},{"$sort":{"count":-1}}, {"$limit":5}])
    list(religions)
    
    [
     {u'_id': u'christian', u'count': 239},
     {u'_id': u'muslim', u'count': 35},
     {u'_id': None, u'count': 23},
     {u'_id': u'hindu', u'count': 5},
     {u'_id': u'jewish', u'count': 5}
    ]
    

#### Top 5 food types
We are going to check best and top 5 food types [restaurant style].

    places = db.map.aggregate([{"$match":{"amenity":{"$exists":1}, "amenity":"restaurant"}},
                               {"$group":{"_id":"$cuisine", "count":{"$sum":1}}},
                               {"$sort":{"count":-1}}, {"$limit":5}])
    list(places)
    
    [
        {u'_id': None, u'count': 311},
        {u'_id': u'regional', u'count': 42},
        {u'_id': u'italian', u'count': 24},
        {u'_id': u'pizza', u'count': 21},
        {u'_id': u'steak_house', u'count': 14}
    ]


#### Total Nando's resturants
Nando's is one of the most famous restaurants in South Africa. Here we are going to count how many one exists in Cape Town. 

    nandos_count = db.map.find({"$or":[ {"name": "Nando's"}, {"name": "Nandos"}]}).count()

    print "Total Number of Nando's Resuturant:", nandos_count

Total Number of Nando's Resuturant: 12


#### Total number of Restaurants and Cafes
For a city, to be a better place for visitors it has to contains as many resturants and cafes as possible. 

    eating_pleaces_count = db.map.find({"$or":[ {"amenity": "cafe"}, {"amenity": "restaurant"}]}).count()
    print "Number of restaurants and cafes:", eating_pleaces_count

Number of restaurants and cafes: 708


<hr/>

## 4.Conclusion
In conclusion, Open Street Map is an amazing project. It is going to help the community and improve many applications as it is an open source project. I liked the way how people participate in the project and help each other.

I suggest the site managers help people developing automated bots to correct the data via Google Maps API. This will allow us to have a copy from Google Map data, also Google Maps will get benefits from the community. Using Google Map with OpenStreetMap via Google Places API for example will help OpenStreetMap improve places details.

**Python Google Places** is a library to work with the API. You can find it here:

https://github.com/slimkrazy/python-google-places

This will allows us to search for location and fill empty details from it.

    # this is a smaple and imagination how we can change places details using API.
    '''
    for place in places:
        if place.phone == None:
            place.phone = API.find(place.geolocation).phone
        if place.logo == None:
            place.logo = API.find(place.geolocation).logo
        ...
            ...
    '''


This soultion is very well, but we can not be 100% sure that data is correct. Maybe the old data is better than the new, so we are going to fill only empty attributes. 


Another improvment can be done by making competitions to improve data either manually or automated. This may be a wrong solution or idea becouse many people may add spam content to win.


For the datasets structures I think adding a logo attribute to places is a good idea. As well as adding images for the place.

*Refs:*
<ul>
    <li>Wikipedia</li>
    <li>Udacity</li>
    <li>Stack Overflow</li>
</ul>