# Data Wrangling Project with Python and MongoDB
## 1.What is Data Wrangling?
## 2.Project Explanation
## 3.Data Exploration
## 4.Conclusion
<hr/>

## 1.What is data wrangling?
#### Definition: 
The process of manually converting or mapping data from one "raw" form into another format that allows for more convenient consumption of the data with the help of semi-automated tools.
#### Why do we need it?
After having the data, usually the data's shape is not consistant and clean. So, we need to reshape the data and clean unnecessary things. Espisally when we take data using web scrapping methods and when combining multiple sources of data together. 
#### Tools used:
To clean (wrangle) data, you can do in one of many ways. The listed tools below are most used in these days.
<ol>
    <li>Manually by hand [Not good way with huge data].</li>
    <li>Programming [Scripts to complete a spesific task, excellent with most cases].</li>
    <li>Applications [Commerical or open source applications, most of them works with just a small list of data types].</li>
</ol>


## 2.Project Explanation:
###### The project aims to help the data analyst understanding the wrangling process, from getting data cleaning it and then store it on local database. 
This project is about the Open Street Map, which is a collaborative Geographical Information System (GIS). Open Street Map allows the community adding, updating and deleting data from maps. Open Street Map (OSM) lunched by Steve Coast 12 years ago and it has more than 3 milion users. 

###### How project works?
To complate the project you need to download an OSM XML dataset for one of areas you are intrested in. Then audit the data in many ways.
<ul>
    <li>Languages</li>
    <li>Data formats</li>
    <li>Abbreviation</li>
    <li>Etc...</li>
</ul>

Also, statistical overview of the dataset must be provided for:
<ul>
    <li>Size of the file</li>
    <li>Number of unique users</li>
    <li>Number of nodes and ways</li>
    <li>Number of chosen type of nodes, like cafes, shops etc.</li>
</ul>

Finally, you need to provide suggestions for improving, analyzing the data and includes thoughtful discussion about the benefits as well as some anticipated problems in implementing the improvement.
<hr/>

### 3. Data Exploration
I selected Cape Town, South Africa as my area to be cleaned. I will do the following steps:
<ol>
    <li>Download map's data</li>
    <li>Prepare workspace</li>
    <li>Select dataset file</li>
    <li>Check file size</li>
    <li>Calculate total tags</li>
    <li>Calculate total users</li>
    <li>Audit abbreviations</li>
    <li>Apply changes</li>
    <li>Convert data JSON</li>
    <li>Storing data in MongoDB</li>
    <li>Statistical overview</li>
</ol>

#### 1.Download Map's Data
I Downloaded Cape Town from the https://mapzen.com/data/metro-extracts datasets. The dataset path is https://s3.amazonaws.com/metro-extracts.mapzen.com/cape-town_south-africa.osm.bz2
<hr/>

#### 2.Prepare Workspace
I used **Python** and **MongoDB** to clean and query data. **Python** is one of the best languages for this task. It is very fast and lightweight. In the other side, **MongoDB** is one of the fastest NoSQL databases. I used it to store my dataset after cleaning it. It is very powerful tool.

#### 3.Select Dataset File
My dataset is huge, so I have to create a sample to make the process faster. I created a Python script called **_sampler.py_** to make the sample. This allows us to check our code faster. The final result of the script it to make a **_sample.osm_** file.

In [32]:
# The main file
OSM_FILE = 'cape-town_south-africa.osm'
# The sample file
#OSM_FILE = 'sample.osm'

#### 4.Check File Size
The main dataset file's size is **283.0 MB** where the sample dataset file's size is **5.7 MB**. The dataset size has to be more than **50.00 MB**.

In [31]:
# this function takes a number of file size in bytes and convert it to other sizes
def convert_bytes_to_size(num):
    for x in ['Bytes', 'KB', 'MB', 'GB', 'TB']:
        if num < 1024.0:
            return "%3.1f %s" % (num, x)
        num = num / 1024.0

def get_file_size(file_path):
    if os.path.isfile(file_path):
        file_info = os.stat(file_path)
        return convert_bytes_to_size(file_info.st_size)
    
size = get_file_size(OSM_FILE)
print ('File size', size)


('File size', '5.7 MB')


The size of the dataset is 283 MB. Whihe means it is bigger than 50 MB.


<hr/>

#### 5.Calculate Total Tags
The dataset is in XML like format, which means it made up from many open and close tags like this < tag >Content</ tag >. Knowing the tags count will help us understanding dataset structure and details. 

The below table list all tags in the dataset with its count.

_To know more about tags you can check OpenStreetMap wiki._


|    Tag   |  Count  |
|:--------:|:-------:|
|  bounds  |    1    |
|  member  |  30724  |
|    nd    | 1554908 |
|   node   | 1350871 |
|    osm   |    1    |
| relation |   3033  |
|    tag   |  604348 |
|    way   |  212159 | 


Tags in the dataset can be one the following:

    1- lower: valid tags in lowercase
    
    2- lower_colon: valid tags with a colon in their names
    
    3- problemchars: tags with problematic characters
    
    4- other: other tags that do not fall into the other three categories
    


|      Tag     |  Count |
|:------------:|:------:|
|     lower    | 555062 |
|  lower_colon |  48113 |
| problemchars |    7   |
|     other    |  1166  |


In [33]:
# Useing Element Tree, we will loop over the dataset and count the number of appercence of tags.
def get_tags_count(filename):
        tags_dictionary = {}
        for event, element in ET.iterparse(filename):
            if element.tag in tags_dictionary: 
                tags_dictionary[element.tag] = tags_dictionary[element.tag] + 1
            else:
                tags_dictionary[element.tag] = 1
        return tags_dictionary

pprint.pprint(get_tags_count(OSM_FILE))

{'bounds': 1,
 'member': 30724,
 'nd': 1554908,
 'node': 1350871,
 'osm': 1,
 'relation': 3033,
 'tag': 604348,
 'way': 212159}


The dataset set contains the above numbers of tags. Which means it is little bit big.

In [34]:
'''
    reguler expressions where we will check all tags...
    "lower", valid tags in lowercase,
    "lower_colon", valid tags with a colon in their names,
    "problemchars", tags with problematic characters, and
    "other",other tags that do not fall into the other three categories.
'''
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')


def key_type(element, keys):
    if element.tag == "tag":
        for tag in element.iter('tag'):
            k = tag.get('k')
            if lower.search(k):
                keys['lower'] += 1
            elif lower_colon.search(k):
                keys['lower_colon'] += 1
            elif problemchars.search(k):
                keys['problemchars'] += 1
            else:
                keys['other'] += 1
    return keys


def check_key_types(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for event, element in ET.iterparse(filename):
        keys = key_type(element, keys)

    return keys

keys = check_key_types(OSM_FILE)
pprint.pprint(keys)

{'lower': 555062, 'lower_colon': 48113, 'other': 1166, 'problemchars': 7}


Most of dataset tags are valid with lowercase style. Where 7 tags are not valid.

<hr/>

#### 6.Calculate Total Users
The number of users who helped in editing the map is very good indicator to know the comunity. A bigger number means more comunity, where a smaller number means usually using automated bots.

Our dataset contains **1538 users** which is a good number.

In [35]:
# Check all elemetn and who edit them. We are using set to count users only once [not allowing duplication].
def get_users_set(filename):
    users_set = set()
    for event, element in ET.iterparse(filename):
        for item in element:
            if 'uid' in item.attrib:
                users_set.add(item.attrib['uid'])
    return users_set
users = get_users_set(OSM_FILE)
len(users)


1538

Our dataset puplated and edited by 1538 users.

<hr/>

#### 7.Audit Abbreviations
Here we will list all appriviations and match them with current appriviations in the dataset.

In [21]:
# A reguler expresion to check the last word in element name which is usally its type [street, road, etc...] 
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

# What we want to see
expected = ["Avenue", "Boulevard", "Commons", "Court", "Drive", "Lane", "Parkway", 
                         "Place", "Road", "Square", "Street", "Trail", "Circle", "Highway"]

# Mapping appriviations to expected.
mapping = {'Ave'  : 'Avenue',
           'Blvd' : 'Boulevard',
           'Dr'   : 'Drive',
           'Ln'   : 'Lane',
           'Pkwy' : 'Parkway',
           'Rd'   : 'Road',
           'Rd.'   : 'Road',
           'St'   : 'Street',
           'st'   : 'Street',
           'street' :"Street",
           'stre' :"Street",
           'stree' :"Street",
           'Ct'   : "Court",
           'Cir'  : "Circle",
           'Cr'   : "Court",
           'ave'  : 'Avenue',
           'Hwg'  : 'Highway',
           'Hwy'  : 'Highway',
           'Sq'   : "Square"}


In [28]:
# This function takes a dictunary of types and the street value. 
# If its ending not in expected add it to the dictonary.
def audit_street_type(street_types_dict, value):
    m = street_type_re.search(value)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types_dict[street_type].add(value)

# Take a value and check if it is not accepted, then change it.             
def audit_street(value):
    m = street_type_re.search(value)
    if m:
        street_type = m.group()
        if street_type not in expected:
            try:
                new_type = mapping[street_type]
                value = value.replace(street_type, new_type)
            except:
                return value
    return value

# This function takes a filename, loop over nodes and ways then check if it has an address with street.
# If it is, audit its type.
def audit_street_names(filename):
    dataset = open(filename, "r")
    street_types_dict = defaultdict(set)
    for event, element in ET.iterparse(dataset, events=("start",)):
        if element.tag == "node" or element.tag == "way":
            for tag in element.iter("tag"):
                if tag.attrib['k'] == "addr:street":
                    newval = audit_street(tag.attrib['v'])
                    if newval != tag.attrib['v']:
                        print tag.attrib['v'] + " => " + newval 

    return street_types_dict

types = audit_street_names(OSM_FILE)
#pprint.pprint(dict(types)['street'])


1 Beach Rd. => 1 Beach Road
Main Rd => Main Road
Foam Rd => Foam Road
Solan st => Solan Street
Caledon St => Caledon Street
De Villiers St => De Villiers Street
New Church street => New Church Street
Main Rd => Main Road
Barrack St => Barrack Street
Rhine Road & Eindhoven street => Rhine Road & Eindhoven Street
test stree => test Street
Main Rd => Main Road
Doncaster St => Doncaster Street
Prospect Rd => Prospect Road
Pienaar Rd => Pienaar Road
Bank St => Bank Street
Bank St => Bank Street
Bank St => Bank Street
Bank St => Bank Street
tygerberg street => tygerberg Street


We grouped items based on the endings.
The result above for **street** endings. 

<hr />

#### 8.Apply changes
##### 8.1 Change Appriviations
Now, we will change old and wrong appriviations to the expected ones.

In [None]:
# This function takes the old name, the mapping of appriviations, and the types reguler expresion
# It try to find a regex in the old name
# If found, it update it then return the new name
# If not found, return old name as it is.
def update_name(name, mapping, regex):
    m = regex.search(name)
    if m:
        street_type = m.group()
        if street_type in mapping:
            name = re.sub(regex, mapping[street_type], name)

    return name

# For each type in types dictionary. Print the old name and the updated name. 
# Ex. Foam Rd => Foam Road
for street_type, ways in types.iteritems():
    for name in ways:
        better_name = update_name(name, mapping, street_type_re)
        if len(better_name) != len(name):
            print name, "=>", better_name


The above are sample from the changed items after procceing them.
<hr/>

##### 8.2 Change Cities Names
As we selected Cape Town, some small cities were selected with it. So, we are going to clean how cities writen. The wat cities have to be writen is Capitel way. Which means cape Town, cape town, Cape town or cape-town are not accepted. It has to be Cape Town. 

I went over the data and made a set with cities names. Where I can make a list of all used names. Then I changed the wrong names with better ones.

In [None]:
# Names in the dataset.
expected_city_names = ['Abbotsdale', 'Athlone', 'Atlantis', 'Belgravia', 'Bellville',
     'Bergvliet', 'Blouberg', 'Blue Downs', 'Bothasig', 'Brackenfell', 'Camps Bay', 'Cape Town',
     'Capetown', 'Capricorn', 'Century City', 'Claremont', 'Constantia', 'De Tijger',
     'Delft Cape Town', 'Diep River', 'Durbanville', 'Epping', 'Filippi',
     'Foreshore, Cape Town', 'Gardens', 'Glencairn Heights', 'Goodwood', 'Gordons Bay',
     'Grassy Park', 'Green Point', 'Hout Bay', 'Hout Bay Harbour', 'Hout Bay Heights Estate',
     'Kalbaskraal', 'Kapstaden', 'Kenilworth', 'Khayelitsha', 'Killarney Gardens, Cape Town',
     'Kommetjie', 'Kuilsriver', 'Kuilsrivier', 'Lansdowne', 'Loevenstein','Maitland',
     'Makahza', 'Manenberg', 'Marina Da Gama', 'Melkbosstrand', 'Mfuleni',
     'Milnerton', 'Milnerton,  Cape Town', 'Mitchells Plain', 'Mowbray', 'Mowbray, Cape Town',
     'Muizenberg', 'Nerina Lane', 'Newlands', 'Noordhoek', 'Noordhoek, Cape Town', 'Nyanga',
     'Observatory', 'Paarden Eiland' ,'Paarl' ,'Parklands' ,'Parow','Philadelphia',
     'Pinelands','Plumstead','Pniel','Pringle Bay','Richwood','Rondebosch','Rondebosch East','Rondebosh East',
     'Rosebank','Salt River','Scarborough','Sea Point','Sea Point, Cape Town',"Simon's Town",
     'Somerset West','Sonnekuil','Steenberg','Stellenbosch','Stellenbosch Farms','Strand',
     'Strandfotein','Suider Paarl','Sybrand Park','Table View','Techno Park','Technopark',
     'Test city','Vredehoek','Vrygrond','Welgelegen 2','Welgemoed','Wellington','Woodstock',
     'Woodstock, Cape Town','Wynberg','Zonnebloem']

# Wrong names with better names.
city_names_mapping = {
         'cape Town' : "Cape Town",
         'cape town' : "Cape Town",
         'cape-town' : "Cape Town",
         'Cape town' : "Cape Town",
         'muizenberg' : "Muizenberg",
         'rylands' : "Rylands"
}




In [None]:
# This function takes a value of city name. 
# If the city name is in expected names, return it. Else, change it with better one.
def audit_city(value):
    if value not in expected_city_names:
        return city_names_mapping[value]
    return value

# This function takes a filename, loop over nodes and ways then check if it has an address with city name.
# If it is, audit the name and change it.
def audit_city_name(filename):
    dataset = open(filename, "r")
    cities_list = []
    for event, element in ET.iterparse(dataset, events=("start",)):
        if element.tag == "node" or element.tag == "way":
            for tag in element.iter("tag"):
                if tag.attrib['k'] == "addr:city":
                    new_tag = tag
                    new_name = audit_city(tag.attrib['v'])
                    if new_name != tag.attrib['v']:
                        print tag.attrib['v'] + " => " + new_name
                    new_tag.attrib['v'] = new_name
                    cities_list.append(new_tag)
    return cities_list

cities = audit_city_name(OSM_FILE)


The result above is a sample of changed city names. You can see the wrong names changed with better names. Writing Cape Town as cape town is not accepted. So, we went throw all items and changed them. This is the scound cleaning in this project.

#### 9.Convert Data to JSON
Our dataset is in XML format. We are going to store data in MonogoDB, but before that we have to change dataset format from XML to JSON. 

To start doing that, we have to set the structure of the object, then start changing items based on it. 

In [None]:
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
address_regex = re.compile(r'^addr\:')
street_regex = re.compile(r'^street')

CREATED = [ "version", "changeset", "timestamp", "user", "uid"]

'''
This method takes an element and check if it is a node or a way. If not skip it.
It map the XML element to an object where we can store it as JSON.
'''
def convert_element(element):
    node = {}
    if element.tag == "node" or element.tag == "way" :
        node['type'] = element.tag
        # address details of element
        address = {}
        
        # for each attribute in the element, parse it into object's variable
        for attribute in element.attrib:
            if attribute in CREATED:
                if 'created' not in node:
                    node['created'] = {}
                node['created'][attribute] = element.get(attribute)
            elif attribute in ['lat', 'lon']:
                continue
            else:
                node[attribute] = element.get(attribute)
                
        # store posstion cordinates if the element has lon and lat [GPS details]
        if 'lat' in element.attrib and 'lon' in element.attrib:
            node['pos'] = [float(element.get('lat')), float(element.get('lon'))]

        # for each sub element of the root one
        for sub in element:
            # parse second-level tags for ways and populate `node_refs`
            if sub.tag == 'nd':
                if 'node_refs' not in node:
                    node['node_refs'] = []
                if 'ref' in sub.attrib:
                    node['node_refs'].append(sub.get('ref'))

            # skip the sub if it has no k or v
            if sub.tag != 'tag' or 'k' not in sub.attrib or 'v' not in sub.attrib:
                continue
                
            key = sub.get('k')
            val = sub.get('v')

            # skip the key if it is not well prepared
            if problemchars.search(key):
                continue

            # if it is an address, store it in clean way
            elif address_regex.search(key):
                key = key.replace('addr:', '')
                address[key] = val

            # for others
            else:
                node[key] = val
                
        # clean address and store it in the node
        if len(address) > 0:
            node['address'] = {}
            street_full = None
            street_dict = {}
            street_format = ['prefix', 'name', 'type']
            # for each key in address
            for key in address:
                val = address[key]
                if street_regex.search(key):
                    if key == 'street':
                        street_full = val
                    elif 'street:' in key:
                        street_dict[key.replace('street:', '')] = val
                else:
                    node['address'][key] = val
            # assign street_full or fallback to compile street dict
            if street_full:
                node['address']['street'] = street_full
            elif len(street_dict) > 0:
                node['address']['street'] = ' '.join([street_dict[key] for key in street_format])
        return node
    else:
        return None
    

def convert_file(filename):
    output = "{0}.json".format(filename)
    data = []
    with codecs.open(output, "w") as fw:
        for event, element in ET.iterparse(filename):
            obj = convert_element(element)
            if obj:
                data.append(obj)
                fw.write(json.dumps(obj) + "\n")
    return data

data_objects = convert_file(OSM_FILE)

In [None]:
# Sample of result
data_objects[10]

You can see the dataset converted to JSON format and saved on the local folder.

<hr/>

#### 10.Store Data in MongoDB
After cleaning data and store it in a clean structure, we are going to save it on local database using MongoDB. 

In [None]:
# start connection
client = MongoClient('localhost:27017')
db = client['map']
# add items one by one to DB
db.map.drop()
for item in data_objects:
    db.map.insert_one(item)

A long process ends with adding all items to the database.

<hr/>

#### 11.Statistical Overview
The last step is about showing statistical overview about the database.
<ul>
    <li>Size of the file</li>
    <li>Total elements</li>
    <li>Number of unique users</li>
    <li>Number of nodes and ways</li>
    <li>Number of chosen type of nodes, like cafes, shops etc.</li>
</ul>

#### Size of the file
What do you think, which file is bigger, the old or the new one?

In [None]:
# get the old file bytes and convert it to size
old_size = convert_bytes_to_size(os.path.getsize(OSM_FILE))

# get the new file bytes and convert it to size
new_size = convert_bytes_to_size(os.path.getsize(OSM_FILE + ".json"))


print ("The old file size is: {}.".format(old_size))
print ("The new file size is: {}.".format(new_size))

#### Total elements
Here is how we can count the number of elements in the database.

In [None]:
# total elements in DB
db.map.find().count()

#### Number of unique users
Now we are going to calculate the number of users who participated in the map.

In [None]:
# total number of users
unique_users = len(db.map.distinct('created.user'))
print ("Total users who participated in the map is: {} users.".format(unique_users))

#### Number of nodes and ways
We are going to count the number of nodes and ways in the database.

In [None]:
# count items where type = way
total_ways = db.map.find({'type':'way'}).count()

# count items where type = node
total_nodes = db.map.find({'type':'node'}).count()

print ("Total ways in the map is: {} ways.".format(total_ways))
print ("Total nodes in the map is: {} nodes.".format(total_nodes))


#### Number of chosen type of nodes, like cafes, shops etc.
We are going to see togather what are the most popular nodes in selected area.

In [None]:
amenities = db.map.aggregate([{"$match":{"amenity":{"$exists":1}}}, {"$group":{"_id":"$amenity",
    "count":{"$sum":1}}}, {"$sort":{"count":-1}}, {"$limit":10}])

pprint.pprint(list(amenities))


#### Top 5 religons 
We are going to investigate the **place_of_worship** to see top religons.

In [None]:
religions = db.map.aggregate([{"$match":{"amenity":{"$exists":1}, "amenity":"place_of_worship"}},
                 {"$group":{"_id":"$religion", "count":{"$sum":1}}},{"$sort":{"count":-1}}, {"$limit":5}])
list(religions)

#### Top 5 food types
We are going to check best and top 5 food types [restaurant style].

In [None]:
places = db.map.aggregate([{"$match":{"amenity":{"$exists":1}, "amenity":"restaurant"}},
                           {"$group":{"_id":"$cuisine", "count":{"$sum":1}}},
                           {"$sort":{"count":-1}}, {"$limit":5}])
list(places)

#### Total Nando's
Nando's is one of the most famous restaurants in South Africa. Here we are going to count how many one exists in Cape Town. 

In [None]:
nandos_count = db.map.find({"$or":[ {"name": "Nando's"}, {"name": "Nandos"}]}).count()

print "Total Number of Nando's Resuturant:", nandos_count


#### Total number of Restaurants and Cafes
For a city, to be a better place for visitors it has to contains as many resturants and cafes as possible. 


In [None]:
eating_pleaces_count = db.map.find({"$or":[ {"amenity": "cafe"}, {"amenity": "restaurant"}]}).count()
print "Number of restaurants and cafes:", eating_pleaces_count


<hr/>

## 4.Conclusion
In conclusion, Open Street Map is an amazing project. It is going to help the community and improve many applications as it is an open source project. I liked the way how people participate in the project and help each other.

I suggest the site managers help people developing automated bots to correct the data via Google Maps API. This will allow us to have a copy from Google Map data, also Google Maps will get benefits from the community. Using Google Map with OpenStreetMap via Google Places API for example will help OpenStreetMap improve places details.

**Python Google Places** is a library to work with the API. You can find it here:

https://github.com/slimkrazy/python-google-places

This will allows us to search for location and fill empty details from it.

In [None]:
# this is a smaple and imagination how we can change places details using API.
'''
for place in places:
    if place.phone == None:
        place.phone = API.find(place.geolocation).phone
    if place.logo == None:
        place.logo = API.find(place.geolocation).logo
    ...
        ...
'''
print ""

This soultion is very well, but we can not be 100% sure that data is correct. Maybe the old data is better than the new, so we are going to fill only empty attributes. 


Another improvment can be done by making competitions to improve data either manually or automated. This may be a wrong solution or idea becouse many people may add spam content to win.


For the datasets structures I think adding a logo attribute to places is a good idea. As well as adding images for the place.

*Refs:*
<ul>
    <li>Wikipedia</li>
    <li>Udacity</li>
    <li>Stack Overflow</li>
</ul>