# Index

* Introduction
    * MongoDB or SQL?
    * Map Area
    * Why This Area?
* Study on OpenStreetMap Wiki
    * Elements
        * Node
        * Ways
        * Relation
        * Tag
* Questions
* Data Cleaning
    * Using algorithm to get example.
    * Auditing the data
        * Insert the data into local MongoDB Database
    * Make .osm to .json
* Data Overview
* Conclusion
* Reference

# Introduction

## MongoDB or SQL?

After reading [Should-I-learn-SQL-or-NoSQL-MySQL-or-MongoDB-And-why](https://www.quora.com/Should-I-learn-SQL-or-NoSQL-MySQL-or-MongoDB-And-why) at Quora and well estimate my own situation, I choose **MongoDB to wrangle OpenStreetMap Data.**

**Why MongoDB?**

I previously have learned the SQL at my junior high school. Since Udacity suggested that the MongoDB is worth learning.

So I choose MongoDB to compete my project.

## Map Area

[Miami,Florida,United State of America](https://mapzen.com/data/metro-extracts/metro/miami_florida/)

**Why this area?**

Miami is a great place, I want to visit there.

# Question 

# Data Cleaning
## Using this algorithm to get sample

### set **k=10**

In [2]:
import xml.etree.ElementTree as ET

OSM_FILE = "miami_florida.osm"
SAMPLE_FILE = "miami_example.osm"

k = 10

def get_element(osm_file, tags=('node', 'way', 'relation')):
    """Yield element if it is the right type of tag

    Reference:
    http://stackoverflow.com/questions/3095434/inserting-newlines-in-xml-file-generated-via-xml-etree-elementtree-in-python
    """
    context = iter(ET.iterparse(osm_file, events=('start', 'end')))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()


with open(SAMPLE_FILE, 'wb') as output:
    output.write('<?xml version="1.0" encoding="UTF-8"?>\n')
    output.write('<osm>\n  ')

    # Write every kth top level element
    for i, element in enumerate(get_element(OSM_FILE)):
        if i % k == 0:
            output.write(ET.tostring(element, encoding='utf-8'))

    output.write('</osm>')

get_element(OSM_FILE,tags=('node', 'way', 'relation'))

<generator object get_element at 0x0000000004530948>

After finish the sample, we began our project

## Auditing the data

Load libraries

In [3]:
import os
import xml.etree.cElementTree as cET
from collections import defaultdict
import pprint
import re
import codecs
import json
import string
from pymongo import MongoClient

Set up map file path

In [4]:
# set up map file path
filename = "miami_example.osm" # osm filename
path = "E:\CS\Data Science\P3" # directory contain the osm file

miamiOSM = os.path.join(path, filename)

According to Python re document, we set lower part, and lower colon.

Also, set initial version of expected street names, and using list to store.

In [5]:
# some regular expression
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

# initial version of expected street names
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place",
            "Square", "Lane", "Road", "Trail", "Parkway", "Commons","River"]

First, audit that whether the street type is our expected, so set up an function to look at the street names and print out all the street names that is with a unexpected streer type.

To do this, we need to add unexpected street name to a list.

In [6]:
# Look at the street names, print out all the street names that is with
# a unexpected street type
def audit_street_type(street_types, street_name):
    # add unexpected street name to a list
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)

In order to auditing the street data,we need to determined whether the element is a street name or not.

In [7]:
def is_street_name(elem):
    # determine whether a element is a street name
    return (elem.attrib['k'] == "addr:street")

def audit_street(osmfile):
    # iter through all street name tag under node or way and audit
    # the street name value
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    for event, elem in cET.iterparse(osm_file, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])
    return street_types

Now we began our auditing.

In [8]:
st_types = audit_street(miamiOSM)

Began our auditing, we should print the data in the map.

In [9]:
# print out unexpected street names
pprint.pprint(dict(st_types))

{'101': set(['Commercial Blvd #101']),
 '337': set(['NW 53rd Street, Suite 337']),
 '9': set(['State Road 9']),
 'Atlanta': set(['Atlanta']),
 'Ave': set(['Bedford Ave',
             'Corporate Ave',
             'E Gardenia Ave',
             'E Seneca Ave',
             'E Whitewater Ave',
             'Granada Ave',
             'Islewood Ave',
             'N Gardenia Ave',
             'NW 73rd Ave',
             'Park Ave',
             'Rainbow Ave',
             'S Gardenia Ave',
             'S Whitewater Ave',
             'SE 1st Ave',
             'SW 148th Ave',
             'SW 167th Ave',
             'W Gardenia Ave',
             'W Whitewater Ave',
             'Westgate Ave']),
 'Birkdale': set(['Birkdale']),
 'Blvd': set(['Blatt Blvd',
              'Bonaventure Blvd',
              'Corporate Lakes Blvd',
              'Crandon Blvd',
              'Falls Blvd',
              'Forest Hill Blvd',
              'Gables Blvd',
              'Indian Trace Blvd',
      

Based on the auditing results, I came up with the following mapping dictionary, which addressed the abbrivations and the incorrect names.

Create a function that correct incorrect streer names.

So here is the **problem encountered in the map**: 

1. The data in the .osm has over-abbreviated street names such as (“S Tryon St Ste 105”).
* The data in the .osm do not have a uniform data format, such as 'Ave' and 'Ave.'. It has a tiny point there.

To solve the problem, I decide to create a dictionary for correcting street names

In [10]:
# creating a dictionary for correcting street names
mapping = { "Ct": "Court",
            "STREET":"Street",
           "street":"Street",
            "St": "Street",
            'Trce':'Trace',
            'Ter':'Terrace',
            "Cir":"Circle",
            "St.": "Street",
            "St,": "Street",
            'Pkwy':'Parkway',
            "Ave": "Avenue",
            "Ave.": "Avenue",
            "road.": "Road",
            "Riad":"Road",
            "Ride":"Road",
            "Rd":"Road",
            "Hwy": "Highway",
            "HIghway": "Highway",
            "Pkwy": "Parkway",
            "Pl": "Place",      
            "place": "Place",
            "Sq.": "Square"}

# function that corrects incorrect street names
def update_name(name, mapping):
    for key in mapping:
        if key in name:
            name = string.replace(name,key,mapping[key])
    return name

Audit zipcodes

We found that the zip code remain errors, all the zip codes there should start with 331 [Miami Zip Code](http://www.zipmap.net/Florida/Miami-Dade_County/Z_Downtown.htm)

![](zipcode.jpg)

In [11]:
def audit_zipcodes(osmfile):
    # iter through all zip codes, collect all the zip codes that does not
    # start with 331
    osm_file = open(osmfile, "r")
    zip_codes = {}
    for event, elem in cET.iterparse(osm_file, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if tag.attrib['k'] == "addr:postcode" and not tag.attrib['v'].startswith('331'):
                    if tag.attrib['v'] not in zip_codes:
                        zip_codes[tag.attrib['v']] = 1
                    else:
                        zip_codes[tag.attrib['v']] += 1
    return zip_codes

zipcodes = audit_zipcodes(miamiOSM)

for zipcode in zipcodes:
    print zipcode, zipcodes[zipcode]

CREATED = [ "version", "changeset", "timestamp", "user", "uid"]

33027 2
33030 58
33496 2
33498 1
33054 57
33321 1
33326 639
33327 665
33324 1
33470 1
33406-1507 1
33014 25
33015 64
33016 74
33010 114
33012 56
33013 27
3301811 1
33411 2
33018 44
33055 5
33056 13
33311 5
33487-3502 1
33314 1
33313 4
33409 1
33406 2
33034 32
33035 9
33032 24
33441 1
33401 1
33031 3
33351 2
33067-3176 1
33331 289
33442 2
33461-3021 1
33332 202
33460 1
33462 1
33410 3
33404 3
33309 2
11890 1
33064-1324 1
33388 1
33063 1
33413 1
33065 3
33064 1
33067 4
33069 2
33068 1
33033 44
33432 3
33301 2
33009 1
33431 1
33304 9
33319 3
33308 5
33024 1
33412 1
33026 2
33414 10
33020 1
33004 2


Process the node and way tags.

In [12]:
def shape_element(element):
    node = {}
    node["created"]={}
    node["address"]={}
    node["pos"]=[]
    refs=[]

    # we only process the node and way tags
    if element.tag == "node" or element.tag == "way" :
        if "id" in element.attrib:
            node["id"]=element.attrib["id"]
        node["type"]=element.tag

        if "visible" in element.attrib.keys():
            node["visible"]=element.attrib["visible"]

        # the key-value pairs with attributes in the CREATED list are
        # added under key "created"
        for elem in CREATED:
            if elem in element.attrib:
                node["created"][elem]=element.attrib[elem]

        # attributes for latitude and longitude are added to a "pos" array
        # include latitude value
        if "lat" in element.attrib:
            node["pos"].append(float(element.attrib["lat"]))
        # include longitude value
        if "lon" in element.attrib:
            node["pos"].append(float(element.attrib["lon"]))


        for tag in element.iter("tag"):
            if not(problemchars.search(tag.attrib['k'])):
                if tag.attrib['k'] == "addr:housenumber":
                    node["address"]["housenumber"]=tag.attrib['v']

                if tag.attrib['k'] == "addr:postcode":
                    node["address"]["postcode"]=tag.attrib['v']

                # handling the street attribute, update incorrect names using
                # the strategy developed before
                if tag.attrib['k'] == "addr:street":
                    node["address"]["street"]=tag.attrib['v']
                    node["address"]["street"] = update_name(node["address"]["street"], mapping)

                if tag.attrib['k'].find("addr")==-1:
                    node[tag.attrib['k']]=tag.attrib['v']

        for nd in element.iter("nd"):
             refs.append(nd.attrib["ref"])

        if node["address"] =={}:
            node.pop("address", None)

        if refs != []:
            node["node_refs"]=refs

        return node
    else:
        return None

Process the xml openstreetmap file, write a json out file and return a list of dictionaries

In [13]:
def process_map(file_in, pretty = False):
    file_out = "{0}.json".format(file_in)
    data = []
    with codecs.open(file_out, "w") as fo:
        for _, element in cET.iterparse(file_in):
            el = shape_element(element)
            if el:
                data.append(el)
                if pretty:
                    fo.write(json.dumps(el, indent=2)+"\n")
                else:
                    fo.write(json.dumps(el) + "\n")
    return data

Make .osm to .json

In [14]:
# process the file
data = process_map(miamiOSM, True)

### Insert the data into local MongoDB Database

In [15]:
# Insert the data into local MongoDB Database
client = MongoClient()
db = client.miamiOSM
collection = db.miamiMap
collection.insert(data)

collection



Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), u'miamiOSM'), u'miamiMap')

![](微信截图_20161208150702.png)

# Data Overview 

## According to the file, we have got the information of:

1. Size of the original xml file.
* Size of the processed json file
* The Number of documents
* The Number of Unique users
* The Number of Nodes
* The Number of Ways
* The Number of Methods Used to Create Data Entry
* The Number of Ways
* Most popular cuisines
* Universities

In [16]:
# size of the original xml file
os.path.getsize(miamiOSM)/1024/1024

55L

In [17]:
# size of the processed json file
os.path.getsize(os.path.join(path, "miami_florida.osm.json"))/1024/1024

822L

In [18]:
# The Number of documents
collection.find().count()

389097

In [19]:
# The Number of Nodes
collection.find({"type":"node"}).count()

345418

In [20]:
# The Number of Ways
collection.find({"type":"way"}).count()

43555

In [21]:
# The Number of Methods Used to Create Data Entry

pipeline = [{"$group":{"_id": "$created_by","count": {"$sum": 1}}}]
result = collection.aggregate(pipeline)
for doc in result:
    pprint.pprint(doc)

{u'_id': u'Potlatch 0.10e', u'count': 2}
{u'_id': None, u'count': 389068}
{u'_id': u'JOSM', u'count': 15}
{u'_id': u'Potlatch 0.8a', u'count': 3}
{u'_id': u'Potlatch 0.8b', u'count': 4}
{u'_id': u'Potlatch 0.10f', u'count': 5}


In [22]:
# Proportions of top users' contributions
pipeline = [{"$group":{"_id": "$created.user",
                       "count": {"$sum": 1}}},
            {"$project": {"proportion": {"$divide" :["$count",collection.find().count()]}}},
            {"$sort": {"proportion": -1}},
            {"$limit": 3}]
result = collection.aggregate(pipeline)
for doc in result:
    pprint.pprint(doc)

{u'_id': u'MiamiBuildingsImport', u'proportion': 0.3089640886462759}
{u'_id': u'grouper', u'proportion': 0.11397929051110649}
{u'_id': u'woodpeck_fixbot', u'proportion': 0.08299729887405968}


In [23]:
# Most popular cuisines
pipeline = [{"$match":{"amenity":{"$exists":1}, "amenity":"restaurant", "cuisine":{"$exists":1}}},
            {"$group":{"_id":"$cuisine", "count":{"$sum":1}}},
            {"$sort":{"count":-1}},
            {"$limit":10}]
result = collection.aggregate(pipeline)
for doc in result:
    pprint.pprint(doc)

{u'_id': u'pizza', u'count': 12}
{u'_id': u'italian', u'count': 5}
{u'_id': u'spanish', u'count': 5}
{u'_id': u'mexican', u'count': 5}
{u'_id': u'american', u'count': 3}
{u'_id': u'chinese', u'count': 2}
{u'_id': u'sushi', u'count': 2}
{u'_id': u'greek', u'count': 2}
{u'_id': u'barbecue', u'count': 2}
{u'_id': u'sandwich', u'count': 1}


In [24]:
#  Universities
pipeline = [{"$match":{"amenity":{"$exists":1}, "amenity": "university", "name":{"$exists":1}}},
            {"$group":{"_id":"$name", "count":{"$sum":1}}},
            {"$sort":{"count":-1}}]
result = collection.aggregate(pipeline)
for doc in result:
    print(doc)

{u'count': 2, u'_id': u'Northwood University - West Palm Beach Campus'}
{u'count': 2, u'_id': u'University of Phoenix'}
{u'count': 1, u'_id': u'Florida International University - Biscayne Bay Campus'}
{u'count': 1, u'_id': u'DeVry University'}


# Additional idea

### Top 10 appearing amenities.

In [25]:
Top10_appearing=collection.aggregate([{"$match":{"amenity":{"$exists":1}}},
{"$group":{"_id":"$amenity",
"count":{"$sum":1}}}, {"$sort":{"count": 1}}, {"$limit":10}])
for doc in Top10_appearing:
    print(doc)

{u'count': 1, u'_id': u'parking_space'}
{u'count': 1, u'_id': u'community_centre'}
{u'count': 1, u'_id': u'childcare'}
{u'count': 1, u'_id': u'weighbridge'}
{u'count': 1, u'_id': u'car_wash'}
{u'count': 1, u'_id': u'ice_cream'}
{u'count': 1, u'_id': u'nightclub'}
{u'count': 1, u'_id': u'Optical'}
{u'count': 1, u'_id': u'food_court'}
{u'count': 1, u'_id': u'ferry_terminal'}


Of course, Parking space are not a appearing amenities. It shows that most people in Miami have at least one auto. 

In the list, we could easily picture the living in Miami.

People drive to work place, but most of them are too busy to take good care of their baby so they will tend to entrust childcare to help. We see a lots of people visit visit nightclub, so the night life of the Miami must be very splendid and the night-life industry is very strong. Also, nightclub is also a place of social communication, so the nightclub is more like a culture in Miami.

Car wash and Parking Space are both the places about car.

Here we found that the car really take a different status in Miami.

### Biggest religion : Muslim

In [29]:
Biggest_religion=collection.aggregate([{"$match":{"amenity":{"$exists":1},
"amenity":"place_of_worship"}},
{"$group":{"_id":"$religion", "count":{"$sum":1}}},
{"$sort":{"count": 1}}, {"$limit":3}])

for doc in Biggest_religion:
    print(doc)

{u'count': 1, u'_id': u'muslim'}
{u'count': 2, u'_id': u'jewish'}
{u'count': 10, u'_id': None}


Muslim is the biggest religion in Miami.

However, the main mainstream culture of Miami definitely are not Muslim. So as a religion of strong self-discipline, the conflicts between Muslim and other religion even the atheism will face the possibility of culture shock and some of them may feel annoyed once they meet each other, so I think one of the most social challenge of Miami City is that the balance between the Muslims and others.

America is a state of democratic, which means the number of voter do counts for each city's policy. Muslims have the dominated part of this area, they may force local government to meet their requirements or they won't vote for them.

Once government use administrative means to control or force the market to sell Moslem food and object to sell common food, others will feel be insulted and here comes the conflicts.

### Most popular cuisines

In [31]:
# Most popular cuisines
pipeline = [{"$match":{"amenity":{"$exists":1}, "amenity":"restaurant", "cuisine":{"$exists":1}}},
            {"$group":{"_id":"$cuisine", "count":{"$sum":1}}},
            {"$sort":{"count":-1}},
            {"$limit":10}]
result = collection.aggregate(pipeline)
for doc in result:
    pprint.pprint(doc)

{u'_id': u'pizza', u'count': 12}
{u'_id': u'italian', u'count': 5}
{u'_id': u'spanish', u'count': 5}
{u'_id': u'mexican', u'count': 5}
{u'_id': u'american', u'count': 3}
{u'_id': u'chinese', u'count': 2}
{u'_id': u'sushi', u'count': 2}
{u'_id': u'greek', u'count': 2}
{u'_id': u'barbecue', u'count': 2}
{u'_id': u'sandwich', u'count': 1}


According to [Wikipedia, Miami](https://en.wikipedia.org/wiki/Miami), Miami is an international city. Miami has a strong bond with other nations.

So people there have really strong need of the cuisines all over the world. 

Pizza and Italian food have strong relationship, so you could found that people there like Italian food most. Pizza is a convenient and delicious food, as a white-collar worker, they will tend to place an order at office when they are too busy to have dinner of lunch. And pizza could share with friends, so once people want 

# Conclusion

After this review of the data it’s obvious that this area is incomplete, though I believe it has been well cleaned for the purposes of this exercise. It interests me to notice a fair amount of GPS data makes it into OpenStreetMap.org on account of users’ efforts, whether by scripting a map editing bot or otherwise. With a rough GPS data processor in place and working together with a more robust data processor similar to data.pyI think it would be possible to input a great amount of cleaned data to OpenStreetMap.org.

Due to I do not have the full acknowledge of the location in the Miami, I only correct the area that is obvious remained to be seen.

## Suggestions

1. This datasets has so many mistakes, such as zip code format. 
* The data are not detailed enough, so we can't draw a conclusion of the need of the cuisine and food.
* Though we found that car has a different status in the life of people live in the Miami. If I want to get more information about this, I need to get more data to solve these questions:
    1. How many hours did a normal Miamis spend in cars?
    * Where is the most popular place that people visit by auto?
    * Each day,how many miles did normal Miamis drive?
    * Is there any relationship between culture and cuisine that people like?
    * Since we know that cuisine that the public like could be changing all the time, is there any relationship between culture, education, and even the status of the throng? Because the root of the fashion is the upper class. If it is because upper class or one throng or nation are now take dominated part of the upper class that affect the public's choice?
    * Why people likes to spent time on their cars? Is there any reliable replacement?
    * We know different time, people would choose different food to eat, as a Chinese, I would less likely to choose hamburger as my breakfast because I think it is not suitable or unhealthy. So can we quantity cuisine on time? So we could solve the question of what is the need of the food on different time. 

Once we get the data of the miles that people drive and most popular place that people visit by auto, we will be able to offer the service of auto that only serve for these time. The profit of the service is really really impressive. However the cost is comparing less.

Once we could get the data of the exactly need of the cuisine and food,we could offer a map with the need of the food and sell them to merchant that pause and ponder in front of whether it is suitable or not to run a food business. And the business is guaranteed with solid support.

## What potential issues could you see that may arise from the implementation of the solution?

So here are the questions expect to be solved.

1. How can we get the data and make sure the data is well qualified?
* Some people might just choose cuisine in a random, which means that could be a big disturb of our data.
* The best cuisine need more than delicious food, the location could also affect the choice, sometimes people think that is hard to choose a restaurant, so they just choose the restaurant nearby to save time. Like Pizza, it is possible because Pizza could serve to office to eat.
* On some particular festivals like Chirstmas or Thanksgiving, people would more likely to stay at home,so those days could also affect the reliability of our conclusion.

# Reference

1. MongoDB: The Definitive Guide
* SciPy Reference Guide , Release 0.18.1
* [Udacity 项目详情](https://classroom.udacity.com/nanodegrees/nd002/parts/0021345404/modules/316820862075463/lessons/3168208620239847/concepts/77135319070923)
* [Open Street Map Wiki](https://wiki.openstreetmap.org/wiki/OSM_XML)
* [inserting-newlines-in-xml-file-generated-via-xml-etree-elementtree-in-python](http://stackoverflow.com/questions/3095434/inserting-newlines-in-xml-file-generated-via-xml-etree-elementtree-in-python)
* [PyMongo Document](https://api.mongodb.com/python/current/)
* [Github](https://github.com/ryancheunggit/Udacity/blob/master/P3/code.py)
* [Python 正则表达式](http://www.runoob.com/python/python-reg-expressions.html)