# P3: Wrangle OpenStreetMap Data

## For this project, we will look at OpenStreetMap data for Culver City, California - where the city motto is "the heart of screenland".  I expect the Culver City dataset to be rich with amenities.
 
## We start by getting familiar with the data. [Here](https://mapzen.com/data/metro-extracts/) is a link to the metro-extracts page where extracts are available for download.  As mentioned above, I choose to extract the data for Culver City.

## <font color = red> This project applies data wrangling and mungling techniques to an  OSM XML extract with pymongo </font>

## After the zip file is downloaded and extracted, we see the OSM file is 604M in size
> $ ls -sh culver-city_ca.osm 
<br/> 604M culver-ciy_ca.osm

## The next few cells show how to programatically clean street names and audit zip codes.  To clean street names, create a list of expected street names to look for any unexpected street names in the street value in the address key.

In [44]:
# Hello pymongo
import os
import xml.etree.cElementTree as cET
from collections import defaultdict
import pprint
import re
import codecs
import json
import string
from pymongo import MongoClient

# create filename and path
filename = "culver-city_ca.osm"
path = "/Users/JasonMedina/Downloads"
culverCityOSM = os.path.join(path, filename)

# common regex
lower = re.compile(r'^([a-z]|_)*$') 
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

# valid ending street names
expected = ["Street", "Avenue", "Boulevard", "Drive", "Way", "Court", 
            "Place", "Circle", "Lane", "Road", "Trail", "Parkway"]

## This script will iterate thru node and way tags to looks for the street value and print the sets of unexpected street types and attribute values.

In [45]:
def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)
            
def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")

def audit_street(osmfile):
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    for event, elem in cET.iterparse(osm_file, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])
    return street_types

st_types = audit_street(culverCityOSM)
pprint.pprint(dict(st_types))

{'200': set(['North Camden Drive, Suite 200']),
 u'246-0756': set([u'Greenleaf Gourmet Chopshopmore info\u200e9671 Wilshire BoulevardBeverly Hills, CA 90212  (310) 246-0756']),
 u'276-1562': set([u'Subwaymore info\u200e9673 Wilshire BoulevardBeverly Hills, CA 90212 (310) 276-1562']),
 u'308': set([u'499 North Ca\xf1on Drive, Suite 308']),
 '3190': set(['W Jefferson Blvd Suite 3190']),
 u'777-5877': set([u'Barney Greengrass Restaunt, \u200e9570 Wilshire Boulevard Beverly Hills, CA 90212 (310) 777-5877']),
 u'858-1383': set([u"Capriotti'smore info\u200e9683 Wilshire BoulevardBeverly Hills, CA 90212(310) 858-1383"]),
 'A': set(['North Roxbury Drive Penthouse Suite A']),
 'Ave': set(['Centinela Ave',
             'Glencoe Ave',
             'Le Conte Ave',
             'Ohio Ave',
             'Olive Ave',
             'Pacific Ave',
             'S Centinela Ave',
             'South Fairfax Ave',
             'Watseka Ave']),
 'Ave,': set(['Marlton Ave,']),
 'Bl.': set(['National Bl.']),

## There are a few sets that need cleaning, mostly of the Avenue value.  The data looks good, nonetheless we can create a map to fix incorrect street names.

In [46]:
# Dictionary to fix street names
mapping = { "Ave": "Avenue",
            "Ave,": "Avenue",
            "Bl.": "Boulevard",
            "Blvd": "Boulevard",
            "Blvd.": "Boulevard",
            "Bvd": "Boulevard",
            "Dr": "Drive",
            "Ln": "Lane",
            "St": "Street",
            "ave": "Avenue",
            "avenue": "Avenue"}
           
# update street names
def update_name(name, mapping):    
    for key in mapping:
        if key in name:
            name = string.replace(name,key,mapping[key])
    return name

## Let's continue to audit the data by looking at the zip codes to identify any potential issues.

In [47]:
def audit_zipcodes(osmfile):
    osm_file = open(osmfile, "r")
    zip_codes = {}
    for event, elem in cET.iterparse(osm_file, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if tag.attrib['k'] == "addr:postcode" and not tag.attrib['v'].startswith('90'):
                    if tag.attrib['v'] not in zip_codes:
                        zip_codes[tag.attrib['v']] = 1
                    else:
                        zip_codes[tag.attrib['v']] += 1
    return zip_codes

zipcodes = audit_zipcodes(culverCityOSM)
for zipcode in zipcodes:
    print zipcode, zipcodes[zipcode]

CA 90045 1
CA 90291 1
CA 90405 1
CA 90404 1
CA 90034 1
CA 90036 1
CA 90024 1
91108 1
CA 90232 1


### Looks like there are a few zip codes that start with CA and one outside the expected '90' prefix.  The zip codes for our data start with '90' [source](http://www.zipmap.net/California/Los_Angeles_County/Culver_City.htm).  There are some zip codes that start with CA, however not many.

### Now it's time to convert the OSM file into a JSON file via cygwin :
> $ python data.py culver-city_ca.osm

## We can use pymongo once the OSM data is reshaped from the original XML to JSON format.  The original [OSM XML extract is here](https://mapzen.com/data/metro-extracts/your-extracts/31eb1a32d44d).

In [48]:
# py script to output JSON file given OSM XML file

CREATED = [ "version", "changeset", "timestamp", "user", "uid"]

def shape_element(element):
    node = {}
    node["created"]={}
    node["address"]={}
    node["pos"]=[]
    refs=[]
    
    # only process node and way tags
    if element.tag == "node" or element.tag == "way" :
        if "id" in element.attrib:
            node["id"]=element.attrib["id"]
        node["type"]=element.tag

        if "visible" in element.attrib.keys():
            node["visible"]=element.attrib["visible"]
      
        # key-value pairs with attributes from CREATED list
        for elem in CREATED:
            if elem in element.attrib:
                node["created"][elem]=element.attrib[elem]
                
        # appending lat and lon to pos array
        
        if "lat" in element.attrib:
            node["pos"].append(float(element.attrib["lat"]))
        
        if "lon" in element.attrib:
            node["pos"].append(float(element.attrib["lon"]))

        
        for tag in element.iter("tag"):
            if not(problemchars.search(tag.attrib['k'])):
                if tag.attrib['k'] == "addr:housenumber":
                    node["address"]["housenumber"]=tag.attrib['v']
                    
                if tag.attrib['k'] == "addr:postcode":
                    node["address"]["postcode"]=tag.attrib['v']
                
                if tag.attrib['k'] == "addr:street":
                    node["address"]["street"]=tag.attrib['v']
                    node["address"]["street"] = update_name(node["address"]["street"], mapping)

                if tag.attrib['k'].find("addr")==-1:
                    node[tag.attrib['k']]=tag.attrib['v']
                    
        for nd in element.iter("nd"):
             refs.append(nd.attrib["ref"])
                
        if node["address"] =={}:
            node.pop("address", None)

        if refs != []:
           node["node_refs"]=refs
            
        return node
    else:
        return None

def process_map(file_in, pretty = False):
    file_out = "{0}.json".format(file_in)
    data = []
    with codecs.open(file_out, "w") as fo:
        for _, element in cET.iterparse(file_in):
            el = shape_element(element)
            if el:
                data.append(el)
                if pretty:
                    fo.write(json.dumps(el, indent=2)+"\n")
                else:
                    fo.write(json.dumps(el) + "\n")
    return data

In [49]:
# this reshaping can take several minutes
data = process_map(culverCityOSM, True)

### The next step requires we run mongod.exe from cygwin
> $ run mongod.exe

In [50]:
# inserting data can take several minutes
client = MongoClient()
db = client.culverCityOSM
collection = db.culverCityMAP
collection.insert(data)



[ObjectId('57e200606df1501eb837dc82'),
 ObjectId('57e200606df1501eb837dc83'),
 ObjectId('57e200606df1501eb837dc84'),
 ObjectId('57e200606df1501eb837dc85'),
 ObjectId('57e200606df1501eb837dc86'),
 ObjectId('57e200606df1501eb837dc87'),
 ObjectId('57e200606df1501eb837dc88'),
 ObjectId('57e200606df1501eb837dc89'),
 ObjectId('57e200606df1501eb837dc8a'),
 ObjectId('57e200606df1501eb837dc8b'),
 ObjectId('57e200606df1501eb837dc8c'),
 ObjectId('57e200606df1501eb837dc8d'),
 ObjectId('57e200606df1501eb837dc8e'),
 ObjectId('57e200606df1501eb837dc8f'),
 ObjectId('57e200606df1501eb837dc90'),
 ObjectId('57e200606df1501eb837dc91'),
 ObjectId('57e200606df1501eb837dc92'),
 ObjectId('57e200606df1501eb837dc93'),
 ObjectId('57e200606df1501eb837dc94'),
 ObjectId('57e200606df1501eb837dc95'),
 ObjectId('57e200606df1501eb837dc96'),
 ObjectId('57e200606df1501eb837dc97'),
 ObjectId('57e200606df1501eb837dc98'),
 ObjectId('57e200606df1501eb837dc99'),
 ObjectId('57e200606df1501eb837dc9a'),
 ObjectId('57e200606df150

In [51]:
collection

Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), u'culverCityOSM'), u'culverCityMAP')

### Here is the size for original OSM XML file

In [52]:
os.path.getsize(culverCityOSM)/1024/1024

603L

### Here is the size for reshaped JSON output file

In [53]:
os.path.getsize(os.path.join(path, "culver-city_ca.osm.json"))/1024/1024

903L

### Count of docuements in collection

In [54]:
collection.count()

3027238

### Here are the number of unique users

In [55]:
# took a few minutes to run
len(collection.group(["created.uid"], {}, {"count":0}, "function(o, p){p.count++}"))

569

### Number of nodes

In [56]:
collection.find({"type":"node"}).count()

2757124

### Number of ways

In [57]:
collection.find({"type":"way"}).count()

270091

### Now that our data is in a more question friendly format we can look at a few different questions with data pipelines.

### <font color = grey> Question 1: </font> What are the top 20 sources for this OSM data?</font>

In [58]:
pipeline = [{"$match":{"source":{"$exists":1}}},
            {"$group":{"_id": "$source",
                       "count": {"$sum": 1}}},
            {"$sort": {"count": -1}},
            {"$limit":20}]
result = collection.aggregate(pipeline)

for doc in result:
    print(doc)

{u'count': 541, u'_id': u'usgs_imagery'}
{u'count': 432, u'_id': u'TIGER, Bing'}
{u'count': 427, u'_id': u'Bing'}
{u'count': 278, u'_id': u'Yahoo!, local knowledge'}
{u'count': 223, u'_id': u'survey;image;usgs_imagery'}
{u'count': 195, u'_id': u'Yahoo'}
{u'count': 161, u'_id': u'Yahoo,TIGER'}
{u'count': 124, u'_id': u'bing_imagery_0.06m_200801'}
{u'count': 123, u'_id': u'survey, image, usgs_imagery'}
{u'count': 107, u'_id': u'USGS Geonames'}
{u'count': 95, u'_id': u'survey'}
{u'count': 85, u'_id': u'Bing, local knowledge'}
{u'count': 72, u'_id': u'yahoo_imagery'}
{u'count': 62, u'_id': u'Yahoo!, Bing, local knowledge'}
{u'count': 39, u'_id': u'survey;image;usgs_imagery;CDOT'}
{u'count': 31, u'_id': u'bing_imagery_0.06m_200801;LACA'}
{u'count': 26, u'_id': u'Bing, TIGER'}
{u'count': 17, u'_id': u'usgs_imagery;survey;image'}
{u'count': 13, u'_id': u'bing'}
{u'count': 12, u'_id': u'Los Angeles Fire Department'}


### <font color = blue> Answer 1: </font> In the case where a source exists, the usgs_imagery appears twice in the top 5, and multiple times overall.  The other sources include Yahoo, Bing and even the LA Fire Department.

### <font color = grey> Question 2: </font> Who are the top ten users?</font>

In [59]:
pipeline = [{"$group":{"_id": "$created.user",
                       "count": {"$sum": 1}}},
            {"$sort": {"count": -1}},
            {"$limit": 10}]
result = collection.aggregate(pipeline)
for doc in result:
    print(doc)

{u'count': 521459, u'_id': u'schleuss_imports'}
{u'count': 367727, u'_id': u'manings_labuildings'}
{u'count': 348528, u'_id': u'calfarome_labuilding'}
{u'count': 308393, u'_id': u'ridixcr_import'}
{u'count': 178890, u'_id': u'karitotp_labuildings'}
{u'count': 161989, u'_id': u'kingrollo'}
{u'count': 138848, u'_id': u'dannykath_labuildings'}
{u'count': 119205, u'_id': u'Luis36995_labuildings'}
{u'count': 108652, u'_id': u'RichRico_labuildings'}
{u'count': 105593, u'_id': u'schleuss'}


### <font color = blue> Answer 2: </font> Note first and last indices contain Schleuss.  A quick internet search leads me to suspect this could be data from Jon Schleuss, data viz artist for the LA times.  I could not come up with any ideas for kingrollo or the users ending in _labuildings.   

### <font color = grey> Question 3: </font> What is the ratio of top 10 user contributions?</font>

In [60]:
pipeline = [{"$group":{"_id": "$created.user",
                       "count": {"$sum": 1}}},
            {"$project": {"ratio": {"$divide" :["$count",collection.find().count()]}}},
            {"$sort": {"ratio": -1}},
            {"$limit": 10}]
result = collection.aggregate(pipeline)
for doc in result:
    print(doc)

{u'_id': u'schleuss_imports', u'ratio': 0.1722556997500692}
{u'_id': u'manings_labuildings', u'ratio': 0.12147277485285267}
{u'_id': u'calfarome_labuilding', u'ratio': 0.11513069008779621}
{u'_id': u'ridixcr_import', u'ratio': 0.10187273019167968}
{u'_id': u'karitotp_labuildings', u'ratio': 0.05909347068185587}
{u'_id': u'kingrollo', u'ratio': 0.05351049372398206}
{u'_id': u'dannykath_labuildings', u'ratio': 0.04586623185887598}
{u'_id': u'Luis36995_labuildings', u'ratio': 0.039377478744651064}
{u'_id': u'RichRico_labuildings', u'ratio': 0.03589146277894239}
{u'_id': u'schleuss', u'ratio': 0.03488097070663093}


### <font color = blue> Answer 3: </font> Twenty percent of contributions come from schleuss_imports and schleuss.

### <font color = grey> Question 4: </font> What amenities are near me?</font>

In [61]:
pipeline = [{"$match":{"amenity":{"$exists":1}, "name":{"$exists":1}}},  # exclude any document without amenity or name
            {"$group":{"_id":"$amenity", "count":{"$sum":1}}},
            {"$sort":{"count":-1}},
            {"$limit":30}]

result = collection.aggregate(pipeline)
for doc in result:
    print(doc)

{u'count': 206, u'_id': u'restaurant'}
{u'count': 192, u'_id': u'school'}
{u'count': 181, u'_id': u'place_of_worship'}
{u'count': 120, u'_id': u'fast_food'}
{u'count': 93, u'_id': u'cafe'}
{u'count': 54, u'_id': u'hospital'}
{u'count': 51, u'_id': u'parking'}
{u'count': 42, u'_id': u'fuel'}
{u'count': 36, u'_id': u'library'}
{u'count': 31, u'_id': u'bank'}
{u'count': 28, u'_id': u'post_office'}
{u'count': 24, u'_id': u'pharmacy'}
{u'count': 23, u'_id': u'fire_station'}
{u'count': 20, u'_id': u'theatre'}
{u'count': 16, u'_id': u'bar'}
{u'count': 12, u'_id': u'cinema'}
{u'count': 10, u'_id': u'police'}
{u'count': 8, u'_id': u'social_facility'}
{u'count': 8, u'_id': u'doctors'}
{u'count': 6, u'_id': u'car_rental'}
{u'count': 5, u'_id': u'college'}
{u'count': 5, u'_id': u'arts_centre'}
{u'count': 5, u'_id': u'university'}
{u'count': 4, u'_id': u'bicycle_rental'}
{u'count': 4, u'_id': u'fountain'}
{u'count': 4, u'_id': u'townhall'}
{u'count': 4, u'_id': u'marketplace'}
{u'count': 4, u'_id':

### <font color = blue> Answer 4: </font> Three of the top five amenities (restaurant, fast_food, & cafe) are food related. Plus there are 32 bars, which may also serve food. Seeing ice_cream near the bottom  means people around me like to eat.   </font>

### <font color = grey> Question 5: </font> What else is in the fast_food amenity look like?

In [62]:
pipeline = [{"$match":{"amenity":{"$exists":1}, "amenity":"fast_food" , "name":{"$exists":1}}},
            {"$sort":{"count":-1}},
            {"$limit":5}]

result = collection.aggregate(pipeline)
for doc in result:
    print(doc)

{u'amenity': u'fast_food', u'drive_through': u'yes', u'name': u'In-n-Out Burger', u'created': {u'changeset': u'16865124', u'version': u'3', u'uid': u'1519787', u'timestamp': u'2013-07-07T19:45:41Z', u'user': u'kdano'}, u'wheelchair': u'yes', u'pos': [34.0631487, -118.4480856], u'_id': ObjectId('57e200626df1501eb83837b6'), u'type': u'node', u'id': u'672875635'}
{u'amenity': u'fast_food', u'name': u'Subway', u'created': {u'changeset': u'5978908', u'version': u'1', u'uid': u'36489', u'timestamp': u'2010-10-07T15:03:57Z', u'user': u'jerjozwik'}, u'pos': [34.0521995, -118.3443903], u'source': u'survey', u'_id': ObjectId('57e200626df1501eb8384d9d'), u'type': u'node', u'id': u'940358499'}
{u'amenity': u'fast_food', u'name': u'Panda Express', u'created': {u'changeset': u'1830721', u'version': u'1', u'uid': u'28775', u'timestamp': u'2009-07-14T21:47:44Z', u'user': u'StellanL'}, u'pos': [34.0174651, -118.4084395], u'_id': ObjectId('57e200626df1501eb8382635'), u'type': u'node', u'id': u'441936006

### <font color = blue> Answer 5: </font> Note the drive_through tag.  We could look at the fast food locations by name.

### <font color = grey> Question 6: </font> What are the top 10 different fast food names?</font>

In [63]:
pipeline = [{"$match":{"amenity":{"$exists":1}, "amenity":"fast_food" , "name":{"$exists":1}}},
            {"$group":{"_id":"$name", "count":{"$sum":1}}},
            {"$sort":{"count":-1}},
            {"$limit":10}]

result = collection.aggregate(pipeline)
for doc in result:
    print(doc)

{u'count': 13, u'_id': u'Subway'}
{u'count': 8, u'_id': u"McDonald's"}
{u'count': 5, u'_id': u'KFC'}
{u'count': 4, u'_id': u'Burger King'}
{u'count': 4, u'_id': u'Chipotle'}
{u'count': 4, u'_id': u'Taco Bell'}
{u'count': 3, u'_id': u'Jamba Juice'}
{u'count': 3, u'_id': u"Noah's Bagels"}
{u'count': 3, u'_id': u'Panda Express'}
{u'count': 2, u'_id': u'Jack in the Box'}


### <font color = blue> Answer 6: </font> Subway has the top spot for fastfood.  I wonder what attributes and features are used to classify amenities like restaurants, fast food, bars and pub.  I read [this page here](http://wiki.openstreetmap.org/wiki/Key:amenity) although these sorts of definitions are often subjective and difficult to crowd source.

### <font color = grey> Question 7: </font> How else can we group the restaurant amenities?

In [64]:
pipeline = [{"$match":{"amenity":{"$exists":1}, "amenity":"restaurant" , "name":{"$exists":1}}},
            {"$sort":{"count":-1}},
            {"$limit":3}]

result = collection.aggregate(pipeline)
for doc in result:
    print(doc)

{u'cuisine': u'sushi', u'amenity': u'restaurant', u'name': u'Yokohama Sushi', u'created': {u'changeset': u'412604', u'version': u'2', u'uid': u'28775', u'timestamp': u'2008-07-06T10:19:10Z', u'user': u'StellanL'}, u'pos': [34.0169087, -118.4066108], u'created_by': u'Potlatch 0.9c', u'_id': ObjectId('57e200616df1501eb8380cda'), u'type': u'node', u'id': u'276597746'}
{u'amenity': u'restaurant', u'name': u"Jerry's Famous Deli", u'created': {u'changeset': u'652741', u'version': u'1', u'uid': u'100465', u'timestamp': u'2009-02-24T01:05:48Z', u'user': u'emc1x'}, u'pos': [34.0770679, -118.3806155], u'created_by': u'Potlatch 0.10f', u'_id': ObjectId('57e200616df1501eb8381a8e'), u'type': u'node', u'id': u'350742224'}
{u'amenity': u'restaurant', u'name': u'Bossa Nova', u'created': {u'changeset': u'10875502', u'version': u'2', u'uid': u'81983', u'timestamp': u'2012-03-05T00:23:14Z', u'user': u'Ogmios'}, u'pos': [34.0832812, -118.3858855], u'_id': ObjectId('57e200616df1501eb8381acf'), u'type': u'n

### Let's group the data to count by cuisine.

In [65]:
pipeline = [{"$match":{"amenity":{"$exists":1}, "amenity":"restaurant", "cuisine":{"$exists":1}}}, 
            {"$group":{"_id":"$cuisine", "count":{"$sum":1}}},        
            {"$sort":{"count":-1}}, 
            {"$limit":10}]
result = collection.aggregate(pipeline)
for doc in result:
    print(doc)

{u'count': 15, u'_id': u'mexican'}
{u'count': 12, u'_id': u'italian'}
{u'count': 9, u'_id': u'american'}
{u'count': 7, u'_id': u'burger'}
{u'count': 7, u'_id': u'sushi'}
{u'count': 6, u'_id': u'pizza'}
{u'count': 5, u'_id': u'thai'}
{u'count': 5, u'_id': u'indian'}
{u'count': 5, u'_id': u'chinese'}
{u'count': 4, u'_id': u'japanese'}


### <font color = blue> Answer 7: </font> Japanese and Sushi combine to be the number three type of cuisine.  Likewise, pizza and Italian, american and burger are similar cuisine.  So eventhough mexican is the number one cuisine, the distribution of cuisine types are similar when combine like cuisine.  

### <font color = grey> Question 8: </font> Are the opening hours for the cafe amenity in a standard format?

In [66]:
pipeline = [{"$match":{"amenity":{"$exists":1}, "amenity":"cafe" , "opening_hours":{"$exists":1}}},
            {"$group":{"_id":"$opening_hours", "count":{"$sum":1}}},
            {"$sort":{"count":-1}},
            {"$limit":20}]

result = collection.aggregate(pipeline)
for doc in result:
    print(doc)

{u'count': 1, u'_id': u'Mo-Fr 07:00-16:00'}
{u'count': 1, u'_id': u'Mo-Fr 5:00-21:00; Sa-Su 6:00-21:00'}
{u'count': 1, u'_id': u'Monday 11:00 am - 9:00 pm, Tuesday 11:00 am - 9:00 pm, Wednesday 11:00 am - 9:00 pm, Thursday 11:00 am - 10:00 pm, Friday 11:00 am - 10:00 pm, Saturday 11:00 am - 10:00 pm, Sunday 4:00 pm - 9:00 pm'}
{u'count': 1, u'_id': u'Mo-Su 05:00-21:00'}
{u'count': 1, u'_id': u'Mo-Su 08:00-17:00'}
{u'count': 1, u'_id': u'Mo-Fr 09:00-15:00'}
{u'count': 1, u'_id': u'Mo-Su 16:00-22:00'}
{u'count': 1, u'_id': u'Sun - Sat 5 am - 8 pm'}
{u'count': 1, u'_id': u'M-F 4:30AM-9:00PM, Sa-Su 5:00AM-9:00PM'}
{u'count': 1, u'_id': u'Mo-Su 5:00-20:00'}
{u'count': 1, u'_id': u'M-F 7am - 11pm Sa-Su 8am - 11p'}
{u'count': 1, u'_id': u'Mo-Fr 04:30-22:00; Sa-Su 05:00-22:00'}


### <font color = blue> Answer 8: </font> No, the hours do not look standardized.  Maybe an open and close tag containing open and close hours in a 24-hr format could help improve the data.

### <font color = grey> Question 9: </font> What denominations with respect to place of worship are most represented?

In [67]:
pipeline = [{"$match":{"amenity":{"$exists":1}, "amenity":"place_of_worship" , "denomination":{"$exists":1}}},
            {"$group":{"_id":"$denomination", "count":{"$sum":1}}},
            {"$sort":{"count":-1}},
            {"$limit":10}]

result = collection.aggregate(pipeline)
for doc in result:
    print(doc)

{u'count': 21, u'_id': u'baptist'}
{u'count': 18, u'_id': u'lutheran'}
{u'count': 15, u'_id': u'catholic'}
{u'count': 11, u'_id': u'presbyterian'}
{u'count': 10, u'_id': u'methodist'}
{u'count': 5, u'_id': u'mormon'}
{u'count': 3, u'_id': u'Jewish'}
{u'count': 2, u'_id': u'pentecostal'}
{u'count': 1, u'_id': u'jewish'}
{u'count': 1, u'_id': u'jehovahs_witness'}


### <font color = blue> Answer 9: </font> Baptist is the most frequent denomination within place of worship.  Note the Jewish denomination is on the list twice due to capitalization.

### Conclusion:  For this project the focus was data wrangling and mungling using a selected XML extract from OpenStreetMap (OSM).  The OSM data is human sourced, as such we cleaned up street names and audited zip codes before inserting data into mongodb.  Once the data is collected in mongodb, we can look at the number of unique users, look at the source for our data and investigate amenities.  In Culver City, we see places to eat and drink, such as restaurants, fast food and cafes are in no shortage.  The opening hours for the cafe amenity are non-uniform, so there could be an opportunity to improve the schema by adding tags such as time_in and time_out.   