# Manhattan OpenStreetMap Data Exploration - P3 Data Analyst

### Marie Leaf | February 2016


Map Area: [Manhattan, New York, United States](https://www.openstreetmap.org/relation/2552485#map=11/40.7811/-73.9779)  
XML OSM Download: [Mapzen Direct](https://s3.amazonaws.com/metro-extracts.mapzen.com/new-york_new-york.osm.bz2)

I chose the New York (Manhattan) map because it is the city where I currently live, and am interested in exploring. 

Manhattan has a population of 1.6 million people (2014) with an influx of commuters on business days that increases the total population to 3.9 million or more than 170,000 people per sq. mile. Manhattan is considered the economic and cultural center of the United States and is home to the United Nations Headquarters and the world's two largest stock exchanges. The city's real estate market consistently ranks amongst the most expensive in the world.

### 1. Problems Encountered in the Map

After downloading and converting small sample size of the New York City metro area data (1sample.py) and running it against processing 2parse.py, 3explore.py, and 4audit.py files, I noticed some issues with the data: 

- When auditing the street types, directions came at the end of some street names. I solved this by classifying these directions as street types. 
- Sometimes the street_type came at the beginning of the string, e.g. Avenue C. I solved this by inlcuding the condition "and `'Avenue' not in street_name"` and `"and 'Road' not in street_name"` in the `audit_street_type` function
- State abbreviations needed to be standardized with capitalization
- Some streets were all caps, so added `.lower().title()` on the regex match
- There was a lot of zipcode redunancy with the TIGER data so I normalized these keys as zipcode keys, i.e. `tiger:zip_right` and `tiger:zip_left` keys were converted to `zipcode`
- Skipped past the GNIS data



### 2. Data Overview

This section contains basic statistics about the dataset and the MongoDB queries used to gather them. The data we have here is a snapshot of the lastest version available on Mapzen.

To load json into MongoDB:

```> mongoimport -d nanodegree -c manhattan --file new-york_mini.osm.json```   
```> use nanodegree ```


__File Sizes__

new-york_mini.osm : 221.5 MB  
new-york_mini.osm.json : 243.6 MB

__Number of documents__

```
> db.manhattan.find().count()  
```

__Number of nodes and ways__

```
> db.manhattan.aggregate( [ { $group : { "_id" : "$type", "count" : { $sum : 1 } } } ] )
```

__Number of unique users__

```
> db.manhattan.distinct("created.user").length() 
```

__Top one contributing user__

```
> db.manhattan.aggregate( [
    { $group : {"_id" : "$created.user", 
                "count" : { "$sum" : 1} } },
    { $sort : {"count" : -1} }, 
    { $limit : 1 } ] )
```

__Number of users appearing only once (having one post)__

```
> db.manhattan.aggregate( [
    { $group : {"_id" : "$created.user", 
                "count" : { "$sum" : 1} } },
    { $group : {"_id" : "$count",
                "num_users": { "$sum" : 1} } },
    { $sort : {"_id" : 1} },
    { $limit : 1} ] )
```

__Results__

```
Overview:
---------
Document count:  1007733
Node count:  865393
Ways count:  142340
Number of unique users: 1908
Top contributing user: {u'_id': u'Rub21_nycbuildings', u'count': 489487}
Single post users: 589
```


### 3. Additional Ideas and MongoDB Queries


__Top contributing users__

```
> db.manhattan.aggregate( [
    { $group : {"_id" : "$created.user", 
                "count" : { "$sum" : 1} } },
    { $sort : {"count" : -1} }, 
    { $limit : 5 } ] )
    ```
    
Results:
```
{u'_id': u'Rub21_nycbuildings', u'count': 489487}
{u'_id': u'ingalls_nycbuildings', u'count': 93466}
{u'_id': u'woodpeck_fixbot', u'count': 65174}
{u'_id': u'ediyes_nycbuildings', u'count': 27356}
{u'_id': u'lxbarth_nycbuildings', u'count': 23541}
```


            
__Variability in height (min, max, avg building heights)__  
After running 2parse.py, I noticed that 'height' is the second most frequent tag key. I was interested in playing around with the queries, to understand how to set them up for more analytical tasks.

```> db.manhattan.aggregate([{'$match': {'height': {'$gt': 0}}}, {'$group': {
            '_id': None,
            'count': {'$sum': 1},
            'min': {'$min': '$height'},
            'max': {'$max': '$height'},
            'avg': {'$avg': '$height'},
            'stdD': {'$stdDevPop': '$height'},
            }}, {'$sort': {'height': -1}}]``` 

            
Results:
```
{u'_id': None,
 u'avg': 8.403078202995008,
 u'max': 242,
 u'min': 1,
 u'stdD': 9.609090287255432}
```

__Timestamps__  
Buildings, streets, and Manhattan in general develops and changes at a rapid pace. After querying the timestamps and seeing that the majority of documents were added in 2013, I suggest that the dataset would benefit from more current data.

```
> db.manhattan.aggregate([{'$match': {'created.timestamp': {'$exists': 1}}},
                {'$group': {'_id': {'$substr': ['$created.timestamp',
                0, 4]}, 'count': {'$sum': 1}}},
                {'$project': {'Year': '$_id'}},
                {'$sort': {'count': -1}}, {'$limit': 10}])```
Results:
```
Timestamps: 
{u'_id': u'2013', u'count': 405109}
{u'_id': u'2014', u'count': 379748}
{u'_id': u'2009', u'count': 103358}
{u'_id': u'2015', u'count': 61488}
{u'_id': u'2012', u'count': 18883}
{u'_id': u'2010', u'count': 18549}
{u'_id': u'2011', u'count': 10272}
{u'_id': u'2016', u'count': 7385}
{u'_id': u'2008', u'count': 2350}
{u'_id': u'2007', u'count': 591}

```

__Most Common Amenities__

```
> db.manhattan.aggregate([{'$match': {'amenity': {'$exists': 1}}},
                {'$group': {'_id': '$amenity', 'count': {'$sum': 1}}},
                {'$sort': {'count': -1}}, {'$limit': 10}]) ```
Results:
```
Top amenities: 
{u'_id': u'parking', u'count': 635}
{u'_id': u'bicycle_parking', u'count': 484}
{u'_id': u'place_of_worship', u'count': 461}
{u'_id': u'school', u'count': 453}
{u'_id': u'restaurant', u'count': 264}
{u'_id': u'fast_food', u'count': 93}
{u'_id': u'cafe', u'count': 90}
{u'_id': u'bank', u'count': 89}
{u'_id': u'fire_station', u'count': 70}
{u'_id': u'bench', u'count': 66}
```

__Most Common Cuisines__

```
> db.manhattan.aggregate( [ 
    { $match : { "cuisine" : { $exists : 1} } },
    { $group : { "_id" : "$cuisine", "count" : { \$sum : 1} } }, 
    { $sort : { "count" : -1 } },
    { $limit : 10} ] )
```
Results:
```
Top cuisines: 
{u'_id': u'coffee_shop', u'count': 21}
{u'_id': u'burger', u'count': 21}
{u'_id': u'pizza', u'count': 20}
{u'_id': u'italian', u'count': 19}
{u'_id': u'sandwich', u'count': 14}
{u'_id': u'american', u'count': 12}
{u'_id': u'chinese', u'count': 12}
{u'_id': u'donut', u'count': 10}
{u'_id': u'indian', u'count': 9}
{u'_id': u'mexican', u'count': 8}

### Conclusion                        

After reviewing the data, I noticed that the bulk of the dataset, 78% is outdated from either 2013 or 2014. As New York is a fast developing city, the dataset would benefit from having current data from 2016. Further research into how user Rub21_nycbuildings uploads data would be helpful in determining an appropriate incentivization method for the user to consistently provide current data. 

Furthermore, 49% of the data seems to come from this one unique user (Rub21_nycbuildings) and 31% of all unique users are single post users who collectively add 0.05% of the data in the set. This is a skewed ratio between one power contributor and single post users, and seems to imply much unpotentiated contribution from single post users. It would be beneficial to further determine the nature of the single post users to see how to retain and incentivize them to be more frequent contributors - possibly through gamification, or minimum threshold contribution before recognition (i.e. as stackoverflow implements a points system). Perhaps there could be a minimum 5 'new item/document' contribution before a user's data is published, or a 10 'updated item/document' threshold. 

Further possible study could be done to regress building heights against parking availability, with zipcode or even city blocks as a dummy variable. This sort of querying could be useful in understanding city density and parking optimization to present a business case for opening a new parking garage. 