# Manhattan OpenStreetMap Data Exploration - P3 Data Analyst

### Marie Leaf | February 2016


Map Area: [Manhattan, New York, United States](https://www.openstreetmap.org/relation/2552485#map=11/40.7811/-73.9779)  
XML OSM Download: [Mapzen Direct](https://s3.amazonaws.com/metro-extracts.mapzen.com/new-york_new-york.osm.bz2)

I chose the New York (Manhattan) map because it is the city where I currently live, and am interested in exploring. 

Manhattan has a population of 1.6 million people (2014) with an influx of commuters on business days that increases the total population to 3.9 million or more than 170,000 people per sq. mile. Manhattan is considered the economic and cultural center of the United States and is home to the United Nations Headquarters and the world's two largest stock exchanges. The city's real estate market consistently ranks amongst the most expensive in the world.

### 1. Problems Encountered in the Map

After downloading and converting small sample size of the New York City metro area data (1sample.py) and running it against processing 2parse.py, 3explore.py, and 4audit.py files, I noticed some issues with the data: 

- When auditing the street types, directions came at the end of some street names. I solved this by classifying these directions as street types. 
- Sometimes the street_type came at the beginning of the string, e.g. Avenue C. I solved this by inlcuding the condition "and `'Avenue' not in street_name"` and `"and 'Road' not in street_name"` in the `audit_street_type` function
- State abbreviations needed to be standardized with capitalization
- Some streets were all caps, so added `.lower().title()` on the regex match
- There was a lot of zipcode redunancy with the TIGER data so I normalized these keys as zipcode keys, i.e. `tiger:zip_right` and `tiger:zip_left` keys were converted to `zipcode`
- Skipped past the GNIS data



### 2. Data Overview

This section contains basic statistics about the dataset and the MongoDB queries used to gather them. The data we have here is a snapshot of the lastest version available on Mapzen. And only the last user who contibuted to the all the node’s and way’s are counted.

To load json into MongoDB:

```> mongoimport -d nanodegree -c manhattan --file sample.osm.json```   
```> use nanodegree ```


__File Sizes__

small_sample.osm : 4.9 MB  
small_sample.osm.json : 5.4 MB

__Number of documents__

```
> db.manhattan.find().count()  
```
Result:
```
104996
```

__Number of nodes and ways__

```
> db.manhattan.aggregate( [ { $group : { "_id" : "$type", "count" : { $sum : 1 } } } ] )
```
Result:
```
{ "_id" : "way", "count" : 101849 }
{ "_id" : "node", "count" : 3147 } ```

__Number of unique users__

```
> db.manhattan.distinct("created.user").length() 
```
Result:
```
769```

__Top one contributing user__

```
> db.manhattan.aggregate( [
    { $group : {"_id" : "$created.user", 
                "count" : { "$sum" : 1} } },
    { $sort : {"count" : -1} }, 
    { $limit : 1 } ] )
```
Result:
```
{ "_id" : "woodpeck_fixbot", "count" : 66873 }
```

__Number of users appearing only once (having one post)__

```
> db.manhattan.aggregate( [
    { $group : {"_id" : "$created.user", 
                "count" : { "$sum" : 1} } },
    { $group : {"_id" : "$count",
                "num_users": { "$sum" : 1} } },
    { $sort : {"_id" : 1} },
    { $limit : 1} ] )
```
Result:
```
{ "_id" : 1, "num_users" : 258 }
```

__Top unique data sources__
``` 
>    db.manhattan.aggregate( [{"$match": {"source": {"$exists" : 1} } }, 
    {"$group": {"_id": "$source", "count": {"$sum":1} } }, 
    {"$sort": {"count":-1} },{"$limit": 5}])
```
Result:
```
{ "_id" : "Yahoo", "count" : 11 }
{ "_id" : "county_import_v0.1", "count" : 8 }
{ "_id" : "tiger:boundaries", "count" : 7 }
{ "_id" : "Microsoft Bing orbital imagery", "count" : 6 }
{ "_id" : "yahoo", "count" : 4 }
```

### 3. Additional Ideas and MongoDB Queries


__Top 5 Contributers__

```
> db.manhattan.aggregate( [
        { $group : {"_id" : "$created.user", 
                "count" : { "$sum" : 1} } },
        { $sort : {"count" : -1} }, 
        { $limit : 1 } ] )```
Results:
```
{ "_id" : "woodpeck_fixbot", "count" : 66873 }
{ "_id" : "Rub21_nycbuildings", "count" : 10953 }
{ "_id" : "KindredCoda", "count" : 2819 }
{ "_id" : "ingalls_nycbuildings", "count" : 2093 }
{ "_id" : "dufekin", "count" : 1792 }
```


__Variability in height (min, max, avg building heights)__  
After running 2parse.py, I noticed that 'height' is the second most frequent tag key. 

```> db.manhattan.aggregate([{'$match': {'height': {'$gt': 0}}}, {'$group': {
            '_id': None,
            'count': {'$sum': 1},
            'min': {'$min': '$height'},
            'max': {'$max': '$height'},
            'avg': {'$avg': '$height'},
            'stdD': {'$stdDevPop': '$height'},
            }}, {'$sort': {'height': -1}}]``` 

            
Results:
```
```

__Timestamps__  
Buildings, streets, and city in general develops and changes at a rapid pace. After querying the timestamps and seeing that the majority of documents were added in 2009, I suggest that the dataset would benefit from more current data.

```> db.manhattan.aggregate([{'$match': {'cuisine': {'$exists': 1}}},
                {'$group': {'_id': '$cuisine', 'count': {'$sum': 1}}},
                {'$sort': {'count': -1}}, {'$limit': 10}])```
Results:
```
```

__Most Common Amenities__

```> db.manhattan.aggregate([{'$match': {'amenity': {'$exists': 1}}},
                {'$group': {'_id': '$amenity', 'count': {'$sum': 1}}},
                {'$sort': {'count': -1}}, {'$limit': 10}]) ```
Results:
```
```

__Most Common Cuisines__

```> db.manhattan.aggregate( [ 
    { $match : { "cuisine" : { \$exists : 1} } },
    { $group : { "_id" : "\$cuisine", "count" : { \$sum : 1} } }, 
    { $sort : { "count" : -1 } },
    { $limit : 10} ] )
```
Results:
```
```

### Conclusion                        

After reviewing the data, I noticed that the bulk of the dataset is outdated...though I believe it has been well cleaned for the purposes of this exercise. It interests me to notice a fair amount of GPS data makes it into OpenStreetMap.org on account of users’ efforts, whether by scripting a map editing bot or otherwise. With a rough GPS data processor in place and working together with a more robust data processor similar to data.pyI think it would be possible to input a great amount of cleaned data to OpenStreetMap.org.