# Project 3: Data Wrangling with MongoDB

* [Map Coordinates Link](https://www.openstreetmap.org/relation/162069)
* [Map Overpass Download Link (Auto-Download)](http://overpass-api.de/api/map?bbox=-77.3637,38.7337,-76.6660,39.0533)

## 0. Contents
1. [Problems Found in Map Data](#Problems)
2. [Data Overview](#Overview)
3. [Other Observations](#Observations)

<a id='Problems'></a>

## 1. Problems Found in Map Data

After running an initial audit of the data, some systematic issues were found in the DC data.

* Abbreviated / directional street suffix
* Nodes with "type" attribute overwriting node type
* some "name" elements contain street name

### Abbreviated / directional street types

Any abbreviated street types were fixed during the conversion to a JSON format. Washington DC streets commonly use include a cardinal direction at the end. In order to maintain *name-street_type-direction* format, I had to modify my function for updating street names.

### Nodes with "type" attribute overwriting node type

Several data nodes were already using an attribute called "type". After running a query on type, I could see that this attribute was commonly used to describe an area. 

```
db.sample.aggregate([{$match: {type: {$exists: true}}}, {$group: {_id: '$type', count: {$sum: 1}}}])

{ "_id" : "node", "count" : 26425 }
{ "_id" : "national", "count" : 1 }
{ "_id" : "statue", "count" : 2 }
{ "_id" : "newspaper", "count" : 1 }
{ "_id" : "way", "count" : 2948 }
{ "_id" : "Coffee Shop", "count" : 1 }
```

I renamed these elements to "place_type" to make sure that "type" can only be a *way* or a *node*.

### Some "name" elements contain street name

The name attribute has been used to name some ways. 

```
db.washdc.aggregate( [ { "$match": { "name": { "$exists": 1 } } },
                               { "$match": { "type": "way" } },
                               { "$group": { "_id": "$name_type", "count": { "$sum": 1 } } },
                               { "$sort": {"count": -1}},
                               { "$limit": 5 }] )
                           
{'count': 22239, '_id': None}
{'count': 7490, '_id': 'St'}
{'count': 5376, '_id': 'Dr'}
{'count': 5270, '_id': 'Ct'}
{'count': 4923, '_id': 'Rd'}
```

In these cases, "name_type" is used for the street name suffix used for the "name".
This issue creates inconsistency in the use for the "name" key, but it is not explicitly wrong to use "name" by OSM standards. I did not modify these documents because they are not wrong.

<a id='Overview'></a>

## 2. Data Overview

#### Basic data
```
washdc.osm | 758 MB  
washdc.osm.json | 828 MB
```

#### Number of documents
```
> db.washdc.find().count()
3821100
```

#### Number of unique users
```
> len(db.washdc.distinct( "created.user" ))
2300
```
#### Number of nodes
```
> db.washdc.find({"type": "node"}).count()
3430696
```
#### Number of ways
```
> db.washdc.find({ "type": "way"}).count()
390404
```

#### Number of amenities
```
> db.washdc.find({"amenity": {"$exists": "true"}}).count()
24193
```

#### Five most frequent amenities
```
> db.washdc.aggregate([{"$match": {"amenity": {"$exists": "true"}}},
                                  {"$group": {"_id": "$amenity", "count": {"$sum": 1}}},
                                  {"$sort": {"count": -1}},
                                  {"$limit": 5}])
{'_id': 'parking', 'count': 10695}
{'_id': 'restaurant', 'count': 1985}
{'_id': 'school', 'count': 1883}
{'_id': 'place_of_worship', 'count': 1706}
{'_id': 'fast_food', 'count': 805}
```                                 

<a id='Observations'></a>

## 3. Other Observations

OpenStreetMaps has a community driven, weak standards system for naming physical features of locations and ways. These are only suggestions that leave room for creating new conventions or classifying niche locations. This is a feature and a drawback to OSM. If a limited standard on common keys was enforced, issues with common keys, such as "type" or "name", would be less likely to happen and easier to fix.

For example: ("type" edited to "place_type" explained in section 1)
```
db.washdc.aggregate([{"$match": {"place_type": {"$exists": 1}}},
                                {"$group": {"_id": "$place_type", "count": {"$sum": 1}}},
                                {"$sort": {"count": -1}},
                                {"$limit": 5}])
{'_id': 'pillar', 'count': 296}
{'_id': 'communication', 'count': 140}
{'_id': 'ADDRESS', 'count': 112}
{'_id': 'address', 'count': 62}
{'_id': 'apartment', 'count': 30}
```

174 nodes had a key:value pair of "type":"address". That is a very small percentage of all nodes in the dataset, but inconsistencies create noise in the dataset. In this particular example, "pillar" and "apartment" are descriptions that are lost for 336 nodes. Fixing these issues would be very time consuming and still be open to subjective ideas about the best characterization terms.