# OpenStreetMap Data Wrangling for Hampton Roads Virginia
Paulo Black  
Map Area: Hampton Roads, Virginia, U.S.A.

A link to the area covered:  
https://www.openstreetmap.org/search?query=hampton%20roads#map=9/36.9707/-76.1284  
A link to download the OSM file I used:  
https://s3.amazonaws.com/metro-extracts.mapzen.com/hampton-roads_virginia.osm.bz2


This notebook contains an examination of geographical data of the Hampton Roads, Virginia (HRVA) area taken from OpenStreetMap's XML export tool that I have cleaned in Python, imported to mongoDB and performed a number of simple analyses on. I began the investigation by testing a sample area much smaller than the total area to see if there would be any syntax or format issues that would require wrangling. After uploading the data to MongoDB I spent some time familiarizing myself with the dataset, finding basic statistics, and then looking at a few more specific stats. I listed some of my findings below!

I grew up in HRVA and used to live in the sample area where I did the provisional data wrangling. It was really good fun getting to see the names of all the old places I used to hang out as a kid while I did the exercise!

## Contents  
[Problems requiring wrangling](#problems)   
[Data Overview](#overview)  
[General information from the data set](#geninf)  
[Analysis of users](#users)  
[Conclusion](#conclusion)


## Problems requiring wrangling
<a id='problems'></a>

Fortunately for me, there were relatively few errors or formatting mismatches in the data. I expected the postcodes and street names to be a mess, but as we will see later, the majority of contributed user information comes from relatively few bots whose owners have clearly taken time to communicate with one another. I did choose to reformat several of the entries into smaller subdictionaries for the sake of the exercise, such as adding all address information to a subdictionary 'address' and putting latitude and longitude coordinates together in an array.

When dealing with the address, however, I came across the prefix "gnis" in more than half of the entries, which initially threw me for a loop. After some quick googling I learned that this is the USGS Geographic Names Information System which the bots must have accessed frequently with the exception of a minority. I added some steps in the wrangling process to deal with these entries.

Unfortunately, there was also one major formatting error, specifically in the Cities dictionary. Several cities appeared three different ways: Chesapeake, CHESAPEAKE and Chesapeake (city). 

Here are a few examples (I've excluded some intermediate entries for clarity):  
{ "_id" : "Virginia Beach", "count" : 121589 }   
{ "_id" : "Newport News", "count" : 51198 }    
{ "_id" : "Chesapeake", "count" : 46033 }  
{ "_id" : "CHESAPEAKE", "count" : 260 }  
{ "_id" : "Norfolk", "count" : 154 }  
{ "_id" : "Virginia Beach (city)", "count" : 150 }  
{ "_id" : "Suffolk (city)", "count" : 122 }  
{ "_id" : "Norfolk (city)", "count" : 107 }  
{ "_id" : "Chesapeake (city)", "count" : 81 }  

This was readily dealt with through some more wrangling and we can see a much nicer final print out:

{ "_id" : "Virginia Beach", "count" : 121739 }  
{ "_id" : "Hampton", "count" : 57937 }  
{ "_id" : "Newport News", "count" : 51247 }  
{ "_id" : "Chesapeake", "count" : 46374 }  
{ "_id" : "Suffolk", "count" : 28781 }  
{ "_id" : "York County", "count" : 17321 }  
{ "_id" : "Poquoson", "count" : 5979 }  
{ "_id" : "Norfolk", "count" : 279 }  
{ "_id" : "Scappoose", "count" : 176 }  

I expected the street names to be in a horrific state of affairs based on the amount of time the Udacity lessons spent covering street name auditing, so I was over the moon to find out that I couldn't find a single weirdly formatted street name whether I sorted in ascending or descending order through my provisional data or my larger data set.


## Data Overview
<a id='overview'></a>

Some basic info from the dataset I pulled with MongoDB

In [None]:
#File Sizes
$ls -lh
-rw-r--r--@ 1 pauloblack  staff   1.0G Apr  5 11:43 hrva.osm
-rw-r--r--  1 pauloblack  staff   1.1G Apr  5 14:40 hrva.osm.json
#I picked up a fatty! I did not use pretty print in the JSON file so it is comparably large.    
    
#Number of documents:
> db.hrva.find().count()
5041659

#Number of nodes
> db.hrva.find({'type':'node'}).count()
4470090

#Number of ways
> db.hrva.find({'type':'way'}).count()
571569

#Number of unique users
> db.hrva.distinct('created.user').length
6358

#Number of users with only one entry
> db.hrva.aggregate([{"$group":{"_id":"$created.user", "count":{"$sum":1}}}, 
                     {"$group":{"_id":"$count", "num_users":{"$sum":1}}}, {"$sort":{"_id":1}}, {"$limit":1}])
{ "_id" : 1, "num_users" : 2184 }
#Quite a few!



In [None]:
db.hrva.aggregate([{"$match":{"address.city":{"$exists":1}}}, {"$group":{"_id":"$address.city", 
                    "count":{"$sum":1}}}, {"$sort":{"count":-1}}])

## General information from the data set.
<a id='geninf'></a>

Most prolific users:

In [3]:
> db.hrva.aggregate([{'$group':{'_id':'$created.user', 'count' : {'$sum': 1}}},{'$sort' : {'count' : -1}}])

{ "_id" : "Jonah Adkins", "count" : 1725557 }  
{ "_id" : "jonahadkins_vabeach_imports", "count" : 1326536 }  
{ "_id" : "jonahadkins_hampton_imports", "count" : 520208 }  
{ "_id" : "woodpeck_fixbot", "count" : 319978 }  
{ "_id" : "jonahadkins_suffolk_import", "count" : 319893 }  
{ "_id" : "jonahadkins_yorkcounty_import", "count" : 185658 }  
{ "_id" : "CynicalDooDad", "count" : 106179 }  
{ "_id" : "jumbanho", "count" : 62708 }  
{ "_id" : "RoadGeek_MD99", "count" : 61092 }  

Most populated zip codes. I used to live in 23465! Glad to see my neighbors repping!

In [None]:
> db.hrva.aggregate([{'$group':{'_id':'$address.postcode', 'count' : {'$sum': 1}}},{'$sort' : {'count' : -1}}]

{"_id" : "23464", "count" : 22339 }  
{ "_id" : "23666", "count" : 18569 }  
{ "_id" : "23452", "count" : 18291 }  
{ "_id" : "23456", "count" : 18127 }  
{ "_id" : "23454", "count" : 17965 }  
{ "_id" : "23669", "count" : 17467 }  
{ "_id" : "23434", "count" : 16366 }  

Most popular cuisines. A standard East Coast fare.

In [None]:
> db.hrva.aggregate([{"$match":{"amenity":{"$exists":1}, "amenity":"restaurant"}}, {"$group":{"_id":"$cuisine",
                    "count":{"$sum":1}}},        {"$sort":{"count":-1}}])

{ "_id" : "american", "count" : 19 }  
{ "_id" : "pizza", "count" : 15 }  
{ "_id" : "italian", "count" : 12 }  
{ "_id" : "seafood", "count" : 12 }  
{ "_id" : "mexican", "count" : 12 }  

Most popular overall amenities. Gotta put your car somewhere.

In [None]:
db.hrva.aggregate([{"$match":{"amenity":{"$exists":1}}}, {"$group":{"_id":"$amenity",
                    "count":{"$sum":1}}}, {"$sort":{"count":-1}}])

{ "_id" : "parking", "count" : 5877 }  
{ "_id" : "place_of_worship", "count" : 1161 }  
{ "_id" : "school", "count" : 694 }  
{ "_id" : "restaurant", "count" : 456 }  
{ "_id" : "fast_food", "count" : 352 }  
{ "_id" : "fuel", "count" : 264 }  
{ "_id" : "bank", "count" : 146 }  
{ "_id" : "fountain", "count" : 93 }  
{ "_id" : "grave_yard", "count" : 69 }  
{ "_id" : "pharmacy", "count" : 64 }  

More graveyards than pharmacies, more churches than schools... time to get out of dodge...

## Analysis of Users
<a id='users'></a>

Following tradition I poked around with the user statistics

In [None]:
#Top user Jonah Adkins contributed 32% of all info. But that's not the whole story, jonah_adkins shows up in several
#other entries for different cities within the region. If we consider all of them together we see that all bots 
#(presumably) under Jonah Adkins contributed 81%. A whopper!

#Considering the jonah_adkin's entries only make up 50% of the top 10 out of 6358 users, it's fair to assert that 
#we'll find the usual distribution of power in the 1%

#Top 1% of users (64) (this one was a doozie to figure out)
> db.hrva.aggregate([{"$group":{"_id":"$created.user", "count":{"$sum":1}}}, 
                     {"$sort":{"count":-1}}, {"$limit":64}, {'$group':{'_id':null,"total":{'$sum':'$count'}}}])
{ "_id" : null, "total" : 4932661 }
#So 97.8% percent of the datawealth is input by the automated 1% elite. Not quite up to Occupy standards but still
#enough to spur us organic masses to action lest our cartographers be run out of business by these perfidious 
#data wrangle ready robots

## Conclusion
<a id='conclusion'></a>
I'm pleasantly surprised with how thoroughly the HRVA OSM is maintained. The entries seem to be routine and relatively homogenous, all things considered. That said, I think I've cleaned the data and examined some basic statistics in such a fashion that the data is left open-ended for more vigorous analysis if desired. I did notice a few of my favorite new restaurants are missing, and Norfolk is significantly under-represented for it's size. I'd chalk this up to the main contributors all seeming to be bots with references to cities other than Norfolk in their titles, and presumably the bulk of the data collection was done quite some time ago. Finally, in the example document provided for the Udacity project, I noticed the top two contributors for Charlotte, NC were 'jumbanho' and 'woodpeck_fixbot'. I imagine they are region wide bots and probably cover an area even larger than the 500 or so miles between Charlotte and HRVA. I'm glad our local hero Jonah Adkins managed to outdo those outsiders and keep HRVA pride in tact!