# Open Street Maps Data Import and Analysis

# Intro

Data for Raleigh, North Carolina was downloaded from MapZen (https://mapzen.com/data/metro-extracts).

It was converted to JSON while being structured and corrected/normalized, then imported into mongodb.

In [3]:
from pymongo import MongoClient
client = MongoClient()
db = client.osm
collection = db.raleigh

# Problems encountered

There are certainly multiple problems or areas for improvement, but three issues that stuck out were:

1. Inconsistent Street Names
2. Inconsistent Postal Codes
3. Questionable Validity of Education Ammenity Tags

### Street Names

Aside from the expected standardization (such as "st" to "Street") that was required, some nodes had street names that were easily correctible but very specific.  This means they required manual checking (consistency with Google Maps) and specific one-off corrections:

* "Meadowmont Village CIrcle" becomes "Meadowmont Village Circle"
* "LaurelcherryStreet" becomes "Laurel Cherry Street"
* "Garrett Driver" becomes "Garrett Drive"

Others had street names that could not be verified:

* "Triangle Family Practice"
*  Multiple similar names with no best choice:
    * "NC Highway 55 West"
    * "NC Highway 55"
    * "Highway West"
    * "Highway 55 West"
    * "Highway 55"
    * "US 55"


### Postal Codes

Postal codes were suprisingly clean after standardization to a 5-digit zipcode.  A significant number of entries had a more specific 9-digit postal code in the form of #####-####.  This was easily corrected with the following function during import:

In [5]:
import re
postcode_re = re.compile(r'^[0-9]{5}$')
extended_postcode_re = re.compile(r'^[0-9]{5}-[0-9]{4}$')

def correct_postcode(postcode):
    """Try to convert postcode to 5 digit int"""
    if extended_postcode_re.match(postcode): #strip extended postcode with "-####"
        postcode = postcode[0:5]
        return int(postcode)
    elif postcode_re.match(postcode): #normal 5 digit postcode
        return int(postcode)
    else:
        return None
    
#For example:
print(correct_postcode("27705")) #5-digit
print(correct_postcode("27612-5947")) #9-digit
print(correct_postcode("277030")) #invalid

27705
27612
None


Only 6 zipcode entries (out of 6,570 total) were invalid after this correction:

* Two with a different format ("275198404", "275609194") than the majority of entries
* Three with an incorrect total number of digits ("277030", "275199", "2612-6401")
* One ('NC') that was seemingly transposed from a "State" field.

### Use of Amenity Tags for Education

After importing the data I took a look at a few of the "education" amenity tags ('university', 'college', and 'school') to see how the major colleges and universities in the area may be marked.

In [6]:
#Example Code: List of 'university' tags
pipeline = [
    {'$match':{'amenity':'university'}},
    {'$group':{'_id':'$name', 'count':{'$sum':1}}},
    {'$sort':{'count':-1}},
]
documents = collection.aggregate(pipeline)
for r in documents['result']:
    print(r)

{'_id': 'Duke University East Campus', 'count': 3}
{'_id': 'Duke University Medical Center', 'count': 1}
{'_id': 'Campbell University: Norman Adrian Wiggins School of Law', 'count': 1}
{'_id': "St. Augustine's University", 'count': 1}
{'_id': 'Duke University Central Campus', 'count': 1}
{'_id': 'JC Raulston Arboretum at NC State University', 'count': 1}
{'_id': 'North Carolina Central University', 'count': 1}
{'_id': 'Duke University West Campus', 'count': 1}
{'_id': 'North Carolina State University (Centennial Campus)', 'count': 1}
{'_id': None, 'count': 1}
{'_id': 'William Peace University', 'count': 1}
{'_id': 'Campbell University RTP Campus', 'count': 1}


These tags seem to be infrequently used and are inconsistent when they are used.  According to the specificaitons (https://wiki.openstreetmap.org/wiki/Map_Features#Education):

* "university" indicates a university campus.
* "college" indicates a college campus or building
* "school" indicates a school and grounds.

While some tags are as expected ("Duke University East Campus" appears multiple times as "university"), there are issues.

* There is no name for 1 universitiy, 28 college, and 21 school entries.
* "Durham Tech Community College" appears as a "college" while "Durham Technical Community College" appears twice as a school.
* "Duke University" is listed as a "school" while various campus regions are listed as "university".

I believe that this inconsistency is at least partially to blame on the unclear documentation for these tags.

# Overview of the Data

In [7]:
#filesize

In [8]:
#number of different kinds of tags

In [9]:
#visualization

# Additional Ideas

In [10]:
#Consistency Check of Street Names (Nodes) vs Tiger Data street names (Ways)

In [11]:
#Aggregation2

In [12]:
#Aggregation3

* update time vs location very large scatterplot- subsample?
* Types of data input by users- eg same user did most Ways
* kinds of tags by node, way, relation

# Conclusion