# OpenStreetMap (OSM) Data Wrangling Project

## Map area 

For this project, I have chosen to clean and explore the OpenStreetMap data of Boston, where I currently reside:  
- https://www.openstreetmap.org/relation/2315704
- http://metro.teczno.com/#boston

In [1]:
import sqlite3
import pandas as pd 
import re
db = sqlite3.connect("OSMBoston.db")
c = db.cursor()

## Issues with map data

After downloading a sample of the OSM data, converting them to csv files using data.py, and uploading them into a SQLite database, I noticed a number of issues with the map data that warrant our attention: 
- Overabbreviated street names
- Inconsistent state names
- Inconsistent city names 
- Inconsistent and problematic postal codes 

### Overabbreviated street names  

In [7]:
QUERY = """
        SELECT value, COUNT(*) 
        FROM (SELECT * FROM way_tags UNION ALL SELECT * FROM node_tags) t
        WHERE type = 'addr' AND key = 'street' 
        GROUP BY value ORDER BY value;
        """
c.execute(QUERY)
rows = c.fetchall()
st_names = pd.DataFrame(rows, columns = ['st_name', 'count'])

st_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
st_types_good = ["Street", "Avenue", "Drive", "Square", "Broadway", "Place", 
                 "Park", "Center", "Road", "Way", "Boulevard", "Lane"]  
st_types_other = {}

for name in st_names['st_name']: 
    st_type = st_type_re.search(name).group() 
    if st_type not in st_types_good: 
        if st_type in st_types_other: 
            st_types_other[st_type] += 1 
        else: 
            st_types_other[st_type] = 1

st_types_other

{u'1100': 1,
 u'1702': 1,
 u'3': 1,
 u'303': 1,
 u'6': 1,
 u'846028': 1,
 u'Ave': 7,
 u'Ave.': 4,
 u'Boylston': 1,
 u'Cambrdige': 1,
 u'Federal': 1,
 u'Fenway': 1,
 u'Floor': 1,
 u'Hall': 1,
 u'Hampshire': 1,
 u'Highway': 1,
 u'Hwy': 1,
 u'LEVEL': 1,
 u'Lafayette': 1,
 u'Mall': 1,
 u'Newbury': 1,
 u'Pl': 1,
 u'Row': 1,
 u'ST': 1,
 u'South': 1,
 u'Sq.': 1,
 u'St': 19,
 u'St.': 13,
 u'Terrace': 1,
 u'Wharf': 2,
 u'Windsor': 1,
 u'Winsor': 1,
 u'ave': 1,
 u'floor': 2,
 u'st': 1,
 u'street': 1}

### Inconsistent state names

The state name of Massachusetts was inconsistently listed in the data surveyed. While 'MA' was the most common spelling, other entries listed it as 'MA- MASSACHUSETTS', 'Massachusetts', 'ma', and 'Ma'. We will clean these alternative spellings and standardize them as 'MA' across the board. 

In [8]:
QUERY = """
        SELECT value, COUNT(*) FROM 
        (SELECT * FROM way_tags UNION ALL SELECT * FROM node_tags) t
        WHERE type = 'addr' AND key = 'state'
        GROUP BY value ORDER BY COUNT(*) DESC
        """
c.execute(QUERY)
rows = c.fetchall()
pd.DataFrame(rows, columns=['key', 'count'])

Unnamed: 0,key,count
0,MA,665
1,MA- MASSACHUSETTS,62
2,Massachusetts,11
3,ma,2
4,Ma,1


### Inconsistent city names 

Along the same vein, cities are also inconsistently named in the map data. For instance, we see several instances where the state names were listed along with the cities. We will clean these by stripping out portions of strings beginning with comma. 

In [9]:
QUERY = """
        SELECT value, COUNT(*) 
        FROM (SELECT * FROM way_tags UNION ALL SELECT * FROM node_tags) t
        WHERE type = 'addr' AND key = 'city'
        GROUP BY value ORDER BY COUNT(*) DESC
        """
c.execute(QUERY)
rows = c.fetchall()
pd.DataFrame(rows, columns=['key', 'count'])

Unnamed: 0,key,count
0,Cambridge,347
1,Boston,260
2,Somerville,32
3,Brookline,9
4,"Boston, MA",6
5,Charlestown,6
6,"Cambridge, MA",5
7,"Cambridge, Massachusetts",5
8,Roxbury Crossing,2
9,South Boston,2


### Problematic postal codes

Not surprisingly, we also see inconsistency in postal codes. While most postal codes use the 5-digit convention, others use the full 9-digit convention. Worse, there are instances where state was included in the postal code. For sake of standardization, we will remove non-numerical characters and keep only the first 5 digits as zipcodes. 

In [93]:
QUERY = """
        SELECT tags.value, COUNT(*) AS count 
        FROM (SELECT * FROM node_tags UNION ALL SELECT * FROM way_tags) tags
        WHERE tags.key = 'postcode' 
        GROUP BY tags.value ORDER BY count DESC LIMIT 30; 
        """
c.execute(QUERY)
rows = c.fetchall()
pd.DataFrame(rows, columns=['zipcode', 'count'])

Unnamed: 0,zipcode,count
0,02139,282
1,02114,61
2,02215,50
3,02116,42
4,02138,34
5,02142,32
6,02143,30
7,02210,24
8,02111,17
9,02141,16


## Data Overview and Additional Ideas

This section contains basic statistics about the dataset, the SQL queries used to gather them, as well as some additional ideas about improving the dataset and deploying it for other purposes. 

### File Size 

In [123]:
file_size = {'boston.osm': '53.5MB', 'nodes_tags': '2.6MB', 'nodes.csv': '17.5MB',
            'ways_nodes.csv': '6.3MB', 'ways_tags.csv': '3.6MB', 'ways.csv': '2.0MB',
             'OSMBoston.db': '34.9MB', }
file_size_df = pd.DataFrame.from_dict(file_size, orient='index')
file_size_df.rename(columns={0: 'file_size'}, inplace = True)
file_size_df

Unnamed: 0,file_size
boston.osm,53.5MB
ways_nodes.csv,6.3MB
nodes.csv,17.5MB
OSMBoston.db,34.9MB
ways.csv,2.0MB
nodes_tags,2.6MB
ways_tags.csv,3.6MB


### Number of Nodes

In [69]:
QUERY = "SELECT COUNT(*) FROM nodes"
c.execute(QUERY)
print c.fetchone()[0]

216691


### Number of Ways 

In [17]:
QUERY = "SELECT COUNT(DISTINCT id) FROM ways"
c.execute(QUERY)
print c.fetchone()[0]

31551


### Number of Unique Users 

In [32]:
QUERY = """
        SELECT COUNT(DISTINCT uid) 
        FROM (SELECT uid FROM nodes UNION ALL SELECT uid FROM ways) t  
        """
c.execute(QUERY)
print c.fetchone()[0]

636


### Top Contributors

In [54]:
QUERY = """
        SELECT user, 
               COUNT(*), 
               ROUND(CAST(COUNT(*) AS FLOAT) * 100 / (SELECT COUNT(*) FROM nodes UNION ALL SELECT user FROM ways), 1)   
        FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) t
        GROUP BY user ORDER BY COUNT(*) DESC LIMIT 10
        """
c.execute(QUERY)
rows = c.fetchall()
pd.DataFrame(rows, columns=['user', 'contrib_count', 'contrib_percent'])

Unnamed: 0,user,contrib_count,contrib_percent
0,crschmidt,117658,54.3
1,wambag,25220,11.6
2,jremillard-massgis,24180,11.2
3,mapper999,14139,6.5
4,morganwahl,12139,5.6
5,OceanVortex,9253,4.3
6,MassGIS Import,3962,1.8
7,JasonWoof,3663,1.7
8,Ahlzen,2396,1.1
9,fiveisalive,2145,1.0


### Amenity Types 

The table below shows the top amenity categories tagged in OpenStreetMaps. Not surprisingly, 'restaurant' was the most common category. However, we also see related categories like 'cafe', 'fast_food', 'pub', and 'bar' listed separately. One might argue that there is little distinction across these categories (especially between 'pub' and 'bar') so they should all be listed under a broader 'restaurant & bar' category - if not as a single category then an umbrella one (similar to 'addr' as in 'addr:street'). This will be helpful for users mining for all F&B establishments in an area. Instead of pain-stakingly combing through all categories to avoid missing a relevant category, users may simply search for tags within that one category, or top-level category. Making that change shouldn't be overly complicated (remapping existing key values and restricting user options to pick amenity types going forward), but it may raise questions about the degree to which amenity types (and other tag types) should be aggregated and subsumed into larger categories, which may have larger implications on the data structure of OSM. 

In [77]:
QUERY = """
        SELECT value, COUNT(*) FROM node_tags WHERE key = 'amenity' 
        GROUP BY value ORDER BY COUNT(*) DESC LIMIT 15
        """
c.execute(QUERY)
rows = c.fetchall()
pd.DataFrame(rows, columns=['value', 'count'], index=range(1,16))

Unnamed: 0,value,count
1,restaurant,291
2,bench,228
3,bicycle_parking,152
4,library,140
5,school,130
6,cafe,124
7,bicycle_rental,99
8,place_of_worship,88
9,fast_food,87
10,fountain,57


### Cuisine Types 

Having said that, let's explore the cuisine types of all the nodes tagged as 'restaurant'. Usual suspects like 'pizza' and 'American' rank amongst the top cuisine types, but it was somewhat surprising to see 'Mexican' as the top category as Boston doesn't have a particularly strong Mexican influence. Instead, I would have expected to see 'Italian' rank higher given the city's heritage. 

In [88]:
QUERY = """
        SELECT value, COUNT(*) FROM node_tags 
        WHERE id IN (SELECT id FROM node_tags WHERE key = 'amenity' and value = 'restaurant') 
              AND key = 'cuisine'
        GROUP BY value ORDER BY COUNT(*) DESC LIMIT 10
        """
c.execute(QUERY)
rows = c.fetchall()
pd.DataFrame(rows, columns=['value', 'count'], index=range(1,11))

Unnamed: 0,value,count
1,mexican,16
2,pizza,15
3,american,13
4,italian,13
5,chinese,11
6,indian,11
7,thai,9
8,japanese,8
9,asian,7
10,international,7


## Conclusion 