# OpenStreetMap (OSM) Data Wrangling Project

## Map area 

For this project, I have chosen to clean and explore the OpenStreetMap data of Boston, where I currently reside:  
- https://www.openstreetmap.org/relation/2315704
- http://metro.teczno.com/#boston

In [118]:
import sqlite3
import pandas as pd 
import re
db = sqlite3.connect("OSMBoston.db")
c = db.cursor()

## Issues with map data

After downloading a sample of the OSM data, converting them to csv files using data.py, and uploading them into a SQLite database, I noticed a number of issues with the map data that warrant our attention: 
- Inconsistent state names
- Issue 2
- Issue 3 

### Inconsistent state names

The state name of Massachusetts was inconsistently listed in the data surveyed. While 'MA' was the most common spelling, other entries listed it as 'MA- MASSACHUSETTS', 'Massachusetts', 'ma', and 'Ma'. We will clean these alternative spellings and standardize them as 'MA' across the board. 

In [160]:
QUERY = """
        SELECT value, COUNT(*) FROM way_tags WHERE type = 'addr' AND key = 'state'
        GROUP BY value ORDER BY COUNT(*) DESC
        """
c.execute(QUERY)
rows = c.fetchall()
pd.DataFrame(rows, columns=['key', 'count'])

Unnamed: 0,key,count
0,MA,225
1,MA- MASSACHUSETTS,33
2,Massachusetts,3
3,ma,2
4,Ma,1


### Overabbreviated street names  

In [119]:
QUERY = "SELECT value, COUNT(*) FROM way_tags WHERE key = 'name' GROUP BY value ORDER BY COUNT(*) DESC;"
c.execute(QUERY)
rows = c.fetchall()
street_names = pd.DataFrame(rows, columns = ['street_name', 'count'])
street_names.head(10)

Unnamed: 0,street_name,count
0,Boston HarborWalk,52
1,Massachusetts Avenue,48
2,Washington Street,48
3,Green Line,44
4,Boston University,42
5,Cambridge Street,32
6,Beacon Street,31
7,Massachusetts Turnpike,30
8,Orange Line,27
9,Boylston Street,26


In [144]:
good_st_types = []   
unexp_st_types = {}  
st_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

for name in street_names['street_name']: 
    street_type = st_type_re.search(name).group() 
    if street_type in unexp_st_types: 
        unexp_st_types[street_type] += 1 
    else: 
        unexp_st_types[street_type] = 1 




{u'1': 5,
 u'10': 2,
 u'1001': 1,
 u'101': 1,
 u'102': 1,
 u'11': 3,
 u'12': 3,
 u'13': 1,
 u'14': 1,
 u'16': 1,
 u'17': 1,
 u'1849-1855)': 1,
 u'2': 5,
 u'2013': 1,
 u'24': 1,
 u'26': 1,
 u'3': 4,
 u'301': 1,
 u'303': 1,
 u'31': 1,
 u'33': 1,
 u'34': 1,
 u'35': 1,
 u'351': 1,
 u'36': 1,
 u'37': 1,
 u'38': 1,
 u'388': 1,
 u'39': 2,
 u'4': 4,
 u'402': 1,
 u'403': 1,
 u'404': 1,
 u'405': 1,
 u'414': 1,
 u'415': 1,
 u'416': 1,
 u'417': 1,
 u'418': 1,
 u'419': 1,
 u'420': 1,
 u'421': 1,
 u'422': 1,
 u'423': 1,
 u'424': 1,
 u'425': 1,
 u'426': 1,
 u'427': 1,
 u'428': 1,
 u'429': 1,
 u'430': 1,
 u'431': 1,
 u'432': 1,
 u'433': 1,
 u'434': 1,
 u'435': 1,
 u'436': 1,
 u'437': 1,
 u'438': 1,
 u'439': 1,
 u'440': 1,
 u'441': 1,
 u'442': 1,
 u'443': 1,
 u'444': 1,
 u'46': 1,
 u'5': 4,
 u'50': 1,
 u'502': 1,
 u'503': 1,
 u'539': 1,
 u'54': 1,
 u'542': 1,
 u'543': 1,
 u'56': 1,
 u'6': 5,
 u'62': 1,
 u'64': 1,
 u'6B': 1,
 u'7': 2,
 u'701': 1,
 u'705': 1,
 u'706': 1,
 u'710': 1,
 u'714': 1,
 u'715': 

In [145]:
QUERY = """
        SELECT key, COUNT(*) FROM way_tags WHERE type = 'addr'
        GROUP BY key ORDER BY COUNT(*) DESC LIMIT 20;
        """
c.execute(QUERY)
rows = c.fetchall()
pd.DataFrame(rows)

Unnamed: 0,0,1
0,housenumber,1400
1,street,531
2,postcode,297
3,city,280
4,state,264
5,housename,80
6,interpolation,20
7,inclusion,17
8,country,11
9,full,3
