# OpenStreet Data Wrangling with Python and SQL
### by Sergei Neviadomski

## Project Summary
### Map area: Pittsburgh, PA, United States

https://mapzen.com/data/metro-extracts/metro/pittsburgh_pennsylvania/

This map is of place where I currently live. I'd like to explore open-source map of this area, reveal some inconsistencies in data and contribute to its improvement on OpenStreetMap.org

## Data auditing and processing

After downloading and auditing Pittsburgh area dataset I noticed some inconsistencies in representing data:
 
1) Street names was inconsistent.   
Abbreviations St -> Street  
Dots at the end St. -> Street  
Lowercase street -> Street  

2) Zip codes had different formats.  
5-4 digit format 15220-4152 -> 15220  
State abbreviation in zip code PA15220 -> 15220  

3) Phone numbers had different formats.  
+4129999999 -> 1-412-999-9999  
1412-999-9999 -> 1-412-999-9999  
(412)999-9999 -> 1-412-999-9999  

First of all I changed all abbreviations to full representation of street types. 
    
Then I brought all zip codes to single 5 digit format by extracting 5 digit sequence from initial value.

And finally I changed phone representations to +1-412-999-9999 format. I standardized the phone number formatting by first removing all spaces, hyphens and parenthesis. Then I appended +1 as country code and separated phone blocks with hyphens.
You can take a look at code in OsmData.py file.

## Preparing for SQL database

After auditing is complete the next step is to prepare the data to be inserted into a SQL database. To do so I parsed all elements in the OSM XML file, transforming them from document format to tabular format, thus making it possible to write to .csv files. These csv files can then easily be imported to a SQL database as tables.

Finally I built SQL database and import tables to this database from csv file from previous step. I used sqlite3 shell for this purpose. 

## Quering SQL database

Here, I present some basic statistics about the data.

Original Data file Size: 407 MB.

SQLite DB file size: 227 MB

For quering database I'll use not sqlite3 shell, but sqlite3 Puthon API. 

1) Number of nodes

2) Number of ways

In [9]:
# Importing SQLite3 API
import sqlite3

#Esteblishing connection and cursor
conn = sqlite3.connect("osm.db")
cursor = conn.cursor()

#Executing and printing 
cursor.execute("select count(id) from nodes;")
print 'There are {} nodes in database.'.format(cursor.fetchall()[0][0])
cursor.execute("select count(id) from ways;")
print 'There are {} ways in database.'.format(cursor.fetchall()[0][0])

There are 47792 nodes in database.
There are 4887 ways in database.


3) Number of unique users

In [10]:
cursor.execute("select count(distinct(uid)) from (select uid from nodes union select uid from ways);")
print 'There are {} uniqe users in database.'.format(cursor.fetchall()[0][0])

There are 662 uniqe users in database.


4) Way with the biggest nodes count

In [11]:
cursor.execute("select id, count(*) as nodes_count from ways_nodes group by id order by nodes_count desc limit 1;")
way_id, count = cursor.fetchall()[0]
print "There're {} nodes in the biggest way in database. Way id is {}.".format(count, way_id)
cursor.execute("select * from ways_tags where id = {};".format(way_id))
print 'This way is:'
pprint.pprint(cursor.fetchall())

There're 382 nodes in the biggest way in database. Way id is 384745032.
This way is:
[]


5) Number of bridges in Pittsburgh

In [8]:
cursor.execute("select count(key) from ways_tags where key = 'bridge' and value != 'yes' group by key;")
print "There are {} bridges in Pitt. That's a second Venice.".format(cursor.fetchall()[0][0])

There are 410 bridges in Pitt. That's a second Venice.


6) Top 5 zip codes in Pittsburgh

In [9]:
cursor.execute("select value, count(*) as count from nodes_tags where key = 'postcode' group by value order by count desc limit 5;")
pprint.pprint(cursor.fetchall())

[(u'15206', 6653),
 (u'15044', 5333),
 (u'15025', 4950),
 (u'15216', 4853),
 (u'15017', 3454)]


## Conclusion

During my analysis I've seen large amount of data that has not been correctly formatted and cleaned. But I successfully parsed this data and corrected streets, zip codes and phones formatting. The bigger issue is that osm data has a lot of inconsistencies. Sometimes it's difficult even to find this inconsistencies. There is a lot of work to be done to complete this map. 

OpenStreetMap data is not perfect as any human modified project. It'll take a lot of time to find and clean all human-made errors. But we've made our first step. We modified street names and made them more consistent and uniform. Then we transformed XML to CSV format and imported it into SQL database. And finally we answered some interesting questions using SQL queries. 

Additional ideas:  
In my opinion there's two way to improve OpenStreenData project.  
First of all it's extremely important to attract more people to improving maps. My suggestion would be the use of gamification. It's reasonable to establish ranking system like on Kaggle or badge system like on Khan academy.  
Benefits of this:
 * increase in productivity
 * help to retain high performers by involving them into moderation
  
Antisipated problems:
 * have very small effect on results
 * creates competition that can be counterproductive
 
Second is to use of different sources to cross-validate inconsistencies and empty spots on OSD maps.