# OpenStreetMap Data Wrangling and Case Study
***
***

### Map Area
The data covers all of lower Manhattan (south of about 40th Street) and parts of Brooklyn including portions of DUMBO, Vinegar Hill, Navy Yard, and Williamsburg.  The area is between 40.7562 and 40.6976 latitude and -74.0199 and -73.9661 longitude.

<img src = "map_area.png">

In [1]:
%load_ext sql
%sql sqlite:///osm.db

'Connected: None@osm.db'

There are three types of entities in the map: nodes, ways, and relations.  In addition, nodes have tags, ways have tags and nodes, and relations have members and tags.  Each of these are represented in a separate table.  The vast majority of the interesting data lies within the tags.

In [12]:
%%sql
SELECT name
FROM sqlite_master
WHERE type='table'
ORDER BY name

Done.


name
Node
NodeTag
Relation
RelationMember
RelationTag
Way
WayNode
WayTag


## Problems Encountered in the Map Data

* Inconsistent zip codes


### Zip Codes
Postcodes were represented inconsistently in the data for the addr:postcode.  Some included a prefix of NY or were in the ZIP+4 format.  These zip codes were fixed programmatically using regular expressions.

I also manually assembled a list of valid zip codes for the areas included in the map.  Some postcodes in the data set were not present in the list.  For instance, 10023, which represents the Lincoln Square area of Manhattan, was erroneously assigned to an address near City Hall.  These zip codes were manually corrected with a dictionary mapping.  Typos in postcodes, such as ‘100014’ which has six digits, were also corrected manually.

Example errors in zip codes:
* 'NY 10007'
* '10002-1013'
* '100014'
* '10023'

In [13]:
%%sql
SELECT tags.value 'Zip Code', COUNT(*) as count 
FROM (SELECT * FROM NodeTag 
      UNION ALL
      SELECT * FROM WayTag) tags
WHERE tags.key='addr:postcode'
GROUP BY tags.value
ORDER BY count DESC
LIMIT 10;

Done.


Zip Code,count
10011,2819
10003,2557
10014,2549
10002,2484
10013,2172
10016,1923
10009,1673
10001,1665
10012,1568
10010,970


### Cities

There was inconsistency with how cities were represented.  The only two acceptable cities are New York and Brooklyn.  The addr:city tags erroneously included the state, referred to the neighborhood as opposed to the city, and had capitalization inconsistencies.  All those referring to New York, Manhattan, and its neighborhoods were mapped to New York; there were no problems with those marked 'Brooklyn.'

Example errors in city:
* 'Manhattan NYC'
* 'NEW YORK CITY'
* 'New York City'
* 'New York, NY'
* 'Tribeca'
* 'York City'
* 'new york'

In [14]:
%%sql
SELECT tags.value 'City', COUNT(*) as count 
FROM (SELECT * FROM NodeTag 
      UNION ALL
      SELECT * FROM WayTag) tags
WHERE tags.key='addr:city'
GROUP BY tags.value
ORDER BY count DESC
LIMIT 10;

Done.


City,count
New York,3301
Brooklyn,77


### Countries
453 of the values for the key addr:country were 'US', with only five being 'USA'.  These were corrected with a simple mapping.

###  States
Almost all of the values for addr:state were 'NY' with only the errors being those listed below.

Errors in state:
* '10009'
* 'New York'
* 'New York State'
* 'ny'

Since 'NY' is the only valid value, a function is not required and all values for the key 'addr:state' were assigned to 'NY'.