# Parse and clean OpenStreetMap data

### Identify, archive, document data file:

Data File: New Orleans [MapZen Extract](https://mapzen.com/data/metro-extracts/metro/new-orleans_louisiana/)

### Inspect and document general structure:

Essential task is to parse data file and clean. What am I parsing?

First, looking at the documentation, the gist that I get is:
* node is a lon/lat point, possibly with 'tag' child elements
* way has nodes as children(as nd elements) and defines things like roads, buildings, etc. Also may have 'tag' children
* relation has nested 'member' elements which refer to ways and nodes - defines relationships among elements

Next, open doc in vim - get aquanted, why? Because I can and I want to. Visually I see (screenshots):
* standard opening:  <?xml version='1.0' encoding='UTF-8'?>
* root is an 'osm' tag
* doc is 16,082,009 lines long in total - wow!
* skipping down some pages, looks like a bunch of nodes with no internal element defined at top
* skipping down from middle of doc, looks like a bunch of 'way' tags of natural features (streams, other waterway)
* skipping up from the end of the doc, looks like a bunch of 'relations' of things like gardens and public transportation stations

### Speculate on what to work with specifically:

So what should I clean here? 'Tag' elements seem to be an important part where details about map features are stored. This should also be where a lot of user-generated/added content is stored.

In the middle, what is NHD? Strikes me as interesting. So Googling it, Ah, US Geological Survey's National Hydrography Dataset (Watershed Boundary Dataset). The data should be damn near perfect. Simply, it should be a way, with node child element references and tags with all relevant info (so  poly boundary and some meta data). These are mostly 'natural' features - auditing them sounds interesting.

First, the NHD data shouldn't really have many errors given the data is taken from the NHD and uploaded to OpenStreetMaps (as opposed to multiple user-generated content for most contributor use-cases of OpenStreetMap). It's worth checking a few things to be sure. Some inspection is in order:

### Inspect/audit/plan out cleaning methodology:

First, what keys exist in natural feature tag elements as a whole?

In [None]:
import xml.etree.ElementTree as ET

OSM_FILE = "/Users/mchana/GitHub/udacity/large_files/new-orleans_region.osm"

def count_elem(osm_file):
    tag_set = set() 
    
    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    
    for event, elem in context:
        if event == 'end' and (elem.tag == 'node' or elem.tag == 'way' or elem.tag == 'relation'):
            for tag_elem in elem.iter("tag"):
                if tag_elem.attrib['k'] == 'natural':
                    for tag_elem in elem.iter("tag"):
                        tag_set.add(tag_elem.attrib['k'])
        root.clear()
    return tag_set

keys_list = list(count_elem(OSM_FILE))
print(keys_list)

Hard to tell visually - sort it:

In [None]:
sorted(keys_list)

Some of the k values like 'golf' seem pretty arbitrary, but they may have a purpose. Others, like NHD:FCode seem redundant with FCode. Maybe US Geological society changed their naming convention at one point. There may be something to clean here... I also see a 'fixme' key which may be something. Worth noting for now, but what will really help is to create a dictionary of the features and their attribute contents.

Some notes:
* FCODE is used at just once - uses '46600' as its contents. Jumping down to NHD:FCode I see more contents and again, the '46600'. I'm betting FCODE and NHD:FCode are the same. Looking at the NHD poster I find that 46600 is for "Swamp/Marsh". The different keys here are definitely trying to achieve the same thing. They can be merged. Also, I'm not worried about a key having multiple FCode - for example 33600;46600 is a Canal/Ditch, but it's also a Swamp/Marsh. 
* Next redundant tag key is FDATE. 'FDATE' has 'Mon May 23 00:00:00 CEST 2011'as its contents. Looks like a time stamp - checking. I'll check to make sure other elements with the FDATE tag elem have a different timestamp. I could do this, but there's a better - faster way (all I need to do is see a few others to verify it's a time stamp. Vim - /search_key for 'FDATE". Sure enough, found on line 13,356,160. Vim - n finds no other instances! What about NHD:Fdate? Yes, all kinds of others, but their dates ar emore simple - for example, "2005/12/05". Fair enough to say that the rogue element's date can be changed to 2011/05/23. 
* Now thinking, there only seems to be a single rogue elem. Inspect it to make sure it's not wildly different. Nothing stands out - /search for the user Aleks-Berlin. He also seems to have edited a few others - some tags called "FIXME", which seem a little haphazard. I'm wondering who this person is and if entering data haphazardly. Check element on map: /search for one of the nodes, then enter lat/lon coordinates on map: 29.3557244, -90.0538116. Looks like a valid natural feature, so the way elem should stay in place. 
* Next, FTYPE. 466 corresponds to Feature Type from pdf. 
* 'natural' all good
* 'PERMANENT_ is redundant. Has strange value: &#123;A0AFF249-A7D2-44F4-AD8A-0A4A68F99450&#125;
    * Search for NHD:Permanent_. They all have much simpler values like "151098380". What is it? On PDF there is a 'Permanent_Identifier' feature that is a 40 character string, but all the ones I find are 9 characters. Not enough info here - so leaving this one alone. Somehow can flag this elem? Not sure what this is so leaving alone.
*RESOLUTION is '2'. From NHD documentation, this should be set to "High" (Code of source resolution: 1=Local resolution, 2=High resolution, 3=Medium Resolution.)
* SHAPE_AREA and SHAPE_LENG exist in PDF guide as 'Shape_Area' and 'Shape_Length' but have no guidelines and are not used in any other elements in the XML file. Again, wondering if best to just remove this element!

    
Now onto auditing some some of the other keys. For this it's best to build the dictionary and see what kind of information they hold:

In [None]:
import xml.etree.ElementTree as ET
import collections as col
import pprint

OSM_FILE = "/Users/mchana/GitHub/udacity/large_files/new-orleans_region.osm"

def make_elem_dict(osm_file):
    
    elem_dict = col.defaultdict(set)
    
    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    
    # opportunity for 'continue' here...
    # also, pull out - make function to find correct element
    for event, elem in context:
        if event == 'end' and (elem.tag == 'node' or elem.tag == 'way' or elem.tag == 'relation'):
            for tag in elem.iter("tag"):
                if tag.attrib['k'] == 'natural':
                    # am overwriting tag here from prev. loop - call something else
                    for tag in elem.iter("tag"):
                        elem_dict[tag.attrib['k']].add(tag.attrib['v'])
        root.clear()
    return elem_dict

pprint.pprint(make_elem_dict(OSM_FILE))

Some notes on checking the other fields starting at the end:
* 'wikipedia' - not going to touch, connects an elem to wikipedi article
* wikidata - Connects to wikimedia commons. Googling 'wikidata and one of the values from the dict gives' a link to an [article](https://commons.wikimedia.org/wiki/File:Bayou_St_John_by_Spanish_Fort_2009.jpg) - not touching this
* Scanning through others doesn't seem to throw any flags (trying to keep scope of investigation more targeted) However:
* will fix capitalization in 'water' tag
* in source - change any 'bing' to 'Bing', 'landsat' to 'LandSat'
* Look at all 'note'. There are many non-'nature' elems with notes as well, so I'll have to write a script to pull all 'nature' elems that also have a 'note' tag elem:

In [None]:
import xml.etree.ElementTree as ET

OSM_FILE = "/Users/mchana/GitHub/udacity/large_files/new-orleans_region.osm"

def find_note_tag(osm_file):
    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    
    for event, elem in context:
        if event == 'end' and (elem.tag == 'node' or elem.tag == 'way' or elem.tag == 'relation'):
            for tag_elem in elem.iter("tag"):
                if tag_elem.attrib['k'] == 'natural':
                    for tag_elem in elem.iter("tag"):
                        if tag_elem.attrib['k'] == 'note':
                            print(ET.dump(elem))
        root.clear()
find_note_tag(OSM_FILE)

What I'm learing: (add PDF descriptions for each field when introducing them)
* The comment: ""I have altered the natural:coatline tag as this way duplicates existing coastline ways"" tells me that I shouldn't mess with either of the 'coastline' or '_coastline' keys as it's purposful. Other notes don't indicate any other action needed.

Back to dict:
* 'name' - fix capitalization on lower case words There are enough that I can see they all need it. 'xxx' seems like an error but not going to change it...
    * discard 'yes' in node 4506654389
* 'fixme' - not touching. Seems like a way for people to know what needs to be changed due to construction, etc. ("Needs survey", "tempoary way, whilst coastline is sorted out", etc.) Still, slightly problematic as just states "temporary fix" for some. Not doing anything with this.
* NHD:ReachCode - ("Unique identifier composed of two parts, first eight digits = subbasin code as defined by FIPS 103, and next six digits = random-assigned sequential number unique within a Cataloguing Unit."
    * issues that I see but can't fix:
           * Is supposed to be a unique identifier but some have more than one. Manually inspecting one such element doesn't reveal why, but noticed also has a duplicate NHD:ComID. (WayID: 43393226)

* NHD:FDate has one or more instances with two dates. Nothing I can do though.
* NHD:FCode , ok to have more than one.
    * NDH:ComID, has more than one id for some, but more imporantly PDF indicates "ComID field deleted from all feature classes/tables" in the Model Changes section. Should this data still be here? This is the most updated model documentation from August 2016. Waiting on email sent to NHD...

### Build basic parser/CSV compiler:

See data.py in folder

### Confirm basic parser/CSV compiler working properly:

In [None]:
# open each CSV and check manually - VIM search in large OSM and check
    # √nodes_tag.csv
    # √nodes.csv
    # √relation_members.csv
    # √relations_tags.csv
    # √relations.csv
    # √ways_nodes.csv
    # √ways_tags.csv
    # √ways.csv

### Write cleaning scripts - insert into data.py:

In [None]:
#√ Fix wayID 321535489
    # No need to fix - it's inefficient to parse every line to clean a single element.
        # better approach to do manually with access to the actual database
#√ All NHD:FTYPE should be NDF:FType to conform to NHD data model - same with RESOLUTION should be Resolution
#√ Fix capitalization in tag elems with 'water' key
#√ Change 'bing' to 'Bing' and 'landsat' to 'Landsat' in tag elems with 'source' key
    # maybe regex compare based on capitalization - then fix to first letter capitalized?
#√ Fix tags with 'name' elem (fapitalization of firsr letter of each word)
#not doing: discard node tag elem 4506654389

### Update schema - implement schema validation:

In [None]:
#√ descipher schema file and validation functionality
    #√ validator class object is created
    #√ dict and validator class object fed into validate_element()
        # validates and throws error (stops execution) if validation not True
        # else, execution continues and all items written to the csv
#√ update schema doc and run/verify

### Perform statistical analysis using database queries

##### Port CSV files into database:

In [1]:
import sqlite3
import csv
from pprint import pprint
from supporting_files import data_wrangling_schema as dws

In [2]:
sqlite_file = 'supporting_files/exports/osm_db.db'
conn = sqlite3.connect(sqlite_file)

In [3]:
cur = conn.cursor()

In [4]:
sql_schema = dws.sql_schema

In [5]:
# Create the table, specifying the schema
cur.executescript(sql_schema)
# commit the changes
conn.commit()

In [6]:
with open('supporting_files/exports/nodes.csv','r') as fin:
    dr = csv.DictReader(fin)
    to_db = [(i['id'],i['lat'],i['lon'],i['user'],i['uid'],i['version'],i['changeset'],i['timestamp']) for i in dr]
cur.executemany("INSERT INTO nodes(id, lat, lon, user, uid, version, changeset, timestamp) VALUES (?,?,?,?,?,?,?,?);", to_db)
conn.commit()

with open('supporting_files/exports/nodes_tags.csv','r') as fin:
    dr = csv.DictReader(fin)
    to_db = [(i['id'], i['key'],i['value'], i['type']) for i in dr]
cur.executemany("INSERT INTO nodes_tags(id, key, value, type) VALUES (?,?,?,?);", to_db)
conn.commit()

with open('supporting_files/exports/ways.csv','r') as fin:
    dr = csv.DictReader(fin)
    to_db = [(i['id'],i['user'],i['uid'],i['version'],i['changeset'],i['timestamp']) for i in dr]
cur.executemany("INSERT INTO ways(id, user, uid, version, changeset, timestamp) VALUES (?,?,?,?,?,?);", to_db)
conn.commit()

with open('supporting_files/exports/ways_nodes.csv','r') as fin:
    dr = csv.DictReader(fin)
    to_db = [(i['id'], i['node_id'],i['position']) for i in dr]
cur.executemany("INSERT INTO ways_nodes(id, node_id, position) VALUES (?,?,?);", to_db)
conn.commit()

with open('supporting_files/exports/ways_tags.csv','r') as fin:
    dr = csv.DictReader(fin)
    to_db = [(i['id'], i['key'],i['value'], i['type']) for i in dr]
cur.executemany("INSERT INTO ways_tags(id, key, value, type) VALUES (?,?,?,?);", to_db)
conn.commit()

with open('supporting_files/exports/relations.csv','r') as fin:
    dr = csv.DictReader(fin)
    to_db = [(i['id'],i['user'],i['uid'],i['version'],i['timestamp'],i['changeset']) for i in dr]
cur.executemany("INSERT INTO relations(id, user, uid, version, timestamp, changeset) VALUES (?,?,?,?,?,?);", to_db)
conn.commit()

with open('supporting_files/exports/relations_members.csv','r') as fin:
    dr = csv.DictReader(fin)
    to_db = [(i['id'],i['mem_id'],i['type'],i['role'],i['position']) for i in dr]
cur.executemany("INSERT INTO relations_members(id, mem_id, type, role, position) VALUES (?,?,?,?,?);", to_db)
conn.commit()

with open('supporting_files/exports/relations_tags.csv','r') as fin:
    dr = csv.DictReader(fin)
    to_db = [(i['id'], i['key'],i['value'], i['type']) for i in dr]
cur.executemany("INSERT INTO relations_tags(id, key, value, type) VALUES (?,?,?,?);", to_db)
conn.commit()

In [7]:
cur.execute('SELECT * FROM relations')
all_rows = cur.fetchall()
print('1):')
pprint(all_rows)

1):
[(38763, 'sctrojan79', 2601744, '3', 40703435, '2016-07-13T05:40:26Z'),
 (304272, 'Maarten Deen', 9176, '5', 33489234, '2015-08-21T17:49:07Z'),
 (304314, 'Maarten Deen', 9176, '2', 33136718, '2015-08-05T18:54:47Z'),
 (304325, 'eric22', 160949, '2', 18699468, '2013-11-03T18:23:55Z'),
 (304344, 'eric22', 160949, '2', 18699468, '2013-11-03T18:23:59Z'),
 (304359, 'eric22', 160949, '5', 18770230, '2013-11-07T19:16:00Z'),
 (304400, 'Maarten Deen', 9176, '7', 33489234, '2015-08-21T17:49:09Z'),
 (304415, 'Maarten Deen', 9176, '3', 25399335, '2014-09-12T20:28:15Z'),
 (304417, 'cart0', 2941385, '4', 35636229, '2015-11-28T21:13:06Z'),
 (304449, 'ELadner', 22925, '3', 33773806, '2015-09-03T14:03:04Z'),
 (304465, 'ELadner', 22925, '3', 33773806, '2015-09-03T14:02:49Z'),
 (304484, 'ELadner', 22925, '3', 33773806, '2015-09-03T14:02:44Z'),
 (304488, 'eric22', 160949, '2', 18699468, '2013-11-03T18:23:54Z'),
 (304491, 'Maarten Deen', 9176, '6', 33489234, '2015-08-21T17:49:13Z'),
 (304493, 'Maarten D

 (307594, 'Maarten Deen', 9176, '10', 32868317, '2015-07-25T10:48:58Z'),
 (307596, 'Maarten Deen', 9176, '6', 32800953, '2015-07-22T12:37:18Z'),
 (307598, 'Maarten Deen', 9176, '2', 33530996, '2015-08-23T19:03:29Z'),
 (307600, 'Maarten Deen', 9176, '2', 32394460, '2015-07-03T16:50:29Z'),
 (307601, 'Maarten Deen', 9176, '4', 32547727, '2015-07-10T16:13:06Z'),
 (307603, 'Maarten Deen', 9176, '3', 34055654, '2015-09-16T07:07:29Z'),
 (307604, 'Maarten Deen', 9176, '2', 32849174, '2015-07-24T13:16:12Z'),
 (307619, 'Maarten Deen', 9176, '6', 33530996, '2015-08-23T19:03:48Z'),
 (307621, 'Maarten Deen', 9176, '2', 33197535, '2015-08-08T09:36:53Z'),
 (307622, 'ELadner', 22925, '4', 46884249, '2017-03-15T23:48:04Z'),
 (307623, 'Maarten Deen', 9176, '3', 33552485, '2015-08-24T17:20:52Z'),
 (307625, 'Maarten Deen', 9176, '5', 33069741, '2015-08-03T13:30:57Z'),
 (307627, 'Maarten Deen', 9176, '2', 32543518, '2015-07-10T13:57:03Z'),
 (307628, 'Andre Engels', 4054, '3', 41851588, '2016-09-01T15:27:55

 (4175726, 'wvdp', 436419, '2', 26696217, '2014-11-10T19:57:47Z'),
 (4177814, 'wvdp', 436419, '2', 26777194, '2014-11-14T13:55:24Z'),
 (4178308, 'Maarten Deen', 9176, '1', 26665303, '2014-11-09T15:13:38Z'),
 (4178390, 'ELadner', 22925, '1', 26666277, '2014-11-09T15:52:35Z'),
 (4178391, 'ELadner', 22925, '1', 26666277, '2014-11-09T15:52:35Z'),
 (4178392, 'ELadner', 22925, '1', 26666277, '2014-11-09T15:52:35Z'),
 (4178393, 'ELadner', 22925, '1', 26666277, '2014-11-09T15:52:35Z'),
 (4178394, 'ELadner', 22925, '1', 26666277, '2014-11-09T15:52:36Z'),
 (4178395, 'ELadner', 22925, '1', 26666277, '2014-11-09T15:52:36Z'),
 (4178396, 'ELadner', 22925, '1', 26666277, '2014-11-09T15:52:36Z'),
 (4178397, 'ELadner', 22925, '1', 26666277, '2014-11-09T15:52:36Z'),
 (4178398, 'ELadner', 22925, '1', 26666277, '2014-11-09T15:52:36Z'),
 (4178595, 'ELadner', 22925, '3', 26779839, '2014-11-14T15:59:19Z'),
 (4179368, 'Maarten Deen', 9176, '1', 26672257, '2014-11-09T19:33:48Z'),
 (4181318, 'wvdp', 436419, '1'

In [8]:
conn.close()