# OpenStreetMap - A Data Cleaning Case Study

# Cleaning [OpenStreetMap](http://www.openstreetmap.org/) Data

If you're wondering what OpenStreetMap is, think of it as Wikipedia but for meta-data-filled maps of the world. I would argue that it's one of the most important technologies you've probably never heard of. From the [OpenStreetMap Wiki](https://wiki.openstreetmap.org/wiki/Main_Page):

> "Welcome to OpenStreetMap, the project that creates and distributes free geographic data for the world. We started it because most maps you think of as free actually have legal or technical restrictions on their use, holding back people from using them in creative, productive, or unexpected ways."

If you're like me and you're refining your XML parsing/Python scripting/SQL querying skills, why not use said skills to clean some OpenStreetMap data? I'm going to be as detailed as possible below so anyone interested can follow along as an exercise. 

According to [The New York Times' 'What Could Disappear'](http://www.nytimes.com/interactive/2012/11/24/opinion/sunday/what-could-disappear.html?_r=2&) interactive article, New Orleans could be 88% below sea level in as little as 100 years. I'm personally interested to see what we're set to lose as a consequence of global warming so I'll focus on approximately 100x150 mile area [surrounding New Orleans](https://mapzen.com/data/metro-extracts/metro/new-orleans_louisiana/).

___

### Getting to know the data:
OpenStreetMap data is available as XML bearing the file type '.osm'. Reading through the [documentation](https://wiki.openstreetmap.org/wiki/Main_Page), the gist of the main components of a osm file is:
* a **'node'** element essentially represents a latitude/longitude coordinate. It may have 'tag' child elements with other data points comprising things like an 'address ' or features such as 'school'.
* a **'way'** element has nodes as children(with their element type shortened to 'nd'). It defines things like roads, buildings, natural areas, etc. Also may have 'tag' children for the same purpose as a node.
* a **'relation'** has nested 'member' elements which reference existing ways and nodes. It defines relationships among the other elements such as an extended hiking trail made up by a number of way elements.

The osm file unzips to a whopping 1.28gb and crashes both Atom and SublimeText on my machine. [VIM](http://www.vim.org/), on the other hand, is a text editor designed to be used in a bare-bones command line interface (like the Terminal app on a mac) that can easily handle the job, allowing me to jump around and explore freely. Some VIM commands that let me do this are:
* Jump to bottom: `shift+g`
* Jump to top: `gg`
* page down: `Control+d`
* page up: `Control+u`
* Jump five million lines down: type `5000000` and then hit `j`
* Jump five million lines up: type `5000000` and then hit `k`
* Search the entire massive file for anything you want: type `/` followed by your search
* Go to the next instance of what you searched for: `n`
    
*Check out [this little gem](https://vim-adventures.com/) to get started with VIM in a fun way.*

Here is what I learned from exploring the file:
* the file has the standard opening:  <?xml version='1.0' encoding='UTF-8'?>
* the XML root element is 'osm' and is parent to all other elements
* there are 16,082,009 lines in total - wow!
* an example of a node element with no tags:

![node](supporting_files/screenshots/node.png "node")

* a node element with tags:

![node with tags](supporting_files/screenshots/node_tags.png "node with tags")

* a way element with its nested nd (node) element references and tags:

![way](supporting_files/screenshots/way.png "way")

* a relation element with its nested nodes, ways and tags:

![relation](supporting_files/screenshots/relation.png "relation")

### Reflecting upon the data - what should I clean?:
'tag' elements seem to constitute the bulk of a 'way' or 'relation' element's meta-data. From what I understand about OpenStreetMap, this should also be where a lot of user-generated/added content is stored. I'm sure there will be cleaning to do there. One other thing catches my eye - the large number of 'tag' elements who's 'k' attribute contains the acronym 'NHD'. What is that about?

It turns out NHD is an acronym for the US Geological Survey's ['National Hydrography Dataset'](https://nhd.usgs.gov/) which maps out and documents the nation's watershed boundaries and their features. New Orleans is surrounded by a massive amount of such natural features:


![new orleans watershed](supporting_files/screenshots/new_orleans_watershet.png "new orleans watershed")


I would assume the data to be damn near perfect given it was most likely adopted directly from the NHD. The wonderful thing here is that, without much trouble, I can programmatically verify whether or not that is true despite the over 16 million lines of data to deal with. This could be another opportunity for data cleaning.

Here we go!!!

### Auditing the data:
Looking again at the sample 'way' and 'relation' elements in the 'Getting to know the data:' section above, 'NHD' stands out as a prefix for the field names of the NHD data (ex. 'ComID', 'FCode' and 'FTYPE'). 'FTYPE' appears to define a type of natural feature, which in the case of our examples is a stream/river or a swamp/marsh. This is exactly what I'm interested in for my cleaning so I'm going to leverage it to isolate the 'node', 'way' or 'relation' elements I want and have a look at all the possible keys: 

In [1]:
import xml.etree.ElementTree as ET

OSM_FILE = "/Users/mchana/GitHub/udacity/large_files/new-orleans_region.osm"

def count_elem(osm_file, tags=('node', 'way', 'relation')):
    tag_set = set() 
    
    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            for tag_elem in elem.iter("tag"):
                if tag_elem.attrib['k'] == 'NHD:ComID':
                    for tag_elem in elem.iter("tag"):
                        tag_set.add(tag_elem.attrib['k'])
        root.clear()
    return tag_set

keys_list = list(count_elem(OSM_FILE))
sorted(keys_list)

['NHD:ComID',
 'NHD:Elevation',
 'NHD:FCode',
 'NHD:FDate',
 'NHD:FTYPE',
 'NHD:FType',
 'NHD:GNIS_ID',
 'NHD:GNIS_Name',
 'NHD:Permanent_',
 'NHD:RESOLUTION',
 'NHD:ReachCode',
 'NHD:Resolution',
 'NHD:way_id',
 'admin_level',
 'attribution',
 'boat',
 'boundary',
 'created_by',
 'culvert',
 'ele',
 'gnis:county_id',
 'gnis:created',
 'gnis:feature_id',
 'gnis:state_id',
 'history',
 'landuse',
 'layer',
 'leisure',
 'lock',
 'man_made',
 'name',
 'natural',
 'note',
 'place',
 'source',
 'tunnel',
 'type',
 'water',
 'waterway',
 'wetland']

Looking over the list I can already see some possible redundancies in the field naming convention for some of the NHD keys. I can reference these field names against the official [NHD data model](https://nhd.usgs.gov/NHDv2.2.1_poster_081216.pdf) (last updated on August 1st 2016) and [NHD data dictionary](https://nhd.usgs.gov/userGuide/Robohelpfiles/NHD_User_Guide/Feature_Catalog/Data_Dictionary/Data_Dictionary.htm) in order to clarify. I'll investigate the validity of each one and note here those that I find problematic or noteworthy:

##### NHD:ComID:
The 'Model Changes' section of the NHD data model notes, "ComID field deleted from all feature classes/tables". This field has been replaced by one called 'Permanent_Identifier', yet it is still valid as the NHD data dictionary clarifies - "features already assigned a ComID retain that value as the Permanent_ Identifier". While this is an important note for future data entry, it will not require cleaning here.

##### NHD:FTYPE and NHD:FType
Manually searching through my osm file with VIM (ex. `/NHD:FTYPE`) reveals that both 'FTYPE' and 'FType' are referring to the same category of things - 'SwampMarsh', 'StreamRiver', etc. It's pretty clear they can be united into a single field name, which is according to the NHD data model should be 'FType'. 

##### NHD:GNIS_ID
'GNIS' stands for the '[Geographic Names Information System](https://nhd.usgs.gov/gnis.html)', which is another data set from the U.S. Geological Survey:

>"The Geographic Names Information System (GNIS), developed by the U.S. Geological Survey in cooperation with the U.S. Board on Geographic Names, contains information about physical and cultural geographic features in the United States and associated areas..."

Each instance of 'NHD:GNIS_ID' that I find in the osm file seems to be followed by another 'tag' element containing the key 'gnis:feature_id' that bears the exact same numbers as its value. It's fair to say that this is redundant data. Because this is data removal, the best i can do here is to point out the pairs so that OpenStreetMap can remove the data should they choose. I'll do this via SQL queries later in this case study.

##### NHD:Permanent_
When i manually examine the osm file searching for the key I find that it's not a "40-char GUID value that uniquely identifies the occurrence of each feature in The National Map. National Database primary key." as stated in the NHD Data Model dictionary. In fact, if I look again at the other 'tag' elements I can see that the same value is repeated in the 'NHD:ComID' tag element. I can find elements with duplicate values for NHD:ComID and NHD:Permanent_ using SQL later in this case study.

##### NHD:RESOLUTION
Just as 'NHD:FTYPE' is redundant with 'NHD:FType', 'NHD:RESOLUTION' is redundant with 'NHD:Resolution'. 

#### NHD:way_id
Just as 'NHD:Permanent_' bears the same value as 'NHD:ComID', so does 'NHD:way_id'. There is something going on here. Also, 'way_id' is not an official feature of the NHD data model. I'll do an SQL query to show where it it duplicated with COMID and can be removed:

Onto the remaining fields, there isn't an official source I can verify they are structurally correct or what data they correspond to. So I'll just print out each one with a list of all the unique values they correspond to from the osm file:

In [2]:
import xml.etree.ElementTree as ET
import collections as col
import pprint

OSM_FILE = "/Users/mchana/GitHub/udacity/large_files/new-orleans_region.osm"

def make_elem_dict(osm_file, tags=('node', 'way', 'relation')):
    
    elem_dict = col.defaultdict(set)
    
    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    
    # opportunity for 'continue' here...
    # also, pull out - make function to find correct element
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            for tag in elem.iter("tag"):
                if tag.attrib['k'] == 'NHD:ComID':
                    # am overwriting tag here from prev. loop - call something else
                    for tag in elem.iter("tag"):
                        elem_dict[tag.attrib['k']].add(tag.attrib['v'])
        root.clear()
    return elem_dict

pprint.pprint(make_elem_dict(OSM_FILE))

defaultdict(<class 'set'>,
            {'NHD:ComID': {'121369572',
                           '121376748',
                           '121376910',
                           '121376927',
                           '121379126',
                           '131339718',
                           '131339719',
                           '131339720',
                           '131339721',
                           '133385498',
                           '137031874',
                           '137032010',
                           '137032289',
                           '137032340',
                           '139142583',
                           '139142584',
                           '139142585',
                           '139142586',
                           '139142587',
                           '139142588',
                           '139142590',
                           '139142591',
                           '139142595',
                           '139142597',
             

                           '139145348',
                           '139145350',
                           '139145352',
                           '139145354',
                           '139145355',
                           '139145358',
                           '139145359',
                           '139145361',
                           '139145364',
                           '139145365',
                           '139145366',
                           '139145367',
                           '139145369',
                           '139145370',
                           '139145372',
                           '139145375',
                           '139145376',
                           '139145377',
                           '139145378',
                           '139145379',
                           '139145381',
                           '139145382',
                           '139145386',
                           '139145387',
                           '139145389',


                           '139147884',
                           '139147887',
                           '139147888',
                           '139147890',
                           '139147891',
                           '139147892',
                           '139147893',
                           '139147894',
                           '139147897',
                           '139147898',
                           '139147899',
                           '139147900',
                           '139147901',
                           '139147902',
                           '139147903',
                           '139147905',
                           '139147907',
                           '139147909',
                           '139147910',
                           '139147911',
                           '139147912',
                           '139147914',
                           '139147916',
                           '139147917',
                           '139147918',


                           '139151903',
                           '139151905',
                           '139151906',
                           '139151907',
                           '139151908',
                           '139151909',
                           '139151911',
                           '139151912',
                           '139151913',
                           '139151914',
                           '139151915',
                           '139151917',
                           '139151918',
                           '139151919',
                           '139151920',
                           '139151921',
                           '139151922',
                           '139151923',
                           '139151924',
                           '139151925',
                           '139151926',
                           '139151927',
                           '139151929',
                           '139151930',
                           '139151932',


                           '139155779',
                           '139155780',
                           '139155781',
                           '139155782',
                           '139155784',
                           '139155786',
                           '139155787',
                           '139155788',
                           '139155789',
                           '139155790',
                           '139155791',
                           '139155792',
                           '139155793',
                           '139155799',
                           '139155801',
                           '139155802',
                           '139155803',
                           '139155804',
                           '139155806',
                           '139155807',
                           '139155808',
                           '139155809',
                           '139155810',
                           '139155811',
                           '139155813',


                           '139183580',
                           '139183581',
                           '139183582',
                           '139183583',
                           '139183584',
                           '139183585',
                           '139183586',
                           '139183587',
                           '139183588',
                           '139183589',
                           '139183590',
                           '139183591',
                           '139183592',
                           '139183593',
                           '139183594',
                           '139183596',
                           '139183597',
                           '139183598',
                           '139183601',
                           '139183602',
                           '139183604',
                           '139183606',
                           '139183609',
                           '139183610',
                           '139183611',


                           '139186026',
                           '139186028',
                           '139186029',
                           '139186031',
                           '139186032',
                           '139186033',
                           '139186035',
                           '139186036',
                           '139186037',
                           '139186038',
                           '139186039',
                           '139186040',
                           '139186042',
                           '139186043',
                           '139186046',
                           '139186047',
                           '139186049',
                           '139186050',
                           '139186052',
                           '139186053',
                           '139186054',
                           '139186056',
                           '139186058',
                           '139186060',
                           '139186061',


                           '139189004',
                           '139189005',
                           '139189006',
                           '139189007',
                           '139189008',
                           '139189009',
                           '139189010',
                           '139189012',
                           '139189013',
                           '139189014',
                           '139189015',
                           '139189016',
                           '139189017',
                           '139189018',
                           '139189019',
                           '139189020',
                           '139189021',
                           '139189022',
                           '139189023',
                           '139189024',
                           '139189025',
                           '139189026',
                           '139189027',
                           '139189028',
                           '139189029',


                           '139191653',
                           '139191654',
                           '139191656',
                           '139191657',
                           '139191658',
                           '139191659',
                           '139191660',
                           '139191661',
                           '139191662',
                           '139191663',
                           '139191664',
                           '139191665',
                           '139191666',
                           '139191667',
                           '139191668',
                           '139191670',
                           '139191672',
                           '139191673',
                           '139191674',
                           '139191676',
                           '139191677',
                           '139191678',
                           '139191679',
                           '139191680',
                           '139191681',


                           '139194120',
                           '139194121',
                           '139194122',
                           '139194123',
                           '139194124',
                           '139194125',
                           '139194126',
                           '139194127',
                           '139194130',
                           '139194131',
                           '139194132',
                           '139194133',
                           '139194134',
                           '139194135',
                           '139194136',
                           '139194137',
                           '139194138',
                           '139194139',
                           '139194140',
                           '139194141',
                           '139194142',
                           '139194143',
                           '139194144',
                           '139194145',
                           '139194146',


                           '139196912',
                           '139196913',
                           '139196914',
                           '139196915',
                           '139196916',
                           '139196919',
                           '139196920',
                           '139196922',
                           '139196923',
                           '139196924',
                           '139196925',
                           '139196926',
                           '139196927',
                           '139196928',
                           '139196929',
                           '139196930',
                           '139196931',
                           '139196932',
                           '139196933',
                           '139196934',
                           '139196935',
                           '139196937',
                           '139196938',
                           '139196939',
                           '139196940',


                           '139200034',
                           '139200035',
                           '139200036',
                           '139200037',
                           '139200038',
                           '139200039',
                           '139200040',
                           '139200041',
                           '139200042',
                           '139200043',
                           '139200044',
                           '139200045',
                           '139200046',
                           '139200048',
                           '139200049',
                           '139200050',
                           '139200051',
                           '139200052',
                           '139200053',
                           '139200054',
                           '139200055',
                           '139200056',
                           '139200057',
                           '139200058',
                           '139200059',


                           '139202975',
                           '139202976',
                           '139202977',
                           '139202978',
                           '139202979',
                           '139202980',
                           '139202981',
                           '139202983',
                           '139202984',
                           '139202985',
                           '139202986',
                           '139202987',
                           '139202988',
                           '139202989',
                           '139202990',
                           '139202993',
                           '139202994',
                           '139202996',
                           '139202997',
                           '139202998',
                           '139202999',
                           '139203000',
                           '139203001',
                           '139203002',
                           '139203003',


                           '139205566',
                           '139205567',
                           '139205568',
                           '139205569',
                           '139205570',
                           '139205571',
                           '139205572',
                           '139205573',
                           '139205574',
                           '139205575',
                           '139205576',
                           '139205577',
                           '139205578',
                           '139205579',
                           '139205580',
                           '139205581',
                           '139205582',
                           '139205583',
                           '139205584',
                           '139205585',
                           '139205586',
                           '139205587',
                           '139205588',
                           '139205589',
                           '139205590',


                           '139207885',
                           '139207886',
                           '139207887',
                           '139207888',
                           '139207889',
                           '139207890',
                           '139207891',
                           '139207892',
                           '139207893',
                           '139207894',
                           '139207895',
                           '139207896',
                           '139207897',
                           '139207898',
                           '139207899',
                           '139207900',
                           '139207901',
                           '139207902',
                           '139207903',
                           '139207904',
                           '139207905',
                           '139207906',
                           '139207907',
                           '139207908',
                           '139207909',


                           '141719587',
                           '141719589',
                           '141719591',
                           '141719592',
                           '141719593',
                           '141719595',
                           '141719596',
                           '141719598',
                           '141719603',
                           '141719605',
                           '141719607',
                           '141719608',
                           '141719609',
                           '141719610',
                           '141719612',
                           '141719615',
                           '141719616',
                           '141719619',
                           '141719620',
                           '141719621',
                           '141719622',
                           '141719623',
                           '141719624',
                           '141719625',
                           '141719626',


                           '141724096',
                           '141724097',
                           '141724098',
                           '141724099',
                           '141724100',
                           '141724102',
                           '141724103',
                           '141724107',
                           '141724108',
                           '141724109',
                           '141724110',
                           '141724111',
                           '141724114',
                           '141724117',
                           '141724118',
                           '141724120',
                           '141724122',
                           '141724124',
                           '141724126',
                           '141724127',
                           '141724128',
                           '141724134',
                           '141724135',
                           '141724137',
                           '141724139',


                           '143843420',
                           '143843421',
                           '143843422',
                           '143843426',
                           '143843437',
                           '143843452',
                           '143843453',
                           '143843465',
                           '143843467',
                           '143843470',
                           '143843471',
                           '143843472',
                           '143843473',
                           '143843475',
                           '143843476',
                           '143843480',
                           '143843481',
                           '143843484',
                           '143843487',
                           '143843488',
                           '143843490',
                           '143843491',
                           '143843498',
                           '143843499',
                           '143843501',


                           '143858110',
                           '143858112',
                           '143858113',
                           '143858116',
                           '143858117',
                           '143858118',
                           '143858119',
                           '143858120',
                           '143858123',
                           '143858125',
                           '143858127',
                           '143858128',
                           '143858130',
                           '143858131',
                           '143858133',
                           '143858135',
                           '143858137',
                           '143858138',
                           '143858141',
                           '143858143',
                           '143858144',
                           '143858153',
                           '143858155',
                           '143858159',
                           '143858165',


                           '148742837',
                           '148742847',
                           '148742849',
                           '148742850',
                           '148742851',
                           '148742852',
                           '148742859',
                           '148742863',
                           '148742865',
                           '148742873',
                           '148742879',
                           '148742884',
                           '148742889',
                           '148742894',
                           '148742895',
                           '148742896',
                           '148742901',
                           '148742916',
                           '148742917',
                           '148742922',
                           '148742923',
                           '148742924',
                           '148742925',
                           '148742926',
                           '148742929',


                           '148750880',
                           '148750881',
                           '148750883',
                           '148750885',
                           '148750886',
                           '148750889',
                           '148750893',
                           '148750894',
                           '148750895',
                           '148750896',
                           '148750897',
                           '148750898',
                           '148750899',
                           '148750901',
                           '148750902',
                           '148750903',
                           '148750904',
                           '148750905',
                           '148750906',
                           '148750907',
                           '148750908',
                           '148750909',
                           '148750910',
                           '148750911',
                           '148750912',


                           '151082037',
                           '151082038',
                           '151082039',
                           '151082040',
                           '151082041',
                           '151082042',
                           '151082043',
                           '151082044',
                           '151082045',
                           '151082046',
                           '151082047',
                           '151082048',
                           '151082049',
                           '151082050',
                           '151082051',
                           '151082052',
                           '151082053',
                           '151082054',
                           '151082055',
                           '151082056',
                           '151082057',
                           '151082058',
                           '151082059',
                           '151082060',
                           '151082061',


                           '151098303',
                           '151098304',
                           '151098305',
                           '151098306',
                           '151098307',
                           '151098308',
                           '151098309',
                           '151098310',
                           '151098311',
                           '151098312',
                           '151098313',
                           '151098314',
                           '151098315',
                           '151098316',
                           '151098317',
                           '151098318',
                           '151098319',
                           '151098320',
                           '151098321',
                           '151098322',
                           '151098323',
                           '151098324',
                           '151098325',
                           '151098327',
                           '151098328',


                                '151097986',
                                '151097987',
                                '151097988',
                                '151097989',
                                '151097990',
                                '151097991',
                                '151097992',
                                '151097993',
                                '151097994',
                                '151097995',
                                '151097996',
                                '151097997',
                                '151097998',
                                '151097999',
                                '151098000',
                                '151098001',
                                '151098002',
                                '151098003',
                                '151098004',
                                '151098005',
                                '151098006',
                                '151098007',
          

                               '08090201007644',
                               '08090201007645',
                               '08090201007646',
                               '08090201007647',
                               '08090201007648',
                               '08090201007649',
                               '08090201007650',
                               '08090201007651',
                               '08090201007652',
                               '08090201007653',
                               '08090201007654',
                               '08090201007656',
                               '08090201007657',
                               '08090201007658',
                               '08090201007659',
                               '08090201007660',
                               '08090201007661',
                               '08090201007662',
                               '08090201007663',
                               '08090201007664',
                    

                               '08090203031224',
                               '08090203031225',
                               '08090203031226',
                               '08090203031229',
                               '08090203031230',
                               '08090203031231',
                               '08090203031232',
                               '08090203031234',
                               '08090203031235',
                               '08090203031237',
                               '08090203031238',
                               '08090203031240',
                               '08090203031241',
                               '08090203031242',
                               '08090203031243',
                               '08090203031244',
                               '08090203031245',
                               '08090203031246',
                               '08090203031248',
                               '08090203031249',
                    

                               '08090203033082',
                               '08090203033083',
                               '08090203033084',
                               '08090203033085',
                               '08090203033086',
                               '08090203033087',
                               '08090203033088',
                               '08090203033089',
                               '08090203033090',
                               '08090203033091',
                               '08090203033092',
                               '08090203033094',
                               '08090203033095',
                               '08090203033097',
                               '08090203033098',
                               '08090203033099',
                               '08090203033100',
                               '08090203033101',
                               '08090203033102',
                               '08090203033103',
                    

                               '08090203035625',
                               '08090203035626',
                               '08090203035627',
                               '08090203035628',
                               '08090203035630',
                               '08090203035631',
                               '08090203035632',
                               '08090203035636',
                               '08090203035637',
                               '08090203035638',
                               '08090203035639',
                               '08090203035640',
                               '08090203035641',
                               '08090203035642',
                               '08090203035643',
                               '08090203035646',
                               '08090203035648',
                               '08090203035649',
                               '08090203035650',
                               '08090203035651',
                    

                               '08090203038686',
                               '08090203038687',
                               '08090203038688',
                               '08090203038689',
                               '08090203038690',
                               '08090203038691',
                               '08090203038692',
                               '08090203038693',
                               '08090203038694',
                               '08090203038695',
                               '08090203038696',
                               '08090203038697',
                               '08090203038698',
                               '08090203038699',
                               '08090203038700',
                               '08090203038701',
                               '08090203038702',
                               '08090203038703',
                               '08090203038704',
                               '08090203038705',
                    

                               '08090203041628',
                               '08090203041629',
                               '08090203041630',
                               '08090203041631',
                               '08090203041633',
                               '08090203041634',
                               '08090203041635',
                               '08090203041636',
                               '08090203041639',
                               '08090203041640',
                               '08090203041641',
                               '08090203041642',
                               '08090203041643',
                               '08090203041644',
                               '08090203041646',
                               '08090203041647',
                               '08090203041648',
                               '08090203041649',
                               '08090203041650',
                               '08090203041651',
                    

                               '08090203044552',
                               '08090203044553',
                               '08090203044554',
                               '08090203044555',
                               '08090203044556',
                               '08090203044557',
                               '08090203044558',
                               '08090203044559',
                               '08090203044560',
                               '08090203044561',
                               '08090203044562',
                               '08090203044563',
                               '08090203044565',
                               '08090203044566',
                               '08090203044567',
                               '08090203044568',
                               '08090203044569',
                               '08090203044570',
                               '08090203044571',
                               '08090203044572',
                    

                               '08090203047269',
                               '08090203047270',
                               '08090203047271',
                               '08090203047272',
                               '08090203047274',
                               '08090203047275',
                               '08090203047277',
                               '08090203047279',
                               '08090203047281',
                               '08090203047282',
                               '08090203047284',
                               '08090203047285',
                               '08090203047286',
                               '08090203047288',
                               '08090203047289',
                               '08090203047290',
                               '08090203047291',
                               '08090203047292',
                               '08090203047293',
                               '08090203047294',
                    

                               '08090301036435',
                               '08090301036439',
                               '08090301036440',
                               '08090301036442',
                               '08090301036443',
                               '08090301036447',
                               '08090301036450',
                               '08090301036452',
                               '08090301036461',
                               '08090301036462',
                               '08090301036463',
                               '08090301036464',
                               '08090301036466',
                               '08090301036471',
                               '08090301036473',
                               '08090301036475',
                               '08090301036477',
                               '08090301036481',
                               '08090301036483',
                               '08090301036485',
                    

                               '3180004000649',
                               '3180004000650',
                               '3180004000651',
                               '3180004000653',
                               '3180004000658',
                               '3180004000661',
                               '3180004000662',
                               '3180004000663',
                               '3180004000664',
                               '3180004000665',
                               '3180004000671',
                               '3180004000672',
                               '3180004000686',
                               '3180004000688',
                               '3180004000689',
                               '3180004000690',
                               '3180004000693',
                               '3180004000694',
                               '3180004000695',
                               '3180004000696',
                               '31800040

                               '8090100007644',
                               '8090100007645',
                               '8090100007647',
                               '8090100007651',
                               '8090100007652',
                               '8090100007653',
                               '8090100007662',
                               '8090100007663',
                               '8090100007673',
                               '8090100007674',
                               '8090100007676',
                               '8090100007677',
                               '8090100007678',
                               '8090100007680',
                               '8090100007681',
                               '8090100007682',
                               '8090100007683',
                               '8090100007684',
                               '8090100007685',
                               '8090100007688',
                               '80901000

                               '8090203007936',
                               '8090203007937',
                               '8090203007941',
                               '8090203007942',
                               '8090203007946',
                               '8090203007950',
                               '8090203007956',
                               '8090203007957',
                               '8090203007959',
                               '8090203007962',
                               '8090203007963',
                               '8090203007965',
                               '8090203007966',
                               '8090203007973',
                               '8090203007978',
                               '8090203007981',
                               '8090203007985',
                               '8090203007989',
                               '8090203007990',
                               '8090203007996',
                               '80902030

                               '8090203012937',
                               '8090203012939',
                               '8090203012940',
                               '8090203012941',
                               '8090203012942',
                               '8090203012946',
                               '8090203012947',
                               '8090203012948',
                               '8090203012951',
                               '8090203012952',
                               '8090203012959',
                               '8090203012961',
                               '8090203012962',
                               '8090203012964',
                               '8090203012965',
                               '8090203012966',
                               '8090203012968',
                               '8090203012971',
                               '8090203012973',
                               '8090203012974',
                               '80902030

                               '8090203017352',
                               '8090203017353',
                               '8090203017364',
                               '8090203017375',
                               '8090203017378',
                               '8090203017380',
                               '8090203017384',
                               '8090203017389',
                               '8090203017390',
                               '8090203017391',
                               '8090203017392',
                               '8090203017394',
                               '8090203017395',
                               '8090203017396',
                               '8090203017397',
                               '8090203017399',
                               '8090203017401',
                               '8090203017402',
                               '8090203017404',
                               '8090203017407',
                               '80902030

                               '8090203020664',
                               '8090203020665',
                               '8090203020666',
                               '8090203020667',
                               '8090203020670',
                               '8090203020672',
                               '8090203020674',
                               '8090203020676',
                               '8090203020677',
                               '8090203020680',
                               '8090203020681',
                               '8090203020683',
                               '8090203020686',
                               '8090203020688',
                               '8090203020690',
                               '8090203020694',
                               '8090203020699',
                               '8090203020701',
                               '8090203020702',
                               '8090203020705',
                               '80902030

                            '139145127',
                            '139145129',
                            '139145131',
                            '139145132',
                            '139145133',
                            '139145134',
                            '139145137',
                            '139145138',
                            '139145142',
                            '139145143',
                            '139145144',
                            '139145145',
                            '139145147',
                            '139145151',
                            '139145153',
                            '139145154',
                            '139145155',
                            '139145157',
                            '139145158',
                            '139145160',
                            '139145163',
                            '139145164',
                            '139145166',
                            '139145168',
                

                            '139148174',
                            '139148175',
                            '139148176',
                            '139148177',
                            '139148179',
                            '139148181',
                            '139148184',
                            '139148185',
                            '139148187',
                            '139148188',
                            '139148189',
                            '139148190',
                            '139148191',
                            '139148194',
                            '139148196',
                            '139148197',
                            '139148198',
                            '139148199',
                            '139148200',
                            '139148201',
                            '139148202',
                            '139148204',
                            '139148205',
                            '139148207',
                

                            '139151889',
                            '139151890',
                            '139151892',
                            '139151894',
                            '139151897',
                            '139151899',
                            '139151901',
                            '139151903',
                            '139151905',
                            '139151906',
                            '139151907',
                            '139151908',
                            '139151909',
                            '139151911',
                            '139151912',
                            '139151913',
                            '139151914',
                            '139151915',
                            '139151917',
                            '139151918',
                            '139151919',
                            '139151920',
                            '139151921',
                            '139151922',
                

                            '139155663',
                            '139155664',
                            '139155665',
                            '139155666',
                            '139155668',
                            '139155671',
                            '139155672',
                            '139155673',
                            '139155674',
                            '139155675',
                            '139155676',
                            '139155678',
                            '139155680',
                            '139155681',
                            '139155682',
                            '139155684',
                            '139155685',
                            '139155688',
                            '139155691',
                            '139155692',
                            '139155693',
                            '139155694',
                            '139155695',
                            '139155698',
                

                            '148744168',
                            '148744169',
                            '148744170',
                            '148744171',
                            '148744172',
                            '148744173',
                            '148744174',
                            '148744175',
                            '148744176',
                            '148744177',
                            '148744178',
                            '148744179',
                            '148744180',
                            '148744181',
                            '148744182',
                            '148744183',
                            '148744184',
                            '148744185',
                            '148744186',
                            '148744187',
                            '148744188',
                            '148744189',
                            '148744190',
                            '148744191',
                

                                 '00556228',
                                 '00556414',
                                 '00556441',
                                 '00556548',
                                 '00558149',
                                 '00558153',
                                 '00558225',
                                 '00558306',
                                 '00558625',
                                 '00558682',
                                 '00558710',
                                 '00558722',
                                 '00558730',
                                 '00558765',
                                 '00559628',
                                 '00559695',
                                 '00559711',
                                 '00559750',
                                 '00559752',
                                 '00559823',
                                 '00559851',
                                 '00559853',
          

Starting from the top of the list, the standouts are:

##### NHD:ComID
It has more than one entry in some of them. I track down way 43261226, but nothing I can do here reallly.

##### NHD:FCode
Again, more than one entry, but this seems ok to have more than one. 


##### NHD:FDate
Dupes, nothing can do

##### NHD:FTYPE and FHD:FType
Will be merged 

##### NHD:ReachCode
* "Unique identifier composed of two parts, first eight digits = subbasin code as defined by FIPS 103, and next six digits = random-assigned sequential number unique within a Cataloguing Unit."
    * issues that I see but can't fix:
           * Is supposed to be a unique identifier but some have more than one. Manually inspecting one such element doesn't reveal why, but noticed also has two NHD:ComID values. (WayID: 43393226)
           
##### NHD:way_id 
not official, dupe of other data

##### name
Scanning through 'name' I can see there is just an inconsistency in the naming, where some have the first letter of each capitalized and others don't. I'll fix that.

##### natural
There appears to be a redundant value between '_coastline' and 'coastline'. A quick VIM search in the osm document for '_coastline' reveals a 'note' key in a 'tag' element reading "I have altered the natural:coatline tag as this way duplicates existing coastline ways". Not touching this one.

##### note
I'm not really sure what this is about so I'll write a little script to pull out all elements bearing a 'note' tag element to inspect:



In [3]:
import xml.etree.ElementTree as ET

OSM_FILE = "/Users/mchana/GitHub/udacity/large_files/new-orleans_region.osm"

def find_note_tag(osm_file):
    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    
    for event, elem in context:
        if event == 'end' and (elem.tag == 'node' or elem.tag == 'way' or elem.tag == 'relation'):
            for tag_elem in elem.iter("tag"):
                if tag_elem.attrib['k'] == 'NHD:ComID':
                    for tag_elem in elem.iter("tag"):
                        if tag_elem.attrib['k'] == 'note':
                            print(ET.dump(elem))
        root.clear()
find_note_tag(OSM_FILE)

<way changeset="12512084" id="46179177" timestamp="2012-07-27T14:15:07Z" uid="22925" user="ELadner" version="9">
		<nd ref="549574941" />
		<nd ref="549574939" />
		<nd ref="1122849628" />
		<nd ref="1121382828" />
		<nd ref="549574933" />
		<nd ref="1121161585" />
		<nd ref="549574926" />
		<nd ref="549574925" />
		<nd ref="549574922" />
		<nd ref="549574921" />
		<nd ref="1122910760" />
		<nd ref="1121424207" />
		<nd ref="1120689016" />
		<nd ref="1120757160" />
		<nd ref="549574913" />
		<nd ref="1121246452" />
		<nd ref="549574906" />
		<nd ref="549574904" />
		<nd ref="549574902" />
		<nd ref="549574900" />
		<nd ref="549574898" />
		<nd ref="549574649" />
		<nd ref="549574647" />
		<nd ref="549574646" />
		<nd ref="1122150873" />
		<nd ref="1120558234" />
		<nd ref="549574642" />
		<nd ref="1121289595" />
		<nd ref="549574639" />
		<nd ref="549574638" />
		<nd ref="549574637" />
		<nd ref="549574636" />
		<nd ref="1120666323" />
		<nd ref="549574633" />
		<nd ref="549574632" />


		<tag k="NHD:FCode" v="44500" />
		<tag k="NHD:FDate" v="2009/01/15" />
		<tag k="NHD:FTYPE" v="SeaOcean" />
		<tag k="NHD:RESOLUTION" v="High" />
	</way>
	
None
<way changeset="12506377" id="173109934" timestamp="2012-07-27T03:21:58Z" uid="22925" user="ELadner" version="1">
		<nd ref="549575326" />
		<nd ref="1839421429" />
		<nd ref="1839421436" />
		<nd ref="1839421434" />
		<nd ref="1839421433" />
		<nd ref="1839421438" />
		<nd ref="1839421441" />
		<nd ref="1839421428" />
		<nd ref="1839421412" />
		<nd ref="1839421410" />
		<nd ref="1839421406" />
		<nd ref="1839421401" />
		<nd ref="1839421399" />
		<nd ref="1839421396" />
		<nd ref="1839421392" />
		<nd ref="1839421388" />
		<nd ref="1839421386" />
		<nd ref="1839421385" />
		<nd ref="1839421381" />
		<nd ref="1839421380" />
		<nd ref="1839421362" />
		<nd ref="1839421360" />
		<nd ref="1839421354" />
		<nd ref="1839421356" />
		<nd ref="1839421333" />
		<nd ref="1839421330" />
		<nd ref="1839421320" />
		<nd ref="1839421312"

Not going to do anything about the parent or 'tag' child elements with 'note' as their key.

##### source
Fix, NHD & bing to NHD & Bing

##### water
Minor, but Bayou should be bayou.

### Parse, clean and compile database:

run data.py

So now that I've noted above how to fix each thing, here is how I'm going to do it (algos)

Next, here is how the data will be parsed, cleaning algos inserted, and database compiled (algo)

### Examine the data efficiently with database queries

First I build the database:

In [4]:
import sqlite3
import csv
from pprint import pprint
from supporting_files import database_schema as ds
import os

sqlite_file = 'supporting_files/exports_databases/osm_db.db'
conn = sqlite3.connect(sqlite_file)
c = conn.cursor()

In [5]:
sql_schema = ds.sql_schema

In [6]:
# Create the table, specifying the schema
c.executescript(sql_schema)
# commit the changes
conn.commit()

In [7]:
# first row of CSVs MUST be fieldnames
import os
from csv import DictReader

def add_table_from_csv(filename):
    if not filename.endswith('.csv'):
        return
    
    tablename = filename.split(".")[0].split('/')[-1]
    with open(filename, 'r') as fin:
        dr = DictReader(fin)
        fieldnames = dr.fieldnames
        to_db = [tuple( [i[k] for k in dr.fieldnames] ) for i in dr]
    c.executemany("INSERT INTO {}({}) VALUES({}?);".format(
            tablename, ",".join(dr.fieldnames), "?," * (len(dr.fieldnames)-1)), to_db)
    conn.commit()
    
def add_to_db(directory):
    for f_name in os.listdir(directory):
        full_path = os.path.join(directory, f_name)
        add_table_from_csv(full_path)

add_to_db('supporting_files/exports')

##### Investigate any other questions I may have

Database file size:

In [8]:
database_size = os.path.getsize(sqlite_file) / 1000000
print("The size of the database is: %s mb" %round(database_size, 2))

The size of the database is: 77.13 mb


Create my own function for rendering a database query into a pandas dataframe:

In [9]:
import pandas as pd
pd.set_option('display.max_rows', 6)
def df_query(query):
    df = pd.read_sql(query, conn)
    return df

Top 10 contributors:

In [10]:
query = """
SELECT
    all_tables.user,
    SUM(Total) Total
FROM
    (
    SELECT
        nodes.user,
        COUNT(*) AS Total
    FROM nodes
    GROUP BY nodes.user

    UNION ALL

    SELECT
        relations.user,
        COUNT(*) AS Total
    FROM relations
    GROUP BY relations.user
    
    UNION ALL
    
    SELECT
        ways.user,
        COUNT(*) AS Total
    FROM ways
    GROUP BY ways.user
    ORDER BY Total DESC   
    
    ) AS all_tables

GROUP BY user
ORDER BY Total DESC
LIMIT 10
"""

df_query(query)

Unnamed: 0,user,Total
0,Matt Toups,25243
1,Maarten Deen,17221
2,Andre68,4370
...,...,...
7,wvdp,455
8,eric22,367
9,DKNOTT,201


Total number of nodes, ways, relations:

In [11]:
# nodes
query = """
SELECT COUNT(*) AS Total FROM nodes
"""

df_query(query)

Unnamed: 0,Total
0,2


In [12]:
# ways
query = """
SELECT COUNT(*) AS Total FROM ways
"""

df_query(query)

Unnamed: 0,Total
0,53562


In [13]:
# relations
query = """
SELECT COUNT(*) AS Total FROM relations
"""

df_query(query)

Unnamed: 0,Total
0,406


In [14]:
# nodes, ways and relations
query = """
SELECT
    SUM(all_tables.Total) AS Total_All
FROM 
    (
        SELECT COUNT(*) AS Total FROM nodes

        UNION ALL

        SELECT COUNT(*) AS Total FROM ways

        UNION ALL

        SELECT COUNT(*) AS Total FROM relations
    ) AS all_tables
"""

df_query(query)

Unnamed: 0,Total_All
0,53970


Unnamed features:
* Way ID: 43829974 connects to one of its nodes: 555906935 which has lat/lon coordinates lat="29.2658771" lon="-89.4146036". OpenStreetMaps shows those coordinates indeeed to be just some random natural feature [hEre](http://www.openstreetmap.org/search?query=29.2658771%2C%20-89.4146036#map=18/29.26588/-89.41460).

What about all of the unnamed features?
First, to review the named ones:

In [15]:
# List of named features
query = """

SELECT
    all_tables.value
FROM
    (
    SELECT
        nodes_tags.value
    FROM nodes_tags
    WHERE nodes_tags.key="name"
    GROUP BY nodes_tags.value
    
    UNION ALL
    
    SELECT
        ways_tags.value
    FROM ways_tags
    WHERE ways_tags.key="name"
    GROUP BY ways_tags.value
    
    UNION ALL
    
    SELECT
        relations_tags.value
    FROM relations_tags
    WHERE relations_tags.key="name"
    GROUP BY relations_tags.value
    ORDER BY relations_tags.value
    
    ) AS all_tables

GROUP BY all_tables.value
ORDER BY all_tables.value
"""

df_query(query)

Unnamed: 0,value
0,Abita Creek
1,Abita River
2,Adema Pond
...,...
265,Yankee Pond
266,Yellow Bayou
267,Yellow Lake Bayou


Number of unnamed natural features:

In [16]:
# All node/way/relation ids with 'FType' as their key
query = """
SELECT
    all_tables.value,
    COUNT(*) AS Total
FROM
    (
    SELECT
        nodes_tags.value
    FROM nodes_tags
    WHERE nodes_tags.key="FType"
    
    UNION ALL
    
    SELECT
        ways_tags.value
    FROM ways_tags
    WHERE ways_tags.key="FType"
    
    UNION ALL
    
    SELECT
        relations_tags.value
    FROM relations_tags
    WHERE relations_tags.key="FType"
    
    ) AS all_tables
GROUP BY all_tables.value
ORDER BY Total DESC
"""

df_query(query)

Unnamed: 0,value,Total
0,LakePond,22278
1,SwampMarsh,16200
2,StreamRiver,12371
...,...,...
10,Gate,6
11,Pipeline,6
12,Wall,6


Note cleaning that needs to be done here - circle back to data.py algo and adjust where necessary... or,this is not cleaning - just the best way to describe the data...

All features that are not already named

In [17]:
# list of FTYPEs that are not in the list of named features
query = """
SELECT *, COUNT(*) as Total
    FROM (
    SELECT
        all_tables.id,
        all_tables.value
    FROM
        (
        SELECT
            *
        FROM nodes_tags
        WHERE nodes_tags.key="FType"
        
        UNION ALL
        
        SELECT
            *
        FROM ways_tags
        WHERE ways_tags.key="FType"

        UNION ALL

        SELECT
            *
        FROM relations_tags
        WHERE relations_tags.key="FType"

        ) AS all_tables
    ) AS ftype_table
WHERE ftype_table.id NOT IN (
    SELECT
        all_tables.id
    FROM
        (
        SELECT
            *
        FROM nodes_tags
        WHERE nodes_tags.key="name"
        
        UNION ALL
        
        SELECT
            *
        FROM ways_tags
        WHERE ways_tags.key="name"

        UNION ALL

        SELECT
            *
        FROM relations_tags
        WHERE relations_tags.key="name"

        ) AS all_tables
)
GROUP By ftype_table.value
ORDER BY Total DESC
"""

df_query(query)

Unnamed: 0,id,value,Total
0,7077413,LakePond,21686
1,7077411,SwampMarsh,16188
2,2313282,StreamRiver,11758
...,...,...,...
10,43275245,Pipeline,6
11,368947130,Wall,6
12,97140992,Lock Chamber,5


In [None]:
conn.close()

### Consider how the data could be improved:

Investigate (programmatically) the date the data was uploaded, then compare to the date when the NHD switched over to the Permanent identifier

All elements where NHD:GNIS_ID has a duplicate value with gnis:feature_id:

In [18]:
query = """
SELECT
    x.id,
    x.key,
    x.value,
    y.key,
    y.value
FROM (
    SELECT
        *
    FROM nodes_tags
    WHERE nodes_tags.key="GNIS_ID"

    UNION ALL
    
    SELECT
        *
    FROM ways_tags
    WHERE ways_tags.key="GNIS_ID"

    UNION ALL
    
    SELECT
        *
    FROM relations_tags
    WHERE relations_tags.key="GNIS_ID"
) AS x
    JOIN (
    SELECT
    *
    FROM nodes_tags
    WHERE nodes_tags.key="feature_id"
    
    UNION ALL
    
    SELECT
    *
    FROM ways_tags
    WHERE ways_tags.key="feature_id"
    
    UNION ALL
    
    SELECT
    *
    FROM relations_tags
    WHERE relations_tags.key="feature_id"
    ) AS y
    ON x.id=y.id
WHERE x.value=y.value
"""

df_query(query)

Unnamed: 0,id,key,value,key.1,value.1
0,558006771,GNIS_ID,00559711,feature_id,00559711
1,3301246063,GNIS_ID,00554877,feature_id,00554877
2,22516599,GNIS_ID,00538745,feature_id,00538745
...,...,...,...,...,...
924,4626652,GNIS_ID,00532593,feature_id,00532593
925,4626653,GNIS_ID,00555004,feature_id,00555004
926,5380366,GNIS_ID,00532757,feature_id,00532757


All elements where NHD:Permanent_ is duplicate value with NHD:ComID

In [19]:
query = """
SELECT
    x.id,
    x.key,
    x.value,
    y.key,
    y.value
FROM (
    SELECT
        *
    FROM nodes_tags
    WHERE nodes_tags.key="ComID"

    UNION ALL
    
    SELECT
        *
    FROM ways_tags
    WHERE ways_tags.key="ComID"

    UNION ALL
    
    SELECT
        *
    FROM relations_tags
    WHERE relations_tags.key="ComID"
) AS x
    JOIN (
    SELECT
    *
    FROM nodes_tags
    WHERE nodes_tags.key="Permanent_"
    
    UNION ALL
    
    SELECT
    *
    FROM ways_tags
    WHERE ways_tags.key="Permanent_"
    
    UNION ALL
    
    SELECT
    *
    FROM relations_tags
    WHERE relations_tags.key="Permanent_"
    ) AS y
    ON x.id=y.id
WHERE x.value=y.value
"""

df_query(query)

Unnamed: 0,id,key,value,key.1,value.1
0,43239879,ComID,148751000,Permanent_,148751000
1,43244825,ComID,148750992,Permanent_,148750992
2,43246802,ComID,151098269,Permanent_,151098269
...,...,...,...,...,...
1292,4493454,ComID,151099663,Permanent_,151099663
1293,4493455,ComID,151099653,Permanent_,151099653
1294,4493460,ComID,151097798,Permanent_,151097798


NHD:way_id duplicated with NHD:ComID

In [20]:
query = """
SELECT
    x.id,
    x.key,
    x.value,
    y.key,
    y.value
FROM (
    SELECT
        *
    FROM nodes_tags
    WHERE nodes_tags.key="ComID"

    UNION ALL
    
    SELECT
        *
    FROM ways_tags
    WHERE ways_tags.key="ComID"

    UNION ALL
    
    SELECT
        *
    FROM relations_tags
    WHERE relations_tags.key="ComID"
) AS x
    JOIN (
    SELECT
    *
    FROM nodes_tags
    WHERE nodes_tags.key="way_id"
    
    UNION ALL
    
    SELECT
    *
    FROM ways_tags
    WHERE ways_tags.key="way_id"
    
    UNION ALL
    
    SELECT
    *
    FROM relations_tags
    WHERE relations_tags.key="way_id"
    ) AS y
    ON x.id=y.id
WHERE x.value=y.value
"""

df_query(query)

Unnamed: 0,id,key,value,key.1,value.1
0,41225956,ComID,139142734,way_id,139142734
1,41226082,ComID,139142736,way_id,139142736
2,43166285,ComID,148743907,way_id,148743907
...,...,...,...,...,...
12980,481376674,ComID,148740191,way_id,148740191
12981,482476279,ComID,148740240,way_id,148740240
12982,482476281,ComID,148740240,way_id,148740240
