# OpenStreetMap - A Data Cleaning Case Study

# Cleaning [OpenStreetMap](http://www.openstreetmap.org/) Data

If you're wondering what OpenStreetMap is, think of it as Wikipedia but for meta-data-filled maps of the world. I would argue that it's one of the most important technologies you've probably never heard of. From the [OpenStreetMap Wiki](https://wiki.openstreetmap.org/wiki/Main_Page):

> "Welcome to OpenStreetMap, the project that creates and distributes free geographic data for the world. We started it because most maps you think of as free actually have legal or technical restrictions on their use, holding back people from using them in creative, productive, or unexpected ways."

If you're like me and you're refining your XML parsing/Python scripting/SQL querying skills, why not use said skills to clean some OpenStreetMap data? I'm going to be as detailed as possible below so anyone interested can follow along as an exercise. 

According to [The New York Times' 'What Could Disappear'](http://www.nytimes.com/interactive/2012/11/24/opinion/sunday/what-could-disappear.html?_r=2&) interactive article, New Orleans could be 88% below sea level in as little as 100 years. I'm personally interested to see what we're set to lose as a consequence of global warming so I'll focus on approximately 100x150 mile area [surrounding New Orleans](https://mapzen.com/data/metro-extracts/metro/new-orleans_louisiana/).

___

### Getting to know the data:
OpenStreetMap data is available as XML bearing the file type '.osm'. Reading through the [documentation](https://wiki.openstreetmap.org/wiki/Main_Page), the gist of the main components of a osm file is:
* a **'node'** element essentially represents a latitude/longitude coordinate. It may have 'tag' child elements with other data points comprising things like an 'address ' or features such as 'school'.
* a **'way'** element has nodes as children(with their element type shortened to 'nd'). It defines things like roads, buildings, natural areas, etc. Also may have 'tag' children for the same purpose as a node.
* a **'relation'** has nested 'member' elements which reference existing ways and nodes. It defines relationships among the other elements such as an extended hiking trail made up by a number of way elements.

The osm file unzips to a whopping 1.28gb and crashes both Atom and SublimeText on my machine. [VIM](http://www.vim.org/), on the other hand, is a text editor designed to be used in a bare-bones command line interface (like the Terminal app on a mac) that can easily handle the job, allowing me to jump around and explore freely. Some VIM commands that let me do this are:
* Jump to bottom: `shift+g`
* Jump to top: `gg`
* page down: `Control+d`
* page up: `Control+u`
* Jump five million lines down: type `5000000` and then hit `j`
* Jump five million lines up: type `5000000` and then hit `k`
* Search the entire massive file for anything you want: type `/` followed by your search
* Go to the next instance of what you searched for: `n`
    
*Check out [this little gem](https://vim-adventures.com/) to get started with VIM in a fun way.*

Here is what I learned from exploring the file:
* the file has the standard opening:  <?xml version='1.0' encoding='UTF-8'?>
* the XML root element is 'osm' and is parent to all other elements
* there are 16,082,009 lines in total - wow!
* an example of a node element with no tags:

![node](supporting_files/screenshots/node.png "node")

* a node element with tags:

![node with tags](supporting_files/screenshots/node_tags.png "node with tags")

* a way element with its nested nd (node) element references and tags:

![way](supporting_files/screenshots/way.png "way")

* a relation element with its nested nodes, ways and tags:

![relation](supporting_files/screenshots/relation.png "relation")

### Reflecting upon the data - what should I clean?:
'tag' elements seem to constitute the bulk of a 'way' or 'relation' element's meta-data. From what I understand about OpenStreetMap, this should also be where a lot of user-generated/added content is stored. I'm sure there will be cleaning to do there. One other thing catches my eye - the large number of 'tag' elements who's 'k' attribute contains the acronym 'NHD'. What is that about?

It turns out NHD is an acronym for the US Geological Survey's ['National Hydrography Dataset'](https://nhd.usgs.gov/) which maps out and documents the nation's watershed boundaries and their features. New Orleans is surrounded by a massive amount of such natural features:


![new orleans watershed](supporting_files/screenshots/new_orleans_watershet.png "new orleans watershed")


I would assume the data to be damn near perfect given it was most likely adopted directly from the NHD. The wonderful thing here is that, without much trouble, I can programmatically verify whether or not that is true despite the over 16 million lines of data to deal with. This could be another opportunity for data cleaning.

Here we go!!!

### Auditing the data:
Looking again at the sample 'way' and 'relation' elements from the 'Getting to know the data' section, it's clear that data from the NHD is prefixed by 'NHD', which is followed by a colon and what appears to be the field name for each data point (ex. 'ComID', 'FCode' and 'FTYPE'). 'FTYPE' appears to define what kind of natural feature we have - in the case of our examples a stream/river or a swamp/marsh. I want to find all of these natural features as documented by the NHD and present in OpenStreetMap. To do this I can simply find any 'node', 'way' or 'relation' element that has a child 'tag' element with a 'k' attribute of 'NHD:FTYPE'. Finding each of these elements and throwing their 'k' atrributes into a list gives me an idea of what exists here:

In [7]:
import xml.etree.ElementTree as ET

OSM_FILE = "/Users/mchana/GitHub/udacity/large_files/new-orleans_region.osm"

def count_elem(osm_file):
    tag_set = set() 
    
    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    
    for event, elem in context:
        if event == 'end' and (elem.tag == 'node' or elem.tag == 'way' or elem.tag == 'relation'):
            for tag_elem in elem.iter("tag"):
                if tag_elem.attrib['k'] == 'NHD:FTYPE':
                    for tag_elem in elem.iter("tag"):
                        tag_set.add(tag_elem.attrib['k'])
        root.clear()
    return tag_set

keys_list = list(count_elem(OSM_FILE))
print(keys_list)

['ele', 'landuse', 'water', 'layer', 'leisure', 'note', 'NHD:RESOLUTION', 'NHD:Resolution', 'NHD:Elevation', 'wetland', 'gnis:county_id', 'NHD:ComID', 'NHD:FDate', 'NHD:way_id', 'gnis:created', 'man_made', 'source', 'NHD:FTYPE', 'boundary', 'type', 'admin_level', 'natural', 'NHD:FCode', 'NHD:GNIS_Name', 'name', 'created_by', 'attribution', 'NHD:ReachCode', 'gnis:feature_id', 'NHD:Permanent_', 'import_uuid', 'place', 'history', 'waterway', 'NHD:FType', 'NHD:GNIS_ID', 'gnis:state_id']


Hard to tell visually - sort it:

In [8]:
sorted(keys_list)

['NHD:ComID',
 'NHD:Elevation',
 'NHD:FCode',
 'NHD:FDate',
 'NHD:FTYPE',
 'NHD:FType',
 'NHD:GNIS_ID',
 'NHD:GNIS_Name',
 'NHD:Permanent_',
 'NHD:RESOLUTION',
 'NHD:ReachCode',
 'NHD:Resolution',
 'NHD:way_id',
 'admin_level',
 'attribution',
 'boundary',
 'created_by',
 'ele',
 'gnis:county_id',
 'gnis:created',
 'gnis:feature_id',
 'gnis:state_id',
 'history',
 'import_uuid',
 'landuse',
 'layer',
 'leisure',
 'man_made',
 'name',
 'natural',
 'note',
 'place',
 'source',
 'type',
 'water',
 'waterway',
 'wetland']

Some of the k values like 'golf' seem pretty arbitrary, but they may have a purpose. Others, like NHD:FCode seem redundant with FCode. Maybe US Geological society changed their naming convention at one point. There may be something to clean here... I also see a 'fixme' key which may be something. Worth noting for now, but what will really help is to create a dictionary of the features and their attribute contents.

Some notes:
* FCODE is used at just once - uses '46600' as its contents. Jumping down to NHD:FCode I see more contents and again, the '46600'. I'm betting FCODE and NHD:FCode are the same. Looking at the NHD poster I find that 46600 is for "Swamp/Marsh". The different keys here are definitely trying to achieve the same thing. They can be merged. Also, I'm not worried about a key having multiple FCode - for example 33600;46600 is a Canal/Ditch, but it's also a Swamp/Marsh. 
* Next redundant tag key is FDATE. 'FDATE' has 'Mon May 23 00:00:00 CEST 2011'as its contents. Looks like a time stamp - checking. I'll check to make sure other elements with the FDATE tag elem have a different timestamp. I could do this, but there's a better - faster way (all I need to do is see a few others to verify it's a time stamp. Vim - /search_key for 'FDATE". Sure enough, found on line 13,356,160. Vim - n finds no other instances! What about NHD:Fdate? Yes, all kinds of others, but their dates ar emore simple - for example, "2005/12/05". Fair enough to say that the rogue element's date can be changed to 2011/05/23. 
* Now thinking, there only seems to be a single rogue elem. Inspect it to make sure it's not wildly different. Nothing stands out - /search for the user Aleks-Berlin. He also seems to have edited a few others - some tags called "FIXME", which seem a little haphazard. I'm wondering who this person is and if entering data haphazardly. Check element on map: /search for one of the nodes, then enter lat/lon coordinates on map: 29.3557244, -90.0538116. Looks like a valid natural feature, so the way elem should stay in place. 
* Next, FTYPE. 466 corresponds to Feature Type from pdf. 
* 'natural' all good
* 'PERMANENT_ is redundant. Has strange value: &#123;A0AFF249-A7D2-44F4-AD8A-0A4A68F99450&#125;
    * Search for NHD:Permanent_. They all have much simpler values like "151098380". What is it? On PDF there is a 'Permanent_Identifier' feature that is a 40 character string, but all the ones I find are 9 characters. Not enough info here - so leaving this one alone. Somehow can flag this elem? Not sure what this is so leaving alone.
*RESOLUTION is '2'. From NHD documentation, this should be set to "High" (Code of source resolution: 1=Local resolution, 2=High resolution, 3=Medium Resolution.)
* SHAPE_AREA and SHAPE_LENG exist in PDF guide as 'Shape_Area' and 'Shape_Length' but have no guidelines and are not used in any other elements in the XML file. Again, wondering if best to just remove this element!

    
Now onto auditing some some of the other keys. For this it's best to build the dictionary and see what kind of information they hold:

In [None]:
import xml.etree.ElementTree as ET
import collections as col
import pprint

OSM_FILE = "/Users/mchana/GitHub/udacity/large_files/new-orleans_region.osm"

def make_elem_dict(osm_file):
    
    elem_dict = col.defaultdict(set)
    
    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    
    # opportunity for 'continue' here...
    # also, pull out - make function to find correct element
    for event, elem in context:
        if event == 'end' and (elem.tag == 'node' or elem.tag == 'way' or elem.tag == 'relation'):
            for tag in elem.iter("tag"):
                if tag.attrib['k'] == 'natural':
                    # am overwriting tag here from prev. loop - call something else
                    for tag in elem.iter("tag"):
                        elem_dict[tag.attrib['k']].add(tag.attrib['v'])
        root.clear()
    return elem_dict

pprint.pprint(make_elem_dict(OSM_FILE))

Some notes on checking the other fields starting at the end:
* 'wikipedia' - not going to touch, connects an elem to wikipedi article
* wikidata - Connects to wikimedia commons. Googling 'wikidata and one of the values from the dict gives' a link to an [article](https://commons.wikimedia.org/wiki/File:Bayou_St_John_by_Spanish_Fort_2009.jpg) - not touching this
* Scanning through others doesn't seem to throw any flags (trying to keep scope of investigation more targeted) However:
* will fix capitalization in 'water' tag
* in source - change any 'bing' to 'Bing', 'landsat' to 'LandSat'
* Look at all 'note'. There are many non-'nature' elems with notes as well, so I'll have to write a script to pull all 'nature' elems that also have a 'note' tag elem:

In [None]:
import xml.etree.ElementTree as ET

OSM_FILE = "/Users/mchana/GitHub/udacity/large_files/new-orleans_region.osm"

def find_note_tag(osm_file):
    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    
    for event, elem in context:
        if event == 'end' and (elem.tag == 'node' or elem.tag == 'way' or elem.tag == 'relation'):
            for tag_elem in elem.iter("tag"):
                if tag_elem.attrib['k'] == 'natural':
                    for tag_elem in elem.iter("tag"):
                        if tag_elem.attrib['k'] == 'note':
                            print(ET.dump(elem))
        root.clear()
find_note_tag(OSM_FILE)

What I'm learing: (add PDF descriptions for each field when introducing them)
* The comment: ""I have altered the natural:coatline tag as this way duplicates existing coastline ways"" tells me that I shouldn't mess with either of the 'coastline' or '_coastline' keys as it's purposful. Other notes don't indicate any other action needed.

Back to dict:
* 'name' - fix capitalization on lower case words There are enough that I can see they all need it. 'xxx' seems like an error but not going to change it...
    * discard 'yes' in node 4506654389
* 'fixme' - not touching. Seems like a way for people to know what needs to be changed due to construction, etc. ("Needs survey", "tempoary way, whilst coastline is sorted out", etc.) Still, slightly problematic as just states "temporary fix" for some. Not doing anything with this.
* NHD:ReachCode - ("Unique identifier composed of two parts, first eight digits = subbasin code as defined by FIPS 103, and next six digits = random-assigned sequential number unique within a Cataloguing Unit."
    * issues that I see but can't fix:
           * Is supposed to be a unique identifier but some have more than one. Manually inspecting one such element doesn't reveal why, but noticed also has a duplicate NHD:ComID. (WayID: 43393226)

* NHD:FDate has one or more instances with two dates. Nothing I can do though.
* NHD:FCode , ok to have more than one.
    * NDH:ComID, has more than one id for some, but more imporantly PDF indicates "ComID field deleted from all feature classes/tables" in the Model Changes section. Should this data still be here? This is the most updated model documentation from August 2016. Waiting on email sent to NHD...

### Build basic parser/CSV compiler:

See data.py in folder

### Confirm basic parser/CSV compiler working properly:

In [None]:
# open each CSV and check manually - VIM search in large OSM and check
    # √nodes_tag.csv
    # √nodes.csv
    # √relation_members.csv
    # √relations_tags.csv
    # √relations.csv
    # √ways_nodes.csv
    # √ways_tags.csv
    # √ways.csv

### Write cleaning scripts - insert into data.py:

In [None]:
#√ Fix wayID 321535489
    # No need to fix - it's inefficient to parse every line to clean a single element.
        # better approach to do manually with access to the actual database
#√ All NHD:FTYPE should be NDF:FType to conform to NHD data model - same with RESOLUTION should be Resolution
#√ Fix capitalization in tag elems with 'water' key
#√ Change 'bing' to 'Bing' and 'landsat' to 'Landsat' in tag elems with 'source' key
    # maybe regex compare based on capitalization - then fix to first letter capitalized?
#√ Fix tags with 'name' elem (fapitalization of firsr letter of each word)
#not doing: discard node tag elem 4506654389

### Update schema - implement schema validation:

In [None]:
#√ descipher schema file and validation functionality
    #√ validator class object is created
    #√ dict and validator class object fed into validate_element()
        # validates and throws error (stops execution) if validation not True
        # else, execution continues and all items written to the csv
#√ update schema doc and run/verify

### Perform statistical analysis using database queries

##### Port CSV files into database:

In [1]:
import sqlite3
import csv
from pprint import pprint
from supporting_files import data_wrangling_schema as dws

sqlite_file = 'supporting_files/exports_databases/osm_db.db'
conn = sqlite3.connect(sqlite_file)
c = conn.cursor()

In [2]:
sql_schema = dws.sql_schema

In [3]:
# Create the table, specifying the schema
c.executescript(sql_schema)
# commit the changes
conn.commit()

In [4]:
# first row of CSVs MUST be fieldnames
import os
from csv import DictReader

def add_table_from_csv(filename):
    if not filename.endswith('.csv'):
        return
    
    tablename = filename.split(".")[0].split('/')[-1]
    with open(filename, 'r') as fin:
        dr = DictReader(fin)
        fieldnames = dr.fieldnames
        to_db = [tuple( [i[k] for k in dr.fieldnames] ) for i in dr]
    c.executemany("INSERT INTO {}({}) VALUES({}?);".format(
            tablename, ",".join(dr.fieldnames), "?," * (len(dr.fieldnames)-1)), to_db)
    conn.commit()
    
def add_to_db(directory):
    for f_name in os.listdir(directory):
        full_path = os.path.join(directory, f_name)
        add_table_from_csv(full_path)

add_to_db('supporting_files/exports')

In [5]:
c.execute('SELECT * FROM nodes')
all_rows = c.fetchall()
print('1):')
pprint(all_rows)

1):
[(358366452,
  30.3979277,
  -90.5765609,
  'wvdp',
  436419,
  2,
  21098612,
  '2014-03-14T12:41:10Z'),
 (358372166,
  29.6627131,
  -90.2675782,
  'iandees',
  4732,
  1,
  777367,
  '2009-03-10T12:58:40Z'),
 (358377272,
  30.3826938,
  -90.5659234,
  'iandees',
  4732,
  1,
  777367,
  '2009-03-10T13:07:02Z'),
 (358394403,
  29.9274242,
  -89.3108775,
  'iandees',
  4732,
  1,
  777367,
  '2009-03-10T13:39:32Z'),
 (358394528,
  29.8885374,
  -89.5342203,
  'iandees',
  4732,
  1,
  777367,
  '2009-03-10T13:39:54Z'),
 (358394976,
  30.1854738,
  -89.7522833,
  'iandees',
  4732,
  1,
  777367,
  '2009-03-10T13:41:13Z'),
 (358395580,
  29.8774267,
  -89.5945001,
  'iandees',
  4732,
  1,
  777367,
  '2009-03-10T13:42:33Z'),
 (368385864,
  30.3485276,
  -89.5447775,
  '1248',
  170672,
  2,
  3916547,
  '2010-02-19T16:02:44Z'),
 (368385866,
  30.417971,
  -89.5870026,
  '1248',
  170672,
  2,
  3916547,
  '2010-02-19T16:02:44Z'),
 (368385896,
  30.280752,
  -89.4031042,
  '1248',


 (369012699,
  29.2663311,
  -89.1786602,
  'amillar',
  28145,
  1,
  95394,
  '2009-04-02T20:00:04Z'),
 (369012700,
  29.3993884,
  -90.0647961,
  'amillar',
  28145,
  2,
  150679,
  '2009-04-03T22:04:44Z'),
 (369012704,
  29.5052152,
  -89.620617,
  'amillar',
  28145,
  1,
  95394,
  '2009-04-02T20:00:06Z'),
 (369012705,
  29.2174452,
  -89.3294972,
  'amillar',
  28145,
  1,
  95394,
  '2009-04-02T20:00:06Z'),
 (369012707,
  29.3377239,
  -90.443973,
  'amillar',
  28145,
  2,
  150679,
  '2009-04-03T22:04:49Z'),
 (369012708,
  29.4146617,
  -89.5370029,
  'amillar',
  28145,
  1,
  95394,
  '2009-04-02T20:00:08Z'),
 (369012712,
  29.1796679,
  -89.1969939,
  'amillar',
  28145,
  1,
  95394,
  '2009-04-02T20:00:09Z'),
 (369012713,
  29.28967,
  -90.4373063,
  'amillar',
  28145,
  2,
  150679,
  '2009-04-03T22:04:58Z'),
 (369012716,
  29.1874503,
  -90.2703573,
  'amillar',
  28145,
  2,
  150679,
  '2009-04-03T22:05:04Z'),
 (369012717,
  29.2596647,
  -89.1511594,
  'amillar',


  29.9460956,
  -90.0778844,
  'Matt Toups',
  41187,
  1,
  8840207,
  '2011-07-26T21:21:19Z'),
 (1375274218,
  29.9461199,
  -90.0779494,
  'Matt Toups',
  41187,
  1,
  8840207,
  '2011-07-26T21:21:19Z'),
 (1375274220,
  29.9461272,
  -90.0778131,
  'Matt Toups',
  41187,
  1,
  8840207,
  '2011-07-26T21:21:19Z'),
 (1375274224,
  29.946156,
  -90.0779096,
  'Matt Toups',
  41187,
  1,
  8840207,
  '2011-07-26T21:21:19Z'),
 (1375274225,
  29.9461651,
  -90.0774177,
  'Matt Toups',
  41187,
  1,
  8840207,
  '2011-07-26T21:21:20Z'),
 (1375274227,
  29.9461651,
  -90.0780243,
  'Matt Toups',
  41187,
  1,
  8840207,
  '2011-07-26T21:21:20Z'),
 (1375274229,
  29.9462102,
  -90.0779214,
  'Matt Toups',
  41187,
  1,
  8840207,
  '2011-07-26T21:21:20Z'),
 (3301246063,
  29.2535917,
  -90.1904187,
  'wvdp',
  436419,
  1,
  28292689,
  '2015-01-20T20:01:31Z'),
 (3378142470,
  30.2932822,
  -89.8259963,
  'Scott Lincoln',
  101788,
  2,
  29188449,
  '2015-03-01T23:27:16Z'),
 (4064549580,
 

In [None]:
conn.close()

##### Investigate any other questions I may have

File size:

In [8]:
import sqlite3
import csv
from pprint import pprint
from supporting_files import data_wrangling_schema as dws

sqlite_file = 'supporting_files/exports/osm_db.db'
conn = sqlite3.connect(sqlite_file)
c = conn.cursor()

In [9]:
# firtst get database file size
import os
database_size = os.path.getsize(sqlite_file) / 1000000
print("The size of the database is: %s mb" %round(database_size, 2))

The size of the database is: 69.51 mb


In [10]:
import pandas as pd
# pd.set_option('display.max_rows', 6)
# my own printing function
def df_query(query):
    df = pd.read_sql(query, conn)
    return df

Top 10 contributors:

In [11]:
query = """
SELECT
    all_tables.user,
    SUM(Total) Total
FROM
    (
    SELECT
        nodes.user,
        COUNT(*) AS Total
    FROM nodes
    GROUP BY nodes.user

    UNION ALL

    SELECT
        relations.user,
        COUNT(*) AS Total
    FROM relations
    GROUP BY relations.user
    
    UNION ALL
    
    SELECT
        ways.user,
        COUNT(*) AS Total
    FROM ways
    GROUP BY ways.user
    ORDER BY Total DESC   
    
    ) AS all_tables

GROUP BY user
ORDER BY Total DESC
LIMIT 10
"""

df_query(query)

Unnamed: 0,user,Total
0,Maarten Deen,17203
1,Matt Toups,16735
2,ELadner,4589
3,Andre68,3134
4,wvdp,2010
5,OSMF Redaction Account,1752
6,woodpeck_repair,994
7,dmgroom_ct,465
8,Hartmut Holzgraefe,345
9,amillar,296


Total number of nodes, ways, relations:

In [12]:
# nodes
query = """
SELECT COUNT(*) AS Total FROM nodes
"""

df_query(query)

Unnamed: 0,Total
0,521


In [13]:
# ways
query = """
SELECT COUNT(*) AS Total FROM ways
"""

df_query(query)

Unnamed: 0,Total
0,45280


In [14]:
# relations
query = """
SELECT COUNT(*) AS Total FROM relations
"""

df_query(query)

Unnamed: 0,Total
0,3803


In [15]:
# nodes, ways and relations
query = """
SELECT
    SUM(all_tables.Total) AS Total_All
FROM 
    (
        SELECT COUNT(*) AS Total FROM nodes

        UNION ALL

        SELECT COUNT(*) AS Total FROM ways

        UNION ALL

        SELECT COUNT(*) AS Total FROM relations
    ) AS all_tables
"""

df_query(query)

Unnamed: 0,Total_All
0,49604


Additional analysis: All the unnamed features. For example:
* Way ID: 43829974 connects to one of its nodes: 555906935 which has lat/lon coordinates lat="29.2658771" lon="-89.4146036". OpenStreetMaps shows those coordinates indeeed to be just some random natural feature [hEre](http://www.openstreetmap.org/search?query=29.2658771%2C%20-89.4146036#map=18/29.26588/-89.41460).

What about all of the unnamed features?
First, to review the named ones:

In [16]:
query = """

SELECT
    all_tables.value
FROM
    (
    SELECT
        nodes_tags.value
    FROM nodes_tags
    WHERE nodes_tags.key="name"
    GROUP BY nodes_tags.value
    
    UNION ALL
    
    SELECT
        ways_tags.value
    FROM ways_tags
    WHERE ways_tags.key="name"
    GROUP BY ways_tags.value
    
    UNION ALL
    
    SELECT
        relations_tags.value
    FROM relations_tags
    WHERE relations_tags.key="name"
    GROUP BY relations_tags.value
    ORDER BY relations_tags.value
    
    ) AS all_tables

GROUP BY all_tables.value
ORDER BY all_tables.value
"""

df_query(query)

Unnamed: 0,value
0,Abita Springs Park Splash Pad
1,Adema Pond
2,Alberts Pond
3,Alexis Bay
4,Allen Bay
5,Alligator Bend
6,American Bay
7,Anderson Bay
8,Andres Pond
9,Ashton Plantation Pond 1


Now how to quantify the unnamed features? How many unnamed lakes/ponds, etc. are there? But are ways elements relying on a tag or relation to have the name?

In [None]:
# NOTES
    # db contains only elems that had a 'natural' tag
    # but note, elems such as 'way' 49613025 don't have a 'natural' tag elem.
        # so goal here is to fix NHD data - not comprehensively ALL the data!
    # what if a way is used as a name in a relation?
        # 

# REQS
    # Filter to elements that are described by a relevant NHD:FType but don't have a 'name' tag element describing them (unnamed elems)
# RULES
    # Must be a way or relation elem id that is represented by a set from ways_tags or relations_tags tables
    # Set must NOT have a 'name' key in it
    # Set MUST have an 'FTYPE' key in it
# ALGO
    # EXPERIMENT: What computational logic does sql use (that I know of so far)?
        # join tables
        # filter
        # sort
    # Create list of all way/relation tag elem ids with name in them
    # Find each FType elem id that is NOT in the previous list
    # Branch off/aggregate accordingly
# DECOMP
    # Create list of all way/relation tag elems with 'name' as their key
    # Create a list of all way/relation tag elems with 'FType' as their element
    # Simplify the lists
    # Filter FType list using name list
    # Pull other insights

In [17]:
# All way/relation ids with 'name' as their key
query = """

SELECT
    all_tables.id,
    all_tables.value
FROM
    (
    SELECT
        *
    FROM ways_tags
    WHERE ways_tags.key="name"
    
    UNION ALL
    
    SELECT
        *
    FROM relations_tags
    WHERE relations_tags.key="name"
    
    ) AS all_tables
"""

df_query(query)

Unnamed: 0,all_tables.id,all_tables.value
0,22500204,Halliburton Slip
1,22516599,Round Lake
2,22517455,Round Lake
3,43232567,Gum Swamp
4,43232570,Gum Swamp
5,43240505,Fritchie Marsh
6,43240520,Fritchie Marsh
7,43240525,Fritchie Marsh
8,43240970,Boggy Bay
9,43241104,Savanna Lake


In [23]:
# All way/relation ids with 'FType' as their key
query = """
SELECT
    all_tables.id,
    COUNT(*) AS Total,
    all_tables.value
FROM
    (
    SELECT
        *
    FROM ways_tags
    WHERE ways_tags.key="FType"
    
    UNION ALL
    
    SELECT
        *
    FROM relations_tags
    WHERE relations_tags.key="FType"
    
    ) AS all_tables
GROUP BY all_tables.value
"""

df_query(query)

Unnamed: 0,id,Total,value
0,321535489,1,466
1,2082789,3,CanalDitch
2,191631476,2,CanalDitch;SwampMarsh
3,7077413,21895,LakePond
4,475760275,480,SeaOcean
5,4496553,25,StreamRiver
6,7077411,15234,SwampMarsh
7,190765058,19,SwampMarsh;LakePond


Note cleaning that needs to be done here - circle back to data.py algo and adjust where necessary... or,this is not cleaning - just the best way to describe the data...

Filter out any way or relation that is 

In [20]:
query = """
SELECT *, COUNT(*) as Total
    FROM (
    SELECT
        all_tables.id,
        all_tables.value
    FROM
        (
        SELECT
            *
        FROM ways_tags
        WHERE ways_tags.key="FType"

        UNION ALL

        SELECT
            *
        FROM relations_tags
        WHERE relations_tags.key="FType"

        ) AS all_tables
    ) AS ftype_table
WHERE ftype_table.id NOT IN (
    SELECT
        all_tables.id
    FROM
        (
        SELECT
            *
        FROM ways_tags
        WHERE ways_tags.key="name"

        UNION ALL

        SELECT
            *
        FROM relations_tags
        WHERE relations_tags.key="name"

        ) AS all_tables
)
GROUP By ftype_table.value
"""

df_query(query)

Unnamed: 0,id,value,Total
0,321535489,466,1
1,2082789,CanalDitch,3
2,191631476,CanalDitch;SwampMarsh,2
3,7077413,LakePond,21392
4,475760275,SeaOcean,480
5,1972439,StreamRiver,21
6,7077411,SwampMarsh,15222
7,190765058,SwampMarsh;LakePond,19


### Report on how data could be improved

Investigate (programmatically) the date the data was uploaded, then compare to the date when the NHD switched over to the Permanent identifier