# [OpenStreetMap](http://www.openstreetmap.org/) - A Data Cleaning Case Study

If you're wondering what OpenStreetMap is, think of it as Wikipedia for free, editable maps of the world. From the [OpenStreetMap Wiki](https://wiki.openstreetmap.org/wiki/Main_Page):

> "Welcome to OpenStreetMap, the project that creates and distributes free geographic data for the world. We started it because most maps you think of as free actually have legal or technical restrictions on their use, holding back people from using them in creative, productive, or unexpected ways."

This case study into data cleaning should give any reader a sufficient introduction to understanding the OpenStreetMap data format, basic XML parsing, Python scripting to audit and clean the data, and SQL database queries to learn things from it.

### Focus of the case study:
According to [The New York Times' 'What Could Disappear'](http://www.nytimes.com/interactive/2012/11/24/opinion/sunday/what-could-disappear.html?_r=2&) interactive article, New Orleans could be 88% below sea level in as little as 100 years. I'm personally interested to see what we're set to lose as a consequence of global warming so I'll focus on approximately 100x150 mile area surrounding New Orleans.

Download the data from [Mapzen Extracts](https://mapzen.com/data/metro-extracts/metro/new-orleans_louisiana/).

Download a small sample osm file of the same area I created [HERE](https://github.com/rancherobeans/udacity/raw/master/P3/PROJECT/supporting_files/samples/new-orleans_samplek10) (made with [this script](https://github.com/rancherobeans/udacity/blob/master/P3/PROJECT/supporting_files/scripts/make_osm.py)).

![map area](supporting_files/screenshots/mapzen_extract_map.png "map area")

### Getting to know the data:
OpenStreetMap data is available in the XML dat format bearing the file type '.osm'. Reading through the [documentation](https://wiki.openstreetmap.org/wiki/Main_Page), I find the gist of osm data to be:
* A **'node'** element essentially represents a latitude/longitude coordinate. It may have 'tag' child elements with other data points indicating things like an 'address' or a 'school'.
* A **'way'** element has nodes as children (with their element type shortened to 'nd'). It defines things like roads, buildings, natural areas, etc. It also may have 'tag' children for the same reasons as a node.
* A **'relation'** has nested 'member' elements which reference existing ways and nodes. It defines logical or geographic relationships between other elements such as an extended hiking trail made up by a number of way elements.

The osm file unzips to a whopping 1.28gb and crashes both Atom and SublimeText on my machine. [VIM](http://www.vim.org/), on the other hand, is a wonderfully simple text editor designed to be used in a bare-bones command line interface (like the Terminal app on a mac). It can easily handle the job. I opened the file with VIM and used a few basic VIM commands to jump around and explore the osm file freely:
* Jump to bottom: `shift+g`
* Jump to top: `gg`
* page down: `Control+d`
* page up: `Control+u`
* Jump five million lines down (or any number of lines you want): type `5000000` and then hit `j`
* Jump five million lines up: type `5000000` and then hit `k`
* Search the entire massive file for anything you want: type `/` followed by your search
* Go to the next instance of what you searched for: `n`
    
*Side note: check out [this little gem](https://vim-adventures.com/) if you are interested in getting started with VIM in a fun way.*

Here is what I learned from exploring the file:
* The file has the standard opening:  <?xml version='1.0' encoding='UTF-8'?>. This was used in the script to make a sample, smaller file size, osm file.
* The XML root element is 'osm' and is parent to all other elements.
* There are 16,082,009 lines in total - wow!
* an example of a node element with no tags:

![node](supporting_files/screenshots/node.png "node")

* a node element with tags:

![node with tags](supporting_files/screenshots/node_tags.png "node with tags")

* a way element with its nested nd (node) element references and tags:

![way](supporting_files/screenshots/way.png "way")

* a relation element with its nested member elements that reference nodes and ways. It also has tag elements:

![relation](supporting_files/screenshots/relation.png "relation")

### Reflecting upon the data - what should I clean?:
Thinking out loud (writing actually), I can see that tag elements seem to constitute the bulk of a way or relation element's meta-data. From what I understand about OpenStreetMap, this should also be where a lot of user-generated/added content is stored. I'll probably have to parse entire elements (node, way, relation) in order to create a well-structured database of the data, but the focus of my cleaning should be in the way tags. 

One other thing catches my eye - the large number of tag elements who's 'k' attribute contains the acronym 'NHD'. What is that about? It turns out NHD stands for the US Geological Survey's ['National Hydrography Dataset'](https://nhd.usgs.gov/), which maps out and documents the nation's watershed boundaries and their features. New Orleans is surrounded by a massive amount of such natural features. These will be among the first to go as ocean levels rise, and in my opinion it's worth knowing about it. This is my focus - the New Orleans watershed:


![new orleans watershed](supporting_files/screenshots/new_orleans_watershet.png "new orleans watershed")


I would assume the data to be damn near perfect given it was most likely adopted directly from the NHD. The wonderful thing here is that, without much trouble, I can programmatically verify whether or not that is true despite the over 16 million lines of data to deal with.

Ok, here we go!!!

### Auditing the data:
Within the tag elements, 'NHD' stands out as a prefix for the field names of the NHD data (ex. 'ComID', 'FCode' and 'FTYPE'). I need to rely on one of these to isolate the elements in the osm file that I'm interested in when investigating or parsing. I used the official [NHD data model](https://nhd.usgs.gov/NHDv2.2.1_poster_081216.pdf) and the [NHD data dictionary](https://nhd.usgs.gov/userGuide/Robohelpfiles/NHD_User_Guide/Feature_Catalog/Data_Dictionary/Data_Dictionary.htm) to find that ComID is a unique identifier for a natural feature. That will work.

I've created a number of executable Python scripts in the `supporting_files/scripts` folder in the GitHub repository this report. I've also linked each of them to their page on GitHub for individual viewing and download:

---
*Run [**keys_list.py**](https://github.com/rancherobeans/udacity/blob/master/P3/PROJECT/supporting_files/scripts/keys_list.py) to show a list of unique keys (tag element 'k' attributes) from all of the element in the osm file that have a child tag element with the 'NHD:ComID' key.*

---

In [7]:
%%capture

"""
KEYS LIST:
NHD:ComID
NHD:Elevation
NHD:FCode
NHD:FDate
NHD:FTYPE
NHD:FType
NHD:GNIS_ID
NHD:GNIS_Name
NHD:Permanent_
NHD:RESOLUTION
NHD:ReachCode
NHD:Resolution
NHD:way_id
admin_level
attribution
boat
boundary
created_by
culvert
ele
gnis:county_id
gnis:created
gnis:feature_id
gnis:state_id
history
landuse
layer
leisure
lock
man_made
name
natural
note
place
source
tunnel
type
water
waterway
wetland
"""

Looking over the list I already see some possible redundancies in the field naming convention for some of the NHD keys ('NHD:FTYPE' and 'NHD:FType' for example). Investigating the NHD keys and listing out the problematic or noteworthy:

##### NHD:ComID
The 'Model Changes' section of the NHD data model notes, "ComID field deleted from all feature classes/tables". It turns out that it has been replaced by a different field called 'Permanent\_Identifier' as its unique identifier. The NHD data dictionary clarifies - "features already assigned a ComID retain that value as the Permanent\_Identifier". While this is an important note for future data entry, it will not require cleaning here.

##### NHD:FTYPE and NHD:FType
These two keys appear to be referring to the same thing. To check, I manually searched through the osm file with VIM (keyboard shortcut: `/NHD:FTYPE`). Indeed, both 'FTYPE' and 'FType' refer to the same category of things - 'SwampMarsh', 'StreamRiver', etc. According to the NHD data model, 'FType' is the correct way to refer to this feature.

To fix (clean) this and anything else I may come across in this case study, I created a non-executable python file called [**fix_dict.py**](https://github.com/rancherobeans/udacity/blob/master/P3/PROJECT/supporting_files/scripts/fix_dict.py) (`supporting_files/scripts`) that will be invoked in a later script that will parse the data. The basic algorithm of this file is very simple:
1. Iterate over a dictionary (created from the osm file) to find a nested dictionary of node, way or relation tags values
2. Iterate over the nested dictionary to find the key or value that matches what I need to fix
3. Overwrite data with the fix

##### NHD:GNIS_ID
'GNIS' stands for the '[Geographic Names Information System](https://nhd.usgs.gov/gnis.html)', which is another data set of natural feature identifiers from the U.S. Geological Survey:

>"The Geographic Names Information System (GNIS), developed by the U.S. Geological Survey in cooperation with the U.S. Board on Geographic Names, contains information about physical and cultural geographic features in the United States and associated areas..."

VIM searching (`/NHD:GNIS_ID`) for instances of these in the osm file I found that they are redundant with 'gnis:feature_id' key/value pairs. For example, these tag elements which are both children of a single way element:

`tag k="NHD:GNIS_ID" v="00559711"`

AND

`tag k="gnis:feature_id" v="00559711"`

Instead of deciding to delete one or the other, I will flag this and any other redundant tags using SQL queries towards the end of this case study, in the section where I consider how the data could be improved. Someone with the appropriate permissions with the OpenStreetMap project can take action based on that.

##### NHD:Permanent_
This field most certainly refers to the NHD data model field 'Permanent_Identifier'. It seems to always share the save value as 'NHD:ComID'. This is another example where I'm not able to decide which is most appropriate, but I can use an SQL query to flag all instances for the OpenStreetMap project. 

##### NHD:RESOLUTION
This is something I can fix using my [**fix_dict.py**](https://github.com/rancherobeans/udacity/blob/master/P3/PROJECT/supporting_files/scripts/fix_dict.py) file. According to the NHD data model it should be 'Resolution' (capital R and lowercase remaining letters). While this seems minor, it still fulfills my goal to make the data more accurate and consistent.

#### NHD:way_id
This confused me at first as 'way_id' is not an official field in the NHD data model. I soon noticed that tags with this key have the same value as 'NHD:ComID'. This one seems not only redundant but misleading because its value has nothing to do with the id attribute for the parent way tag. As with the other redundancies, I'll use SQL to flag them for the OpenStreetMap project.

OK, I'm focusing on NHD data but I don't want to omit the other tag elements while I have the opportunity to easily examine them. Yet, there isn't an official source I can use to verify they are valid, accurate or complete. My best bet is to manually review them and use my best judgement. This is also an opportunity to review the actual data contained in NHD data tags. 

---
*Run [**audit_keys.py**](https://github.com/rancherobeans/udacity/blob/master/P3/PROJECT/supporting_files/scripts/audit_keys.py) in the scripts folder to show each key and a list of all unique values contained from the osm file.*

Below I've included an example head and tail of the output:

In [6]:
%%capture 
"""
defaultdict(<class 'set'>,
            {'NHD:ComID': {'121369572',
                           '121376748',
                           '121376910',
                           '121376927',
                           '121379126',
                           '131339718',
                           '131339719',
                           '131339720',
                           '131339721',
                           '133385498',
                           '137031874',
                           '137032010',
                           '139142995;139144124',
...

                      'Widgeon Pond',
                      'Willow Bayou',
                      'Willow Outside Pond',
                      'Woods Bayou',
                      'Wooly Bayou',
                      'Wright Pass',
                      'Yankee Pond',
                      'Yellow Bayou',
                      'Yellow Lake Bayou'},
             'natural': {'wetland', 'water', 'coastline', 'bay', '_coastline'},
             'note': {'I am using this NHD data to replace old PGS coastline',
                      'I have altered the natural:coatline tag as this way '
                      'duplicates existing coastline ways',
                      'more temporary coastline fixups'},
             'place': {'islet', 'island'},
             'source': {'Bing',
                        'Landsat',
                        'NHD',
                        'NHD & Bing',
                        'NHD & bing',
                        'NHD_import',
                        'NHD_import_v0.2_20091028002820',
                        'NHD_import_v0.2_20091029112923',
                        'NHD_import_v0.2_20110106172549',
                        'NHD_import_v0.4_20091025102857',
                        'NHD_import_v0.4_20091025102857; NHD',
                        'NHD_import_v0.4_20091028000350',
                        'NHD_import_v0.4_20091028000350;NHD_import_v0.4_20091028002739',
                        'NHD_import_v0.4_20091028002739',
                        'NHD_import_v0.4_20091029113158',
                        'NHD_import_v0.4_20110106172508',
                        'PGS'},
             'tunnel': {'yes', 'culvert'},
             'type': {'water', 'multipolygon'},
             'water': {'reservoir', 'Bayou', 'pond'},
             'waterway': {'canal',
                          'dam',
                          'ditch',
                          'drain',
                          'river',
                          'riverbank',
                          'stream'},
             'wetland': {'swamp', 'saltmarsh'}})
"""

Starting from the top of the list, the standouts are:

##### NHD:ComID
There are instances of a value having more than one ComID - for example, '139142995;139144124'. This should be a unique identifier for a geographical feature, so I can't say for sure if multiple values here is valid. It may have a reason depending upon the element/geographical feature. I'll flag each instance of multiple ComID values using SQL in the section considering how to improve the data.

##### NHD:FCode
Again, there are instances with more than one entry, but this is ok - the NHD data model assigns different FCode to the various natural features, like spillways, streams, etc, which may be in multiples for a given feature. 

##### NHD:FDate
Like ComID, FDate seems like multiple entries could be problematic. I'll add it to my SQL queries for multiple ComID identifiers.

##### NHD:ReachCode
As with ComID and FDate, ReachCode should only have one per feature. According to the NHD definition:

> "Unique identifier composed of two parts, first eight digits = subbasin code as defined by FIPS 103, and next six digits = random-assigned sequential number unique within a Cataloguing Unit."

Instead of speculating on what to do I'll add it to my SQL query to flag things like this.
           
##### natural
There appears to be a redundancy with '_coastline' and 'coastline'. A quick VIM search in the osm document for '_coastline' reveals a 'note' key in a child 'tag' element reading "I have altered the natural:coatline tag as this way duplicates existing coastline ways". I can't say what to do with this either way so I'll leave it as-is.

##### note
I reviewed all instances of a 'note' in a tag element by running the following script:

---
*Run [**audit_notes.py**](https://github.com/rancherobeans/udacity/blob/master/P3/PROJECT/supporting_files/scripts/audit_notes.py) to find all isntances of a 'note'.*

---

It doesn't seem like enough information to decide what to do so this one will stay as-is as well.

##### source
This one is simple - I'm going to change 'NHD & bing' to be 'NHD & Bing' where Bing is a proper name of the Microsoft product.

##### water
Finally, this one is pretty minor, but any instance of 'Bayou' should be 'bayou'. This and 'source' are both fixed via [**fix_dict.py**](https://github.com/rancherobeans/udacity/blob/master/P3/PROJECT/supporting_files/scripts/fix_dict.py).

### Parse and clean data, export to CSV files:

Now that I fully understand what needs to be cleaned and flagged for the OpenStreetMap project, I can move on to actually executing on that work. To start with the parsing and cleaning, Python's [ElementTree XML API](https://docs.python.org/3/library/xml.etree.elementtree.html) has the ability to iteratively parse very large XML files. The `data.py` file contains this basic algorithm:

* Prepare CSV files for writing data
* Iterate over each child element of parent element in the file to create the appropriate dictionary
* Apply data cleaning methods to the dictionary ([**fix_dict.py**](https://github.com/rancherobeans/udacity/blob/master/P3/PROJECT/supporting_files/scripts/fix_dict.py))
* Validate dictionary's data format and types against a schema ([**schema_validation.py**](https://github.com/rancherobeans/udacity/blob/master/P3/PROJECT/supporting_files/scripts/schema_validation.py))
* Write to appropriate CSV file depending on dictionary type

---
*Run [**data.py**](https://github.com/rancherobeans/udacity/blob/master/P3/PROJECT/supporting_files/scripts/data.py) to parse and clean the data. Be sure to replace the variable with the path to the osm file, "OSM_PATH" to point to the osm file you've downloaded, as well as any path where a file is saved to.*

---

### Examine the data efficiently with database queries:
Before flagging things like redundancies, I'll pull out a few insights from the data.

Using Python to manage building and querying my SQL database will be helpful in the form of this notebook because it gives me an easy way to display everything. First, I will create the database, schema and add all the data from the CSV files:

In [1]:
import sqlite3
import csv
from pprint import pprint
from supporting_files.scripts import database_schema as ds
import os

sqlite_file = 'supporting_files/exports/databases/osm_db.db'

# connects to the database
conn = sqlite3.connect(sqlite_file)

# cursor object for database interaction
c = conn.cursor()

In [2]:
# load schema
sql_schema = ds.sql_schema

# Create the tables, specifying the schema
c.executescript(sql_schema)
# commit the changes
conn.commit()

import os
from csv import DictReader

# function takes in a CSV file and writes data to the database
# first row of CSVs MUST be fieldnames
def add_table_from_csv(filename):
    if not filename.endswith('.csv'):
        return
    
    tablename = filename.split(".")[0].split('/')[-1]
    with open(filename, 'r') as fin:
        dr = DictReader(fin)
        fieldnames = dr.fieldnames
        to_db = [tuple( [i[k] for k in dr.fieldnames] ) for i in dr]
    c.executemany("INSERT INTO {}({}) VALUES({}?);".format(
            tablename, ",".join(dr.fieldnames), "?," * (len(dr.fieldnames)-1)), to_db)
    conn.commit()
    
# function loops through all files in a directory
def add_to_db(directory):
    for f_name in os.listdir(directory):
        full_path = os.path.join(directory, f_name)
        add_table_from_csv(full_path)

add_to_db('supporting_files/exports/csv')

##### Database file size:

In [3]:
database_size = os.path.getsize(sqlite_file) / 1000000
print("The size of the database is: %s mb" %round(database_size, 2))

The size of the database is: 77.13 mb


##### Create my own function for rendering a database query into a pandas dataframe:

In [4]:
import pandas as pd
pd.set_option('display.max_rows', 15) # sets pandas max rows to display
def df_query(query):
    df = pd.read_sql(query, conn)
    return df

##### The top 10 contributors to this section of OpenStreetMap:

In [5]:
# top 10 contributors
query = """
SELECT
    all_tables.user,
    SUM(Total) Total
FROM
    (
    SELECT
        nodes.user,
        COUNT(*) AS Total
    FROM nodes
    GROUP BY nodes.user

    UNION ALL

    SELECT
        relations.user,
        COUNT(*) AS Total
    FROM relations
    GROUP BY relations.user
    
    UNION ALL
    
    SELECT
        ways.user,
        COUNT(*) AS Total
    FROM ways
    GROUP BY ways.user
    ORDER BY Total DESC   
    
    ) AS all_tables

GROUP BY user
ORDER BY Total DESC
LIMIT 10
"""

df_query(query)

Unnamed: 0,user,Total
0,Matt Toups,25243
1,Maarten Deen,17221
2,Andre68,4370
3,OSMF Redaction Account,2203
4,ELadner,1349
5,woodpeck_repair,985
6,Hartmut Holzgraefe,456
7,wvdp,455
8,eric22,367
9,DKNOTT,201


##### Total number of nodes:

In [6]:
# nodes
query = """
SELECT COUNT(*) AS Total FROM nodes
"""

df_query(query)

Unnamed: 0,Total
0,2


##### Total number of ways:

In [7]:
# ways
query = """
SELECT COUNT(*) AS Total FROM ways
"""

df_query(query)

Unnamed: 0,Total
0,53562


##### Total number of relations:

In [8]:
# relations
query = """
SELECT COUNT(*) AS Total FROM relations
"""

df_query(query)

Unnamed: 0,Total
0,406


##### Total number of nodes, ways and relations:

In [9]:
# nodes, ways and relations
query = """
SELECT
    SUM(all_tables.Total) AS Total_All
FROM 
    (
        SELECT COUNT(*) AS Total FROM nodes

        UNION ALL

        SELECT COUNT(*) AS Total FROM ways

        UNION ALL

        SELECT COUNT(*) AS Total FROM relations
    ) AS all_tables
"""

df_query(query)

Unnamed: 0,Total_All
0,53970


##### Unnamed natural features from the New Orleans watershed:
As I mentioned earlier, we're set to lose quite a lot as sea levels rise due to global warming. The city of New Orleans is a huge loss of course, but what about all of the unnamed features in the New Orleans watershed?

For example, the unnamed natural feature with way ID: 43829974 connects to one of its nodes, 555906935, which has lat/lon coordinates lat="29.2658771" lon="-89.4146036". OpenStreetMaps shows [those coordinates](http://www.openstreetmap.org/search?query=29.2658771%2C%20-89.4146036#map=18/29.26588/-89.41460) indeeed to be one of many unnamed natural features. It's probably not a feature that justifies a name, but it can certainly be quantified along with the many others.

![unnamed natural feature](supporting_files/screenshots/unnamed_feature.png "unnamed natural feature")

To begin, here is a list of all the named natural features in the New Oreleans watershed:

In [10]:
# list of named features
query = """

SELECT
    all_tables.value
FROM
    (
    SELECT
        nodes_tags.value
    FROM nodes_tags
    WHERE nodes_tags.key="name"
    GROUP BY nodes_tags.value
    
    UNION ALL
    
    SELECT
        ways_tags.value
    FROM ways_tags
    WHERE ways_tags.key="name"
    GROUP BY ways_tags.value
    
    UNION ALL
    
    SELECT
        relations_tags.value
    FROM relations_tags
    WHERE relations_tags.key="name"
    GROUP BY relations_tags.value
    ORDER BY relations_tags.value
    
    ) AS all_tables

GROUP BY all_tables.value
ORDER BY all_tables.value
"""

df_query(query)

Unnamed: 0,value
0,Abita Creek
1,Abita River
2,Adema Pond
3,Alligator Branch
4,Bark Landing River
5,Bay Jaque
6,Bay Lanaux
...,...
261,Willow Outside Pond
262,Woods Bayou


To find all the unnamed features, I simply find anything with the 'FType' or 'FTYPE' key, and then return it if it's not in the list of already named features. Here I aggregate by the value, which is the type of natural feature, such as a lake/pond, swamp/marsh.

In [11]:
# list of naturatl features that are NOT in the list of named features
query = """
SELECT *, COUNT(*) as Total
    FROM (
    SELECT
        all_tables.id,
        all_tables.value
    FROM
        (
        SELECT
            *
        FROM nodes_tags
        WHERE nodes_tags.key="FType" OR nodes_tags.key="FTYPE"
        
        UNION ALL
        
        SELECT
            *
        FROM ways_tags
        WHERE ways_tags.key="FType" OR ways_tags.key="FTYPE"

        UNION ALL

        SELECT
            *
        FROM relations_tags
        WHERE relations_tags.key="FType" OR relations_tags.key="FTYPE"

        ) AS all_tables
    ) AS ftype_table
WHERE ftype_table.id NOT IN (
    SELECT
        all_tables.id
    FROM
        (
        SELECT
            *
        FROM nodes_tags
        WHERE nodes_tags.key="name"
        
        UNION ALL
        
        SELECT
            *
        FROM ways_tags
        WHERE ways_tags.key="name"

        UNION ALL

        SELECT
            *
        FROM relations_tags
        WHERE relations_tags.key="name"

        ) AS all_tables
)
GROUP By ftype_table.value
ORDER BY Total DESC
"""

df_query(query)

Unnamed: 0,id,value,Total
0,7077413,LakePond,21686
1,7077411,SwampMarsh,16188
2,2313282,StreamRiver,11758
3,2090861,CanalDitch,2274
4,475760275,SeaOcean,483
5,107501995,Reservoir,177
6,43679482,DamWeir,85
7,190765058,SwampMarsh;LakePond,21
8,191631489,CanalDitch;SwampMarsh,7
9,97026929,Gate,6


I find this pretty compelling - if the New Orleans watershed goes underwater in as little as 100 years, we're bound to lose 21,686 ponds/lakes, 16,188 swamps/marshes, etc. This will amount to a huge loss for a long list of reasons. According to the U.S. Geological Survey's article [Louisiana Coastal Wetlands: A Resource At Risk](https://pubs.usgs.gov/fs/la-wetlands/):

>"Barrier islands fronting the Mississippi River delta plain act as a buffer to reduce the effects of ocean waves and currents on associated estuaries and wetlands. Louisiana's barrier islands are eroding, however, at a rate of up to 20 meters per year; so fast that, according to recent USGS estimates, several will disappear by the end of the century. As the barrier islands disintegrate, the vast system of sheltered wetlands along Louisiana's delta plains are exposed to the full force and effects of open marine processes such as wave action, salinity intrusion, storm surge, tidal currents, and sediment transport that combine to accelerate wetlands deterioration."

### Consider how the data could be improved:

As mentioned in the auditing section of this case study, I found various instances of redundant data, and I decided to flag it all for the OpenStreetMaps project. Someone from the OpenStreetMap project with appropriate permissions could decide what the right thing to do is in each of the following cases and take action. This could reduce the file size of the OpenStreetMap data set and improve data consistency/uniformity.

Here are the queries pointing out this problematic data:

##### All instances where NHD:GNIS_ID is redundant with gnis:feature_id:

In [12]:
# redundancy between NHD:GNIS_ID and gnis:feature_id
query = """
SELECT
    x.id,
    x.key,
    x.value,
    y.key,
    y.value
FROM (
    SELECT
        *
    FROM nodes_tags
    WHERE nodes_tags.key="GNIS_ID"

    UNION ALL
    
    SELECT
        *
    FROM ways_tags
    WHERE ways_tags.key="GNIS_ID"

    UNION ALL
    
    SELECT
        *
    FROM relations_tags
    WHERE relations_tags.key="GNIS_ID"
) AS x
    JOIN (
    SELECT
    *
    FROM nodes_tags
    WHERE nodes_tags.key="feature_id"
    
    UNION ALL
    
    SELECT
    *
    FROM ways_tags
    WHERE ways_tags.key="feature_id"
    
    UNION ALL
    
    SELECT
    *
    FROM relations_tags
    WHERE relations_tags.key="feature_id"
    ) AS y
    ON x.id=y.id
WHERE x.value=y.value
"""

df_query(query)

Unnamed: 0,id,key,value,key.1,value.1
0,558006771,GNIS_ID,00559711,feature_id,00559711
1,3301246063,GNIS_ID,00554877,feature_id,00554877
2,22516599,GNIS_ID,00538745,feature_id,00538745
3,22516707,GNIS_ID,00536584,feature_id,00536584
4,22517455,GNIS_ID,00538745,feature_id,00538745
5,22523995,GNIS_ID,00536584,feature_id,00536584
6,22524448,GNIS_ID,00536584,feature_id,00536584
...,...,...,...,...,...
920,4512013,GNIS_ID,00532601,feature_id,00532601
921,4512014,GNIS_ID,00555590,feature_id,00555590


##### All instances where NHD:Permanent_ has the same value as NHD:ComID:

In [13]:
# redundancy between NHD:Permanent_ and NHD:ComID
query = """
SELECT
    x.id,
    x.key,
    x.value,
    y.key,
    y.value
FROM (
    SELECT
        *
    FROM nodes_tags
    WHERE nodes_tags.key="ComID"

    UNION ALL
    
    SELECT
        *
    FROM ways_tags
    WHERE ways_tags.key="ComID"

    UNION ALL
    
    SELECT
        *
    FROM relations_tags
    WHERE relations_tags.key="ComID"
) AS x
    JOIN (
    SELECT
    *
    FROM nodes_tags
    WHERE nodes_tags.key="Permanent_"
    
    UNION ALL
    
    SELECT
    *
    FROM ways_tags
    WHERE ways_tags.key="Permanent_"
    
    UNION ALL
    
    SELECT
    *
    FROM relations_tags
    WHERE relations_tags.key="Permanent_"
    ) AS y
    ON x.id=y.id
WHERE x.value=y.value
"""

df_query(query)

Unnamed: 0,id,key,value,key.1,value.1
0,43239879,ComID,148751000,Permanent_,148751000
1,43244825,ComID,148750992,Permanent_,148750992
2,43246802,ComID,151098269,Permanent_,151098269
3,43246809,ComID,151098249,Permanent_,151098249
4,43246862,ComID,151098318,Permanent_,151098318
5,43246875,ComID,151098331,Permanent_,151098331
6,43246920,ComID,151098259,Permanent_,151098259
...,...,...,...,...,...
1288,2313274,ComID,151099754,Permanent_,151099754
1289,2313276,ComID,151099763,Permanent_,151099763


##### All instances where NHD:way_id  has the same value as NHD:ComID (also note that NHD:way_id may not even be a valid field):

In [14]:
# redundancy between NHD:way_id and NHD:ComID
query = """
SELECT
    x.id,
    x.key,
    x.value,
    y.key,
    y.value
FROM (
    SELECT
        *
    FROM nodes_tags
    WHERE nodes_tags.key="ComID"

    UNION ALL
    
    SELECT
        *
    FROM ways_tags
    WHERE ways_tags.key="ComID"

    UNION ALL
    
    SELECT
        *
    FROM relations_tags
    WHERE relations_tags.key="ComID"
) AS x
    JOIN (
    SELECT
    *
    FROM nodes_tags
    WHERE nodes_tags.key="way_id"
    
    UNION ALL
    
    SELECT
    *
    FROM ways_tags
    WHERE ways_tags.key="way_id"
    
    UNION ALL
    
    SELECT
    *
    FROM relations_tags
    WHERE relations_tags.key="way_id"
    ) AS y
    ON x.id=y.id
WHERE x.value=y.value
"""

df_query(query)

Unnamed: 0,id,key,value,key.1,value.1
0,41225956,ComID,139142734,way_id,139142734
1,41226082,ComID,139142736,way_id,139142736
2,43166285,ComID,148743907,way_id,148743907
3,43166292,ComID,148742957,way_id,148742957
4,43166296,ComID,148743712,way_id,148743712
5,43166298,ComID,148744066,way_id,148744066
6,43166299,ComID,148739594,way_id,148739594
...,...,...,...,...,...
12976,446304952,ComID,139147561,way_id,139147561
12977,480254887,ComID,139142758,way_id,139142758


##### Finally, all instances where a node, way or relation has more than one ComID, FDate or ReachCode:

In [15]:
# all instances of multiple ComID, FDate or ReachCode
query = """
SELECT
    *
FROM nodes_tags
WHERE (key = "ComID" OR key = "FDate" OR key = "ReachCode") AND value LIKE "%;%"

UNION ALL

SELECT
    *
FROM ways_tags
WHERE (key = "ComID" OR key = "FDate" OR key = "ReachCode") AND value LIKE "%;%"

UNION ALL

SELECT
    *
FROM relations_tags
WHERE (key = "ComID" OR key = "FDate" OR key = "ReachCode") AND value LIKE "%;%"
"""

df_query(query)

Unnamed: 0,id,key,value,type
0,43216266,ComID,148743929; 148751048,NHD
1,43216266,ReachCode,8090201004544; 08090201008402,NHD
2,43261226,ComID,139142995;139144124,NHD
3,43287755,ComID,139153311;139153312,NHD
4,43393226,ComID,139182074;139182055,NHD
5,43393226,ReachCode,08090203004245;08090203031009,NHD
6,43429207,ComID,139184297;139184312,NHD
...,...,...,...,...
68,191631474,FDate,2008/07/04;2005/12/05,NHD
69,191631476,ComID,143863413;143853555;143863370;143853787;143853...,NHD


In [16]:
conn.close()

### Implementing the improvement:
Ultimately it's up to someone with appropriate permissions who is a part of the OpenStreetMap project to take on the task of correcting the redundant data problem. I suspect it would be well worth the effort considering the nature of the problem:

* The scope of NHD data covers at least the United States and terretories.
* I've only considered an approximately 100x150 mile area of the map.
* If redundant fields are present across all NHD data that is included in the OpenStreetMap project, there may literally be gigabytes of redundancies with the potential to be eliminated.

How much data-savings are we talking about here? According to [GeoFabrik](http://download.geofabrik.de/north-america.html), the United States downloaded as a **compressed** osm file is an approximately 12.4gb file! As a rough comparison, consider the the area I used in the case study is an 81.7mb file. Uncompressed it baloons to 1.28 gigabyte. This expansion by a factor of approximately 15.6 means that even if the redundant data constitutes just 0.25 percent of the whole dataset of the United States, there may be savings of around 4gb of uncompressed data!!! See the math below:

In [15]:
uncompress_factor = 1280/81.7
uncompress_factor

15.667074663402692

In [16]:
us_uncompress_size = 12400 * 15.667
us_uncompress_size

194270.8

In [17]:
redundant_saving = (us_uncompress_size * .025) / 1000
print("Approximate uncompressed redundant data removed in mb: " + str(round(redundant_saving, 2)))

Approximate uncompressed redundant data removed in mb: 4.86


Again, someone with the appropriate permissions would need to do this. It would not be overly complicated. Some steps to implementing this may be:

1. Modify the [**data.py**](https://github.com/rancherobeans/udacity/blob/master/P3/PROJECT/supporting_files/scripts/data.py) script to parse the map of the United States.
2. Run the [**fix_dict.py**](https://github.com/rancherobeans/udacity/blob/master/P3/PROJECT/supporting_files/scripts/fix_dict.py) script
3. Execute the SQL queries in the "Consider how the data could be improved" section of this case study to generate lists of the redundancies. 
4. Audit the lists/conduct a reality check for the upcoming deletion process.  
5. Use the lists of the redundancies to remove the data from the OpenStreetMap database.

I believe the logic for how tag elements and their corresponding key/value pairs would hold up for the entire US map, as it did for this case study. This would, however, be something a careful data analyst would test for while conducting the database cleaning. Additionally, contacting the US Geological Survey's support service would be advisable. There may be a reason for an element to have multiple ComID for example, but one cannot be certain until simple yet practical things like asking around are carried out. I actually contacted the support staff at nhd@usgs.gov while conducting this case study to confirm that `Permanent_Identifier` is indeed the new primary key in the NHD dataset (where ComID was the prior one).

### Conclusion to the case study:
This study moved from getting to know the data, auditing it, parsing/cleaning the data, learning from it using SQL queries and finally to suggesting some improvements. It was all conducted within the context of New Orleans and the quantified natural features we may lose within the next 100 years due to global warming. I encourage anyone to download this GitHub repository and run the project locally. Please also feel free to send me a pull request with any changes you think may be valuable.