# OpenStreetMap Data Data Audit
___

Matthew Flood

### Map Area
 New Orleans, LA, United States

- [https://www.openstreetmap.org/relation/131885](https://www.openstreetmap.org/relation/131885)
- [https://www.openstreetmap.org/export#map=11/30.0326/-89.8826](https://www.openstreetmap.org/export#map=11/30.0326/-89.8826)
- [https://overpass-api.de/api/map?bbox=-90.2170,29.8633,-89.5482,30.2015](https://overpass-api.de/api/map?bbox=-90.2170,29.8633,-89.5482,30.2015)

I'll be visiting New Orleans in the next month, so I want to get a feel for the amenities and sights.
   
   

In [6]:
import xml.etree.cElementTree as ET
import pprint
from collections import defaultdict
import re
MAP_PATH="maps/new_orleans_city.osm"


## Audit the top level tags

> First, we want to identify all the top level tags in the data set to make sure the document looks sane from a high level.

In [2]:
def identify_tags_fullpath(filepath):
    """
        Identifies tags along with their parent heirarchy
    """
    tag_types = {}
    tag_types_ordered = []
    stack = []
    events = {}
    with open(filepath, "r") as handle:
        for event, elem in ET.iterparse(handle, events=("start","end")):
            if event == "start":
                stack.append(elem.tag)
                tag_path = "/".join(stack)
                tag_types.setdefault(tag_path, 0)
                if tag_types[tag_path] == 0:
                    # print(tag_path)
                    tag_types_ordered.append(tag_path)
                tag_types[tag_path] += 1
            if event == "end":
                stack.pop()

    for item in tag_types_ordered:
        print("{0: <20} {1: >10}".format(item, tag_types[item]))

In [7]:
identify_tags_fullpath(MAP_PATH)

osm                           1
osm/note                      1
osm/meta                      1
osm/bounds                    1
osm/node                1786106
osm/node/tag             130282
osm/way                  189487
osm/way/nd              2050430
osm/way/tag              535711
osm/relation               1407
osm/relation/member       25395
osm/relation/tag           7114
osm/remark                    1


> This structure is what we expected and nothing looks out of place.  We see &lt;node&gt; and &lt;way&gt; tags.  Those both have nested &lt;tag&gt; elements, and &lt;way&gt; tags also have nested &lt;nd&gt; tags.

## Audit the tag 'keys' 

> Now we want to audit the "k" value for each "&lt;tag&gt;" and see if there are any potential problems.  The following utility function will put each key into one of four buckets.

In [8]:
def audit_tag_keys(filepath, parent_tag):
    """
        parent_tag: way, node, relation
        
        Audit all the "k" attributes and put them into 4 buckets
        1. all lowercase and valid
        2. all lowercase with a ':' in the middle, and valid
        3. ones with bad characters
        4. others (none of the above)
    """
    
    lowercase_tags = {}
    lowercase_colon_tags = {}
    bad_char_tags = {}
    other_tags = {}
   
    regex_all_lower = re.compile(r'^([_a-z])*$')
    lower_colon = re.compile(r'^([_a-z])*:([_a-z])*$')
    problemchars = re.compile(r'[^_a-z]')
    
    with open(filepath, "r") as handle:
        for event, elem in ET.iterparse(handle, events=("start",)):
            if elem.tag == parent_tag:
                for elem in elem.iterfind("tag"):
                    if elem.tag == "tag":
                        k_val = elem.attrib['k']
                        v_val = elem.attrib['v']
                        # print(k_val)
                        if regex_all_lower.match(k_val):
                            lowercase_tags.setdefault(k_val, set())
                            lowercase_tags[k_val].add(v_val)
                        elif lower_colon.match(k_val):
                            lowercase_colon_tags.setdefault(k_val, set())
                            lowercase_colon_tags[k_val].add(v_val)
                        elif problemchars.match(k_val):
                            bad_char_tags.setdefault(k_val, set())
                            bad_char_tags[k_val].add(v_val)
                        else:
                            other_tags.setdefault(k_val, set())
                            other_tags[k_val].add(v_val)
    
    return (lowercase_tags, lowercase_colon_tags, bad_char_tags, other_tags)

> I want to look at Nodes and Ways separately, so I'm going to store "key" info for each of thoise in separate variables

In [9]:
node_lowercase_tags, node_lowercase_colon_tags, node_bad_char_tags, node_other_tags = audit_tag_keys(MAP_PATH, "node")
way_lowercase_tags, way_lowercase_colon_tags, way_bad_char_tags, way_other_tags = audit_tag_keys(MAP_PATH, "way")


### tags with bad characters
First, let's look a the keys with bad characters in nodes, and then ways

In [10]:
sorted_keys = list(node_bad_char_tags.keys())
sorted_keys.sort()
for item in sorted_keys:
    value_set = node_bad_char_tags[item]
    print("{0: <20} {1: >10} {2:}".format(item, len(value_set), ", ".join(list(value_set))[:80] ) )

ADDRESS_LA                   85 9430 OLEANDER ST, 9028 FORSHEY ST, 9030 FORSHEY ST, 3225 MISTLETOE ST, 9311 FIG 
FIXME                         9 Where does operation change?, rough location, rough location; does 114 stop here
HIV                           1 Delgado STD Clinic
Payments                      1 MasterCard, Amex, Cash, Visa, Discover, Checks


In [11]:
sorted_keys = list(way_bad_char_tags.keys())
sorted_keys.sort()
for item in sorted_keys:
    value_set = way_bad_char_tags[item]
    print("{0: <20} {1: >10} {2:}".format(item, len(value_set), ", ".join(list(value_set))[:80] ) )

ADDRESS_LA                   13 3223 HAMILTON ST, 3321 HAMILTON ST, 3034 HAMILTON ST, 3212 HAMILTON ST, 9200 FOR
FIXME                         6 rough alignment in places, dual carriageway, verify, check lanes, route temporar
FIXME:ncn_ref                 1 verify
FIXME:ref                     1 verify north end
NHD:ComID                  6845 143839747, 139202274, 139181741, 139146052, 139186134, 139148356, 139188218, 139
NHD:Elevation                 5 -2.40000000000, 0.60000000000, 0.90000000000, -2.70000000000, -3.00000000000
NHD:FCode                    13 43612, 39004, 36900, 39800, 46003, 33600, 48300, 44500, 46600, 39001, 43609, 460
NHD:FDate                    12 2005/12/07, 2006/09/25, 2006/05/31, 2005/08/26, 2008/07/04, 2005/01/14, 2005/08/
NHD:FTYPE                     7 SwampMarsh, Reservoir, Lock Chamber, StreamRiver, CanalDitch, LakePond, SeaOcean
NHD:FType                     5 Gate, Wall, Lock Chamber, StreamRiver, CanalDitch
NHD:GNIS_ID                  12 01627901, 0

> So, these keys do not really have bad data, they just have capital letters.  And some, like `NHD:FTYPE` and `NHD:FType` have different capitlization.
> We will address these by forcing all the k values to lowercase when we perform our import


### Other keys

> Now we want to look at the "other" bucket to see what kind of issues there are.

In [20]:
sorted_keys = list(node_other_tags.keys())
sorted_keys.sort()
for item in sorted_keys:
    value_set = node_other_tags[item]
    print("{0: <20} {1: >10} {2:}".format(item, len(value_set), ", ".join(list(value_set))[:80] ) )

amenity_1                     1 cafe
currency:ETH                  1 yes
currency:LTC                  1 yes
currency:XBT                  1 yes
generator:output:electricity          1 560 MW
gnis:Class                    1 Populated Place
gnis:County                   5 St. Tammany, Jefferson, Orleans, Plaquemines, St. Bernard
gnis:County_num               5 075, 087, 103, 051, 071
gnis:ST_alpha                 1 LA
gnis:ST_num                   1 22
name:etymology:wikidata          1 Q8027
name_1                        1 Lakefront Airport
service:bicycle:diy           1 yes
service:bicycle:pump          1 yes
service:bicycle:rental          1 yes
service:bicycle:repair          1 yes
service:bicycle:retail          1 yes
service:bicycle:second_hand          1 yes
service:bicycle:tools          1 yes
socket:nema_5_20              1 2
socket:type1                  2 2, 1
source:name:oc                1 Lo Congrès
wikipedia_1                   1 pl:Pomnik Andrew Jacksona w Waszyngtonie


In [21]:
sorted_keys = list(way_other_tags.keys())
sorted_keys.sort()
for item in sorted_keys:
    value_set = way_other_tags[item]
    print("{0: <20} {1: >10} {2:}".format(item, len(value_set), ", ".join(list(value_set))[:80] ) )

building:roof:material          1 glass
cycleway:left:oneway          1 no
cycleway:left:width           1 9'
cycleway:right:oneway          1 no
cycleway:right:width          5 11', 12', 9', 8', 9
destination:ref:lanes          2 LA 45|LA 45, LA 45
gnis:Class                    1 Populated Place
gnis:County                   1 Jefferson
gnis:County_num               1 051
gnis:ST_alpha                 1 LA
gnis:ST_num                   1 22
name:etymology:wikidata          1 Q8027
name_1                       93 State Route 3137, Harrow Road, Highway 433, Beech Drive, Cameron Boulevard, 3rd 
name_2                        7 Sanctuary Drive, West Saint Bernard Highway, West Saint Jean Baptiste Street, Sa
plant:output:electricity          2 959 MW, 1646 MW
ref:AEFE                      1 272N15
service:vehicle:body_repair          1 yes
sidewalk:both:surface          1 concrete
tiger:name_base_1           119 Veterans Memorial, State Route 3137, Highway 433, I-510, State Route 45, Johnso

> Again, we see issues related to capitalization.  We also see some numbers, which have to do with the fact that a key can only be used once for a tag, so if you want more than one name, then instead of having three tags with `name`, you need to have `name`, `name_1` and `name_2`.

> To verify that this is the case, I want to look at a few tags that have `name_`.

#### Look into name_1 and amenity_1

>This function will find entities that have a tag with the 
given `tag_key` and print out the full set of tags for the node/way/relation



In [22]:
def print_example_entities(filepath, tag_key, max_results):
    """
        finds and prints up to max_results sample Nodes / Ways / Relations
        that have the specified tag key
    """
    with open(filepath, "r") as handle:
        num_examples_found = 0
        for event, elem in ET.iterparse(handle, events=("start",)):
            if elem.tag in ("node", "relation", "way"):
                found_an_example = False
                example_dict = {}
                
                for child_tag in elem.iterfind("tag"):
                    if child_tag.tag == "tag":
                        k_val = child_tag.attrib['k']
                        v_val = child_tag.attrib['v']
                        
                        example_dict.setdefault(k_val, set())
                        example_dict[k_val].add(v_val)
                        
                        if k_val == tag_key:
                            found_an_example = True
                
                if found_an_example:
                    num_examples_found += 1
                    print("Example {}".format(elem.tag))    
                    pprint.pprint(example_dict)
                          
                    if max_results and (num_examples_found >= max_results):
                        # we have printed max_results so we are done
                        return
 

In [23]:
print_example_entities(MAP_PATH, "name_1", max_results=3)

Example node
{'addr:state': {'LA'},
 'aeroway': {'aerodrome'},
 'closest_town': {'New Orleans, Louisiana'},
 'ele': {'1'},
 'faa': {'NEW'},
 'gnis:county_name': {'Orleans'},
 'gnis:created': {'08/01/1991'},
 'gnis:feature_id': {'1632892'},
 'gnis:feature_type': {'Airport'},
 'iata': {'NEW'},
 'icao': {'KNEW'},
 'name': {'New Orleans Lakefront Airport'},
 'name:en': {'New Orleans Lakefront Airport'},
 'name_1': {'Lakefront Airport'},
 'operator': {'Orleans Levee District'},
 'source': {'wikipedia'},
 'source_ref': {'geonames.usgs.gov'},
 'type': {'public'},
 'wikidata': {'Q10853543'},
 'wikipedia': {'en:New Orleans Lakefront Airport'}}
Example way
{'highway': {'secondary'},
 'maxspeed': {'45 mph'},
 'name': {'Peters Road'},
 'name_1': {'State Route 3017'},
 'ref': {'LA 3017'},
 'tiger:cfcc': {'A41;A35'},
 'tiger:county': {'Jefferson, LA'},
 'tiger:name_base': {'Peters'},
 'tiger:name_base_1': {'State Route 3017'},
 'tiger:name_type': {'Rd'},
 'tiger:reviewed': {'no'},
 'tiger:zip_left':

> Yup - these all have name_1 as alternates to the name key. I want to look at amenity_1 as well

In [24]:
print_example_entities(MAP_PATH, "amenity_1", max_results=3)

Example node
{'addr:city': {'New Orleans'},
 'addr:housenumber': {'2372 #130'},
 'addr:postcode': {'70117'},
 'addr:state': {'LA'},
 'addr:street': {'Saint Claude Avenue'},
 'amenity': {'restaurant'},
 'amenity_1': {'cafe'},
 'delivery': {'no'},
 'name': {'The Spotted Cat'},
 'takeaway': {'yes'}}


### Lowercase keys without semicolons 

> Now we want to look at the lowercase keys, the ones without semicolons.

In [29]:
sorted_keys = list(way_lowercase_tags.keys())
sorted_keys.sort()
for item in sorted_keys:
    value_set = way_lowercase_tags[item]
    print("{0: <20} {1: >10} {2:}".format(item, len(value_set), ", ".join(list(value_set))[:80] ) )

access                        7 no, customers, permissive, destination, private, delivery, yes
admin_level                   2 8, 6
aeroway                       6 helipad, terminal, taxiway, runway, hangar, apron
alt_name                     15 Walmart Supercenter New Orleans, Alcee Fortier Boulevard, Highway 433, Audubon B
amenity                      48 public_building, nursing_home, clubhouse, veterinary, college, car_wash, fuel, f
architect                     4 Edward Durell Stone, Emile Weil, Charles Moore, DMJM
area                          2 no, yes
artist_name                   1 Multiple
artwork_type                  2 sculpture, statue
atm                           1 yes
attribution                   1 NHD
backrest                      1 yes
barrier                       9 retaining_wall, fence, wall, guard_rail, wire_fence, dyke, gate, kerb, hedge
bicycle                       4 no, designated, permissive, yes
boat                          2 no, yes
border_type            

In [30]:
sorted_keys = list(node_lowercase_tags.keys())
sorted_keys.sort()
for item in sorted_keys:
    value_set = node_lowercase_tags[item]
    print("{0: <20} {1: >10} {2:}".format(item, len(value_set), ", ".join(list(value_set))[:80] ) )

access                        5 no, customers, permissive, private, yes
aeroway                       3 helipad, holding_position, aerodrome
alt_name                     19 Chipotle Mexican Grill, BCBG, Jesuit Church, Tastee Donuts, UPT, Sbisa's Cafe, F
amenity                      64 public_building, vending_machine, veterinary, parking_entrance, car_wash, ice_cr
artist                        1 Clark Mills
artist_name                  14 Rebecca Pons, Madeleine Faust, Terry Weldin, Paul Perret, Robert Schoen, Franco 
artwork_type                  4 sculpture, mural, statue, painting
atm                           2 no, yes
attribution                   1 USGS 2001 County Boundary
automated                     1 yes
backrest                      1 yes
barrier                       8 yes, guard_rail, block, bollard, swing_gate, gate, entrance, lift_gate
beauty                        3 hair_removal, spa, nails
bench                         2 no, yes
bicycle                       2 no, yes

> Nothing seems out of place there.  

# Lowercase keys with semicolons

Finally, we want to look at the ones with semicolons

In [31]:
sorted_keys = list(way_lowercase_colon_tags.keys())
sorted_keys.sort()
for item in sorted_keys:
    value_set = way_lowercase_colon_tags[item]
    print("{0: <20} {1: >10} {2:}".format(item, len(value_set), ", ".join(list(value_set))[:80] ) )

abandoned:railway             1 rail
abandoned:tourism             1 theme_park
addr:city                    16 Metairie, Arabi, Jefferson, Avondale, Harahan, Belle Chasse, New Oreleans, New O
addr:country                  2 UA, US
addr:full                     2 6000 Bullard Ave, 99 Westbank Expy
addr:housename                2 Arabella Station, Bagneris Home
addr:housenumber           8908 4042, 8534, 7246, 7039, 1224, 6525, 5103, 6670, 947, 5401, 1052, 3740, 576, 7441
addr:postcode                30 70114, 70118, 70001, 70115, 70124, 70053, 70037, 70126, 70131, 70032, 70116, LA 
addr:state                    4 LA, la, lA, Louisiana
addr:street                1998 South Genois Street, Lemans Street, Cartier Avenue, West End Boulevard, Baudin S
addr:unit                     1 115
amenity:bar                   1 yes
amenity:historic              1 hospital
brand:wikidata               32 Q735942, Q483551, Q2645636, Q2735242, Q550258, Q5583655, Q1141226, Q7500392, Q20
brand:wikipedia   

In [28]:
sorted_keys = list(node_lowercase_colon_tags.keys())
sorted_keys.sort()
for item in sorted_keys:
    value_set = node_lowercase_colon_tags[item]
    print("{0: <20} {1: >10} {2:}".format(item, len(value_set), ", ".join(list(value_set))[:80] ) )

abandoned:man_made            1 lighthouse
addr:city                    17 Metairie, Terrytown, Marrero, Meraux, Harahan, Belle Chasse, New Oreleans, Chalm
addr:country                  1 US
addr:full                     6 4810 Lapalco Blvd, 2500 Archbishop Hannan Blvd, 4001 Behrman Pl, 4301 Chef Mente
addr:housename                9 Aframe, Chase Bank, Malbrough Notary, Subway, New Orleans City Park Volunteer Ce
addr:housenumber           6685 8534, 4042, 11971, 7039, 1224, 6525, 5103, 15534, 6670, 947, 5401, 1052, 3740, 7
addr:interpolation            2 200, 600
addr:place                    1 Suite 500
addr:postcode                37 70114, 70118, 70001, 70115, 70124, 70053, 70037, 70130-3890, 70126, 70116, 70075
addr:state                    3 La, LA, Louisiana
addr:street                1294 38th Street, Annette Street, North Lemans Street, Elgin Street, South Genois Str
addr:unit                    31 110, Suite B, #1825, 119, 107, Suite 1, #100, 109, 111, Suite A, Suite A-2, G, 

### Auditing Street Names

> Now I want to look at street names to make sure they are consistent
