# Parse and clean OpenStreetMap data

### Identify, archive, document data file:

Data File: New Orleans [MapZen Extract](https://mapzen.com/data/metro-extracts/metro/new-orleans_louisiana/)

### Inspect and document general structure:

Essential task is to parse data file and clean. What am I parsing?

First, looking at the documentation, the gist that I get is:
* node is a lon/lat point, possibly with 'tag' child elements
* way has nodes as children(as nd elements) and defines things like roads, buildings, etc. Also may have 'tag' children
* relation has nested 'member' elements which refer to ways and nodes - defines relationships among elements

Next, open doc in vim - get aquanted, why? Because I can and I want to. Visually I see (screenshots):
* standard opening:  <?xml version='1.0' encoding='UTF-8'?>
* root is an 'osm' tag
* doc is 16,082,009 lines long in total - wow!
* skipping down some pages, looks like a bunch of nodes with no internal element defined at top
* skipping down from middle of doc, looks like a bunch of 'way' tags of natural features (streams, other waterway)
* skipping up from the end of the doc, looks like a bunch of 'relations' of things like gardens and public transportation stations

### Speculate on what to work with specifically:

So what should I clean here? 'Tag' elements seem to be an important part where details about map features are stored. This should also be where a lot of user-generated/added content is stored.

In the middle, what is NHD? Strikes me as interesting. So Googling it, Ah, US Geological Survey's National Hydrography Dataset (Watershed Boundary Dataset). The data should be damn near perfect. Simply, it should be a way, with node child element references and tags with all relevant info (so  poly boundary and some meta data). These are mostly 'natural' features - auditing them sounds interesting.

First, the NHD data shouldn't really have many errors given the data is taken from the NHD and uploaded to OpenStreetMaps (as opposed to multiple user-generated content for most contributor use-cases of OpenStreetMap). It's worth checking a few things to be sure. Some inspection is in order:

### Inspect/audit/plan out cleaning methodology:

First, what keys exist in natural feature tag elements as a whole?

In [None]:
import xml.etree.ElementTree as ET

OSM_FILE = "/Users/mchana/GitHub/udacity/large_files/new-orleans_region.osm"

def count_elem(osm_file):
    tag_set = set() 
    
    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    
    for event, elem in context:
        if event == 'end' and (elem.tag == 'node' or elem.tag == 'way' or elem.tag == 'relation'):
            for tag_elem in elem.iter("tag"):
                if tag_elem.attrib['k'] == 'natural':
                    for tag_elem in elem.iter("tag"):
                        tag_set.add(tag_elem.attrib['k'])
        root.clear()
    return tag_set

keys_list = list(count_elem(OSM_FILE))
print(keys_list)

Hard to tell visually - sort it:

In [None]:
sorted(keys_list)

Some of the k values like 'golf' seem pretty arbitrary, but they may have a purpose. Others, like NHD:FCode seem redundant with FCode. Maybe US Geological society changed their naming convention at one point. There may be something to clean here... I also see a 'fixme' key which may be something. Worth noting for now, but what will really help is to create a dictionary of the features and their attribute contents.

Some notes:
* FCODE is used at just once - uses '46600' as its contents. Jumping down to NHD:FCode I see more contents and again, the '46600'. I'm betting FCODE and NHD:FCode are the same. Looking at the NHD poster I find that 46600 is for "Swamp/Marsh". The different keys here are definitely trying to achieve the same thing. They can be merged. Also, I'm not worried about a key having multiple FCode - for example 33600;46600 is a Canal/Ditch, but it's also a Swamp/Marsh. 
* Next redundant tag key is FDATE. 'FDATE' has 'Mon May 23 00:00:00 CEST 2011'as its contents. Looks like a time stamp - checking. I'll check to make sure other elements with the FDATE tag elem have a different timestamp. I could do this, but there's a better - faster way (all I need to do is see a few others to verify it's a time stamp. Vim - /search_key for 'FDATE". Sure enough, found on line 13,356,160. Vim - n finds no other instances! What about NHD:Fdate? Yes, all kinds of others, but their dates ar emore simple - for example, "2005/12/05". Fair enough to say that the rogue element's date can be changed to 2011/05/23. 
* Now thinking, there only seems to be a single rogue elem. Inspect it to make sure it's not wildly different. Nothing stands out - /search for the user Aleks-Berlin. He also seems to have edited a few others - some tags called "FIXME", which seem a little haphazard. I'm wondering who this person is and if entering data haphazardly. Check element on map: /search for one of the nodes, then enter lat/lon coordinates on map: 29.3557244, -90.0538116. Looks like a valid natural feature, so the way elem should stay in place. 
* Next, FTYPE. 466 corresponds to Feature Type from pdf. 
* 'natural' all good
* 'PERMANENT_ is redundant. Has strange value: &#123;A0AFF249-A7D2-44F4-AD8A-0A4A68F99450&#125;
    * Search for NHD:Permanent_. They all have much simpler values like "151098380". What is it? On PDF there is a 'Permanent_Identifier' feature that is a 40 character string, but all the ones I find are 9 characters. Not enough info here - so leaving this one alone. Somehow can flag this elem? Not sure what this is so leaving alone.
*RESOLUTION is '2'. From NHD documentation, this should be set to "High" (Code of source resolution: 1=Local resolution, 2=High resolution, 3=Medium Resolution.)
* SHAPE_AREA and SHAPE_LENG exist in PDF guide as 'Shape_Area' and 'Shape_Length' but have no guidelines and are not used in any other elements in the XML file. Again, wondering if best to just remove this element!

    
Now onto auditing some some of the other keys. For this it's best to build the dictionary and see what kind of information they hold:

In [2]:
import xml.etree.ElementTree as ET
import collections as col
import pprint

OSM_FILE = "/Users/mchana/GitHub/udacity/large_files/new-orleans_region.osm"

def make_elem_dict(osm_file):
    
    elem_dict = col.defaultdict(set)
    
    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    
    # opportunity for 'continue' here...
    # also, pull out - make function to find correct element
    for event, elem in context:
        if event == 'end' and (elem.tag == 'node' or elem.tag == 'way' or elem.tag == 'relation'):
            for tag in elem.iter("tag"):
                if tag.attrib['k'] == 'natural':
                    # am overwriting tag here from prev. loop - call something else
                    for tag in elem.iter("tag"):
                        elem_dict[tag.attrib['k']].add(tag.attrib['v'])
        root.clear()
    return elem_dict

pprint.pprint(make_elem_dict(OSM_FILE))

defaultdict(<class 'set'>,
            {'FCODE': {'46600'},
             'FDATE': {'Mon May 23 00:00:00 CEST 2011'},
             'FTYPE': {'466'},
             'NHD:ComID': {'121376748',
                           '121376910',
                           '121379126',
                           '131339718',
                           '131339719',
                           '131339720',
                           '131339721',
                           '137031874',
                           '137032340',
                           '139181529',
                           '139181530',
                           '139181531',
                           '139181532',
                           '139181534',
                           '139181538',
                           '139181539',
                           '139181542',
                           '139181544',
                           '139181546',
                           '139181548',
                           '139181549',
            

                           '139183331',
                           '139183332',
                           '139183333',
                           '139183334',
                           '139183335',
                           '139183336',
                           '139183337',
                           '139183339',
                           '139183340',
                           '139183341',
                           '139183342',
                           '139183343',
                           '139183344',
                           '139183345',
                           '139183346',
                           '139183347',
                           '139183348',
                           '139183349',
                           '139183350',
                           '139183351',
                           '139183353',
                           '139183354',
                           '139183355',
                           '139183356',
                           '139183357',


                           '139185365',
                           '139185367',
                           '139185368',
                           '139185369',
                           '139185370',
                           '139185372',
                           '139185373',
                           '139185374',
                           '139185375',
                           '139185376',
                           '139185377',
                           '139185380',
                           '139185382',
                           '139185383',
                           '139185384',
                           '139185385',
                           '139185386',
                           '139185387',
                           '139185388',
                           '139185389',
                           '139185390',
                           '139185391',
                           '139185392',
                           '139185393',
                           '139185394',


                           '139188628',
                           '139188629',
                           '139188630',
                           '139188631',
                           '139188632',
                           '139188633',
                           '139188634',
                           '139188635',
                           '139188637',
                           '139188638',
                           '139188639',
                           '139188640',
                           '139188641',
                           '139188642',
                           '139188644',
                           '139188645',
                           '139188646',
                           '139188647',
                           '139188648',
                           '139188649',
                           '139188650',
                           '139188651',
                           '139188652',
                           '139188653',
                           '139188654',


                           '139191124',
                           '139191125',
                           '139191126',
                           '139191127',
                           '139191128',
                           '139191129',
                           '139191130',
                           '139191131',
                           '139191132',
                           '139191133',
                           '139191134',
                           '139191135',
                           '139191141',
                           '139191142',
                           '139191145',
                           '139191146',
                           '139191147',
                           '139191153',
                           '139191154',
                           '139191155',
                           '139191156',
                           '139191157',
                           '139191158',
                           '139191162',
                           '139191163',


                           '139193802',
                           '139193805',
                           '139193806',
                           '139193807',
                           '139193808',
                           '139193809',
                           '139193810',
                           '139193811',
                           '139193812',
                           '139193813',
                           '139193816',
                           '139193817',
                           '139193818',
                           '139193821',
                           '139193822',
                           '139193823',
                           '139193824',
                           '139193825',
                           '139193826',
                           '139193827',
                           '139193828',
                           '139193829',
                           '139193830',
                           '139193831',
                           '139193832',


                           '139196240',
                           '139196241',
                           '139196242',
                           '139196244',
                           '139196247',
                           '139196249',
                           '139196250',
                           '139196251',
                           '139196252',
                           '139196253',
                           '139196254',
                           '139196255',
                           '139196256',
                           '139196257',
                           '139196258',
                           '139196259',
                           '139196260',
                           '139196261',
                           '139196262',
                           '139196263',
                           '139196264',
                           '139196265',
                           '139196266',
                           '139196269',
                           '139196270',


                           '139199022',
                           '139199024',
                           '139199025',
                           '139199026',
                           '139199029',
                           '139199033',
                           '139199034',
                           '139199036',
                           '139199039',
                           '139199041',
                           '139199042',
                           '139199043',
                           '139199044',
                           '139199045',
                           '139199046',
                           '139199049',
                           '139199051',
                           '139199052',
                           '139199053',
                           '139199055',
                           '139199059',
                           '139199060',
                           '139199061',
                           '139199062',
                           '139199063',


                           '139202389',
                           '139202390',
                           '139202391',
                           '139202392',
                           '139202393',
                           '139202394',
                           '139202395',
                           '139202396',
                           '139202397',
                           '139202398',
                           '139202401',
                           '139202402',
                           '139202403',
                           '139202404',
                           '139202405',
                           '139202406',
                           '139202407',
                           '139202408',
                           '139202409',
                           '139202410',
                           '139202411',
                           '139202413',
                           '139202414',
                           '139202415',
                           '139202416',


                           '139204903',
                           '139204904',
                           '139204905',
                           '139204906',
                           '139204907',
                           '139204908',
                           '139204910',
                           '139204911',
                           '139204912',
                           '139204913',
                           '139204914',
                           '139204915',
                           '139204916',
                           '139204917',
                           '139204918',
                           '139204919',
                           '139204920',
                           '139204921',
                           '139204922',
                           '139204923',
                           '139204925',
                           '139204926',
                           '139204927',
                           '139204929',
                           '139204930',


                           '139207146',
                           '139207147',
                           '139207148',
                           '139207150',
                           '139207152',
                           '139207154',
                           '139207157',
                           '139207158',
                           '139207159',
                           '139207160',
                           '139207161',
                           '139207163',
                           '139207164',
                           '139207165',
                           '139207166',
                           '139207167',
                           '139207168',
                           '139207169',
                           '139207171',
                           '139207172',
                           '139207175',
                           '139207177',
                           '139207178',
                           '139207180',
                           '139207181',


                           '141719667',
                           '141719669',
                           '141719673',
                           '141719674',
                           '141719675',
                           '141719679',
                           '141719683',
                           '141719685',
                           '141719686',
                           '141719687',
                           '141719688',
                           '141719689',
                           '141719691',
                           '141719693',
                           '141719695',
                           '141719697',
                           '141719698',
                           '141719702',
                           '141719703',
                           '141719706',
                           '141719707',
                           '141719708',
                           '141719711',
                           '141719713',
                           '141719714',


                           '141723850',
                           '141723853',
                           '141723854',
                           '141723855',
                           '141723856',
                           '141723858',
                           '141723860',
                           '141723862',
                           '141723863',
                           '141723864',
                           '141723869',
                           '141723871',
                           '141723872',
                           '141723873',
                           '141723874',
                           '141723877',
                           '141723880',
                           '141723884',
                           '141723891',
                           '141723893',
                           '141723894',
                           '141723895',
                           '141723896',
                           '141723897',
                           '141723899',


                           '143842483',
                           '143842487',
                           '143842488',
                           '143842491',
                           '143842492',
                           '143842496',
                           '143842497',
                           '143842505',
                           '143842508',
                           '143842509',
                           '143842516',
                           '143842517',
                           '143842522',
                           '143842526',
                           '143842540',
                           '143842544',
                           '143842551',
                           '143842557',
                           '143842585',
                           '143842594',
                           '143842615',
                           '143842619',
                           '143842628',
                           '143842643',
                           '143842662',


                           '143856967',
                           '143856987',
                           '143856999',
                           '143857003',
                           '143857004',
                           '143857005',
                           '143857007',
                           '143857016',
                           '143857017',
                           '143857018',
                           '143857022',
                           '143857023',
                           '143857030',
                           '143857033',
                           '143857035',
                           '143857038',
                           '143857049',
                           '143857071',
                           '143857072',
                           '143857082',
                           '143857086',
                           '143857088',
                           '143857091',
                           '143857102',
                           '143857110',


                           '148749933',
                           '148749934',
                           '148749935',
                           '148749937',
                           '148749938',
                           '148749939',
                           '148749940',
                           '148749941',
                           '148749942',
                           '148749944',
                           '148749945',
                           '148749946',
                           '148749947',
                           '148749948',
                           '148749949',
                           '148749950',
                           '148749951',
                           '148749952',
                           '148749953',
                           '148749954',
                           '148749955',
                           '148749956',
                           '148749957',
                           '148749958',
                           '148749959',


                           '151097877',
                           '151097878',
                           '151097879',
                           '151097880',
                           '151097881',
                           '151097882',
                           '151097883',
                           '151097884',
                           '151097885',
                           '151097886',
                           '151097887',
                           '151097888',
                           '151097889',
                           '151097890',
                           '151097891',
                           '151097892',
                           '151097893',
                           '151097894',
                           '151097895',
                           '151097896',
                           '151097897',
                           '151097898',
                           '151097899',
                           '151097900',
                           '151097901',


                                '151098008',
                                '151098009',
                                '151098010',
                                '151098011',
                                '151098012',
                                '151098015',
                                '151098016',
                                '151098017',
                                '151098018',
                                '151098019',
                                '151098020',
                                '151098022',
                                '151098024',
                                '151098025',
                                '151098026',
                                '151098027',
                                '151098028',
                                '151098029',
                                '151098030',
                                '151098031',
                                '151098032',
                                '151098033',
          

                               '08090201007805',
                               '08090201007806',
                               '08090201007807',
                               '08090201007808',
                               '08090201007809',
                               '08090201007810',
                               '08090201007812',
                               '08090201007813',
                               '08090201007814',
                               '08090201007815',
                               '08090201007816',
                               '08090201007818',
                               '08090201007819',
                               '08090201007820',
                               '08090201007821',
                               '08090201007822',
                               '08090201007823',
                               '08090201007824',
                               '08090201007825',
                               '08090201007826',
                    

                               '08090203031721',
                               '08090203031722',
                               '08090203031724',
                               '08090203031726',
                               '08090203031727',
                               '08090203031728',
                               '08090203031729',
                               '08090203031730',
                               '08090203031731',
                               '08090203031732',
                               '08090203031733',
                               '08090203031734',
                               '08090203031735',
                               '08090203031736',
                               '08090203031737',
                               '08090203031738',
                               '08090203031739',
                               '08090203031740',
                               '08090203031741',
                               '08090203031742',
                    

                               '08090203034074',
                               '08090203034075',
                               '08090203034076',
                               '08090203034077',
                               '08090203034078',
                               '08090203034079',
                               '08090203034081',
                               '08090203034082',
                               '08090203034083',
                               '08090203034084',
                               '08090203034085',
                               '08090203034087',
                               '08090203034088',
                               '08090203034089',
                               '08090203034091',
                               '08090203034094',
                               '08090203034095',
                               '08090203034096',
                               '08090203034097',
                               '08090203034098',
                    

                               '08090203036688',
                               '08090203036689',
                               '08090203036690',
                               '08090203036691',
                               '08090203036692',
                               '08090203036693',
                               '08090203036694',
                               '08090203036699',
                               '08090203036700',
                               '08090203036701',
                               '08090203036702',
                               '08090203036703',
                               '08090203036705',
                               '08090203036707',
                               '08090203036708',
                               '08090203036709',
                               '08090203036711',
                               '08090203036712',
                               '08090203036714',
                               '08090203036716',
                    

                               '08090203040016',
                               '08090203040019',
                               '08090203040020',
                               '08090203040021',
                               '08090203040022',
                               '08090203040023',
                               '08090203040024',
                               '08090203040025',
                               '08090203040026',
                               '08090203040027',
                               '08090203040028',
                               '08090203040029',
                               '08090203040030',
                               '08090203040031',
                               '08090203040032',
                               '08090203040033',
                               '08090203040035',
                               '08090203040036',
                               '08090203040037',
                               '08090203040038',
                    

                               '08090203043508',
                               '08090203043509',
                               '08090203043511',
                               '08090203043512',
                               '08090203043513',
                               '08090203043514',
                               '08090203043515',
                               '08090203043516',
                               '08090203043517',
                               '08090203043518',
                               '08090203043519',
                               '08090203043520',
                               '08090203043521',
                               '08090203043522',
                               '08090203043523',
                               '08090203043524',
                               '08090203043525',
                               '08090203043526',
                               '08090203043528',
                               '08090203043530',
                    

                               '08090203046988',
                               '08090203046989',
                               '08090203046991',
                               '08090203046992',
                               '08090203046993',
                               '08090203046996',
                               '08090203046997',
                               '08090203046998',
                               '08090203047000',
                               '08090203047001',
                               '08090203047002',
                               '08090203047006',
                               '08090203047007',
                               '08090203047008',
                               '08090203047010',
                               '08090203047011',
                               '08090203047012',
                               '08090203047013',
                               '08090203047014',
                               '08090203047016',
                    

                               '08090301036165',
                               '08090301036171',
                               '08090301036173',
                               '08090301036175',
                               '08090301036177',
                               '08090301036178',
                               '08090301036179',
                               '08090301036180',
                               '08090301036181',
                               '08090301036183',
                               '08090301036184',
                               '08090301036185',
                               '08090301036186',
                               '08090301036187',
                               '08090301036188',
                               '08090301036189',
                               '08090301036190',
                               '08090301036192',
                               '08090301036193',
                               '08090301036194',
                    

                              '02/01/1995',
                              '03/01/1994',
                              '05/01/1994',
                              '06/01/1990',
                              '06/01/1992',
                              '06/01/1993',
                              '06/04/1980',
                              '09/24/1980',
                              '12/01/2003'},
             'gnis:feature_id': {'00532088',
                                 '00532383',
                                 '00532593',
                                 '00532601',
                                 '00532670',
                                 '00532743',
                                 '00532757',
                                 '00532782',
                                 '00532783',
                                 '00532800',
                                 '00532943',
                                 '00533262',
                                 '00533609',
                  

Some notes on checking the other fields starting at the end:
* 'wikipedia' - not going to touch, connects an elem to wikipedi article
* wikidata - Connects to wikimedia commons. Googling 'wikidata and one of the values from the dict gives' a link to an [article](https://commons.wikimedia.org/wiki/File:Bayou_St_John_by_Spanish_Fort_2009.jpg) - not touching this
* Scanning through others doesn't seem to throw any flags (trying to keep scope of investigation more targeted) However:
* will fix capitalization in 'water' tag
* in source - change any 'bing' to 'Bing', 'landsat' to 'LandSat'
* Look at all 'note'. There are many non-'nature' elems with notes as well, so I'll have to write a script to pull all 'nature' elems that also have a 'note' tag elem:

In [None]:
import xml.etree.ElementTree as ET

OSM_FILE = "/Users/mchana/GitHub/udacity/large_files/new-orleans_region.osm"

def find_note_tag(osm_file):
    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    
    for event, elem in context:
        if event == 'end' and (elem.tag == 'node' or elem.tag == 'way' or elem.tag == 'relation'):
            for tag_elem in elem.iter("tag"):
                if tag_elem.attrib['k'] == 'natural':
                    for tag_elem in elem.iter("tag"):
                        if tag_elem.attrib['k'] == 'note':
                            print(ET.dump(elem))
        root.clear()
find_note_tag(OSM_FILE)

What I'm learing: (add PDF descriptions for each field when introducing them)
* The comment: ""I have altered the natural:coatline tag as this way duplicates existing coastline ways"" tells me that I shouldn't mess with either of the 'coastline' or '_coastline' keys as it's purposful. Other notes don't indicate any other action needed.

Back to dict:
* 'name' - fix capitalization on lower case words There are enough that I can see they all need it. 'xxx' seems like an error but not going to change it...
    * discard 'yes' in node 4506654389
* 'fixme' - not touching. Seems like a way for people to know what needs to be changed due to construction, etc. ("Needs survey", "tempoary way, whilst coastline is sorted out", etc.) Still, slightly problematic as just states "temporary fix" for some. Not doing anything with this.
* NHD:ReachCode - ("Unique identifier composed of two parts, first eight digits = subbasin code as defined by FIPS 103, and next six digits = random-assigned sequential number unique within a Cataloguing Unit."
    * issues that I see but can't fix:
           * Is supposed to be a unique identifier but some have more than one. Manually inspecting one such element doesn't reveal why, but noticed also has a duplicate NHD:ComID. (WayID: 43393226)

* NHD:FDate has one or more instances with two dates. Nothing I can do though.
* NHD:FCode , ok to have more than one.
    * NDH:ComID, has more than one id for some, but more imporantly PDF indicates "ComID field deleted from all feature classes/tables" in the Model Changes section. Should this data still be here? This is the most updated model documentation from August 2016. Waiting on email sent to NHD...

### Build basic parser/CSV compiler:

In [1]:
import csv
import codecs
import pprint
import re
import xml.etree.cElementTree as ET
import pprint
import inspect as ins

import cerberus

from supporting_files import test_schema
from supporting_files import fix_dict as fd

OSM_PATH = "/Users/mchana/GitHub/udacity/large_files/new-orleans_region.osm"
# OSM_PATH= "supporting_files/new-orleans_region_sample_k1000.osm"
# OSM_PATH= "supporting_files/new-orleans_region_sample_k100.osm"
# OSM_PATH= "supporting_files/new-orleans_region_sample_k10.osm"

NODES_PATH = "supporting_files/exports/nodes.csv"
NODES_TAGS_PATH = "supporting_files/exports/nodes_tags.csv"
WAYS_PATH = "supporting_files/exports/ways.csv"
WAYS_NODES_PATH = "supporting_files/exports/ways_nodes.csv"
WAYS_TAGS_PATH = "supporting_files/exports/ways_tags.csv"
RELATIONS_PATH = "supporting_files/exports/relations.csv"
RELATIONS_MEMBERS_PATH = "supporting_files/exports/relations_members.csv"
RELATIONS_TAGS_PATH = "supporting_files/exports/relations_tags.csv"


LOWER_UPPER_COLON = re.compile(r'^([a-z]|_)+:([a-z]|_)+', re.IGNORECASE)
PROBLEMCHARS = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

SCHEMA = test_schema.schema

# Make sure the fields order in the csvs matches the column order in the sql table schema
NODES_FIELDS = ['id', 'lat', 'lon', 'user', 'uid', 'version', 'changeset', 'timestamp']
NODES_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAYS_FIELDS = ['id', 'user', 'uid', 'version', 'changeset', 'timestamp']
WAYS_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAYS_NODES_FIELDS = ['id', 'node_id', 'position']
RELATIONS_FIELDS = ['id', 'user', 'uid', 'version', 'timestamp', 'changeset']
RELATIONS_TAGS_FIELDS = ['id', 'key', 'value', 'type']
RELATIONS_MEMBERS_FIELDS = ['id', 'mem_id','type', 'role', 'position']


# why node_attr_field, way_attr_field input params here?
def shape_element(element, node_attr_fields=NODES_FIELDS, way_attr_fields=WAYS_FIELDS,
                  problem_chars=PROBLEMCHARS, default_tag_type='regular'):
    """Clean and shape node or way XML element to Python dict"""

    # holds dicts of 'tag' elements
    tags = []
    
    # creates dicts for 'tag' elements
    for child in element:
            if child.tag != 'tag' or PROBLEMCHARS.search(child.attrib['k']):
                continue
            tag_dict = {'id':element.attrib['id'],
                        'key':child.attrib['k'],
                        'value':child.attrib['v'],
                        'type':default_tag_type 
                       }

            if LOWER_UPPER_COLON.search(tag_dict['key']):
                key_split = tag_dict['key'].split(':',1)
                tag_dict['key'] = key_split[1]
                tag_dict['type'] = key_split[0]

            tags.append(tag_dict)

    if element.tag == 'node':
        node_attribs = {'id':int(element.attrib['id']),
                   'user':element.attrib['user'],
                   'uid':int(element.attrib['uid']),
                   'version':element.attrib['version'],
                   'lat':float(element.attrib['lat']),
                    'lon':float(element.attrib['lon']),
                    'timestamp':element.attrib['timestamp'],
                    'changeset':int(element.attrib['changeset'])
                   }
        return {'node': node_attribs, 'node_tags': tags}
    
    elif element.tag == 'way':
        way_attribs = {'id':int(element.attrib['id']),
                      'user':element.attrib['user'],
                      'uid':int(element.attrib['uid']),
                      'version':element.attrib['version'],
                      'timestamp':element.attrib['timestamp'],
                      'changeset':int(element.attrib['changeset'])
                      }
        
        # holds list of dicts for 'nd' elements
        way_nodes = []
        
        # counter to increment instances of 'nd' tags
        nd_counter = 0
        
        for child in element:
            if child.tag == 'nd':
                nd_dict = {'id':element.attrib['id'],
                          'node_id':int(child.attrib['ref']),
                          'position':nd_counter} 
                nd_counter += 1
                way_nodes.append(nd_dict)
        
        return {'way': way_attribs, 'way_nodes': way_nodes, 'way_tags': tags}
    
    elif element.tag == 'relation':
        rel_attribs = {
            'id':int(element.attrib['id']),
            'user':element.attrib['user'],
            'uid':int(element.attrib['uid']),
            'version':element.attrib['version'],
            'timestamp':element.attrib['timestamp'],
            'changeset':int(element.attrib['changeset'])
        }
        
        rel_members = []
        
        mem_counter = 0
        
        for child in element:
            if child.tag == 'member':
                mem_dict = {
                    'id':element.attrib['id'],
                    'mem_id':int(child.attrib['ref']),
                    'type':child.attrib['type'],
                    'role':child.attrib['role'],
                    'position':mem_counter
                }
                mem_counter += 1
                rel_members.append(mem_dict)
        
        return {'relation': rel_attribs, 'relation_members': rel_members, 'relation_tags': tags}
    
# ================================================== #
#               Helper Functions                     #
# ================================================== #
def get_elements(osm_file, tags=('node', 'way', 'relation')):
    
    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            for tag_elem in elem.iter("tag"):
                if tag_elem.attrib['k'] == 'natural':
                    yield elem
        root.clear()

# takes in ET.element obj, validator object, schema
def validate_element(element, validator, schema=SCHEMA):
    """Raise ValidationError if element does not match schema"""
    if validator.validate(element, schema) is not True:
        field, errors = next(validator.errors.iteritems())
        message_string = "\nElement of type '{0}' has the following errors:\n{1}"
        error_string = pprint.pformat(errors)
        
        raise Exception(message_string.format(field, error_string))

# ================================================== #
#               Main Function                        #
# ================================================== #

# file_in=OSM file, validate=True or False
def process_map(file_in, validate):
    
    # with-open files in write mode
    with codecs.open(NODES_PATH, 'w') as nodes_file, \
        codecs.open(NODES_TAGS_PATH, 'w') as nodes_tags_file, \
        codecs.open(WAYS_PATH, 'w') as ways_file, \
        codecs.open(WAYS_NODES_PATH, 'w') as ways_nodes_file, \
        codecs.open(WAYS_TAGS_PATH, 'w') as ways_tags_file, \
        codecs.open(RELATIONS_PATH, 'w') as relations_file, \
        codecs.open(RELATIONS_MEMBERS_PATH, 'w') as relations_members_file, \
        codecs.open(RELATIONS_TAGS_PATH, 'w') as relations_tags_file:
        
        # create writer objects
        nodes_writer = csv.DictWriter(nodes_file, NODES_FIELDS)
        nodes_tags_writer = csv.DictWriter(nodes_tags_file, NODES_TAGS_FIELDS)
        ways_writer = csv.DictWriter(ways_file, WAYS_FIELDS)
        ways_nodes_writer = csv.DictWriter(ways_nodes_file, WAYS_NODES_FIELDS)
        ways_tags_writer = csv.DictWriter(ways_tags_file, WAYS_TAGS_FIELDS)
        relations_writer = csv.DictWriter(relations_file, RELATIONS_FIELDS)
        relations_members_writer = csv.DictWriter(relations_members_file, RELATIONS_MEMBERS_FIELDS)
        relations_tags_writer = csv.DictWriter(relations_tags_file, RELATIONS_TAGS_FIELDS)
        
        # write headers using field names specified in DictWriter constructor
        nodes_writer.writeheader()
        nodes_tags_writer.writeheader()
        ways_writer.writeheader()
        ways_nodes_writer.writeheader()
        ways_tags_writer.writeheader()
        relations_writer.writeheader()
        relations_members_writer.writeheader()
        relations_tags_writer.writeheader()

        # the Validator class object instantiated here is callable to normalize 
        # and/or validate any mapping against validation schema 
        validator = cerberus.Validator()

        # loop over generator obj from get_element()
            # get_element() takes OSM file and tags I'm interested in
        for element in get_elements(file_in, tags=('node', 'way', 'relation')):
            # create a shape_element() object
                # takes in the element from iterator, outputs a dict
            el = shape_element(element)
#             pprint.pprint(el)

            # cleans data in dict
            el2 = fd.fix_dict(el)
                   
            if not el:
                continue
            if validate is True:
                validate_element(el2, validator)
            # write each dict to appropriate writer obj
            if element.tag == 'node':
                nodes_writer.writerow(el2['node'])
                nodes_tags_writer.writerows(el2['node_tags'])
            elif element.tag == 'way':
                ways_writer.writerow(el2['way'])
                ways_nodes_writer.writerows(el2['way_nodes'])
                ways_tags_writer.writerows(el2['way_tags'])
            elif element.tag == 'relation':
                relations_writer.writerow(el2['relation'])
                relations_members_writer.writerows(el2['relation_members'])
                relations_tags_writer.writerows(el2['relation_tags'])

if __name__ == '__main__':
    process_map(OSM_PATH, validate=False)

# process_map called
    # with-open files to be written
    # create writer objects
    # write headers
    # create class validator object
    # iterate over each element from a generator created by get_element()
        # create a dict from the element
        # FIXER FUNCTION HERE!!!
        # validate dict against a schema using validate_element()
        # write it to csv using appropriate writer object

### Confirm basic parser/CSV compiler working properly:

In [None]:
# open each CSV and check manually - VIM search in large OSM and check
    # √nodes_tag.csv
    # √nodes.csv
    # √relation_members.csv
    # √relations_tags.csv
    # √relations.csv
    # √ways_nodes.csv
    # √ways_tags.csv
    # √ways.csv

### Write cleaning scripts - insert into data.py:

In [None]:
#√ Fix wayID 321535489
    # No need to fix - it's inefficient to parse every line to clean a single element.
        # better approach to do manually with access to the actual database
#√ All NHD:FTYPE should be NDF:FType to conform to NHD data model - same with RESOLUTION should be Resolution
#√ Fix capitalization in tag elems with 'water' key
#√ Change 'bing' to 'Bing' and 'landsat' to 'Landsat' in tag elems with 'source' key
    # maybe regex compare based on capitalization - then fix to first letter capitalized?
#√ Fix tags with 'name' elem (fapitalization of firsr letter of each word)
# discard node tag elem 4506654389

### Update schema - implement schema validation:

In [None]:
# descipher schema file and validation functionality
    # validator class object is created
    # dict and validator class object fed into validate_element()
        # validates and throws error (stops execution) if validation not True
        # else, execution continues and all items written to the csv

In [None]:
my_dict = {"one": 1, "two": 2}
my_dict

In [None]:
my_dict['one']