This notebook explores the ways we're going to need to go about parsing and processing collections submitted to the National Digital Catalog of Geological and Geophysical Data in some new ways. The end game here is to put all records into one or multiple ElasticSearch indexes to drive a new API and various interfaces that search across collections to find useful samples and other artifacts for research. We do essentially have this now via the ScienceBase API, but the underlying data workflow is very stagnant, difficult to manage, and very difficult to change to allow for more heterogeneity in the underlying metadata and workflows.

I'm pursuing a concept of operations that will run as a set of microservices to process various kinds of collections into a common format with variable properties. To split up the work, I am focusing on getting each collection type to a common but varying GeoJSON data structure. I will then cache those data files back on the ScienceBase Items at the collection level and then slurp them up and process into ElasticSearch. In some of our other project work, we are running these types of files into their own ES indexes with a common prefix to support wildcard searches across collections. This will probably be a reasonable approach here as well, and we can take advantage of an established load mechanism based on a message queue and set of microservices.

In [1]:
import requests
from IPython.display import display
import json
import folium
from folium.plugins import MarkerCluster

from pynggdpp import collection as ndcCollection
from pynggdpp import processing as ndcProcessing

# pynggdpp package
I've started work on a Python package to house all of the core logic for this process. All of the data processing pipeline stuff will be incorporated there, but I'm also building in utility functions that should be useful independently for interacting with the NDC.

The following is a primary function to retrieve collection items from the NDC. The default configuration get all collections from the top level based on the ndc_collection vocabulary term I applied in previous work. It can also be configured to use a different organization item as its starting point to do something like retrieve all collections for a given State Geological Survey. This will also serve as the basis for REST API routes.

In [2]:
ndc_collections = ndcCollection.ndc_get_collections()

# Web Accessible Folders with harvestable XML metadata files
There are two working cases of WAFs with harvestable XML, ISO19115 records from AZGS and CSDGM XML files from AK DGGS. Processing these should be relatively straightforward with some degree of brittleness in the process depending on what happens with the source directories. A "WAF" is explicitly indicated as a type of webLink on items, so in the following I use list comprehension to tease out collections that should represent a possible WAF route to follow.

In [3]:
data_route_waf = [i for i in ndc_collections if 'webLinks' in i.keys() and next((l for l in i['webLinks'] if 'type' in l.keys() and l['type'] == 'WAF'), None) is not None]

In working through these cases, the one that does not seem to be active at the moment is the ftp link provided by the Washington State Survey. That's actually not strictly a WAF anyway, so it would be something else if it was an active route to harvestable records. Judging by the name in the path (/natalie/), I'm actually guessing this was set up as a test that didn't go anywhere. The ScienceBase collection item also has an attached file, however, so that's probably the preferred route to the records. In any case, this is another item where we need to do a little curatorial work to unambiguously designate how records should be retrieved. 

In [4]:
for collection in data_route_waf:
    print(collection['title'])
    print(collection['link']['url'])
    print(next((l['uri'] for l in collection['webLinks'] if l['type'] == 'WAF'), None))

Arizona Department of Mines and Mineral Resources Photo Archive
https://www.sciencebase.gov/catalog/item/5009bd91e4b0612f70e97a96
http://repository.stategeothermaldata.org/resources/metadata/DataPres2012-13MineFileInventory/ADMMR_PhotosA-Z/
Grover Heinrichs mining collection
https://www.sciencebase.gov/catalog/item/502e85ece4b0ca196f38d852
http://repository.stategeothermaldata.org/resources/metadata/DataPres2014-2015MineFileInventory/GHeinrichsToUsgin/
Collection of industrial mineral sites from Washington State
https://www.sciencebase.gov/catalog/item/51cc42b6e4b052f2a45398e1
ftp://ww4.dnr.wa.gov/geology/berwick/natalie/nggdpp_wa_deliverables_2012/
Geoscience collections harvested from the Alaska Division of Geological and Geophysical Surveys
https://www.sciencebase.gov/catalog/item/5141e4c2e4b0eefcba208e52
http://www.dggs.alaska.gov/metadata/
A. F. Budge mining collection
https://www.sciencebase.gov/catalog/item/57520032e4b053f0edd03e54
http://repository.stategeothermaldata.org/resou

The following codeblocks run one example that I know works from the above. I ran into a couple of other issues that I'm going to need to work through in terms of error handling. The ndc_collection_from_waf() function I built in the pynggdpp package takes a supplied WAF URL, runs through through all links from the WAF, uses a metadata parsing utility to parse the contents of the ISO XML, builds a simple set of properties, creates a point geometry from the bounding box, and builds out a GeoJSON feature collection. We will need some more work on fully accommodating all useful properties out of the XML as there are some other things we should probably incorporate. This also relies on the convention of using bounding box in the ISO standard to represent a simple point, and that should probably get some validation in the code to make sure that's actually the case and generate a polygon feature in cases where it's actually a bounding box.

I also need to build in a processor for CSDGM, or I might experiment with the gis_metadata_parser package ability to convert between metadata formats.

This process takes a while to run as there are many different HTTP requests that have to execute and be processed. We will probably want to leverage a caching strategy with this that either caches the original files from the source or builds a derivative (e.g., the simplified GeoJSON built here) and caches that on the ScienceBase Item.

In [5]:
example_waf_url = next(l['uri'] for l in data_route_waf[4]['webLinks'] if l['type'] == 'WAF')
example_waf_dataset = ndcProcessing.ndc_collection_from_waf(example_waf_url)

In [6]:
print(example_waf_dataset['features'][0]['properties']['title'])
display(example_waf_dataset['features'][0])

UVX: Contracts to Supply Flux to Hidalgo and Chino Smelters


{"geometry": {"coordinates": [-112.1122222, 34.75388889], "type": "Point"}, "properties": {"abstract": "The 'UVX: Contracts to Supply Flux to Hidalgo and Chino Smelters' file is part of the A. F. Budge Mining Ltd. Mining collection. A. F. Budge Mining Ltd., a British company owned by Tony Budge, controlled properties across several western U. S. states and northern Mexico. The company was active in Arizona during the 1980s and into the early 1990s. The collection consists of economic geologic information including maps, logs, reports and records. A few properties make up most of the collection: Vulture, United Verde Extension and Korn Kob.", "place_keywords": ["United States", "Arizona", "Yavapai County", "Clarkdale - 7.5 Min", "U.V.X. Property", "Edith And Audrey Shafts", "Little Daisy", "Verde Exploration Ltd Prop.", "Daisy Shaft", "Audrey Shaft", "T16N R2E Sec 23 NW", "Black Hills (Ya) physiographic area", "Verde metallic mineral dist.", "Yavapai552B"], "temporal_keywords": ["1990s"

# Attached file processing
The major route to data at this stage are going to be through files attached to items. The NDC Dashboard now supports this basic process, which will give us some better control over what happens in the workflow. At the moment, there are a number of issues that we will need to resolve through curatorial work on the items. The following couple of codeblocks use some logic to tease out collections with just a single potentially processable file vs. multiple files. 

Single files should be fairly straightforward, but multiple files present a problem where we don't really know which file or files should be processed. Looking through the items, there appear to be cases where Tamar Norkin, who was working to support the NGGDPP process for a time, or others went through and did some pre-processing work on original files to set them up for processing using a built-in file processing mechanism in ScienceBase that turned "sample" records in XML files or rows in CSV files into child items. There seem to be other cases where datasets were broken up up across multiple files. And there appear to be cases where different versions of datasets (sometimes with multiple files in each version) were uploaded and processed over time.

Unfortunately, there's no real way to use any combination of attributes on the file objects to determine exactly what to process. We can make some guesses, and we can possible write some code to examine the current child items to see if we can tease out what files were processed. But at the end of the day, we will need to put some type of flag on the file objects or use some type of external manifest approach to determine what files to pull records from.

In [7]:
data_route_files = [i for i in ndc_collections if 'files' in i.keys() and next((f for f in i['files'] if f['name'] != 'metadata.xml' and f['contentType'] != 'application/fgdc+xml'), None) is not None]

In [8]:
items_with_one_file = []
items_with_multiple_files = []

for collection in data_route_files:
    this_item = {'title':collection['title'],'url':collection['link']['url']}
    this_item['files'] = [f for f in collection['files'] if f['name'] != 'metadata.xml']
    if len(this_item['files']) == 1:
        items_with_one_file.append(this_item)
    else:
        items_with_multiple_files.append(this_item)
        
print('Single Processable File', len(items_with_one_file))
print('Multiple Processable Files', len(items_with_multiple_files))

Single Processable File 213
Multiple Processable Files 120


## NGGDPP XML Format
One mechanism of supplying data to the NDC is a rather archaic simple XML document with "sample" records following the original NDC schema. This set of code runs through one of those examples. In looking over the collection records in ScienceBase today, we're going to have to work through a mechanism of flagging the appropriate XML file for processing so that we can run one piece of code across the entire NDC, find these types of cases, and process all files.

In [9]:
example_xml_collection = next((c for c in items_with_one_file if next((f for f in c['files'] if f['contentType'] == 'application/xml' and f['name'] != 'metadata.xml'), None) is not None), None)

In [10]:
display(example_xml_collection)

{'files': [{'checksum': {'type': 'MD5',
    'value': '6bd54577b4a2734cadde288bc461e30b'},
   'contentType': 'application/xml',
   'dateUploaded': '2017-01-11T19:37:56.000Z',
   'downloadUri': 'https://www.sciencebase.gov/catalog/file/get/586fc0c3e4b01a71ba0bc9bc?f=__disk__c4%2Faa%2F9c%2Fc4aa9c5fcc29a6440f1610326bf8c207afa93f8d',
   'name': 'brine_analyses_1112017.xml',
   'pathOnDisk': '__disk__c4/aa/9c/c4aa9c5fcc29a6440f1610326bf8c207afa93f8d',
   'processToken': '1484163476934',
   'processed': True,
   'size': 586627,
   'url': 'https://www.sciencebase.gov/catalog/file/get/586fc0c3e4b01a71ba0bc9bc?f=__disk__c4%2Faa%2F9c%2Fc4aa9c5fcc29a6440f1610326bf8c207afa93f8d',
   'viewUri': 'https://www.sciencebase.gov/catalog/file/get/586fc0c3e4b01a71ba0bc9bc?f=__disk__c4%2Faa%2F9c%2Fc4aa9c5fcc29a6440f1610326bf8c207afa93f8d&allowOpen=true'}],
 'title': 'Collection of Brine Analyses from Kentucky Oil and Gas Wells',
 'url': 'https://www.sciencebase.gov/catalog/item/586fc0c3e4b01a71ba0bc9bc'}

Basing this process on only the list of single-file collections, we can tease out the one XML file to process and run it through the processing logic I built into a function in pynggdpp. I'm pursuing the basic architectural notion of converting everything to GeoJSON from whatever route I pick up records and loading those into ElasticSearch (via a MongoDB persistent layer) as separate indexes. In many ways, this is not much different than what we have now in ScienceBase (also backed by ES), but we'll be unshackling ourselves from the need to have only a single data model based on the limitations of the "ScienceBase Item." Rather, we can have massive variability across the collections with a handful of high-level mappings to core concepts we can integrate or synthesize over time.

The following runs the one example attached XML that matches the old NGGDPP schema we set up long ago. The ndc_xml_to_geojson() function relies on the xmltodict package as a nice shortcut for XML processing and the geojson package to build a compliant GeoJSON FeatureCollection. I still need to build in a lot of error trapping and corner case handling into the function like dealing with null geometry and other things that are going to come up. There's also a fundamental issue where most of these collections to not specify a coordinate reference system in any kind of consistent way. I'm assuming WGS84 for now, but I know there are exceptions we need to tease out and specify in the collection metadata.

In [11]:
example_xml_file_url = [f['url'] for f in example_xml_collection['files'] if f['contentType'] == 'application/xml' and f['name'] != 'metadata.xml'][0]
example_xml_file_geojson = ndcProcessing.ndc_xml_to_geojson(example_xml_collection['url'], example_xml_file_url)

In [12]:
print(example_xml_file_geojson['features'][0]['properties']['title'])
display(example_xml_file_geojson['features'][0])

Scanned brine analyses from oil and gas well: Record Number 344


{"geometry": {"coordinates": [-85.416904, 37.080098], "type": "Point"}, "properties": {"abstract": "Scanned brine analyses records collected with Kentucky oil and gas well drilling activity. Oil or gas well record in the Kentucky Geological Survey data repository.<br/>Well Number: 1 <br/>Operator: M P OIL VENTURES <br/>Farm Name: CORBIN, PAUL <br/>Permit Date: 3/9/1971 <br/>Completion Date: 3/28/1971 <br/>Well Elevation (ft): 763 <br/>Total Depth Formation: Ordovician-Knox Gp <br/>Deepest Pay Formation: Ordovician-Knox Gp <br/>Deviated or Vertical: vertical <br/>", "alternategeometry": "Coordinates represented in Geographic Decimal Degree NAD83. Converted from NAD27 using KYGeoTools (http://ngs.ky.gov/Pages/kyGeoTools.html)<br/>Quadrangle: Gradyville <br/>County: Adair</br>State: Kentucky", "alternatetitle": "Permit Number 24571;API Number 16001030000000", "browsegraphic": {"resourceURL": null}, "build_from_source_date": "2018-11-24T17:48:40.689455", "collection_id": "https://www.scien

Things to do in processing version 1 NDC records from XML:
* Convert coordinates to valid GeoJSON point geometry
* Determine if CRS is other than WGS84 and record those details somewhere in the collection
* Verify collection ID is valid ScienceBase ID
* Determine uses of alternate title and handle appropriately
* Determine uses of browse graphic and online resource; handle appropriately with at least verification of availability
* Verify datasetReferenceDate and set to valid ISO8601 date
* Verify date range and set to valid ISO8601 date range
* Package GeoJSON feature collection

## NGGDPP text file format
The NDC process used a weird pipe-delimited text file format, and as with all simple text files, we'll probably have a ton of corner cases to deal with. The following section starts to explore the same basic approach of reading and building out a GeoJSON FeatureCollection from these text files where we'll be able to make some adjustments to make the data more loadable to online infrastructure. In a quick run through of content types detected by the ScienceBase upload process, I've seen wide variability in what these files are identified as. I'll have to throw in some of my own processing to work out the best ingest methods.

I'm starting with Pandas because it seems to smooth over a lot of the rough edges in text file processing, and then I'll see where we end up.

In [13]:
example_text_file_collection = next((c for c in items_with_one_file if next((f for f in c['files'] if f['contentType'] != 'application/xml'), None) is not None), None)

In [14]:
display(example_text_file_collection)

{'files': [{'contentType': 'text/csv',
   'dateUploaded': '2018-11-01T17:14:39.000Z',
   'downloadUri': 'https://www.sciencebase.gov/catalog/file/get/5bd9e3bae4b0b3fc5cebf246?f=__disk__5c%2F33%2Fb4%2F5c33b46a1d1af567a1d18efe7c61edc9888eb264',
   'name': 'SmithAdam.csv',
   'pathOnDisk': '__disk__5c/33/b4/5c33b46a1d1af567a1d18efe7c61edc9888eb264',
   'processToken': '1541092479667',
   'processed': True,
   'size': 20307,
   'url': 'https://www.sciencebase.gov/catalog/file/get/5bd9e3bae4b0b3fc5cebf246?f=__disk__5c%2F33%2Fb4%2F5c33b46a1d1af567a1d18efe7c61edc9888eb264',
   'viewUri': 'https://www.sciencebase.gov/catalog/file/get/5bd9e3bae4b0b3fc5cebf246?f=__disk__5c%2F33%2Fb4%2F5c33b46a1d1af567a1d18efe7c61edc9888eb264&allowOpen=true'}],
 'title': 'Thin Sections Scans donated by Adam Smith',
 'url': 'https://www.sciencebase.gov/catalog/item/5bd9e3bae4b0b3fc5cebf246'}

In [16]:
example_text_file_url = [f['url'] for f in example_text_file_collection['files'] if f['contentType'] != 'application/xml'][0]
example_text_file_geojson = ndcProcessing.ndc_text_to_geojson(example_text_file_collection['url'], example_text_file_url)

In [17]:
print(example_text_file_geojson['features'][0]['properties']['title'])
display(example_text_file_geojson['features'][0])

S. Red Hills episyenite with partial alteration Sample: REE-1002, Episyenite


{"geometry": {"coordinates": [-107.255901, 32.876984], "type": "Point"}, "properties": {"abstract": " Episyenite collected by Adam Smith for Thesis Sample Notes: partial alteration", "alt_title": "S. Red Hills episyenite with partial alteration", "build_from_source_date": "2018-11-24T17:49:34.594427", "collection_id": "https://www.sciencebase.gov/catalog/item/5bd9e3bae4b0b3fc5cebf246", "collectionname": "Adam Smith", "coordinates": "-107.255901,32.876984", "datatype": "Thin sections and polished Sections", "source_file": "https://www.sciencebase.gov/catalog/file/get/5bd9e3bae4b0b3fc5cebf246?f=__disk__5c%2F33%2Fb4%2F5c33b46a1d1af567a1d18efe7c61edc9888eb264", "supplementalinformation": "Adam Smith,  research funded partially by New Mexico Bureau of Geology and Mineral Resources, Contact: Adam Smith, email: smithae1223 @ gmail.com.", "title": "S. Red Hills episyenite with partial alteration Sample: REE-1002, Episyenite"}, "type": "Feature"}

# Some other online source
One of the things we want to explore is an opportunity to leverage existing infrastructure that some State Geological Surveys are managing that may a) provide more robust and complete information on collections than might have previously been supplied with the "lowest common denominator" NGGDPP metadata approach and b) put us closer to working with what data providers are already investing in for the long term (as opposed to layering on something different that they have to do). The above example of the AZGS WAF is one possibility. I also found an online data service from the Maine Geological Survey that is used to run some of their mapping apps that provides what look to be records for some of the same rock cores provided in the NDC. I'm still confused over the organization of these into "sub-collections" in ScienceBase and where those come from, but this example of a GeoJSON response from an ArcGIS MapServer query service may be a reasonable approach if the service turns out to provide an up to date and sustained source for these records.

In [18]:
maine_core_locations = requests.get('https://gis.maine.gov/arcgis/rest/services/mgs/Geology_Data/MapServer/1/query?where=1%3D1&text=&objectIds=&time=&geometry=&geometryType=esriGeometryEnvelope&inSR=&spatialRel=esriSpatialRelIntersects&relationParam=&outFields=*&returnGeometry=true&returnTrueCurves=false&maxAllowableOffset=&geometryPrecision=&outSR=&returnIdsOnly=false&returnCountOnly=false&orderByFields=&groupByFieldsForStatistics=&outStatistics=&returnZ=false&returnM=false&gdbVersion=&returnDistinctValues=false&resultOffset=&resultRecordCount=&f=geojson').json()

In [19]:
print(maine_core_locations['features'][0]['properties']['CoreID'])
display(maine_core_locations['features'][0])

910-14


{'geometry': {'coordinates': [-68.97902513852661, 46.40405746805513],
  'type': 'Point'},
 'id': 1,
 'properties': {'ConsolidatedLogURL': '<a href="http://www.maine.gov/dacf/mgs/explore/mining/core/t9r10wels/all_core.pdf" target="_blank">View PDF</a>',
  'ConsolidatedMapURL': '<a href="http://www.maine.gov/dacf/mgs/explore/mining/core/t9r10wels/all_core.pdf" target="_blank">View PDF</a>',
  'ConsolidatedXSectionURL': '<a href="http://www.maine.gov/dacf/mgs/explore/mining/core/t9r10wels/all_rag_xsects.pdf" target="_blank">View PDF</a>',
  'CoreID': '910-14',
  'Driller': 'The Joint Venture',
  'LogURL': '<a href="http://www.maine.gov/dacf/mgs/explore/mining/core/t9r10wels/910-14.pdf" target="_blank">View PDF</a>',
  'MapURL': '<a href="http://www.maine.gov/dacf/mgs/explore/mining/core/t9r10wels/all_910_map.pdf" target="_blank">View PDF</a>',
  'OBJECTID': 1,
  'Project': 'Munsungan Lake Area',
  'Township': 'T9 R10 WELS',
  'XSectionURL': None},
 'type': 'Feature'}

# Visualization
The main thing I'm pursuing is getting to a workable API, based on ElasticSearch and the same Flask-based REST API we are building for the Biogeographic Information System, that allows for exploration and discovery across all collections in the NDC. However, we do also need some ways of exploring all of that visually. The code below uses a simple Folium map to display the collections added above. I'll do some more work on this to include properties in the markers.

My plan at this point is to fork the Burwell app from the Macrostrat folks and add in a capability to search for and display NDC artifacts. Their system uses a multi-resolution interface to global geologic maps as a base with a find-by-click approach that uses the surface geology and geographic location of a dropped pin to search a number of different services and display potentially useful items. I'll add a capacity to the discovery panel that uses either a buffer on the dropped pin or geologic formation geometry to set a spatial constraint for the NDC search.

This simple visual does already show that I'm going to need to put some additional stuff into the processing code that will find outliers and flag them in some way as supect. I can then generate an additional API route that highlights suspect records for further action.

In [20]:
do_xml = True
do_text = True
do_waf = True
do_remote = True

m = folium.Map(location=[45, -110], zoom_start=2)

if do_xml:
    xml_example_marker_cluster = MarkerCluster().add_to(m)
    for feature in example_xml_file_geojson['features']:
        folium.Marker([feature['geometry']['coordinates'][1], feature['geometry']['coordinates'][0]],
                popup=feature['properties']['title']
        ).add_to(xml_example_marker_cluster)

if do_text:
    text_example_marker_cluster = MarkerCluster().add_to(m)
    for feature in example_text_file_geojson['features']:
        folium.Marker([feature['geometry']['coordinates'][1], feature['geometry']['coordinates'][0]],
                popup=feature['properties']['title']
        ).add_to(text_example_marker_cluster)

if do_waf:
    waf_example_marker_cluster = MarkerCluster().add_to(m)
    for feature in example_waf_dataset['features']:
        folium.Marker([feature['geometry']['coordinates'][1], feature['geometry']['coordinates'][0]],
                popup=feature['properties']['title'].replace("'","")
        ).add_to(waf_example_marker_cluster)

if do_remote:
    remote_example_marker_cluster = MarkerCluster().add_to(m)
    for feature in maine_core_locations['features']:
        folium.Marker([feature['geometry']['coordinates'][1], feature['geometry']['coordinates'][0]],
                popup=f"{feature['properties']['Project']} - {feature['properties']['CoreID']}"
        ).add_to(remote_example_marker_cluster)

m

### Notes about visualization

* I found pesky issues with quote characters in some of the text strings causing the Leaflet Javascript behind Folium to fail. I caught these in the javascript console and had to put in the workaround for now. We may want to build a cleanup process into the indexing/integration routine to generate derivative properties that are cleaned up for this type of use.
* In my latest experimentation trying to simply throw all properties into a popup in the Folium map, I ran into other issues with content in some of the properties. I tried a couple ways of cleaning or stripping HTML from things like the abstract but still ran into some issues I decided to delay work on for a while. In thinking about implications of this to the future data processing workflow, cleaning up text string values as much as possible is probably one of the things I need to work into the final index so that resulting data coming out will fit into most circumstances without blowing anything up. I think a big part of the problem comes from the fact that a number of data providers elected to try and shove a bunch of extra attributes into things like the abstract or supplemental information fields because the low-bar standard didn't allow for any more elegant options. These may be good candidates to go back at the problem and come up with an alternate way of bringing data into the catalog that will provide more options for folks.