This notebook explores dynamics of the collection items in the NDC. These are the essential items around which all other work will be based. I also found a certain level of inconsistency in the collection items. I found no one single way to reliably retrieve all items that should be classified as "NDC Collections" from ScienceBase.

The original collection items all came from a survey conducted way back at the start of the NGGDPP, which was built as a web form populating survey responses into a Filemaker Pro database built by Jerry McFaul. Rick Brown, who worked up some of the original thinking about the NDC, wrote a process to pull information from that database into what was the Comprehensive Science Catalog (later to be rebuilt as ScienceBase). There are a number of items in the NDC now that have legacy artifacts from that process - NGGDPP Collection extension/facet, unique identifiers (e.g., "P1323"). Most of the information that is contained in these older legacy collections is no longer all that relevant and certainly can't be counted on for accuracy. The categorization concepts that were put into the NGGDPP Collection facet seem to have been pulled up into tags from a set of vocabularies, but I don't believe that those terms have been reviewed for some time if at all and cannot really be used in a semantic integration sense. The facet does contain a numeric value that was intended to indicate the relative magnitude of the collections, but I don't believe these have been reviewed for accuracy in a long time either.

The identifiers that were assigned to collections way back when the survey was done were created to help keep track of work from that point forward. They were used in the process to identify the collection into which submitted metadata should be organized by placing the identifier on records submitted in spreadsheets or xml files. It may still be useful to keep those IDs on the collection items where relevant, but later collections seem to have dropped this concept.

We can somewhat count on the items immediately below the organization collections being the actual NDC Collections and the items I'll be focusing most of my attention on. I created another type classification tag, ndc_collection, that I will likely apply to all of the items deemed to be the actual collections of the NDC. This will make it much more reliable to tease them out of ScienceBase for processing.

In [1]:
import requests
from IPython.display import display

In [2]:
parentId = '4f4e4760e4b07f02db47dfb4'
tagScheme = {"scheme":"https://www.sciencebase.gov/vocab/category/NGGDPP/nggdpp_collection_types","name":"ndc_organization"}
sbQueryPath = f'https://www.sciencebase.gov/catalog/items?format=json&max=50&fields=title,contacts,spatial&parentId={parentId}&filter=tags%3D{tagScheme}'

r_ndc_org = requests.get(sbQueryPath).json()

Periodic slowdowns in ScienceBase API response make it difficult to run everything in a completely live way. The following code block runs through and builds out an array of NDC organizations and the essential information about their collections that we need to examine to determine the essential nature of the collections and what we're going to need to do to process them effectively.

In [11]:
ndc_collections = []

for item in r_ndc_org['items']:
    ndc_org = item
    item_id = item['id']
    sbCollectionsQuery = f'https://www.sciencebase.gov/catalog/items?format=json&max=1000&fields=title,files,webLinks,facets,hasChildren&parentId={item_id}'
    item_collections = requests.get(sbCollectionsQuery).json()
    ndc_org['Collections'] = item_collections['items']
    ndc_collections.append(ndc_org)
    print(item['title'], len(item_collections['items']))

North Dakota Geological Survey 6
Tennessee Geological Survey 21
Maine Geological Survey 2
Kentucky Geological Survey 22
California Geological Survey 4
Alaska Division of Geological and Geophysical Surveys 18
Montana Bureau of Mines and Geology 11
New Hampshire Geological Survey 9
Delaware Geological Survey 24
Colorado Geological Survey 2
U.S. Geological Survey 5
Iowa Geological Survey 10
New Mexico Bureau of Geology and Mineral Resources 49
Texas Bureau of Economic Geology 11
New Jersey Geological Survey 18
North Carolina Geological Survey 0
Ohio Geological Survey 18
South Carolina Geological Survey 8
Nevada Bureau of Mines and Geology 13
Utah Geological Survey 10
Oregon Geological Survey 9
Arizona Geological Survey 58
Arkansas Geological Survey 8
Missouri Geological Survey 7
Minnesota Geological Survey 19
Pennsylvania Geological Survey 129
Vermont Geological Survey 14
New York Geological Survey 13
Nebraska Geological Survey 13
Geological Survey of Alabama 16
Oklahoma Geological Survey

This next loop is a crude way of trying to take an initial cut at what all we need to deal with in processing collection items. It's helping to expose a number of issues that we'll need to deal with and will serve as fodder for setting up a basic characterization of the system as it stands. It seems like we should be finding the following cases in terms of basic collection characteristics:

* Collection identified at some point by an organization but so far containing no items. These are shown in the list with a preceding '-' symbol.
* Collections where an "NGGDPP CSV" file was supplied. I think we'll have other issues to tease out with these once I get into processing the files as a | delimiter idea was used.
* Collections where an "NGGDPP XML" file was supplied. These are pretty simple and straightforward to deal with.
* Collections presenting a WAF where other files can be harvested. Cases I've seen so far include WAF-harvestable ISO29115 and CSDGM XML files.

In [23]:
for org in ndc_collections:
    print(org['title'], len(org['Collections']))
    for collection in org['Collections']:
        if collection['hasChildren']:
            print('+', collection['title'])
            
            nothing_to_process_message = None
            
            if 'files' in collection.keys():
                csv_files = [f['name'] for f in collection['files'] if f['contentType'] == 'text/csv']
                if len(csv_files) > 0:
                    print(csv_files)
                xml_files = [f['name'] for f in collection['files'] if f['contentType'] == 'application/xml']
                if len(xml_files) > 0:
                    print(xml_files)
            else:
                if 'webLinks' in collection.keys():
                    waf_links = [l['uri'] for l in collection['webLinks'] if l['type'] == 'WAF']
                    if len(waf_links) > 0:
                        print(waf_links)
                else:
                    nothing_to_process_message = ""
                    print('NO PROCESSABLE FILES OR WEBLINKS IN THIS COLLECTION')
            
            if nothing_to_process_message is not None:
                print(nothing_to_process_message, collection['link']['url'])
            
        else:
            print('-', collection['title'])
    print('    ')


North Dakota Geological Survey 6
- Collection of Field notes from North Dakota
- Collection of Rock cores from North Dakota
- Collection of Well logs from North Dakota
- Collection of Paleontological samples from North Dakota
- Collection of Photographs from North Dakota
- Collection of Hand samples from North Dakota
    
Tennessee Geological Survey 21
- Collection of Oil & Gas Well Data File from Tennessee
+ Collection of Coal Field Measured Sections from Tennessee
['Tennessee_P1341_Coal_Field_Measured_Sections_Collection.csv']
['metadata.xml']
- Collection of Geotechnical Engineering Reports from Tennessee
+ Tennessee Geological Survey Tennessee Valley Authority Documents Collection
['Tennessee_Collection_1322_TVA_Reports_2018_Deliverable.csv']
- Collection of DOE Oak Ridge Reports from Tennessee
+ Tennessee Geological Survey Mineral Resources Documents Collection
['Tennessee_P1328_Mineral_Resources_Collection_Knoxville.csv', 'Tennessee_P1328_Mineral_Resources_Collection_Nashville.cs