I ran through and did some utility work on the items in ScienceBase that we are likely to keep in that part of the infrastructure as we work to rearchitect how the actual items within collections are dealt with. I applied three different tags to items depending on their function. These come from a new vocabulary I set up in ScienceBase-Vocab, and those are shown below. 

In [1]:
import requests
from IPython.display import display

parentId = '4f4e4760e4b07f02db47dfb4'
queryRoot = 'https://www.sciencebase.gov/catalog/items?format=json&max=1000&'

These are the terms I used to essentially classify the Items that will remain in ScienceBase as to their function. In the immediate term, this will let me easily tease out the items that represent actual collections where I should expect to find either some method of interfacing with the items in the collection (file, web link, etc.) or else nothing at the moment. The collection items are also where we will concentrate efforts on metadata improvement.

In [2]:
item_type_vocab = requests.get('https://www.sciencebase.gov/vocab/5bf3f7bce4b00ce5fb627d57/terms?nodeType=term&parentId=5bf3f7bce4b00ce5fb627d57&max=10&offset=0&format=json').json()

item_types = [{'name':t['name'],'label':t['label'],'description':t['description'],'scheme':t['scheme']} for t in item_type_vocab['list']]

display(item_types)


[{'description': 'Item that represents a logical collection of physical data items managed by an organization contributing to the National Digital Catalog. These are core metadata items for which we expect to find full metadata describing the collection and a method of accessing the contents in the collection.',
  'label': 'NDC Collection',
  'name': 'ndc_collection',
  'scheme': 'https://www.sciencebase.gov/vocab/category/NGGDPP/nggdpp_collection_types'},
 {'description': 'Denotes a ScienceBase Item that functions as an organizational folder within the National Digital Catalog.',
  'label': 'NDC Folder',
  'name': 'ndc_folder',
  'scheme': 'https://www.sciencebase.gov/vocab/category/NGGDPP/nggdpp_collection_types'},
 {'description': 'Item that represents a contributing organization. Used to set permissions in ScienceBase to allow management access by members of the organization.',
  'label': 'NDC Organization',
  'name': 'ndc_organization',
  'scheme': 'https://www.sciencebase.gov/voc

# Organizations

Organization records form the essential containers for managing collections in the NDC. The following sections pull organization items and start to set things up to run some health checks on the organization records.

In [3]:
tag_scheme_orgs = next(({'name':ts['name'],'scheme':ts['scheme']} for ts in item_types if ts['name'] == 'ndc_organization'), None)
fields_orgs = 'title,contacts,spatial'
sb_query_orgs = f'{queryRoot}fields={fields_orgs}&folderId={parentId}&filter=tags%3D{tag_scheme_orgs}'
print(sb_query_orgs)

https://www.sciencebase.gov/catalog/items?format=json&max=1000&fields=title,contacts,spatial&folderId=4f4e4760e4b07f02db47dfb4&filter=tags%3D{'name': 'ndc_organization', 'scheme': 'https://www.sciencebase.gov/vocab/category/NGGDPP/nggdpp_collection_types'}


In [4]:
# Retrieve the org items
r_ndc_org = requests.get(sb_query_orgs).json()

I pulled two essential properties for the org items that will likely be used to help improve metadata for the collection items in these containers:

* Contacts contains a "Data Owner" type contact that should be a reasonably populated responsible party entity for these records. These should be reviewed for current information as they look like contacts that I added a long time ago to the ScienceBase Directory. I did go through and do a little cleanup manually in ScienceBase in a few cases where I did not see Data Owner contacts listed.
* The spatial property here contains a reasonable bounding box for most of the items generated by tying a state ID to the items in ScienceBase. This might be used to generate a bounding box for collection items to build at least reasonable harvestable metadata for cases where there are not actual items presented in some way as yet.

Note that in going through and classifying everything in the NDC, I came across a few other items buried a little ways down the hiearchy that I classed as "ndc_organization." So it is not just the top-level items in the NDC collection that are considered organizations.

In [5]:
print(len(r_ndc_org['items']))
display(r_ndc_org['items'])

48


[{'contacts': [{'active': True,
    'contactType': 'organization',
    'logoUrl': 'http://my.usgs.gov/static-cache/images/dataOwner/v1/logosMed/NDLogo.gif',
    'name': 'North Dakota Geological Survey',
    'oldPartyId': 18256,
    'onlineResource': 'https://www.dmr.nd.gov/ndgs/',
    'primaryLocation': {'mailAddress': {'city': 'Bismarck',
      'line1': '600 East Boulevard Avenue',
      'state': 'ND',
      'zip': '58505-0840'},
     'name': 'North Dakota Geological Survey',
     'officePhone': '7013288000',
     'streetAddress': {'city': 'Bismarck',
      'line1': '1016 E. Calgary Ave.',
      'state': 'ND',
      'zip': '58503'}},
    'smallLogoUrl': 'http://my.usgs.gov/static-cache/images/dataOwner/v1/logosSmall/NDLogo.gif',
    'type': 'Data Owner'}],
  'id': '4f4e4761e4b07f02db47dfe0',
  'link': {'rel': 'self',
   'url': 'https://www.sciencebase.gov/catalog/item/4f4e4761e4b07f02db47dfe0'},
  'relatedItems': {'link': {'rel': 'related',
    'url': 'https://www.sciencebase.gov/cata

# Collections

The real meat of this work will be in the collection items.

In [6]:
tag_scheme_collections = next(({'name':ts['name'],'scheme':ts['scheme']} for ts in item_types if ts['name'] == 'ndc_collection'), None)
fields_collections = 'title,files,webLinks,hasChildren,facets,tags'
sb_query_collections = f'{queryRoot}fields={fields_collections}&folderId={parentId}&filter=tags%3D{tag_scheme_collections}'
print(sb_query_collections)

https://www.sciencebase.gov/catalog/items?format=json&max=1000&fields=title,files,webLinks,hasChildren,facets,tags&folderId=4f4e4760e4b07f02db47dfb4&filter=tags%3D{'name': 'ndc_collection', 'scheme': 'https://www.sciencebase.gov/vocab/category/NGGDPP/nggdpp_collection_types'}


In [7]:
r_ndc_collections = requests.get(sb_query_collections).json()

I'll be lazy for right now and leave off building a paginated method of working through collection items since there are just under the upper limit of 1,000 items that can be retrieved from ScienceBase in a single query.

In [8]:
print(len(r_ndc_collections['items']))

972


## Collections with Child Items

First, we can presume at the moment that all collection items that have children currently are the collection items that we should find some type of processable file or WAF link. I know that there are some exceptions to this rule, and this exploration is designed to help delineate those issues for further work. I'll work through this somewhat backwards in terms of identifying the collections that are special cases.

### No files and no weblinks

How did these collections get populated? These are a little bit mysterious, but we'll probably need to either track down a new source for these records or else generate something like a GeoJSON structure with point geometry and a set of the essential properties from the ScienceBase Items, stash that file on the items, and then write a processor for those.

In [10]:
collections_populated_nofiles_noweblinks = [c for c in r_ndc_collections['items'] if c['hasChildren'] and 'files' not in c.keys() and 'webLinks' not in c.keys()]

for collection in collections_populated_nofiles_noweblinks:
    print(collection['title'])
    print(collection['link']['url'])
    print('--------')

print('Total:', len(collections_populated_nofiles_noweblinks))


Collection of Scanned Bedrock Geologic Paper Maps
https://www.sciencebase.gov/catalog/item/57a9f3fde4b05e859be05d7b
--------
Collection of Scanned Surficial Geologic Paper Maps
https://www.sciencebase.gov/catalog/item/57aa0a62e4b05e859be06621
--------
Collection of Rock cores from MO
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df330
--------
Rock Thin Sections Collection from VA
https://www.sciencebase.gov/catalog/item/4f4e4ad9e4b07f02db684e88
--------
Total: 4


### No files but do have web links

Next are collections that do not have files attached but do have web links. For the most part, these should be items with web accessible folder references that can be harvested.

But wait...

Only a couple of these actually have WAF URLs specified, so it looks like there are some other mystery items we'll need to clean up.

In [11]:
collections_populated_nofiles_weblinks = [c for c in r_ndc_collections['items'] if c['hasChildren'] and 'files' not in c.keys() and 'webLinks' in c.keys()]

for collection in collections_populated_nofiles_weblinks:
    waf_url = next((l['uri'] for l in collection['webLinks'] if l['type'] == 'WAF'), None)
    print(collection['title'])
    print(collection['link']['url'])
    print('WAF URL', waf_url)
    print('--------')

print('Total:', len(collections_populated_nofiles_weblinks))    


Collection of well construction reports from Wisconsin
https://www.sciencebase.gov/catalog/item/4f4e4839e4b07f02db4efed3
WAF URL None
--------
Collection of thin sections and polished sections from Wisconsin
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df131
WAF URL None
--------
Collection of rock cores from Wisconsin
https://www.sciencebase.gov/catalog/item/4f4e4b18e4b07f02db6a6ff8
WAF URL None
--------
Collection of field notes from Illinois
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df141
WAF URL None
--------
Collection of sediment samples from Wisconsin
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df159
WAF URL None
--------
Arizona Department of Mines and Mineral Resources Photo Archive
https://www.sciencebase.gov/catalog/item/5009bd91e4b0612f70e97a96
WAF URL http://repository.stategeothermaldata.org/resources/metadata/DataPres2012-13MineFileInventory/ADMMR_PhotosA-Z/
--------
Collection of Borehole geophysical logs from MN
https://

### Now for the majority of items with files
For collections that have only a single processable file, the route we'll take will be pretty straightforward. The one onboard file should be the one that we need to grab up into the index for that collection. A challenge we're going to have, though, is that quite a number of these collections have more than one file onboard. There are cases where data owners are actively maintaining their collections either in the ScienceBase interface or through the NDC Dashboard app, loading up different versions of files over time and processing them to generate child items in the current architecture. In other cases, it appears that USGS staff helped data providers along by modifying original source files and setting those up to be loaded, leaving both the original file and a modified file on board the items. In some cases, it appears that the number of collection items was high enough that data owners split them up into multiple files.

From a data management perspective, it's great that data owners and USGS support staff keep all versions of files in the repository. However, there is not a clear way to always identify the correct files to process that represent the current inventory for a given collection. We can't really use the data/time stamp for when files were modified or user IDs of the user who uploaded the files to tease this out, because we don't really know the sequence of events. I took a look at using the "processed" flag, which is a ScienceBase-specific thing indicating that the file was run through the process to put its records into child items. However, that doesn't really do anything for us either since there is still no indication of which process was run last.

In other use cases, we've used the title property in the file object part of the ScienceBase Item model to store a text string indicating the purpose of a given file. This is really the only property that can be manipulated in the system, other than some type of file name convention, which is a losing proposition given the varied ways people have of using file naming in their own processes. The other option would be to include some type of manifest as a separate file that provides essentially a little bit of additional metadata for every file attached to a collection. Either one of these approaches could work programmatically but comes with other issues in terms of sustaining and scaling the method. There are also cases like [this one](https://www.sciencebase.gov/catalog/item/5be068b7e4b0b3fc5cf33543) where data owners have already used the title property to provide other information that still doesn't unambiguously indicate file purpose but probably shouldn't be messed with. Some form of external manifest, entity/attribute, or lookup function is likely the way I'll need to go to determine what to process in collections across the board. 

The NDC Dashboard, if that were the only way of maintaining the NDC, would provide the necessary layer of abstraction to impose business rules on how ScienceBase operates in this case. The upload app built into the NDC Dashboard could be tailored to flag files for processing, giving users options such as uploading batches of files, designating that new files should replace old ones, or other necessary dynamics. These processes could result in either files being flagged in a particular way in the ScienceBase Item model or some type of manifest being maintained to be consulted for file purpose.

Another approach could be to move the file management aspect of the NDC into a different type of file management platform that would support actual versioning and more robust ways of providing file-specific metadata. There is, reportedly, a new sbFiles component being developed for ScienceBase that might provide some options, or collection items could reference a third party file management solution via a web link or some other means.

The following crude listing runs through all collection items that have current child items in ScienceBase and do have files on board, and lists the number of files by content type. My next step will be to start winnowing this down to determine as many processable files as I can and then get to a more tractable list of those cases where we will likely need to do some additional sleuthing to determine what to process. I think I can reasonably add some titles to onboard files when their purpose is clear to help narrowing the field. I can at least "set aside" the "metadata.xml" files that were part of the original collection survey information, and I may be able to verify that the single-file cases correspond to the appropriate records for some collections.

I'll then come back and revisit this to provide a more robust report.

In [12]:
collections_populated_files = [c for c in r_ndc_collections['items'] if c['hasChildren'] and 'files' in c.keys()]

for collection in collections_populated_files:
    files_by_type = {}
    # Ugly loop here but actually seems the cheapest way of doing this after looking at itertools methods
    for f in collection['files']:
        if f['contentType'] not in files_by_type.keys():
            files_by_type[f['contentType']] = [f['name']]
        else:
            files_by_type[f['contentType']].append(f['name'])
    print(collection['title'])
    print(collection['link']['url'])
    
    # Now that we flagged the metadata.xml files, we can get those out of the way as irrelevant in this exercise
    try:
        files_by_type['application/xml'].remove('metadata.xml')
    except:
        pass
    
    for k,v in files_by_type.items():
        if len(v) > 0:
            print(k, " - ", len(v))
        if len(v) > 1:
            print(v)
    print('==================')

print('Total:', len(collections_populated_files))


Anderson Mine collection
https://www.sciencebase.gov/catalog/item/504f6057e4b03f3ccc029062
text/plain; charset=windows-1252  -  1
W. H. Crutchfield, Jr. mining collection
https://www.sciencebase.gov/catalog/item/502e9777e4b0ca196f3915cd
text/plain; charset=windows-1252  -  1
Kelsey Boltz mining collection inventory
https://www.sciencebase.gov/catalog/item/505b5f6ae4b08c986b30c279
text/plain; charset=windows-1252  -  1
Montana Mining Property File Collection
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df1b1
text/csv  -  1
Meredith Johnson, Former State Geologist, Field Notes
https://www.sciencebase.gov/catalog/item/5775405fe4b07dd077c70874
text/csv  -  1
Minerals commodities books
https://www.sciencebase.gov/catalog/item/57755b2ce4b07dd077c708dd
text/csv  -  1
Collection of Rock cores from Connecticut
https://www.sciencebase.gov/catalog/item/4f4e49cfe4b07f02db5da90e
text/plain  -  4
['06_112941_CTmetadataCSV.csv', '07_114830_CTmetadataCSV.csv', '14_141856_CTmetadataCSV.c

text/plain  -  1
Collection of Geologic Maps from Iowa
https://www.sciencebase.gov/catalog/item/4f4e49cfe4b07f02db5da9cf
text/csv  -  1
Collection of Rock Cuttings from Kentucky
https://www.sciencebase.gov/catalog/item/4f4e4b32e4b07f02db6b4a4e
application/xml  -  2
['oilgas_cuttings_1112017.xml', 'oilgas_cuttings_1112017.xml']
Collection of Rock Core Analyses from Michigan
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df1cf
text/plain; charset=windows-1252  -  1
Collection of Hardrock Mineral Core from Alaska
https://www.sciencebase.gov/catalog/item/4f4e49cfe4b07f02db5da76d
text/csv  -  2
['bom_rebox_project_NDC_metadata_v1.csv', 'blm_pulps_ndc_metadata.csv']
Collection of Outcrop Locations from Kentucky
https://www.sciencebase.gov/catalog/item/4f4e4b32e4b07f02db6b4a47
application/xml  -  1
Collection of maps from Arizona
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df44d
application/xml  -  1
Collection of Petroleum Hydrocarbon Chromatographic Analyses of 