I ran through and did some utility work on the items in ScienceBase that we are likely to keep in that part of the infrastructure as we work to rearchitect how the actual items within collections are dealt with. I applied three different tags to items depending on their function. These come from a new vocabulary I set up in ScienceBase-Vocab, and those are shown below. 

In [19]:
import requests
from IPython.display import display

parentId = '4f4e4760e4b07f02db47dfb4'
queryRoot = 'https://www.sciencebase.gov/catalog/items?format=json&max=1000&'

These are the terms I used to essentially classify the Items that will remain in ScienceBase as to their function. In the immediate term, this will let me easily tease out the items that represent actual collections where I should expect to find either some method of interfacing with the items in the collection (file, web link, etc.) or else nothing at the moment. The collection items are also where we will concentrate efforts on metadata improvement.

In [2]:
item_type_vocab = requests.get('https://www.sciencebase.gov/vocab/5bf3f7bce4b00ce5fb627d57/terms?nodeType=term&parentId=5bf3f7bce4b00ce5fb627d57&max=10&offset=0&format=json').json()

item_types = [{'name':t['name'],'label':t['label'],'description':t['description'],'scheme':t['scheme']} for t in item_type_vocab['list']]

display(item_types)


[{'description': 'Item that represents a logical collection of physical data items managed by an organization contributing to the National Digital Catalog. These are core metadata items for which we expect to find full metadata describing the collection and a method of accessing the contents in the collection.',
  'label': 'NDC Collection',
  'name': 'ndc_collection',
  'scheme': 'https://www.sciencebase.gov/vocab/category/NGGDPP/nggdpp_collection_types'},
 {'description': 'Denotes a ScienceBase Item that functions as an organizational folder within the National Digital Catalog.',
  'label': 'NDC Folder',
  'name': 'ndc_folder',
  'scheme': 'https://www.sciencebase.gov/vocab/category/NGGDPP/nggdpp_collection_types'},
 {'description': 'Item that represents a contributing organization. Used to set permissions in ScienceBase to allow management access by members of the organization.',
  'label': 'NDC Organization',
  'name': 'ndc_organization',
  'scheme': 'https://www.sciencebase.gov/voc

# Organizations

Organization records form the essential containers for managing collections in the NDC. The following sections pull organization items and start to set things up to run some health checks on the organization records.

In [20]:
tag_scheme_orgs = next(({'name':ts['name'],'scheme':ts['scheme']} for ts in item_types if ts['name'] == 'ndc_organization'), None)
fields_orgs = 'title,contacts,spatial'
sb_query_orgs = f'{queryRoot}fields={fields_orgs}&folderId={parentId}&filter=tags%3D{tag_scheme_orgs}'
print(sb_query_orgs)

https://www.sciencebase.gov/catalog/items?format=json&max=1000&fields=title,contacts,spatial&folderId=4f4e4760e4b07f02db47dfb4&filter=tags%3D{'name': 'ndc_organization', 'scheme': 'https://www.sciencebase.gov/vocab/category/NGGDPP/nggdpp_collection_types'}


In [21]:
# Retrieve the org items
r_ndc_org = requests.get(sb_query_orgs).json()

I pulled two essential properties for the org items that will likely be used to help improve metadata for the collection items in these containers:

* Contacts contains a "Data Owner" type contact that should be a reasonably populated responsible party entity for these records. These should be reviewed for current information as they look like contacts that I added a long time ago to the ScienceBase Directory. I did go through and do a little cleanup manually in ScienceBase in a few cases where I did not see Data Owner contacts listed.
* The spatial property here contains a reasonable bounding box for most of the items generated by tying a state ID to the items in ScienceBase. This might be used to generate a bounding box for collection items to build at least reasonable harvestable metadata for cases where there are not actual items presented in some way as yet.

Note that in going through and classifying everything in the NDC, I came across a few other items buried a little ways down the hiearchy that I classed as "ndc_organization." So it is not just the top-level items in the NDC collection that are considered organizations.

In [22]:
print(len(r_ndc_org['items']))
display(r_ndc_org['items'])

48


[{'contacts': [{'active': True,
    'contactType': 'organization',
    'logoUrl': 'http://my.usgs.gov/static-cache/images/dataOwner/v1/logosMed/NDLogo.gif',
    'name': 'North Dakota Geological Survey',
    'oldPartyId': 18256,
    'onlineResource': 'https://www.dmr.nd.gov/ndgs/',
    'primaryLocation': {'mailAddress': {'city': 'Bismarck',
      'line1': '600 East Boulevard Avenue',
      'state': 'ND',
      'zip': '58505-0840'},
     'name': 'North Dakota Geological Survey',
     'officePhone': '7013288000',
     'streetAddress': {'city': 'Bismarck',
      'line1': '1016 E. Calgary Ave.',
      'state': 'ND',
      'zip': '58503'}},
    'smallLogoUrl': 'http://my.usgs.gov/static-cache/images/dataOwner/v1/logosSmall/NDLogo.gif',
    'type': 'Data Owner'}],
  'id': '4f4e4761e4b07f02db47dfe0',
  'link': {'rel': 'self',
   'url': 'https://www.sciencebase.gov/catalog/item/4f4e4761e4b07f02db47dfe0'},
  'relatedItems': {'link': {'rel': 'related',
    'url': 'https://www.sciencebase.gov/cata

# Collections

The real meat of this work will be in the collection items.

In [26]:
tag_scheme_collections = next(({'name':ts['name'],'scheme':ts['scheme']} for ts in item_types if ts['name'] == 'ndc_collection'), None)
fields_collections = 'title,files,webLinks,hasChildren,facets,tags'
sb_query_collections = f'{queryRoot}fields={fields_collections}&folderId={parentId}&filter=tags%3D{tag_scheme_collections}'
print(sb_query_collections)

https://www.sciencebase.gov/catalog/items?format=json&max=1000&fields=title,files,webLinks,hasChildren,facets,tags&folderId=4f4e4760e4b07f02db47dfb4&filter=tags%3D{'name': 'ndc_collection', 'scheme': 'https://www.sciencebase.gov/vocab/category/NGGDPP/nggdpp_collection_types'}


In [27]:
r_ndc_collections = requests.get(sb_query_collections).json()

I'll be lazy for right now and leave off building a paginated method of working through collection items since there are just under the upper limit of 1,000 items that can be retrieved from ScienceBase in a single query.

In [28]:
print(len(r_ndc_collections['items']))

972


## Collections with Child Items

First, we can presume at the moment that all collection items that have children currently are the collection items that we should find some type of processable file or WAF link. I know that there are some exceptions to this rule, and this exploration is designed to help delineate those issues for further work. I'll work through this somewhat backwards in terms of identifying the collections that are special cases.

### No files and no weblinks

How did these collections get populated? These are a little bit mysterious, but we'll probably need to either track down a new source for these records or else generate something like a GeoJSON structure with point geometry and a set of the essential properties from the ScienceBase Items, stash that file on the items, and then write a processor for those.

In [36]:
for collection in [c for c in r_ndc_collections['items'] if c['hasChildren'] and 'files' not in c.keys() and 'webLinks' not in c.keys()]:
    print(collection['title'])
    print(collection['link']['url'])
    print('--------')
    

Collection of Scanned Bedrock Geologic Paper Maps
https://www.sciencebase.gov/catalog/item/57a9f3fde4b05e859be05d7b
--------
Collection of Scanned Surficial Geologic Paper Maps
https://www.sciencebase.gov/catalog/item/57aa0a62e4b05e859be06621
--------
Collection of Rock cores from MO
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df330
--------
Rock Thin Sections Collection from VA
https://www.sciencebase.gov/catalog/item/4f4e4ad9e4b07f02db684e88
--------


### No files but do have web links

Next are collections that do not have files attached but do have web links. For the most part, these should be items with web accessible folder references that can be harvested.

But wait...

Only a couple of these actually have WAF URLs specified, so it looks like there are some other mystery items we'll need to clean up.

In [38]:
for collection in [c for c in r_ndc_collections['items'] if c['hasChildren'] and 'files' not in c.keys() and 'webLinks' in c.keys()]:
    waf_url = next((l['uri'] for l in collection['webLinks'] if l['type'] == 'WAF'), None)
    print(collection['title'])
    print(collection['link']['url'])
    print('WAF URL', waf_url)
    print('--------')


Collection of well construction reports from Wisconsin
https://www.sciencebase.gov/catalog/item/4f4e4839e4b07f02db4efed3
WAF URL None
--------
Collection of thin sections and polished sections from Wisconsin
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df131
WAF URL None
--------
Collection of rock cores from Wisconsin
https://www.sciencebase.gov/catalog/item/4f4e4b18e4b07f02db6a6ff8
WAF URL None
--------
Collection of field notes from Illinois
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df141
WAF URL None
--------
Arizona Department of Mines and Mineral Resources Photo Archive
https://www.sciencebase.gov/catalog/item/5009bd91e4b0612f70e97a96
WAF URL http://repository.stategeothermaldata.org/resources/metadata/DataPres2012-13MineFileInventory/ADMMR_PhotosA-Z/
--------
Collection of Borehole geophysical logs from MN
https://www.sciencebase.gov/catalog/item/5bc0f6e1e4b0fc368eb70156
WAF URL None
--------
Collection of sediment samples from Wisconsin
https://

### Now for the majority of items with files