This notebook is scratch space for some relatively simple tweaks I'm making to ScienceBase Items in the NDC in order to better position the system for building new data indexing code against. It requires authentication for using the sciencebasepy package (in PyPI) to write changes to ScienceBase.

In [1]:
import sciencebasepy
from IPython.display import display

In [2]:
sb = sciencebasepy.SbSession()

This little function is something I might spruce up and put in a pynggdpp package I'm considering. It uses the ScienceBase Vocab to retrieve a "fully qualified" term for use. Another approach would be to generalize it and contribute it to the sciencebasepy package, but of course, the ScienceBase Vocab kind of sucks in terms of its long-term potential. I could, instead, put some time into developing a more robust vocabulary, express it through the ESIP Community Ontology Repository, and then build code around terms resolvable to that source.

In [3]:
import requests

def ndc_collection_type_tag(tag_name):
    vocab_search_url = f'https://www.sciencebase.gov/vocab/5bf3f7bce4b00ce5fb627d57/terms?nodeType=term&format=json&name={tag_name}'
    r_vocab_search = requests.get(vocab_search_url).json()
    if len(r_vocab_search['list']) == 1:
        tag = {'type':'theme','name':r_vocab_search['list'][0]['name'],'scheme':r_vocab_search['list'][0]['scheme']}
        return tag
    else:
        return None

In [28]:
username = input("Username:  ")
sb.loginc(str(username))

Username:  sbristol@usgs.gov
········


<sciencebasepy.SbSession.SbSession at 0x1123fe2e8>

# Set item type tags
I opted to use a simple vocabulary that sets items as ndc_organization, ndc_folder, or ndc_collection to help classify the primary items in the catalog as to their function. I did this in batches, being careful to review the items from a given data owner to see whether or not they did anything "out of the orginary" before applying tags. The parent ID supplied in the first line of this block deterimined the given batch of items to run through. The main thing I did through this was to flag certain items as "folders," basically extraneous organizational constructs that some data owners decided to employ directly in ScienceBase. We may revisit this as we get into IGSN work as there may be a desire to set these up as actual collections with subcollections.

In [None]:
collection_items = sb.get_child_ids('5ad902ade4b0e2c2dd27a82c')

item_count = 0
for sbid in collection_items:
    this_item = sb.get_item(sbid, {'fields':'tags'})
    isFolder = None
    if 'tags' in this_item.keys():
        isFolder = next((t for t in this_item['tags'] if t['name'] == 'ndc_folder'), None)
    if isFolder is None:
        item = {'id':sbid,'tags':[ndc_collection_type_tag('ndc_collection')]}
        print(item)
        sb.update_item(item)
        item_count = item_count + 1
    
print('===========', item_count)

# Identify and flag metadata.xml files
In the original setup of the NDC from its roots in the Comprehensive Science Catalog, the results of a survey for collections from the State Geological Surveys were pulled from a Filemaker database into "metadata.xml" files that were processed into the Item model. These files are still onboard the ScienceBase Items, which is a reasonable thing to do and keep around in case we want to reprocess them in a different way. It seems reasonable to go ahead and verify these files and flag them with a title so that they can be separated out from files to examine for possible collection item processing.

Just to make sure I don't inadvertenly flag something wrong, I'll write this process to open up and look at the individual "metadata.xml" files to ensure they are what I think they are before setting a title property.

In [29]:
parentId = '4f4e4760e4b07f02db47dfb4'
queryRoot = 'https://www.sciencebase.gov/catalog/items?format=json&max=1000&'
tag_scheme_collections = ndc_collection_type_tag('ndc_collection')
fields_collections = 'title,files'
sb_query_collections = f'{queryRoot}fields={fields_collections}&folderId={parentId}&filter=tags%3D{tag_scheme_collections}'
r_ndc_collections = requests.get(sb_query_collections).json()

In [31]:
for collection in [c for c in r_ndc_collections['items'] if 'files' in c.keys() and next((f for f in c['files'] if f['name'] == 'metadata.xml'), None) is not None]:
    the_files = collection['files']
    f_metadata_xml = next(f for f in the_files if f['name'] == 'metadata.xml')
    
    if requests.get(f_metadata_xml['url']).text[39:47] == '<NGGDPP>':
        new_files = []
        for f in collection['files']:
            if f['name'] == 'metadata.xml':
                f['title'] = 'Collection Metadata Source File'
            new_files.append(f)
        new_item = {'id':collection['id'], 'files':new_files}
        print(sb.update_item(new_item)['link']['url'])

https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df544
https://www.sciencebase.gov/catalog/item/4f4e49cfe4b07f02db5da8ec
https://www.sciencebase.gov/catalog/item/4f4e49cfe4b07f02db5da976
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df543
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df4e0
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df363
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df226
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df294
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df12f
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df160
https://www.sciencebase.gov/catalog/item/4f4e49cce4b07f02db5d917a
https://www.sciencebase.gov/catalog/item/4f4e49cce4b07f02db5d906f
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df1ae
https://www.sciencebase.gov/catalog/item/4f4e4acae4b07f02db67d22b
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df2af
https://ww

https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df551
https://www.sciencebase.gov/catalog/item/4f4e4acae4b07f02db67d228
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df493
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df2ea
https://www.sciencebase.gov/catalog/item/4f4e49cce4b07f02db5d903b
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df4e2
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df516
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df51b
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df51d
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df522
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df4cc
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df407
https://www.sciencebase.gov/catalog/item/4f4e49cbe4b07f02db5d891f
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df3fd
https://www.sciencebase.gov/catalog/item/4f4e479de4b07f02db491dcf
https://ww

https://www.sciencebase.gov/catalog/item/4f4e496fe4b07f02db5a3df0
https://www.sciencebase.gov/catalog/item/4f4e496fe4b07f02db5a3df2
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df29a
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df517
https://www.sciencebase.gov/catalog/item/4f4e49cfe4b07f02db5da806
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df230
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df4ad
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df4ae
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df4b7
https://www.sciencebase.gov/catalog/item/4f4e49cce4b07f02db5d90a8
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df221
https://www.sciencebase.gov/catalog/item/4f4e49cbe4b07f02db5d85d7
https://www.sciencebase.gov/catalog/item/4f4e49cbe4b07f02db5d8956
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df48f
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df515
https://ww

https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df239
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df200
https://www.sciencebase.gov/catalog/item/4f4e49cfe4b07f02db5da879
https://www.sciencebase.gov/catalog/item/4f4e49cfe4b07f02db5da71e
https://www.sciencebase.gov/catalog/item/4f4e49cfe4b07f02db5da78f
https://www.sciencebase.gov/catalog/item/4f4e49cfe4b07f02db5da72f
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df4eb
https://www.sciencebase.gov/catalog/item/4f4e4acae4b07f02db67d224
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df4a5
https://www.sciencebase.gov/catalog/item/4f4e48b1e4b07f02db5305cf
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df4aa
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df31b
https://www.sciencebase.gov/catalog/item/4f4e49cce4b07f02db5d908e
https://www.sciencebase.gov/catalog/item/4f4e4a94e4b07f02db658d4b
https://www.sciencebase.gov/catalog/item/4f4e4a94e4b07f02db658d86
https://ww

https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df506
https://www.sciencebase.gov/catalog/item/4f4e49cce4b07f02db5d90a6
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df1d5
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df234
https://www.sciencebase.gov/catalog/item/4f4e496fe4b07f02db5a3dea
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df539
https://www.sciencebase.gov/catalog/item/4f4e49cfe4b07f02db5da83f
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df19a
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df235
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df48b
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df48a
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df43f
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df494
https://www.sciencebase.gov/catalog/item/4f4e49d8e4b07f02db5df49a
https://www.sciencebase.gov/catalog/item/4f4e49cce4b07f02db5d90d7
https://ww

In [34]:
# Just to make sure
different_purpose_metadata_xml = [c for c in r_ndc_collections['items'] if 'files' in c.keys() and next((f for f in c['files'] if f['name'] == 'metadata.xml' and 'title' in f.keys() and f['title'] != 'Collection Metadata Source File'), None) is not None]
display(different_purpose_metadata_xml)    

[]