Do some matching between records at NCEI and records available through the IOOS data catalog. Essentially come up with a list of datasets that aren't at NCEI.

Outline of process:
1. Build a dataframe of non-federal buoy datasets and metadata from the IOOS Catalog.
2. Use that dataframe to search NCEI for matching datasets affiliated with IOOS.
3. Identify which datasets are not at NCEI that should be.

Borrow code from:
* https://ioos.github.io/ioos_code_lab/content/code_gallery/data_access_notebooks/2017-06-12-NCEI_RA_archive_history.html
* https://ioos.github.io/ioos_code_lab/content/code_gallery/data_access_notebooks/2024-09-17-CKAN_API_Query.html

In [1]:
from ckanapi import RemoteCKAN

ioos_catalog = RemoteCKAN(
    address="https://data.ioos.us",
    user_agent="ckanapiioos/1.0 (+https://ioos.us/)",
)


ioos_catalog

  import pkg_resources


<ckanapi.remoteckan.RemoteCKAN at 0x2159af2dbe0>

In [2]:
orgs = ioos_catalog.action.organization_list()
print(orgs)

['aoos', 'caricoos', 'cdip', 'cencoos', 'comt', 'gcoos', 'glider-dac', 'glos', 'hf-radar-dac', 'ioos', 'maracoos', 'nanoos', 'neracoos', 'noaa-co-ops', 'noaa-ndbc', 'oceansites', 'pacioos', 'sccoos', 'secoora', 'unidata', 'usgs', 'us-navy']


In [3]:
datasets = ioos_catalog.action.package_search()
datasets["count"]

44142

## 2 options

1. go accession by accession and extract the following info:

Let's do some testing with the following NCEI accession:
https://www.ncei.noaa.gov/data/oceans/ncei/archive/metadata/approved/granule/0171311.xml

```xml
<gmd:descriptiveKeywords>
<gmd:MD_Keywords>
<gmd:keyword>
<gco:CharacterString>Indian Island station</gco:CharacterString>
```

```xml
<gmd:citedResponsibleParty>
<gmd:CI_ResponsibleParty>
<gmd:organisationName>
<gmx:Anchor xlink:href="https://ror.org/028paz341" xlink:actuate="onRequest">Central and Northern California Ocean Observing System</gmx:Anchor>
```

2. Use the collection level records to get what we need.

For example, CeNCOOS:
https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.nodc:IOOS-CeNCOOS;view=xml;responseType=text/xml

```xml
<gmd:descriptiveKeywords>
<gmd:MD_Keywords>
<gmd:keyword>
<gco:CharacterString>Bodega Marine Laboratory seawater intake, Horseshoe Cove station,</gco:CharacterString>
</gmd:keyword>
<gmd:keyword>
<gco:CharacterString>Cal Poly Pier San Luis Obispo station</gco:CharacterString>
</gmd:keyword>
<gmd:keyword>
<gco:CharacterString>California Maritime pier Carquinez shore station</gco:CharacterString>
</gmd:keyword>
<gmd:keyword>
<gco:CharacterString>Fort Point Pier station</gco:CharacterString>
</gmd:keyword>
<gmd:keyword>
<gco:CharacterString>Hog Island Oyster Company Burkolator, Tomales Bay,</gco:CharacterString>
</gmd:keyword>
<gmd:keyword>
<gco:CharacterString>Humboldt Bay Pier station</gco:CharacterString>
</gmd:keyword>
<gmd:keyword>
<gco:CharacterString>Humboldt Dock B Shore Station</gco:CharacterString>
</gmd:keyword>
<gmd:keyword>
<gco:CharacterString>Indian Island station</gco:CharacterString>
</gmd:keyword>
<gmd:keyword>
<gco:CharacterString>Monterey Bay Commercial Wharf station</gco:CharacterString>
</gmd:keyword>
<gmd:keyword>
<gco:CharacterString>Morro Bay (BS1) station</gco:CharacterString>
</gmd:keyword>
<gmd:keyword>
<gco:CharacterString>Morro Bay station</gco:CharacterString>
</gmd:keyword>
<gmd:keyword>
<gco:CharacterString>Moss Landing Marine Laboratory Seawater Intake Monitoring Station</gco:CharacterString>
</gmd:keyword>
<gmd:keyword>
<gco:CharacterString>Romberg Tiburon Center Pier station</gco:CharacterString>
</gmd:keyword>
<gmd:keyword>
<gco:CharacterString>Santa Cruz municipal wharf station</gco:CharacterString>
</gmd:keyword>
<gmd:keyword>
<gco:CharacterString>Trinidad Head station</gco:CharacterString>
</gmd:keyword>
<gmd:type>
<gmd:MD_KeywordTypeCode codeList="https://data.noaa.gov/resources/iso19139/schema/resources/Codelist/gmxCodelists.xml#MD_KeywordTypeCode" codeListValue="platform">platform</gmd:MD_KeywordTypeCode>
</gmd:type>
<gmd:thesaurusName>
<gmd:CI_Citation>
<gmd:title>
<gco:CharacterString>Provider Platform Names</gco:CharacterString>
</gmd:title>
<gmd:date gco:nilReason="inapplicable"/>
</gmd:CI_Citation>
</gmd:thesaurusName>
</gmd:MD_Keywords>
</gmd:descriptiveKeywords>
```

## Grab info from NCEI

## Do the searching


In [131]:
import time
import pandas as pd
import json

from ckanapi import RemoteCKAN
from ckanapi.errors import CKANAPIError
from requests.exceptions import ChunkedEncodingError
from urllib3.exceptions import IncompleteRead

ua = "ckanapiioos/1.0 (+https://ioos.us/)"

ioos_catalog = RemoteCKAN("https://data.ioos.us", user_agent=ua)

df_ioos_catalog = pd.DataFrame()
df_plat = pd.DataFrame()

result_count = 0

platforms = ["Elliott Point"]
orgs = ["NANOOS"]

for org in orgs:
    org_ncei = org.lower()

    for platform in platforms:

        platform_ncei = platform

        filter_query = f"organization:{org_ncei.lower()}"

        free_text_query = f"{platform_ncei.lower()}"

        # ioos_catalog.action.package_search(
        #             fq=filter_query, 
        #             q=free_text_query, 
        #             rows=500, 
        #             start=result_count,
        #         )

        while True:
            try:
                datasets = ioos_catalog.action.package_search(
                    fq=filter_query, 
                    q=free_text_query, 
                    rows=500, 
                    start=result_count,
                )
            except (CKANAPIError, IncompleteRead, ChunkedEncodingError):
                continue

            #result_count = datasets.shape[0]

            num_results = datasets["count"]
            
            print(f"num_results: {num_results}, result_count: {result_count}")

            for dataset in datasets["results"]:
                
                # maybe just read all metadata into a DataFrame.
                df = pd.DataFrame.from_dict(dataset, orient='index').T

                # for entry in dataset['extras']:
                #     if entry['key'] == 'temporal-extent-begin':
                #         start_date = entry['value']
                #     elif entry['key'] == 'temporal-extent-end':
                #         end_date= entry['value']
                #     elif entry['key'] == 'aggregation-info':
                #         my_list = json.loads(entry['value'])
                #         my_dict = {i: my_list[i] for i in range(len(my_list))}
                #         for agg in my_dict.keys():
                #             if my_dict[agg]['aggregate-dataset-identifier'] != "":
                #                 dtype = my_dict[agg]['aggregate-dataset-identifier']
                      
                # df = pd.DataFrame(
                #     {
                #         "title": [dataset["title"]],
                #         "url": [dataset["resources"][0]["url"]],
                #         "org": [dataset["organization"]["title"]],
                #         "platform": platform_ncei,
                #         'start_date':start_date,
                #         'end_date':end_date,
                #         'datatype': dtype,

                #     }
                # )

                df_plat = pd.concat([df_plat, df], ignore_index=True)
                
                result_count = df_plat.shape[0]

            if result_count >= num_results:
                print(f"num_results: {num_results}, result_count: {result_count}")
                break
            
        df_ioos_catalog = pd.concat([df_ioos_catalog, df_plat], ignore_index=True)

        print(
                f"num_results: {num_results}, result_count: {result_count}, total_result_count: {df_ioos_catalog.shape[0]}"
            )

num_results: 1, result_count: 0
num_results: 1, result_count: 1
num_results: 1, result_count: 1, total_result_count: 1


In [132]:
df_ioos_catalog

Unnamed: 0,author,author_email,creator_user_id,id,isopen,license_id,license_title,maintainer,maintainer_email,metadata_created,...,title,type,url,version,extras,resources,tags,groups,relationships_as_subject,relationships_as_object
0,,,0ea3933c-4674-41dd-a17d-bfbc8c99bd75,856b00a5-4c7d-4f8c-8244-e77fb85e793e,False,,,,,2025-01-09T13:27:21.796623,...,(CMOP) Elliott Point,dataset,,,"[{'key': 'access-constraints', 'value': '[]'},...","[{'cache_last_updated': None, 'cache_url': Non...","[{'display_name': 'Air Temperature', 'id': 'a6...",[],[],[]


In [130]:
pd.DataFrame.from_dict(dataset, orient='index').T

Unnamed: 0,author,author_email,creator_user_id,id,isopen,license_id,license_title,maintainer,maintainer_email,metadata_created,...,title,type,url,version,extras,resources,tags,groups,relationships_as_subject,relationships_as_object
0,,,0ea3933c-4674-41dd-a17d-bfbc8c99bd75,856b00a5-4c7d-4f8c-8244-e77fb85e793e,False,,,,,2025-01-09T13:27:21.796623,...,(CMOP) Elliott Point,dataset,,,"[{'key': 'access-constraints', 'value': '[]'},...","[{'cache_last_updated': None, 'cache_url': Non...","[{'display_name': 'Air Temperature', 'id': 'a6...",[],[],[]


## Query IOOS Catalog for appropriate datasets

Gather all the datasets associated with an RA and filter to just buoys and similar platforms.

In [133]:
def ioos_ckan_query(ioos_catalog, filter_query, free_text_query):
    '''
    Function to query the IOOS catalog with a filter query and free text query.

    ioos_catalog : RemoteCKAN object
        The RemoteCKAN object to use for querying the IOOS catalog.
    filter_query : str
        The filter query to use for querying the IOOS catalog.
    free_text_query : str
        The free text query to use for querying the IOOS catalog.
    Returns
    -------
    df_plat : pandas.DataFrame
        A DataFrame containing the results of the query.
    '''

    df_plat = pd.DataFrame()

    result_count = 0
    while True:
            try:
                datasets = ioos_catalog.action.package_search(
                    fq=filter_query, 
                    q=free_text_query, 
                    rows=500, 
                    start=result_count,
                )
            except (CKANAPIError, IncompleteRead, ChunkedEncodingError):
                continue

            #result_count = datasets.shape[0]

            num_results = datasets["count"]
            
            print(f"num_results: {num_results}, result_count: {result_count}")

            for dataset in datasets["results"]:
                df = pd.DataFrame.from_dict(dataset, orient='index').T
                # dtype = None
                # for entry in dataset['extras']:
                #     if entry['key'] == 'temporal-extent-begin':
                #         start_date = entry['value']
                #     elif entry['key'] == 'temporal-extent-end':
                #         end_date= entry['value']
                #     elif entry['key'] == 'platform':
                #         platform = entry['value']
                #     elif entry['key'] == 'aggregation-info':
                #         my_list = json.loads(entry['value'])
                #         my_dict = {i: my_list[i] for i in range(len(my_list))}
                #         for agg in my_dict.keys():
                #             if my_dict[agg]['aggregate-dataset-identifier'] != "":
                #                 dtype = my_dict[agg]['aggregate-dataset-identifier']

                # df = pd.DataFrame(
                #     {
                #         "title": [dataset["title"]],
                #         #"url": [dataset["resources"][0]["url"]],
                #         "org": [dataset["organization"]["title"]],
                #         #"platform": platform,
                #         'start_date':start_date,
                #         'end_date':end_date,
                #         'datatype': dtype,

                #     }
                # )

                df_plat = pd.concat([df_plat, df], ignore_index=True)
                
                result_count = df_plat.shape[0]

            if result_count >= num_results:
                print(f"num_results: {num_results}, result_count: {result_count}")
                break
            
    #df_ioos_catalog = pd.concat([df_ioos_catalog, df_plat], ignore_index=True)

    print(
            f"num_results: {num_results}, result_count: {result_count}, total_result_count: {df_plat.shape[0]}"
        )
    
    return df_plat

## Actually do the querying

In [134]:
import time
import pandas as pd

from ckanapi import RemoteCKAN
from ckanapi.errors import CKANAPIError
from requests.exceptions import ChunkedEncodingError
from urllib3.exceptions import IncompleteRead

ua = "ckanapiioos/1.0 (+https://ioos.us/)"

ioos_catalog = RemoteCKAN("https://data.ioos.us", user_agent=ua)
df_ioos_catalog = pd.DataFrame()


platforms = ["Elliott Point"]
orgs = ["NANOOS"]

for org in orgs:
    org_ncei = org.lower()

    for platform in platforms:

        filter_query = f"organization:{org_ncei.lower()}"

        free_text_query = f""#{platform_ncei.lower()}"
    
        df_ioos_catalog = pd.concat([df_ioos_catalog, ioos_ckan_query(ioos_catalog, filter_query, free_text_query)], ignore_index=True)

df_ioos_catalog

num_results: 169, result_count: 0
num_results: 169, result_count: 169
num_results: 169, result_count: 169, total_result_count: 169


Unnamed: 0,author,author_email,creator_user_id,id,isopen,license_id,license_title,maintainer,maintainer_email,metadata_created,...,title,type,url,version,extras,resources,tags,groups,relationships_as_subject,relationships_as_object
0,,,0ea3933c-4674-41dd-a17d-bfbc8c99bd75,68a4c18a-ec0f-4c2d-9479-b96af1661f9c,False,,,,,2025-05-09T16:04:20.047386,...,Glider - Trinidad Head Line: 2019 September - ...,dataset,,,"[{'key': 'access-constraints', 'value': '[]'},...","[{'cache_last_updated': None, 'cache_url': Non...",[{'display_name': 'AUVS > Autonomous Underwate...,[],[],[]
1,,,0ea3933c-4674-41dd-a17d-bfbc8c99bd75,b369815e-03fc-4980-836f-d9b98b53ec0b,False,,,,,2025-05-09T16:03:23.528924,...,Glider - Trinidad Head Line: 2015 September - ...,dataset,,,"[{'key': 'access-constraints', 'value': '[]'},...","[{'cache_last_updated': None, 'cache_url': Non...",[{'display_name': 'AUVS > Autonomous Underwate...,[],[],[]
2,,,0ea3933c-4674-41dd-a17d-bfbc8c99bd75,7da90e03-f8aa-483e-96d0-7a27051b90b4,False,,,,,2025-04-11T14:32:17.541505,...,Backyard Buoys - NANOOS - Washington: Quileute...,dataset,,,"[{'key': 'access-constraints', 'value': '[]'},...","[{'cache_last_updated': None, 'cache_url': Non...","[{'display_name': 'Backyard Buoys', 'id': 'f80...",[],[],[]
3,,,0ea3933c-4674-41dd-a17d-bfbc8c99bd75,703a2dae-6784-4317-9463-dfd2cdfa4d6c,False,,,,,2025-05-09T16:03:49.832982,...,Glider - La Push Line: 2025 March - Ongoing,dataset,,,"[{'key': 'access-constraints', 'value': '[]'},...","[{'cache_last_updated': None, 'cache_url': Non...",[{'display_name': 'AUVS > Autonomous Underwate...,[],[],[]
4,,,0ea3933c-4674-41dd-a17d-bfbc8c99bd75,d0ef6a3a-4894-43f3-b4ea-2a882dccc478,False,,,,,2025-01-09T02:08:28.700159,...,NPBY1 - Point Wells: Meteorological Station Data,dataset,,,"[{'key': 'access-constraints', 'value': '[]'},...","[{'cache_last_updated': None, 'cache_url': Non...",[{'display_name': 'Earth Science > Atmosphere ...,[],[],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
164,,,0ea3933c-4674-41dd-a17d-bfbc8c99bd75,bff06122-cf40-4611-b5b3-c8c79a71cfac,False,,,,,2025-01-09T13:26:01.855237,...,(APL-UW) Ãâ hÃÂ¡ÃâbaÃÂ· UW/NANOOS Moore...,dataset,,,"[{'key': 'access-constraints', 'value': '[]'},...","[{'cache_last_updated': None, 'cache_url': Non...","[{'display_name': 'Air Temperature', 'id': 'a6...",[],[],[]
165,,,0ea3933c-4674-41dd-a17d-bfbc8c99bd75,2449dd5c-57c5-43dd-a3d6-f52de352a0e5,False,,,,,2025-01-09T13:25:59.040273,...,"(WADOH) Hood Canal 1 site, W shore of Hood Can...",dataset,,,"[{'key': 'access-constraints', 'value': '[]'},...","[{'cache_last_updated': None, 'cache_url': Non...","[{'display_name': 'Air Temperature', 'id': 'a6...",[],[],[]
166,,,0ea3933c-4674-41dd-a17d-bfbc8c99bd75,46917a4a-9e77-495b-a0d3-3c5cea2bc5e8,False,,,,,2025-01-09T13:25:56.552833,...,(CMOP) Grays Point (USCG day mark green 13),dataset,,,"[{'key': 'access-constraints', 'value': '[]'},...","[{'cache_last_updated': None, 'cache_url': Non...","[{'display_name': 'Air Temperature', 'id': 'a6...",[],[],[]
167,,,0ea3933c-4674-41dd-a17d-bfbc8c99bd75,3261508c-5b1d-42a8-95ae-fe142449a216,False,,,,,2025-01-09T13:25:53.600691,...,"(WADOH) Skookum Inlet site, N shore near Deer ...",dataset,,,"[{'key': 'access-constraints', 'value': '[]'},...","[{'cache_last_updated': None, 'cache_url': Non...","[{'display_name': 'Air Temperature', 'id': 'a6...",[],[],[]


## Start filtering down to buoys and similar platforms

In [120]:
df_ioos_catalog.loc[df_ioos_catalog['datatype']=='TimeSeries']

Unnamed: 0,title,org,start_date,end_date,datatype
2,Backyard Buoys - NANOOS - Washington: Quileute...,NANOOS,2024-05-02T04:40:00Z,2025-08-29T20:30:00Z,TimeSeries
4,NPBY1 - Point Wells: Meteorological Station Data,NANOOS,2014-09-30T23:42:13Z,2025-09-05T14:50:39Z,TimeSeries
6,NPBY2 - Carr Inlet: Meteorological Station Data,NANOOS,2014-12-17T04:26:26Z,2025-09-05T15:29:20Z,TimeSeries
23,NEMO - ChaBa Meteorlogical - Gill Metpak Pro,NANOOS,2017-05-01T06:10:38Z,2025-09-03T12:55:36Z,TimeSeries
32,"Se'lhaem, Bellingham Bay Meteorological Statio...",NANOOS,2016-02-14T20:14:02Z,2025-07-05T14:20:02Z,TimeSeries
34,ORCA3 - Hansville: Meteorological Station Data,NANOOS,2015-04-01T19:04:49Z,2024-12-16T10:09:46Z,TimeSeries
41,ORCA1 - Twanoh: Meteorological Station Data,NANOOS,2019-09-01T00:00:31Z,2025-09-05T15:25:47Z,TimeSeries
50,ORCA4 - Dabob Bay: Meteorological Station Data,NANOOS,2019-02-20T20:22:04Z,2025-06-04T22:09:56Z,TimeSeries
51,"Se'lhaem, Bellingham Bay Surface Hydrological ...",NANOOS,2016-02-14T20:20:02Z,2025-07-05T14:20:02Z,TimeSeries
54,Backyard Buoys - NANOOS - Washington: Quileute...,NANOOS,2023-10-19T20:20:00Z,2023-12-24T22:50:00Z,TimeSeries


Note that CMOP Elliott Point does not appear in the timeseries filtered results.

In [117]:
df_ioos_catalog.loc[df_ioos_catalog['title'].str.contains('Elliott Point')]

Unnamed: 0,title,org,start_date,end_date,datatype
128,(CMOP) Elliott Point,NANOOS,2018-01-02T08:33:43+00:00,2019-02-10T21:27:42+00:00,
