Do some matching between records at NCEI and records available through the IOOS data catalog. Essentially come up with a list of datasets that aren't at NCEI.

Outline of process:
1. Build a dataframe of non-federal buoy datasets and metadata from the IOOS Catalog.
2. Use that dataframe to search NCEI for matching datasets affiliated with IOOS.
3. Identify which datasets are not at NCEI that should be.

Borrow code from:
* https://ioos.github.io/ioos_code_lab/content/code_gallery/data_access_notebooks/2017-06-12-NCEI_RA_archive_history.html
* https://ioos.github.io/ioos_code_lab/content/code_gallery/data_access_notebooks/2024-09-17-CKAN_API_Query.html

In [1]:
from ckanapi import RemoteCKAN

ioos_catalog = RemoteCKAN(
    address="https://data.ioos.us",
    user_agent="ckanapiioos/1.0 (+https://ioos.us/)",
)


ioos_catalog

  import pkg_resources


<ckanapi.remoteckan.RemoteCKAN at 0x2226d06dd30>

In [2]:
orgs = ioos_catalog.action.organization_list()
print(orgs)

['aoos', 'caricoos', 'cdip', 'cencoos', 'comt', 'gcoos', 'glider-dac', 'glos', 'hf-radar-dac', 'ioos', 'maracoos', 'nanoos', 'neracoos', 'noaa-co-ops', 'noaa-ndbc', 'oceansites', 'pacioos', 'sccoos', 'secoora', 'unidata', 'usgs', 'us-navy']


In [3]:
datasets = ioos_catalog.action.package_search()
datasets["count"]

44147

## Query IOOS Catalog for appropriate datasets

Gather all the datasets associated with an RA and filter to just buoys and similar platforms.

## Actually do the querying

In [42]:
import time
import pandas as pd

from ckanapi import RemoteCKAN
from ckanapi.errors import CKANAPIError
from requests.exceptions import ChunkedEncodingError
from urllib3.exceptions import IncompleteRead

def ioos_ckan_query(ioos_catalog, filter_query, free_text_query):
    '''
    Function to query the IOOS catalog with a filter query and free text query.

    ioos_catalog : RemoteCKAN object
        The RemoteCKAN object to use for querying the IOOS catalog.
    filter_query : str
        The filter query to use for querying the IOOS catalog.
    free_text_query : str
        The free text query to use for querying the IOOS catalog.
    Returns
    -------
    df_plat : pandas.DataFrame
        A DataFrame containing the results of the query.
    '''

    df_plat = pd.DataFrame()

    result_count = 0
    while True:
            try:
                datasets = ioos_catalog.action.package_search(
                    fq=filter_query, 
                    q=free_text_query, 
                    rows=500, 
                    start=result_count,
                )
            except (CKANAPIError, IncompleteRead, ChunkedEncodingError):
                continue

            #result_count = datasets.shape[0]

            num_results = datasets["count"]
            
            print(f"num_results: {num_results}, result_count: {result_count}")

            for dataset in datasets["results"]:
                df = pd.DataFrame.from_dict(dataset, orient='index').T

                df_plat = pd.concat([df_plat, df], ignore_index=True)
                
                result_count = df_plat.shape[0]

            if result_count >= num_results:
                print(f"num_results: {num_results}, result_count: {result_count}")
                break
            
    #df_ioos_catalog = pd.concat([df_ioos_catalog, df_plat], ignore_index=True)

    print(
            f"num_results: {num_results}, result_count: {result_count}, total_result_count: {df_plat.shape[0]}"
        )
    
    return df_plat

In [None]:

ua = "ckanapiioos/1.0 (+https://ioos.us/)"

ioos_catalog = RemoteCKAN("https://data.ioos.us", user_agent=ua)
df_ioos_catalog = pd.DataFrame()

orgs = ["NANOOS"]

for org in orgs:
    org_ncei = org.lower()

    filter_query = f''

    free_text_query = f'organization:{org_ncei.lower()} NOT (glider OR model)'

    df_search = ioos_ckan_query(ioos_catalog, filter_query, free_text_query)

    df_ioos_catalog = pd.concat([df_ioos_catalog, df_search], ignore_index=True)

df_ioos_catalog

num_results: 95, result_count: 0
num_results: 95, result_count: 95
num_results: 95, result_count: 95, total_result_count: 95


Unnamed: 0,author,author_email,creator_user_id,id,isopen,license_id,license_title,maintainer,maintainer_email,metadata_created,...,title,type,url,version,extras,resources,tags,groups,relationships_as_subject,relationships_as_object
0,,,0ea3933c-4674-41dd-a17d-bfbc8c99bd75,7da90e03-f8aa-483e-96d0-7a27051b90b4,False,,,,,2025-04-11T14:32:17.541505,...,Backyard Buoys - NANOOS - Washington: Quileute...,dataset,,,"[{'key': 'access-constraints', 'value': '[]'},...","[{'cache_last_updated': None, 'cache_url': Non...","[{'display_name': 'Backyard Buoys', 'id': 'f80...",[],[],[]
1,,,0ea3933c-4674-41dd-a17d-bfbc8c99bd75,d0ef6a3a-4894-43f3-b4ea-2a882dccc478,False,,,,,2025-01-09T02:08:28.700159,...,NPBY1 - Point Wells: Meteorological Station Data,dataset,,,"[{'key': 'access-constraints', 'value': '[]'},...","[{'cache_last_updated': None, 'cache_url': Non...",[{'display_name': 'Earth Science > Atmosphere ...,[],[],[]
2,,,0ea3933c-4674-41dd-a17d-bfbc8c99bd75,0bd3b7ac-cb00-4dd5-87ca-e55bd6fb8d16,False,,,,,2024-11-08T12:57:32.204016,...,NPBY2 - Carr Inlet: Meteorological Station Data,dataset,,,"[{'key': 'access-constraints', 'value': '[]'},...","[{'cache_last_updated': None, 'cache_url': Non...",[{'display_name': 'Earth Science > Atmosphere ...,[],[],[]
3,,,0ea3933c-4674-41dd-a17d-bfbc8c99bd75,17f2f79e-bab0-4c2d-a0a1-2abf40acaa52,False,,,,,2025-01-09T02:08:48.402290,...,NANOOS Mooring ORCA Pt Wells,dataset,,,"[{'key': 'access-constraints', 'value': '[]'},...","[{'cache_last_updated': None, 'cache_url': Non...","[{'display_name': 'Air Temperature', 'id': 'a6...",[],[],[]
4,,,0ea3933c-4674-41dd-a17d-bfbc8c99bd75,cb6612de-ae7b-4827-a1f9-0d943174ae15,False,,,,,2025-05-09T16:03:29.657747,...,NEMO - ChaBa Meteorlogical - Gill Metpak Pro,dataset,,,"[{'key': 'access-constraints', 'value': '[]'},...","[{'cache_last_updated': None, 'cache_url': Non...",[{'display_name': 'Earth Science > Atmosphere ...,[],[],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90,,,0ea3933c-4674-41dd-a17d-bfbc8c99bd75,bff06122-cf40-4611-b5b3-c8c79a71cfac,False,,,,,2025-01-09T13:26:01.855237,...,(APL-UW) Ãâ hÃÂ¡ÃâbaÃÂ· UW/NANOOS Moore...,dataset,,,"[{'key': 'access-constraints', 'value': '[]'},...","[{'cache_last_updated': None, 'cache_url': Non...","[{'display_name': 'Air Temperature', 'id': 'a6...",[],[],[]
91,,,0ea3933c-4674-41dd-a17d-bfbc8c99bd75,2449dd5c-57c5-43dd-a3d6-f52de352a0e5,False,,,,,2025-01-09T13:25:59.040273,...,"(WADOH) Hood Canal 1 site, W shore of Hood Can...",dataset,,,"[{'key': 'access-constraints', 'value': '[]'},...","[{'cache_last_updated': None, 'cache_url': Non...","[{'display_name': 'Air Temperature', 'id': 'a6...",[],[],[]
92,,,0ea3933c-4674-41dd-a17d-bfbc8c99bd75,46917a4a-9e77-495b-a0d3-3c5cea2bc5e8,False,,,,,2025-01-09T13:25:56.552833,...,(CMOP) Grays Point (USCG day mark green 13),dataset,,,"[{'key': 'access-constraints', 'value': '[]'},...","[{'cache_last_updated': None, 'cache_url': Non...","[{'display_name': 'Air Temperature', 'id': 'a6...",[],[],[]
93,,,0ea3933c-4674-41dd-a17d-bfbc8c99bd75,3261508c-5b1d-42a8-95ae-fe142449a216,False,,,,,2025-01-09T13:25:53.600691,...,"(WADOH) Skookum Inlet site, N shore near Deer ...",dataset,,,"[{'key': 'access-constraints', 'value': '[]'},...","[{'cache_last_updated': None, 'cache_url': Non...","[{'display_name': 'Air Temperature', 'id': 'a6...",[],[],[]


## Search NCEI

In [50]:
# fuzzy_xml_search.py
# This script performs a fuzzy search on the text content of an XML file.

import xml.etree.ElementTree as ET
from thefuzz import fuzz

def fuzzy_search_in_xml(tree, search_query, score_cutoff=70):
    """
    Performs a fuzzy search for a query string within the text of all elements in an XML file.

    Args:
        xml_file_path (str): The path to the XML file.
        search_query (str): The string to search for.
        score_cutoff (int): The minimum similarity score (0-100) to consider a match.
                            Defaults to 70.

    Returns:
        list: A list of dictionaries, where each dictionary represents a match
              and contains the element's tag, its text, and the similarity score.
              Returns an empty list if no matches are found or if the file cannot be parsed.
    """
    matches = []
    try:
        # Parse the XML file
        #iso = _openurl_with_retry(url)
        #tree = ET.parse(iso)
        root = tree.getroot()

        # Iterate through every element in the XML tree
        for element in root.iter():
            # Check if the element has text content
            if element.text and element.text.strip():
                element_text = element.text.strip()
                # Calculate the fuzzy match score (partial_ratio is good for finding substrings)
                score = fuzz.partial_ratio(search_query.lower(), element_text.lower())

                # If the score is above the cutoff, we have a match
                if score >= score_cutoff:
                    matches.append({
                        'tag': element.tag,
                        'text': element_text,
                        'score': score
                    })

    except ET.ParseError as e:
        print(f"Error parsing XML file: {e}")
    except FileNotFoundError:
        print(f"Error: The file '{url}' was not found.")
    
    # Sort matches by score in descending order
    matches.sort(key=lambda x: x['score'], reverse=True)
    
    return matches

In [None]:
# Example usage
organization = "NANOOS"
XML_FILE = f"https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.nodc:IOOS-{organization};view=xml;responseType=text/xml"
SEARCH_QUERY = "(CMOP) SATURN 1" 
SCORE_CUTOFF = 80         # Adjust this value to make the search more or less strict

print(f"Searching for '{SEARCH_QUERY}' in '{XML_FILE}' (cutoff score: {SCORE_CUTOFF})...\n")

# Perform the search
results = fuzzy_search_in_xml(XML_FILE, SEARCH_QUERY, SCORE_CUTOFF)

# Display the results
if results:
    print(f"Found {len(results)} match(es):")
    for result in results:
        print("-" * 20)
        print(f"  Tag:   {result['tag']}")
        print(f"  Text:  '{result['text']}'")
        print(f"  Score: {result['score']}")
    print("-" * 20)
else:
    print("No matches found.")

Searching for '(CMOP) SATURN 1' in 'https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.nodc:IOOS-NANOOS;view=xml;responseType=text/xml' (cutoff score: 80)...

Found 3 match(es):
--------------------
  Tag:   {http://www.isotc211.org/2005/gco}CharacterString
  Text:  'saturn10'
  Score: 88
--------------------
  Tag:   {http://www.isotc211.org/2005/gco}CharacterString
  Text:  'saturn01'
  Score: 88
--------------------
  Tag:   {http://www.isotc211.org/2005/gco}CharacterString
  Text:  'SATURN-10'
  Score: 82
--------------------


## Bring it all together

In [52]:
from urllib.request import urlopen
import urllib.error
import stamina


@stamina.retry(on=urllib.error.HTTPError, attempts=3)
def _openurl_with_retry(url):
    """Thin wrapper around urlopen adding stamina."""
    return urlopen(url)

ua = "ckanapiioos/1.0 (+https://ioos.us/)"

ioos_catalog = RemoteCKAN("https://data.ioos.us", user_agent=ua)
df_ioos_catalog = pd.DataFrame()

orgs = ["NANOOS"]

for org in orgs:
    org_ncei = org.lower()

    filter_query = f''

    free_text_query = f'organization:{org_ncei.lower()} NOT (glider OR model)'

    df_search = ioos_ckan_query(ioos_catalog, filter_query, free_text_query)

    df_ioos_catalog = pd.concat([df_ioos_catalog, df_search], ignore_index=True)

    XML_FILE = f"https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.nodc:IOOS-{organization};view=xml;responseType=text/xml"

    iso = _openurl_with_retry(XML_FILE)
    tree = ET.parse(iso)

    for index, dataset in df_search.iterrows():
        # Example usage
        organization = org
        
        SEARCH_QUERY = dataset['title'] 
        SCORE_CUTOFF = 80         # Adjust this value to make the search more or less strict

        print(f"Searching for '{SEARCH_QUERY}' in '{XML_FILE}' (cutoff score: {SCORE_CUTOFF})...\n")

        # Perform the search
        results = fuzzy_search_in_xml(tree, SEARCH_QUERY, SCORE_CUTOFF)

        # Display the results
        if results:
            print(f"Found {len(results)} match(es):")
            for result in results:
                print("-" * 20)
                print(f"  Tag:   {result['tag']}")
                print(f"  Text:  '{result['text']}'")
                print(f"  Score: {result['score']}")
            print("-" * 20)
        else:
            print("No matches found.")

    

num_results: 95, result_count: 0
num_results: 95, result_count: 95
num_results: 95, result_count: 95, total_result_count: 95
Searching for 'Backyard Buoys - NANOOS - Washington: Quileute - North' in 'https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.nodc:IOOS-NANOOS;view=xml;responseType=text/xml' (cutoff score: 80)...

Found 2 match(es):
--------------------
  Tag:   {http://www.isotc211.org/2005/gmx}Anchor
  Text:  'BUOYS'
  Score: 100
--------------------
  Tag:   {http://www.isotc211.org/2005/gco}CharacterString
  Text:  'NANOOS'
  Score: 100
--------------------
Searching for 'NPBY1 - Point Wells: Meteorological Station Data' in 'https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.nodc:IOOS-NANOOS;view=xml;responseType=text/xml' (cutoff score: 80)...

Found 5 match(es):
--------------------
  Tag:   {http://www.isotc211.org/2005/gmx}Anchor
  Text:  'meteorological'
  Score: 100
--------------------
  Tag:   {http://www.isotc211.org/2

In [48]:
df_search['title']

0     Backyard Buoys - NANOOS - Washington: Quileute...
1      NPBY1 - Point Wells: Meteorological Station Data
2       NPBY2 - Carr Inlet: Meteorological Station Data
3                          NANOOS Mooring ORCA Pt Wells
4          NEMO - ChaBa Meteorlogical - Gill Metpak Pro
                            ...                        
90    (APL-UW) Ãâ hÃÂ¡ÃâbaÃÂ· UW/NANOOS Moore...
91    (WADOH) Hood Canal 1 site, W shore of Hood Can...
92          (CMOP) Grays Point (USCG day mark green 13)
93    (WADOH) Skookum Inlet site, N shore near Deer ...
94    (WADOH) Eld Inlet site, W shore near Frye Cove...
Name: title, Length: 95, dtype: object