Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create/update https://data.ioos.us/waf #72

Closed
mwengren opened this issue Nov 22, 2019 · 11 comments
Closed

Create/update https://data.ioos.us/waf #72

mwengren opened this issue Nov 22, 2019 · 11 comments
Assignees

Comments

@mwengren
Copy link
Member

Presently, there's a static set of subdirectory WAFs from an old version (2016 and earlier) of the IOOS Catalog hosted at https://data.ioos.us/waf/. Because of some changes with NOAA's Data Catalog systems, we need to provide them a regularly-updated dump of all our dataset ISO XML files in a single WAF for harvest (until they are able to read our CS-W service).

Can we write a simple harvester script to either read our CS-W service or copy the source XML records from https://registry.ioos.us/waf and replace this entire directory of old stuff?

@benjwadams

@benjwadams
Copy link
Contributor

What's the issue with harvesting from the current CSW we have running?

@mwengren
Copy link
Member Author

Unfortunately, they're doing some re-tooling and the new tools don't yet support CS-W harvest. They're moving away from CKAN actually and replacing the web interface with a new in-house system called 'OneStop'. You can see it here: https://data.noaa.gov/onestop/. It has many different tools under the hood, like ElasticSearch for indexing for example. It's a slightly different take from the UI perspective than CKAN.

We have to work with what they're offering if we want to keep our harvesting process alive. They plan to support CS-W harvest, but it's still in the backlog.

We should also replace this content anyway: https://data.ioos.us/waf/. I think there's value for us in trying to maintain a singular waf at that URL as well. We can talk about level of effort, because as a workaround they can use https://registry.ioos.us/waf for now. If it's not worth the time, this may turn into just deleting all those old metadata records.

@benjwadams
Copy link
Contributor

OK, no problem. I should be able to export any records from PyCSW to a folder using its export functionality: https://pycsw.org/faq/#how-do-i-export-my-repository

@mwengren
Copy link
Member Author

Great, let's just do that since it's simple and wipe the old records and replace with that.

Let's put this at the top of the Catalog issue list for whenever you get back to working on it.

@mwengren
Copy link
Member Author

mwengren commented Nov 26, 2019

For reference, this is what the new IOOS Catalog -> NOAA 'OneStop' Catalog -> Data.gov harvesting workflow that we need to support looks like:

NOAA Data Working Group Update

Here's an example what a CARICOOS record looks like in OneStop once harvested. Hopefully they'll be parsing and displaying more fields soon.

XCUL_MET_Historic_Realtime_Agg-1

XCUL_MET_Historic_Realtime_Agg-2

Same record in IOOS Catalog:

Screenshot

@benjwadams
Copy link
Contributor

I tried exporting through pycsw-admin. Unfortunately, it looks like the currently running version of the code attempts to load all the records into memory prior to exporting to XML. For a small-sized CSW deployment, this would be OK, but for the current size of our data inventory, it's causing issues with exporting things all at once due to the large number of records. I'll continue looking for workarounds.

@benjwadams
Copy link
Contributor

Created a job to handle this. I will push up to one of the catalog repos momentarily and then close this out.

@mwengren
Copy link
Member Author

mwengren commented Dec 4, 2019

@benjwadams Was looking at Catalog/Registry this morning and noticed the source metadata 'stations' WAF disappeared sometime since Monday: https://data.ioos.us/stations/waf/. Can you look at the nginx config again and restore it?

Also, I'm going to pass on the https://data.ioos.us/waf URL to the NOAA Catalog/OneStop team to harvest. Even if we're not ready to close this issue out, since it's there already it should work for their purposes for testing.

@benjwadams
Copy link
Contributor

Stations WAF has been restored.

@mwengren mwengren moved this from Backlog to In progress in IOOS Catalog Dec 16, 2019
@benjwadams benjwadams moved this from In progress to Review in IOOS Catalog Jan 27, 2020
@benjwadams
Copy link
Contributor

There is a script running now to load the CSW contents from the database into a WAF. The PyCSW admin command tried to do this all at once and caused the server to be overloaded by the size of the requests. I have added a script which compares the md5sum of possibly existing metadata XML files in the WAF against the md5sum of XML contents of each record in the database. This is running successfully so far, and I will close out this issue briefly once I have added this script under revision control in one of the repos.

@benjwadams
Copy link
Contributor

Implemented by ioos/catalog-docker-base@d3660ce and added to cron job, closing issue.

IOOS Catalog automation moved this from Review to Done Jan 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
IOOS Catalog
  
Done
Development

No branches or pull requests

2 participants