Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3437 OAI records for datasets that still exist, but are temporarily un-indexed should not be marked deleted #10222

Merged
merged 12 commits into from
Feb 13, 2024

Conversation

landreev
Copy link
Contributor

@landreev landreev commented Jan 9, 2024

What this PR does / why we need it:

A rare case of an issue with a very low number that's still relevant.

Our OAI sets are defined by search (solr) queries *) therefore when a set is re-exported and the query no longer finds a certain dataset, its OAI record is marked as "deleted" - to communicate to remote clients who may have harvested it that they need to purge it.
A side effect of this is that if an admin decides to clear their index and run a full reindex overnight - and it can take a while on a large database - if it coincides with a set re-export, many existing datasets can end up marked as "deleted".

As Kevin explains in the opening comment, like many OAI problems this one is not fatal, but transient: once the holdings are reindexed, the next set re-export will undo this and mark all such datasets as active again. However, a remote client may be forced to delete and re-harvest a large number of our datasets unnecessarily in the process. So we may as well fix this.

The PR fixes this by adding an extra database check - if the dataset that has disappeared from the search is still there, published and exported properly, the record will be kept in the set as active.

*) This is true for all sets, except for the main, default, "everything" set - which is defined by a database query. The code in the PR recognizes this case, and avoids unnecessary secondary database checks when the main set is re-exported.

Which issue(s) this PR closes:

Special notes for your reviewer:

Note that I have added a simple Index API endpoint - -X DELETE .../api/admin/index/datasets/<id> - for clearing an individual dataset from Solr. I needed it to help create a meaningful API test for this functionality; I could use the existing API for clearing Solr completely, but this felt cleaner, and I feel like it could be useful on its own.

Suggestions on how to test this:

Create an OAI set in the harvesting server dashboard. It can be a set with a single published dataset in it (can be defined with a query like dsPersistentId:B9CEYI (for example).
Export the set. Run http://localhost:8080/oai?verb=ListIdentifiers&set=YOURSETNAME&metadataPrefix=oai_dc, make sure the record is there.
Clear your solr index. Re-export the set from the dashboard. Run the above again, confirm that the record is NOT marked as deleted.

As a control, destroy the dataset, and re-export the set again. This time, the record should be showing with
<header status="deleted">

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

@coveralls
Copy link

coveralls commented Jan 9, 2024

Coverage Status

coverage: 20.163% (-0.002%) from 20.165%
when pulling a833e16 on 3437-oai-set-export-vs-reindex
into 17bfeb9 on develop.

This comment has been minimized.

@landreev landreev changed the title 3437 OAI records for datasets that still exists, but temporarily un-indexed should not be marked deleted 3437 OAI records for datasets that still exist, but are temporarily un-indexed should not be marked deleted Jan 9, 2024
@landreev
Copy link
Contributor Author

landreev commented Jan 9, 2024

I know why the Jenkins run failed. The test I just added destroyed a dataset that another test in that class relied on. ☹️
Will fix.

This comment has been minimized.

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick unofficial review of docs.

@landreev
Copy link
Contributor Author

The tests are passing now.

This comment has been minimized.

@cmbz cmbz added the Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) label Jan 16, 2024
@landreev
Copy link
Contributor Author

My other OAI-PMH PR got merged yesterday; the testing class HarvestingServerIT.java was modified in that PR as well, but a different part of it - so there are no merge conflicts. I've synced the branch, Jenkins running now.

This comment has been minimized.

@scolapasta scolapasta removed their assignment Feb 5, 2024
@jp-tosca jp-tosca self-assigned this Feb 12, 2024
Copy link

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:3437-oai-set-export-vs-reindex
ghcr.io/gdcc/configbaker:3437-oai-set-export-vs-reindex

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

@jp-tosca jp-tosca merged commit 4b5cbe8 into develop Feb 13, 2024
20 checks passed
@pdurbin pdurbin added this to the 6.2 milestone Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting Size: 30 A percentage of a sprint. 21 hours. (formerly size:33)
Projects
Status: Done 🧹
Development

Successfully merging this pull request may close these issues.

Harvest: Set export can temporarily be adversely affected by full (clean) reindex.
6 participants