Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IQSS/9544 - Migrate API fixes #9545

Merged
merged 11 commits into from Jul 7, 2023

Conversation

qqmyers
Copy link
Member

@qqmyers qqmyers commented Apr 20, 2023

What this PR does / why we need it: This PR has 3 commits to address three related issues uncovered in trying to use the Dataset Migration API via the DVUploader.

  1. A call to the /api/dataverses/$DATAVERSE_ID/datasets/:startmigration fails in 5.13. Tracing to the root issue - the TermsOfUseAndAccessValidation that was recently added has the first use of the TermsOfUseAndAccess.getDatasetVersion() method in the codebase. In the Migration API, a new TermsOfUseAndAccess class is created and while that object was added to the DatasetVersion, the DatasetVersion was not added to it, causing a failure in the validation. The first commit fixes that.
  2. Continuing in testing, I noted exceptions in the log from the CitationServlet. These are caused with PermaLinks when code trying to see whether a PID is registered checks by trying to access the PID in URL form. With PermaLinks, that's a call to /citation?persistentId= (versus a call to doi.org or a Handle endpoint). Looking at the code, it appears that the GlobalIdServiceBean.findByGlobalId(GlobalId globalId) call threw a NoResultException rather than returning null when the PID is not found. This was not caught in CitationServlet which is the only user of that call. I added a try/catch block. The result is that the CitationServlet reports a 404 in this case without having a trace in the log. (I don't know if the client was receiving a 404 or a 500 previously.) This is not a PermaLinks-specific bug - I believe we've had reports of such exceptions in the log before, ie. from spammers calling /citation?*.
  3. The trickiest problem that is specific to Fake and PermaLink providers is that we used alreadyExists() in two ways in the code - one to check if a PID exists at the provider, in cases where we want to edit/delete it or all we care about is whether a new PID has a conflict with an existing one, but the Migration API and underlying ImportDatasetCommand tries to verify that a PID is registered before allowing the import (because importing is for datasets that previously lived elsewhere and had a registered PID). In the case of Fake and Perma, since there is no external provider, the only relevant state is what is in the local database, but they still have to respond to the alreadyExists() request. However, the code, as it was, can't handle both cases correctly for Fake and Perma, where a check of the internal Dataverse db, would be wrong in one case or the other. Specifically, Perma was coded to report false if the PID is not in the local db. That works in the normal use case, e.g. one can create a new perma PID without a conflict if there isn't one in the local db, but it causes migrate/import to fail as false implies this PID didn't exist/the dataset isn't coming from somewhere else it was live. The case for FAKE had been dealt with some with work-arounds before - there was a complex check of which method was calling alreadyExists() at one point (removed in the PermaLInk PR I think), and we had an if(FAKE) in another part. The fix in the remaining commits change alreadyExists() -> alreadyRegistered() (making it clearer that these calls do not, for DOI/Handle, check the local db - that is done in other code), and adding a noProviderDefault boolean param to that call to handle both cases for Fake and PermaLink (false for most, true for the import case).

Which issue(s) this PR closes:

Closes #9544

Special notes for your reviewer:
Rather than adding a boolean param, there could be two methods - I just couldn't think of reasonable names - something like conflictingPIDExists() and previouslyRegisteredPIDFound() ? The problem is that, for Handle/DOI, these just return the same thing and the names still reflect that. Perhaps a rewrite to only call alreadyRegistered() if it makes sense, e.g. a if(hasRegistry()) then check alreadyRegistered()? I think that's more work than adding a param, and potentially easier to mess up, but I'm game if that's what's requested. This is something that should be fixed for 5.14 as I think DANS will want to migrate with PermaLinks.

Suggestions on how to test this:
Any call to /api/dataverses/$DATAVERSE_ID/datasets/:startmigration would test the fix for #1, #2 could be tested by going to /citation?persistentId= with a non existent PID. To test #3, one could configure to use PermaLinks and call the Dataset Migration API as in the guides or try the DVUploader with an archival bag (which would have to be created), e.g. java -jar DVUploader-v1.2.0.jar -key=5c0332b9-c546-4e9d-a497-2f9cf668e24e -did=perma:DANSPIDFK212 -server=http://ec2-3-215-16-135.compute-1.amazonaws.com -bag=file:/c:/NetBeansProjects/dataverse-uploader/target/perma-danspidfk212v2.0.zip -createIn=root The latter would also test #1 and #2. For regression, probably the basic tests of create/publish datasets is enough, perhaps just with DOIs.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?: The fact that the Migration API is broken in 5.13 and now fixed could be put in a release note, but this problem hasn't been noticed until now. The bigger part of the PR is fixing PermaLinks which are 5.14.

Additional documentation:

the first place that terms.getDatasetVersion() was ever called on a new
TermsOfUseAndAccess object.
To support Fake and Perma providers correctly in ImportDatasetCommand.
@qqmyers qqmyers added Size: 10 A percentage of a sprint. 7 hours. GDCC: DANS related to GDCC work for DANS labels Apr 20, 2023
@coveralls
Copy link

coveralls commented Apr 24, 2023

Coverage Status

coverage: 20.372% (-0.02%) from 20.396% when pulling b41364d on GlobalDataverseCommunityConsortium:migrateFixes into 6053fa3 on IQSS:develop.

@qqmyers qqmyers added this to the 5.14 milestone Apr 26, 2023
@mreekie mreekie added this to Ready for Review ⏩ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) via automation May 15, 2023
@mreekie mreekie moved this from Ready for Review ⏩ to ▶ SPRINT READY in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) May 15, 2023
@cmbz cmbz moved this from ▶ SPRINT READY to This Sprint 🏃‍♀️ 🏃 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jun 21, 2023
@cmbz cmbz moved this from This Sprint 🏃‍♀️ 🏃 to Ready for Review ⏩ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jun 22, 2023
@sekmiller sekmiller self-assigned this Jun 26, 2023
@sekmiller sekmiller moved this from Ready for Review ⏩ to In Review 🔎 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jun 26, 2023
Copy link
Contributor

@sekmiller sekmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable, except that the logging still refers to alreadyExists instead of the new alreadyRegistered in most cases. Also this branch needs to be updated from Dev

@sekmiller sekmiller added the Status: Waiting issues in need of input from someone currently unavailable label Jun 29, 2023
@qqmyers qqmyers removed the Status: Waiting issues in need of input from someone currently unavailable label Jul 5, 2023
IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from In Review 🔎 to Ready for QA ⏩ Jul 5, 2023
@kcondon kcondon self-assigned this Jul 7, 2023
@kcondon kcondon merged commit f24c7de into IQSS:develop Jul 7, 2023
9 checks passed
IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from QA ✅ to Done 🚀 Jul 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GDCC: DANS related to GDCC work for DANS Size: 10 A percentage of a sprint. 7 hours.
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

Recent changes break the Dataset Migration API
4 participants