Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switching to gdcc/xoai v5.2.0 (9910) #10012

Merged
merged 1 commit into from
Oct 16, 2023
Merged

Switching to gdcc/xoai v5.2.0 (9910) #10012

merged 1 commit into from
Oct 16, 2023

Conversation

landreev
Copy link
Contributor

@landreev landreev commented Oct 16, 2023

What this PR does / why we need it:

This pr contains no code changes in the Dataverse source, but changes the dataverse-parent pom file switching to the newly released v. 5.2 of the gdcc/xoai library. This fixes the problem affecting (some) OAI records with utf8 characters. (see the original issue; the actual problem is described in detail in an issue in the gdcc/xoai repo).

Which issue(s) this PR closes:

Closes #9910

Special notes for your reviewer:

Suggestions on how to test this:

The condition is somewhat tricky to reproduce on purpose. But can be very easily tested using a known affected export and some cheating; I will provide the details.

The bug is triggered when a multi-byte utf8 character is split in 2 at the exact offset of 1024 bytes in the harvestable metadata export. Even if you create a test dataset with all the metadata encoded in some multi-byte utf8 (Chinese, etc.), there will still be a chance that this condition will not be met. It could then be triggered by padding the export with extra characters, but it's easier to cheat as follows.

You need an exported dataset that's part of an OAI set. It helps if the dataset is stored on a filesystem; s3 will work too, but fs is easier. doi:10.70122/FK2/9TZE9Y is used below since that was the dataset I used on my local instance.

Verify that the dataset is exported, and can be accessed via OAI:

curl "http://localhost:8080/api/datasets/export?exporter=oai_dc&persistentId=doi:10.70122/FK2/9TZE9Y"

curl "http://localhost:8080/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=doi:10.70122/FK2/9TZE9Y"

Now replace the cached oai_dc export for the dataset with the attached known problematic export (export_oai_dc.cached.txt; adjust the path as needed as well):

cp export_oai_dc.cached.txt /usr/local/payara6/glassfish/domains/domain1/files/10.70122/FK2/9TZE9Y/export_oai_dc.cached

Verify that the error condition is now met, by looking at the 1024 byte offset:

curl "http://localhost:8080/api/datasets/export?exporter=oai_dc&persistentId=doi:10.70122/FK2/9TZE9Y"  | head -c1024

the end of the record should look like this: The experiments were carried out on farmer? - i.e., we get the question mark instead of the fancy utf8 apostrophe in the word "farmer’s".

To test the "before" state, in develop branch, look at the OAI record above - /oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=doi:10.70122/FK2/9TZE9Y in firefox. It should refuse to render the response complaining about invalid xml.

The OAI record should be rendered properly once the build from this branch is deployed.

export_oai_dc.cached.txt

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

@landreev landreev added this to Ready for Review ⏩ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) via automation Oct 16, 2023
@github-actions
Copy link

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:9910-xoai-5.2
ghcr.io/gdcc/configbaker:9910-xoai-5.2

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

Copy link
Contributor

@poikilotherm poikilotherm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thx!

IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from Ready for Review ⏩ to Ready for QA ⏩ Oct 16, 2023
@kcondon kcondon self-assigned this Oct 16, 2023
@landreev
Copy link
Contributor Author

@kcondon I have updated the "how to test" section in the description. Even with the cheating hack, it's still a bit of a pain in the butt. So there is also an option of just merging it... this thing has been tested and re-tested a lot in the last few weeks.

@kcondon kcondon merged commit b5c117e into develop Oct 16, 2023
17 of 18 checks passed
IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from QA ✅ to Done 🚀 Oct 16, 2023
@kcondon kcondon deleted the 9910-xoai-5.2 branch October 16, 2023 18:51
@pdurbin pdurbin added this to the 6.1 milestone Oct 17, 2023
@poikilotherm
Copy link
Contributor

poikilotherm commented Oct 17, 2023

Technically, I also added a reproducer and a test in the XOIA codebase...
I am not a fan of testing a library in a downstream project, but that might just be me...

Thanks for merging @kcondon !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Broken OAI records for some datasets with UTF-8 characters
4 participants