Switching to gdcc/xoai v5.2.0 (9910) #10012

landreev · 2023-10-16T00:31:44Z

What this PR does / why we need it:

This pr contains no code changes in the Dataverse source, but changes the dataverse-parent pom file switching to the newly released v. 5.2 of the gdcc/xoai library. This fixes the problem affecting (some) OAI records with utf8 characters. (see the original issue; the actual problem is described in detail in an issue in the gdcc/xoai repo).

Which issue(s) this PR closes:

Closes #9910

Special notes for your reviewer:

Suggestions on how to test this:

The condition is somewhat tricky to reproduce on purpose. But can be very easily tested using a known affected export and some cheating; I will provide the details.

The bug is triggered when a multi-byte utf8 character is split in 2 at the exact offset of 1024 bytes in the harvestable metadata export. Even if you create a test dataset with all the metadata encoded in some multi-byte utf8 (Chinese, etc.), there will still be a chance that this condition will not be met. It could then be triggered by padding the export with extra characters, but it's easier to cheat as follows.

You need an exported dataset that's part of an OAI set. It helps if the dataset is stored on a filesystem; s3 will work too, but fs is easier. doi:10.70122/FK2/9TZE9Y is used below since that was the dataset I used on my local instance.

Verify that the dataset is exported, and can be accessed via OAI:

curl "http://localhost:8080/api/datasets/export?exporter=oai_dc&persistentId=doi:10.70122/FK2/9TZE9Y"

curl "http://localhost:8080/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=doi:10.70122/FK2/9TZE9Y"

Now replace the cached oai_dc export for the dataset with the attached known problematic export (export_oai_dc.cached.txt; adjust the path as needed as well):

cp export_oai_dc.cached.txt /usr/local/payara6/glassfish/domains/domain1/files/10.70122/FK2/9TZE9Y/export_oai_dc.cached

Verify that the error condition is now met, by looking at the 1024 byte offset:

curl "http://localhost:8080/api/datasets/export?exporter=oai_dc&persistentId=doi:10.70122/FK2/9TZE9Y"  | head -c1024

the end of the record should look like this: The experiments were carried out on farmer? - i.e., we get the question mark instead of the fancy utf8 apostrophe in the word "farmer’s".

To test the "before" state, in develop branch, look at the OAI record above - /oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=doi:10.70122/FK2/9TZE9Y in firefox. It should refuse to render the response complaining about invalid xml.

The OAI record should be rendered properly once the build from this branch is deployed.

export_oai_dc.cached.txt

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

github-actions · 2023-10-16T00:53:47Z

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:9910-xoai-5.2

ghcr.io/gdcc/configbaker:9910-xoai-5.2

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

poikilotherm

LGTM, thx!

landreev · 2023-10-16T18:01:01Z

@kcondon I have updated the "how to test" section in the description. Even with the cheating hack, it's still a bit of a pain in the butt. So there is also an option of just merging it... this thing has been tested and re-tested a lot in the last few weeks.

poikilotherm · 2023-10-17T18:55:39Z

Technically, I also added a reproducer and a test in the XOIA codebase...
I am not a fan of testing a library in a downstream project, but that might just be me...

Thanks for merging @kcondon !

Switching to the new version of gdcc/xoai, v5.2.0 (#9910)

35f6951

landreev added this to Ready for Review ⏩ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) via automation Oct 16, 2023

poikilotherm approved these changes Oct 16, 2023

View reviewed changes

IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from Ready for Review ⏩ to Ready for QA ⏩ Oct 16, 2023

kcondon moved this from Ready for QA ⏩ to QA ✅ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Oct 16, 2023

kcondon self-assigned this Oct 16, 2023

kcondon merged commit b5c117e into develop Oct 16, 2023
17 of 18 checks passed

IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from QA ✅ to Done 🚀 Oct 16, 2023

kcondon deleted the 9910-xoai-5.2 branch October 16, 2023 18:51

pdurbin added this to the 6.1 milestone Oct 17, 2023

landreev mentioned this pull request Oct 18, 2023

Broken OAI records for some datasets with UTF-8 characters #9910

Closed

luddaniel pushed a commit to Recherche-Data-Gouv/dataverse that referenced this pull request Oct 19, 2023

v5.14_fixed - Switching to gdcc/xoai v5.2.0 (9910) IQSS#10012

98ba27f

landreev mentioned this pull request Mar 15, 2024

XML declaration missing from OAI when using oai_ddi #10329

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switching to gdcc/xoai v5.2.0 (9910) #10012

Switching to gdcc/xoai v5.2.0 (9910) #10012

landreev commented Oct 16, 2023 •

edited

github-actions bot commented Oct 16, 2023

poikilotherm left a comment

landreev commented Oct 16, 2023

poikilotherm commented Oct 17, 2023 •

edited

Switching to gdcc/xoai v5.2.0 (9910) #10012

Switching to gdcc/xoai v5.2.0 (9910) #10012

Conversation

landreev commented Oct 16, 2023 • edited

github-actions bot commented Oct 16, 2023

poikilotherm left a comment

Choose a reason for hiding this comment

landreev commented Oct 16, 2023

poikilotherm commented Oct 17, 2023 • edited

landreev commented Oct 16, 2023 •

edited

poikilotherm commented Oct 17, 2023 •

edited