Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IQSS/7349-4 creator updates in schema.org #9089

Merged

Conversation

qqmyers
Copy link
Member

@qqmyers qqmyers commented Oct 19, 2022

What this PR does / why we need it: Per discussion in #7349 and #5029, Google ignores creator entries from Dataverse because they don't include an @type Person or Organization. Since we don't collect that directly, we have to use some mechanism to infer the @type. The OpenAire export format does this by leveraging an algorithm developed by DataCite. This PR abstracts that algorithm from the code producing the OpenAire xml format into a new PersonOrOrgUtil class that is then used to identify an @type for creators in Dataverse. The new functionality is then used to provide the @type in the schema.org export format and in-dataset-page metadata, along with the givenName and firstName for a person if/when the algorithm determines them.

The PR also updates how the 'affiliation' is handled. According to schema.org, only Persons have 'affilation' and that must be to an 'Organization'. The PR makes this change - to send the affiliation as an object of type Organization with the 'name' specified. Since Organizations don't have an 'affilation', but they do have a 'parentOrganization' key, for Organizations the code now encodes the affiliation for Orgs as a 'parentOrganzation' object of @type 'Organization' and 'name' as specified.

Which issue(s) this PR closes:

Special notes for your reviewer:
FWIW: I think the new PersonOrOrgUtil can replace the code in the OpenAire export but I didn't do that in this PR. The algorithm is used in that code with two minor variations that I think can be handled by an appropriate entry for the organizationIfTied param.

Suggestions on how to test this: Unit tests cover most of this. As with others, visually inspecting the schema.org export and/or embedded datasetpage json-ld can be done to verify that Person or Organization is added and that affiliation is handled as described here. Note that miscategorizations (a person with @type Organization or vice versa) is not considered a bug and is considered better than not sending a type at all. If there are consistent issues, the algorithm can potentially be improved.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?: included.

Additional documentation:

@coveralls
Copy link

coveralls commented Oct 19, 2022

Coverage Status

Coverage: 20.062% (+0.06%) from 20.0% when pulling 839376d on QualitativeDataRepository:IQSS/7349-4_creator_updates into ecc23c0 on IQSS:develop.

same examples as in OrganizationTest but using the extracted algorithm
and also checking given/family name in relevant cases
it does not appear to be useful given the tests in PersonOrOrgUtilTest
@qqmyers
Copy link
Member Author

qqmyers commented Oct 20, 2022

FWIW: The 4 changes to the schema.org output are live on https://data.qdr.syr.edu now if anyone wants to take a look at the results.

@qqmyers
Copy link
Member Author

qqmyers commented Oct 21, 2022

After some testing at QDR, it appears that organizations ending in Project get coded as a person. The PR now adds a jvm option to allow the algorithm to assume that all Person names are added in the recommended Family Name, Given Name format which enables it to code 'John Smith Project' and any other org without a comma correctly. This is off by default and may not be useful for non-curated repositories.

An alternative/additional approach that could be used would be to add a configurable list of non-person names (like 'Project') that would allow finer grained control over specific cases.

@qqmyers qqmyers added GDCC: DANS related to GDCC work for DANS GDCC: QDR of interest to QDR labels Oct 27, 2022
@mreekie mreekie added the bk2211 label Nov 1, 2022
qqmyers added a commit to QualitativeDataRepository/dataverse that referenced this pull request Nov 23, 2022
@qqmyers qqmyers added the Size: 10 A percentage of a sprint. 7 hours. label Dec 14, 2022
@mreekie mreekie removed the bk2211 label Jan 11, 2023
@pdurbin
Copy link
Member

pdurbin commented Jan 24, 2023

@qqmyers I'm a little confused. I see you mention this issue above...

... but this PR doesn't close it. Can you please explain what else is needed? I'll leave a comment over there as well if it makes more sense to have the conversation there. Thanks.

@sekmiller sekmiller self-assigned this Jan 30, 2023
@sekmiller sekmiller moved this from Ready for Review ⏩ to Review 🔎 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jan 30, 2023
IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from Review 🔎 to Ready for QA ⏩ Jan 31, 2023
@sekmiller sekmiller removed their assignment Jan 31, 2023
@sekmiller
Copy link
Contributor

sekmiller commented Jan 31, 2023

Looks good. thanks for adding the ToDos for code consolidation. @qqmyers Looks like there are conflicts that need to be resolved.

I pulled it back into Review for the merge conflicts

@qqmyers qqmyers moved this from Review 🔎 to Ready for QA ⏩ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jan 31, 2023
@qqmyers qqmyers removed their assignment Jan 31, 2023
@kcondon kcondon self-assigned this Feb 1, 2023
@kcondon
Copy link
Contributor

kcondon commented Feb 1, 2023

Issues found:

  1. fragment of url at the end of schema.org export
  2. 500 error in server log on publish when populate affiliation field, cannot publish.
    creatorTypePrErr.txt
    export2vers.txt

Jim was not able to reproduce the above issue and retesting I was not able to either. Theory is a temp caching issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GDCC: DANS related to GDCC work for DANS GDCC: QDR of interest to QDR Size: 10 A percentage of a sprint. 7 hours.
Projects
No open projects
6 participants