Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harvesting client fails : Zenodo "Couperin" community / error with "language" field mapping. #7638

Closed
tjouneau opened this issue Feb 25, 2021 · 6 comments · Fixed by #7690
Closed
Assignees

Comments

@tjouneau
Copy link

tjouneau commented Feb 25, 2021

Dear community
I'm taking here the liberty to create an issue based on some previous discussions in the Dataverse-Users group. I'm trying to harvest some Zenodo communities. My client is set up as follows :

  • Alias : zenodo_test
  • Server URL : http://www.zenodo.org/oai2d
  • Local dataverse : zenodo_test_couperin
  • OAI set : user_couperin
  • Metadata format : oai_dc
  • Repository type : Generic OAI archive

Only 5 datasets are harvested out of the 20 present here: https://zenodo.org/communities/couperin/?page=1&size=20

Exception processing getRecord(), oaiUrl=https://zenodo.org/oai2d, identifier=oai:zenodo.org:3773762, edu.harvard.iq.dataverse.api.imports.ImportException, Failed to import harvested dataset: class edu.harvard.iq.dataverse.util.json.ControlledVocabularyException (Value 'eng' does not exist in type 'language')
(also : "Value 'fra' does not exist...")

Harvested for example: https://zenodo.org/record/4266132
Not harvested: https://zenodo.org/record/3948266

An exchange with @qqmyers (thanks!) linked that problem to another one regarding a SWORD Atom file import :
"With a quick look, it appears that that element is getting mapped to the Language field in the citation metadata block, whose values are controlled by the list of languages available. Those entries are all complete words (e.g. “English”) rather than the 2 or 3 letter ISO language codes. I haven’t checked to see if there’s an open issue to add support for language codes – might be something that could be addressed via an external service as being discussed in the CVV MDWG."

So it seems the problem is almost trivial, but it's still blocking the harvesting. Until further development solves this problem, is there a way around this that could be tried? Maybe a trick to ignore the language field altogether?

Best

Thomas

@djbrooke
Copy link
Contributor

djbrooke commented Mar 2, 2021

Thanks @tjouneau - we'll take a look and see what we can do.

@landreev
Copy link
Contributor

OK, so the good news is that we don't have to touch any code to address this (as I was expecting initially).
We already have a mechanism for providing alternative forms/spellings of controlled vocabulary values. They are stored in the controlledvocabalternate database table. At the moment we are only using it for the metadata field country in the Geospatial metadata block. For example, there are 4 alternative forms for the controlled vocab. entry "United States":

$ grep -i USA ./scripts/api/data/metadatablocks/geospatial.tsv 
	country	United States		234	U.S.A	USA	United States of America	U.S.A.

which translates into

 $ psql <...> -c "SELECT alt.strvalue FROM controlledvocabalternate alt, datasetfieldtype type, controlledvocabularyvalue val WHERE alt.controlledvocabularyvalue_id=val.id AND type.name='country' AND val.strvalue='United States'"
         strvalue         
--------------------------
 U.S.A
 U.S.A.
 USA
 United States of America
(4 rows)

So I believe the solution should be to modify the Citation metadata block, and add the ISO 639 codes as valid alternatives, either for all the 186 supported language values, or some sensible subset thereof.
For example, replace the current citation.tsv entry

	language	English		40

with

	language	English		40	eng	en

etc.
Anyone has any objections?

@tjouneau: if you are feeling adventurous, you can fix it for this particular set by creating 2 entries in the database:

INSERT INTO controlledvocabalternate (strvalue,controlledvocabularyvalue_id,datasetfieldtype_id) VALUES ('fra', N, K);
INSERT INTO controlledvocabalternate (strvalue,controlledvocabularyvalue_id,datasetfieldtype_id) VALUES ('eng', M, K);

where K is the id of the language entry in the datasetfieldtype table;
and N and M are the ids of the entries for French and English in the controlledvocabularyvalue database table respectively.
You will then need to clear all the entries in the clientharvestrun table for this harvesting client, and re-run the harvest from scratch.

@qqmyers
Copy link
Member

qqmyers commented Mar 10, 2021

This makes sense to me as the best option (though I didn't do it for country codes in the metrics PR - that use case isn't directly related to the citation block/metadata fields, but we do list countries in the citation block and I thought we could store the codes there.)

@landreev
Copy link
Contributor

@qqmyers I can see how that use case was a bit different. But if we ever come across a piece of metadata that we need to import, and it fails because they are using a 2-letter code in a field where we expect the name of a country, I would not hesitate to add those to the geospatial block as well.

landreev added a commit that referenced this issue Mar 11, 2021
…titute values in the citation metadata block, where an exact match was available (140 languages total; #7638)
landreev added a commit that referenced this issue Mar 11, 2021
…mote servers offering long lists of OAI sets. (#7638)
@tjouneau
Copy link
Author

Hi
I'm happy to report the INSERT commands above solved the problem for the Couperin community : I'm harvesting the whole set now. Thanks for your help. I'm currently performing several tests with other communities (I was dismayed by some failures but looking at the log I have 429 http statuses - I'm being too active).
Another thing that bothers me, would be that importing in the many datacite variations available (see https://developers.zenodo.org/#oai-pmh) doesn't seem to be supported by Dataverse. Can you confirm this?
Exception processing getRecord(), oaiUrl=https://www.zenodo.org/oai2d, identifier=oai:zenodo.org:204063, edu.harvard.iq.dataverse.api.imports.ImportException, Unsupported import metadata format: oai_datacite4
I'm putting this on the Dataverse users list before I create another ticket here if need be.
Best
thomas

@landreev
Copy link
Contributor

Hello,
Glad to hear that the direct database update hack worked. Of course we'll address this in the next release with a proper metadata block update.

As for the other formats, I don't really know what "oai_datacite4" is. It is safe to say that of all the (10?) formats they are offering (https://www.zenodo.org/oai2d?verb=ListMetadataFormats) oai_dc is the only one Dataverse understands. I am surprised that we are allowing a user to select an unsupported metadata format in the Harvesting Client config. (I thought we were dropping any unsupported formats from the list).

Looking at an example of oai_datacite4 (https://www.zenodo.org/oai2d?verb=GetRecord&identifier=oai:zenodo.org:204063&metadataPrefix=oai_datacite4), it appears to be simple enough. So it should be very doable to add support for it. But yes, that would definitely need to be handled in a separate issue.

(I'll be mostly offline until next week; but will be back on Monday)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants