Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable harvesting from Dataverse without specifying the set. (aka bring back the default, "everything" set) #4659

Closed
shirleyah opened this issue May 9, 2018 · 15 comments

Comments

@shirleyah
Copy link

Hello all,

Is there a way to allow harvest not using the OAI sets? A client (no dataverse) is trying to harvest our repository (dataverse) but it does not make harvest individual sets.

Could be possible do not use/define an specific set and allow to this client haverst our repository?

@shirleyah shirleyah changed the title Harvesting no using OAI sets Harvesting not using OAI sets May 9, 2018
@djbrooke
Copy link
Contributor

djbrooke commented May 9, 2018

Hi @shirleyah - at Harvard, we've created a set that provides all records in the repository. Are you trying to set up a set that contains all of the records in your repository? If so, I can provide the query.

If you're trying to avoid using sets at all, I'd need to do some more research to see if it's possible, but the OAI-PMH documentation may be a good source:

https://www.openarchives.org/OAI/openarchivesprotocol.html

@shirleyah
Copy link
Author

Hi @djbrooke,

We defined a set, using this query: dsPersistentId:"doi:10.5072/FK2/" and this url to allow harvest:
http://132.248.220.47/oai?verb=ListRecords&metadataPrefix=oai_dc&set=testSet

However, the OAI-PMH client that wants to make the harvest does not support individual sets, it only makes harvest of complete repositories. This client was developed by the national repository in Mexico.

We don't know if there is a way to allow this request.

P.S. Could you provide your query?

Thanks,

@djbrooke
Copy link
Contributor

Hey @shirleyah - the query we use is:

dsPersistentId:"hdl:1902.1" OR dsPersistentId:"doi:10.7910" OR dsPersistentId:"hdl:10904"

I'm not sure we support harvesting without the definition of a specific set. I'll check around tomorrow with some more knowledgeable members of the team (@landreev @kcondon).

@pdurbin
Copy link
Member

pdurbin commented Jun 4, 2018

@juancorr recently left the following comment at https://groups.google.com/d/msg/dataverse-community/xrhAth__ZF8/6brEjszRAAAJ

"oai-pmh protocol request that ListRecords and ListIdentifiers with not parameter (set), should be return all records:

https://www.openarchives.org/OAI/openarchivesprotocol.html#ListRecords
https://www.openarchives.org/OAI/openarchivesprotocol.html#ListIdentifiers

To have a 100% oai-pmh response, there are some goods oai-pmh validator portals http://validator.oaipmh.com/ , http://re.cs.uct.ac.za/ "

From a quick look it does sound like "set" is supposed to be optional: "an optional argument with a setSpec value , which specifies set criteria for selective harvesting" (emphasis theirs).

@ghost
Copy link

ghost commented Jun 4, 2018

Just a quick comment to say that we're having this problem at AUSSDA as well. Our umbrella organization (CESSDA) is building a consolidated data catalog but is unable to harvest our metadata because their harvest client is unable to use the set parameter in the query string.

@pbinkley
Copy link

I support @pdurbin 's comment: this problem puts the Dataverse OAI server out of compliance with the spec. It shouldn't be necessary to include a set parameter in the request in order to get results. This really should be fixed.

@landreev landreev changed the title Harvesting not using OAI sets Enable harvesting from Dataverse without specifying the set. (aka bring back the default, "everything" set) Aug 22, 2018
@landreev
Copy link
Contributor

This is very doable. But a few technical decisions will need to be made as we go.
The spec does indeed say that the set parameter is optional. But note that it doesn't really say what the server should return when it receives such a request. I.e., what this default set should contain. (or that it should return anything, for that matter). Although in real life, most implementations assume that no set means "return everything you got".
But even that is somewhat a matter of interpretation: we'll need to decide if this will mean every published dataset in the Dataverse; or, say, every dataset in the sets currently defined?

Will this default, no name set be created and exported automatically, as soon as Harvesting service is enabled? - or does it need to be requested specifically?

Do we want this set show up in the UI, as one of the currently available? (if not - then there's no UI work involved at all).

Once again, very doable - but a few decisions that will need to be made in the process.

@pbinkley
Copy link

Note that the spec does say:

"An item need not be organized in any set; meaning that an exhaustive repetition of ListRecords requests with all possible setSpecs is not guaranteed to return all records in the repository. The only guaranteed methods of harvesting all records or headers are ListRecords or ListIdentifiers requests with no setSpec argument." (in the bullets at the end of 2.6).

So I think the assumption of "return everything you got" is the one to go with: all published datasets, regardless of set membership.

@landreev
Copy link
Contributor

We just discussed this as a group, for the purposes of putting the issue on the schedule of the next development sprint.
Yes, we are going to assume that the "no set"/default set means all the published datasets by default.

But I'm still wondering if this should be configurable somehow. Specifically, the sentence from the spec that you posted above - "The only guaranteed methods of harvesting all records ..." - appears to say unambiguously that omitting the set name must return all the available OAI records, yes. But would any Dataverse owner always want to make their entire public holdings available via OAI? In other words, does every published dataset automatically translate into an OAI record?

I would assume that most Dataverses would in fact want it to work like that. But then I can imagine a use case where somebody only wants a small subset of their holdings to be harvestable via OAI... So maybe there should be a checkbox in the Harvesting Service control panel - "include all the published datasets in the default set". It will be checked by default; but if unchecked, the default set will contain the unique sum of all the defined sets (will still be compliant with the spec too) - something like that? Let's review this option as we work on this...

@landreev
Copy link
Contributor

Also, just to document the solution proposed by @pameyer:

Implement this by not doing anything at all, code-wise. By defining the set with all the published datasets, call it "everything", for example, and then add a rewrite rule, to apache or glassfish, that would modify the incoming ListIdentifiers and ListRecords requests that don't have the set specified, by adding ...set=everything. This is a hack, but it would indeed achieve the same result w/out doing any work. (Except for documenting the above hack in the guide)

@landreev landreev self-assigned this Aug 27, 2018
landreev added a commit that referenced this issue Sep 7, 2018
Refactored/renamed some methods;
Also fixed the unrelated bug introduced in a928025, when resolving a merge conflict.
(still need to check in some tests)
(ref #4659)
@landreev
Copy link
Contributor

landreev commented Sep 7, 2018

Quick status update:
What's checked in this branch right now is ready to go, in that it implements it as requested (and as it was implemented in the pre-Dataverse 4* days): the default, no-name set contains all the local, published, successfully exported datasets.
I wanted to implement an option where the admin could limit the contents of the default set only to the unique sum of the datasets in the explicitly configured named sets. (in case some admin may have a reason NOT to want to expose all their published datasets via OAI?). But then I stopped myself, and figured we should release it as is, and only bother adding this feature if someone asks.

More importantly, I fixed an unrelated OAI server bug that apparently was introduced in a928025 back in June, when resolving a merge conflict.

But that last part made me think we need at least some simple IT tests for the basic OAI health (create a set with one dataset; retrieve the set listing and the OAI record; validate the output). Once this is done, will make a PR.

@pdurbin
Copy link
Member

pdurbin commented Sep 10, 2018

@landreev I just left a review on pull request #5037 and while I think the code looks good overall, if you could move the English out of the Java and into the bundle where I indicated it would be much appreciated.

@landreev
Copy link
Contributor

Will add a short paragraph for the release notes;
and add a few lines to the "managing OAI sets" section in the guide.

@kcondon
Copy link
Contributor

kcondon commented Sep 14, 2018

Harvesting no set failed between test dataverse instances.

@landreev
Copy link
Contributor

Should it be in dev or QA at this point? Since it's actually working now, for the purposes of the issue?
We can discuss on Monday...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants