-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable harvesting from Dataverse without specifying the set. (aka bring back the default, "everything" set) #4659
Comments
Hi @shirleyah - at Harvard, we've created a set that provides all records in the repository. Are you trying to set up a set that contains all of the records in your repository? If so, I can provide the query. If you're trying to avoid using sets at all, I'd need to do some more research to see if it's possible, but the OAI-PMH documentation may be a good source: |
Hi @djbrooke, We defined a set, using this query: dsPersistentId:"doi:10.5072/FK2/" and this url to allow harvest: However, the OAI-PMH client that wants to make the harvest does not support individual sets, it only makes harvest of complete repositories. This client was developed by the national repository in Mexico. We don't know if there is a way to allow this request. P.S. Could you provide your query? Thanks, |
Hey @shirleyah - the query we use is: dsPersistentId:"hdl:1902.1" OR dsPersistentId:"doi:10.7910" OR dsPersistentId:"hdl:10904" I'm not sure we support harvesting without the definition of a specific set. I'll check around tomorrow with some more knowledgeable members of the team (@landreev @kcondon). |
@juancorr recently left the following comment at https://groups.google.com/d/msg/dataverse-community/xrhAth__ZF8/6brEjszRAAAJ "oai-pmh protocol request that ListRecords and ListIdentifiers with not parameter (set), should be return all records: https://www.openarchives.org/OAI/openarchivesprotocol.html#ListRecords To have a 100% oai-pmh response, there are some goods oai-pmh validator portals http://validator.oaipmh.com/ , http://re.cs.uct.ac.za/ " From a quick look it does sound like "set" is supposed to be optional: "an optional argument with a setSpec value , which specifies set criteria for selective harvesting" (emphasis theirs). |
Just a quick comment to say that we're having this problem at AUSSDA as well. Our umbrella organization (CESSDA) is building a consolidated data catalog but is unable to harvest our metadata because their harvest client is unable to use the set parameter in the query string. |
I support @pdurbin 's comment: this problem puts the Dataverse OAI server out of compliance with the spec. It shouldn't be necessary to include a |
This is very doable. But a few technical decisions will need to be made as we go. Will this default, no name set be created and exported automatically, as soon as Harvesting service is enabled? - or does it need to be requested specifically? Do we want this set show up in the UI, as one of the currently available? (if not - then there's no UI work involved at all). Once again, very doable - but a few decisions that will need to be made in the process. |
Note that the spec does say:
So I think the assumption of "return everything you got" is the one to go with: all published datasets, regardless of set membership. |
We just discussed this as a group, for the purposes of putting the issue on the schedule of the next development sprint. But I'm still wondering if this should be configurable somehow. Specifically, the sentence from the spec that you posted above - "The only guaranteed methods of harvesting all records ..." - appears to say unambiguously that omitting the set name must return all the available OAI records, yes. But would any Dataverse owner always want to make their entire public holdings available via OAI? In other words, does every published dataset automatically translate into an OAI record? I would assume that most Dataverses would in fact want it to work like that. But then I can imagine a use case where somebody only wants a small subset of their holdings to be harvestable via OAI... So maybe there should be a checkbox in the Harvesting Service control panel - "include all the published datasets in the default set". It will be checked by default; but if unchecked, the default set will contain the unique sum of all the defined sets (will still be compliant with the spec too) - something like that? Let's review this option as we work on this... |
Also, just to document the solution proposed by @pameyer: Implement this by not doing anything at all, code-wise. By defining the set with all the published datasets, call it "everything", for example, and then add a rewrite rule, to apache or glassfish, that would modify the incoming ListIdentifiers and ListRecords requests that don't have the set specified, by adding ...set=everything. This is a hack, but it would indeed achieve the same result w/out doing any work. (Except for documenting the above hack in the guide) |
Quick status update: More importantly, I fixed an unrelated OAI server bug that apparently was introduced in a928025 back in June, when resolving a merge conflict. But that last part made me think we need at least some simple IT tests for the basic OAI health (create a set with one dataset; retrieve the set listing and the OAI record; validate the output). Once this is done, will make a PR. |
Will add a short paragraph for the release notes; |
Harvesting no set failed between test dataverse instances. |
Should it be in dev or QA at this point? Since it's actually working now, for the purposes of the issue? |
Hello all,
Is there a way to allow harvest not using the OAI sets? A client (no dataverse) is trying to harvest our repository (dataverse) but it does not make harvest individual sets.
Could be possible do not use/define an specific set and allow to this client haverst our repository?
The text was updated successfully, but these errors were encountered: