Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selective Harvesting and metha-cat #34

Open
tobiasschweizer opened this issue Sep 15, 2023 · 2 comments
Open

Selective Harvesting and metha-cat #34

tobiasschweizer opened this issue Sep 15, 2023 · 2 comments

Comments

@tobiasschweizer
Copy link

tobiasschweizer commented Sep 15, 2023

Hi @miku,

We are adding more and more OAI-PMH endpoints and metha does a great job!

I have a question about selective harvesting and metha-cat. I have automated harvesting via crontab.
After an initial harvest that gets all records from the earliest day on, we do one selective harvest a week:

metha-sync -T 5m -r 20 -base-dir /mydir -format marcmxl https://zenodo.org/oai2d

Since all previous harvests are written to /mydir (local cache), metha-sync implicitly sets the -from param according to the last harvest, correct?

Now with metha-cat (without providing a timestamp), I have observed that more records are returned in the virtual XML that are actually in the repo, so I assume this includes also updates of a record (so the same record can occur multiple times in metha-cat's output). Is this interpretation correct?

EDIT: What I'd like to get is the latest version of each record via metha-cat.

Thanks and kind regards,

Tobias

@miku
Copy link
Owner

miku commented Nov 29, 2023

Sorry for my overly delayed reply.

Since all previous harvests are written to /mydir (local cache), metha-sync implicitly sets the -from param according to the last harvest, correct?

Yes.

Now with metha-cat (without providing a timestamp), I have observed that more records are returned in the virtual XML that are actually in the repo, so I assume this includes also updates of a record (so the same record can occur multiple times in metha-cat's output). Is this interpretation correct?

Yes.

EDIT: What I'd like to get is the latest version of each record via metha-cat.

Yes, I understand. So metha does not do much except caching responses so subsequent invocations are faster (that's something I haven't seen a lot in other tools). So be on the safe side with respect to updates, one can always delete the cache for a particular endpoint and start anew.

$ rm $(metha-sync -dir http://my.server.org)

That of course requires some tolerance of possibly stale records - depending on the requirements.

@tobiasschweizer
Copy link
Author

No problem and thanks for your response. I'll have a closer look at an endpoint's cache where I assume that a lot of updated records flow in.

Otherwise, metha works nicely and stable :-) it's a part of our automated workflow since a couple of months.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants