Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metha-Cat: Support for Paging? #28

Open
tobiasschweizer opened this issue Jun 1, 2022 · 3 comments
Open

Metha-Cat: Support for Paging? #28

tobiasschweizer opened this issue Jun 1, 2022 · 3 comments

Comments

@tobiasschweizer
Copy link

Hi there,

We are using metha-sync to harvest quite a big set. Everything went smoothly and we could create an XML using metha-cat containing the whole set.
The XML is quite big (2.5 GB) and we have some difficulties processing it.

Is there a way to get the records in steps of a limited size (like paging with a defined size and offset) with metha-cat? Setting the from or until params wouldn't help us much I think (libraries might process big batches on a single day).

Thanks a lot!

@miku
Copy link
Owner

miku commented Jun 1, 2022

I ran into similar issues (too big XML files) in the past and I remember that there are tools addressing this problem specifically; one I remember is xml_split.

On debian/ubuntu it seems to be available with the xml-twig-tools package.

$ metha-sync https://yareta.unige.ch/oai && metha-cat https://yareta.unige.ch/oai | xml_split -l 1 -s 50kb
$ ls -l
total 456K
drwxrwxr-x  2 tir tir 4.0K Jun  1 13:29 ./
drwxr-xr-x 39 tir tir  56K Jun  1 13:28 ../
-rw-rw-r--  1 tir tir  346 Jun  1 13:29 out-00.xml
-rw-rw-r--  1 tir tir  50K Jun  1 13:29 out-01.xml
-rw-rw-r--  1 tir tir  54K Jun  1 13:29 out-02.xml
-rw-rw-r--  1 tir tir  51K Jun  1 13:29 out-03.xml
-rw-rw-r--  1 tir tir  50K Jun  1 13:29 out-04.xml
-rw-rw-r--  1 tir tir  51K Jun  1 13:29 out-05.xml
-rw-rw-r--  1 tir tir  51K Jun  1 13:29 out-06.xml
-rw-rw-r--  1 tir tir  51K Jun  1 13:29 out-07.xml
-rw-rw-r--  1 tir tir  23K Jun  1 13:29 out-08.xml

The resulting XML is valid, but slightly modified:

$ xmllint --format out-01.xml 2> /dev/null | head -4
<?xml version="1.0"?>
<xml_split:root xmlns:xml_split="http://xmltwig.com/xml_split">
  <record xmlns="http://www.openarchives.org/OAI/2.0/">
    <header status="">

Does this help?

PS: Thanks for using metha! I'm just curious (and collecting uses of metha) - if possible, can you share the project name in which metha is used for data acquisition?

@tobiasschweizer
Copy link
Author

@miku Thanks a lot for your answer. We'll try this out!

We are using metha for our linked open research data project Connectome.

@miku
Copy link
Owner

miku commented Apr 8, 2024

This is go-specific, so I leave this here as a footnote: I had some success making XML processing faster by parallelizing it, with some ideas take from here: Faster XML processing in Go -- anecdata: 5GB of XML can be processed in a few seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants