Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjustions to grobid-client.py #3

Open
darjusp opened this issue Aug 12, 2019 · 0 comments
Open

Adjustions to grobid-client.py #3

darjusp opened this issue Aug 12, 2019 · 0 comments
Labels
enhancement New feature or request implemented At least you try

Comments

@darjusp
Copy link

darjusp commented Aug 12, 2019

Dear Patrice,

as we talked in other repo, i adjusted the client that it could parse citations from text.
The solution became a bit ugly. But now:

  1. it reads "txt" file as an input with each citation in new line
  2. groups citations by thousands (or batch_size specified) and saves them in XML file, naming it by input name plus each thousand (or batch_size specified)
  3. At the end opens each file and adds appropriate XML beginning and END
  4. The TXT and PDF files handling are separated after common function "process"

Issues:

  1. I needed to rename 'input' variable to 'input2' as python was complaining for the name
  2. Input file must be given in TXT
  3. If workers specified more than 1, the input file and outcome file is loosing sorting order.

Examples:
if order matters - (--n < 2):
python grobid-client.py --input /path/to/refs/file.txt --n 1
if not - (--n >1 or default)
python grobid-client.py --input /path/to/refs/file.txt

to parse with single worker 2 millions citations with Macbook Pro 2015 it took around 6 hours. Not so slow :)

Here is the file https://github.com/darjusp/contribs/blob/master/grobid-client.py

@kermitt2 kermitt2 added the enhancement New feature or request label Jun 9, 2021
@kermitt2 kermitt2 added the implemented At least you try label Mar 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request implemented At least you try
Projects
None yet
Development

No branches or pull requests

2 participants