Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GROBID disk I/O issues (temp directory config related) #871

Closed
bnewbold opened this issue Dec 7, 2021 · 3 comments
Closed

GROBID disk I/O issues (temp directory config related) #871

bnewbold opened this issue Dec 7, 2021 · 3 comments
Labels
enhancement implemented The issue has been implemented

Comments

@bnewbold
Copy link
Contributor

bnewbold commented Dec 7, 2021

In our use of GROBID, we have machines with a reasonable number of cores and RAM (eg, 30 cores, 40GB RAM), but poor disk I/O. This makes it important to have GROBID not write to disk, or to use a ramdisk (aka, virtual RAM-backed partition) if it must (eg, for interaction with pdfalto).

In the past it was possible to configure grobid.temp to point to, eg, /run/grobid/tmp, which we configured on Linux to be a ramdisk. In newer versions of GROBID, it looks like this doesn't work any more, due to this change: c8e11b8#diff-65f7e37a114e9b9339efbb8ec03c4b19aec2f6998f127d539b6a07b01aa9b303L360-R362

Eg, if we use YAML to configure:

grobid:
    temp: "/run/grobid/tmp"

then I can see GROBID writing PDF files to: /srv/grobid/grobid-service-0.7.0-131-gdd0251d9f/grobid-home/run/grobid/tmp/origin2651762335153943539.pdf (a relative path, not an absolute path).

I don't know the Java APIs well enough to recommend an alternative function to use, but it seems like it should be possible to use grobid-home as a prefix for relative paths, but fall back and allow absolute paths if the grobid.temp variable is an absolute path.

Separately, I can also see files like /tmp/MIME2368838021331894851.tmp getting written, and it seems like the GROBID java process is writing these. I think this is due to Jersey? I vaguely remember being able to control the location these get written using the TMPDIR UNIX environment variable in the past, but that doesn't seem to be working. It would be great to be able to control this location, or just have it be the same as grobid.temp.

A work around for the first issue (absolute paths not possible) is to create a symlink to the location I want. I can't think of a way to do that with the second problem, without having the entire /tmp directory be a random or symlink, which could have other unintended consequences.

@iiLaurens
Copy link

+1 as I am dealing with a similar problem. Setting the temporary directory using a config file would be helpful.

@lfoppiano
Copy link
Collaborator

lfoppiano commented Jul 5, 2022

I've quickly made a PR (#932) with a change that uses it the temporary directory as it is, if the path is absolute and as before, if the path is relative.
Maybe you can test it. 😅

@lfoppiano lfoppiano self-assigned this Jul 5, 2022
@kermitt2
Copy link
Owner

kermitt2 commented Jul 5, 2022

PR tested and merged !

@lfoppiano lfoppiano added the implemented The issue has been implemented label Jul 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement implemented The issue has been implemented
Projects
None yet
Development

No branches or pull requests

4 participants