Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could the java documentation and the process of embedding grobid into Java project be updated? #577

Closed
lucaspada894 opened this issue May 1, 2020 · 40 comments
Labels
need help Issues where the contributors are even more incompetent than usual Windows-specific Issue visible only on Windows environments

Comments

@lucaspada894
Copy link

Following the instructions on the grobid site, I cannot embed grobid into my JAVA project due to poor instructions regarding Gradle and Maven. Also, I do not know how to use the APIs because the Java documentation has different parameters for the methods. Specifically, fullTextToTei take in different arguments than what is shown in the Java docs.

@lucaspada894 lucaspada894 changed the title Could the java documentation and the process of embedding grobig into Java project be updated? Could the java documentation and the process of embedding grobid into Java project be updated? May 1, 2020
@lucaspada894
Copy link
Author

Honestly, the entire process of using the API is outdated on the site.

@kermitt2
Copy link
Owner

kermitt2 commented May 1, 2020

Hello @lucaspada894 !

We've just made a new release, so all the documentation and the grobid modules/demos in the other repo have not been tested and updated yet, it will take a few days/weeks... sorry (this is a side project for us, everything is slow slow pace here :).

In between, you can look at the class: grobid/grobid-core/src/main/java/org/grobid/core/engines/Engine.java which is basically the Java API to use.

@lucaspada894
Copy link
Author

lucaspada894 commented May 1, 2020 via email

@lfoppiano
Copy link
Collaborator

@lucaspada894 could you share the snippet / code you're using?

@lucaspada894
Copy link
Author

lucaspada894 commented May 1, 2020 via email

@lucaspada894
Copy link
Author

lucaspada894 commented May 1, 2020 via email

@lucaspada894
Copy link
Author

lucaspada894 commented May 1, 2020 via email

@lfoppiano
Copy link
Collaborator

No worries, anytime is good. I'll try to answer quckly but allow some time ;-)

@lfoppiano
Copy link
Collaborator

Meanwhile, have you looked at the grobid-example sample project? https://github.com/kermitt2/grobid-example

@kermitt2
Copy link
Owner

kermitt2 commented May 1, 2020

@lucaspada894 maybe try this https://github.com/kermitt2/article-dataset-builder
you will get very high rate of OA PDF processed by Grobid without pain - for instance, for the same article list, you will get more full structured full texts than the official CORD-19 dataset, and significantly better/richer structuring with the latest version of Grobid (CORD-19 is also relying on Grobid conversion, but not the latest version of Grobid from what I have seen).

From my experience, when it comes to pipeline for scientific articles, using web services is much more convenient.

@lucaspada894
Copy link
Author

lucaspada894 commented May 1, 2020 via email

@lucaspada894
Copy link
Author

lucaspada894 commented May 1, 2020 via email

@lucaspada894
Copy link
Author

lucaspada894 commented May 2, 2020 via email

@lucaspada894
Copy link
Author

lucaspada894 commented May 2, 2020 via email

@kermitt2
Copy link
Owner

kermitt2 commented May 3, 2020

The simplest and most efficient way to integrate grobid in an application is using the service, it provides multithreading, robustness, good documentation, docker, etc. There's a java client here.

Using the Java API would be justified only I think if you need some low level data structures and functionalities, but for processing usual scientific articles this should not be necessary. You would then need to understand the API, implement your own parallelization, and so on, this is a very big effort for something already existing elsewhere.

Having said that, if you stay on the Java API integration, everything is now updated, including the javadoc. Can you run the grobid-example sample project? If no, could you provide some error trace, info about your environment, jdk version?

@lfoppiano
Copy link
Collaborator

@lucaspada894 did you solve your problem or you still need help?

@lucaspada894
Copy link
Author

lucaspada894 commented May 21, 2020 via email

@kermitt2
Copy link
Owner

@lucaspada894 In principle the server does not crash for any of these kind of cases (it ran over 12M PDF without any crashes and 44K is not a lot at all) - but it could come from a recent update or over-loading the server without waiting when receiving a 503 response. Are you using Windows? how much memory and how many concurrent queries? Which version of GROBID are you using? Can you provide the server logs for the crashes? Optionally if you have a sharable problematic PDF for that, it can help.

@lucaspada894
Copy link
Author

lucaspada894 commented May 21, 2020 via email

@lucaspada894
Copy link
Author

lucaspada894 commented May 21, 2020 via email

@lfoppiano
Copy link
Collaborator

OK, so you're using Windows. As you can see all the unresolved issues about windows, our recommendation would be to use a virtual machine with linux or run grobid on docker (https://grobid.readthedocs.io/en/latest/Grobid-docker/)

@lfoppiano lfoppiano added the Windows-specific Issue visible only on Windows environments label May 21, 2020
@lucaspada894
Copy link
Author

lucaspada894 commented May 21, 2020 via email

@lfoppiano
Copy link
Collaborator

macOS should not be a problem, right?

no, it should not, if you are developing it's fine. It's what I'm using for development, anyway.

However certain components behave slightly different (not at the same degrees of windows, though), so for batch processing and production environment, the main architecture is Linux.

@lfoppiano lfoppiano added the need help Issues where the contributors are even more incompetent than usual label May 21, 2020
@lucaspada894
Copy link
Author

lucaspada894 commented May 21, 2020 via email

@lucaspada894
Copy link
Author

lucaspada894 commented May 22, 2020 via email

@lfoppiano
Copy link
Collaborator

@lucaspada894 I'm trying to reproduce your problem, did you use the grobid-client-python to process your files?

@lucaspada894
Copy link
Author

lucaspada894 commented May 23, 2020 via email

@lfoppiano
Copy link
Collaborator

OK, when you get connection refused, it's when the docker container is killed. This can happen when the allocated memory is not enough. However, I had it set to 4.5Gb and still got the problem.
Could you try to increase the memory allocated to your docker server to, let's say 6 or 8 Gb?

Grobid should work fine with 4 Gb but, well, let's see if we can make it run first ;-)

@lucaspada894
Copy link
Author

lucaspada894 commented May 23, 2020 via email

@lucaspada894
Copy link
Author

lucaspada894 commented May 23, 2020 via email

@lfoppiano
Copy link
Collaborator

Two things.

First of all, you should use the latest version, which is the 0.6.0 (yes, I know the documentation said 0.5.6, I corrected this morning). 😅 💦

Secondly, that error is fine, when the pool is empty, the system return 503 (system unavailable) which makes the client wait and try again. What is not good is that the service ends after a while.

Try to increase the memory:

image

I will investigate also

WARN  [2020-05-23 06:56:30,587] org.grobid.core.utilities.LanguageUtilities: Cannot detect language because of: java.lang.IllegalStateException: Cannot read profiles for cybozu language detection from: /opt/grobid/grobid-home/language-detection/cybozu/profiles
WARN  [2020-05-23 06:56:30,601] org.grobid.core.utilities.LanguageUtilities: Cannot detect language because of: java.lang.IllegalStateException: Cannot read profiles for cybozu language detection from: /opt/grobid/grobid-home/language-detection/cybozu/profiles
WARN  [2020-05-23 06:56:30,602] org.grobid.core.utilities.LanguageUtilities: Cannot detect language because of: java.lang.IllegalStateException: Cannot read profiles for cybozu language detection from: /opt/grobid/grobid-home/language-detection/cybozu/profiles

but not today...

@lucaspada894
Copy link
Author

lucaspada894 commented May 23, 2020 via email

@lucaspada894
Copy link
Author

lucaspada894 commented May 23, 2020 via email

@lucaspada894
Copy link
Author

lucaspada894 commented May 23, 2020 via email

@lfoppiano
Copy link
Collaborator

This seems that the docker container has been terminated. For large processing, 8Gb is probably enough. Otherwise, reduce the number of parallel threads in the client.

@lfoppiano
Copy link
Collaborator

@lucaspada894 adding more memory did fix the issue?

@lucaspada894
Copy link
Author

lucaspada894 commented Jun 23, 2020 via email

@lfoppiano
Copy link
Collaborator

lfoppiano commented Jun 23, 2020

If you are satisfied, please close this issue. 😉

FYI I'm working on adding more documentation (mostly already discussed or written somewhere in here), you can see the preview here: https://grobid.readthedocs.io/en/add-developers-guide/

@lucaspada894
Copy link
Author

lucaspada894 commented Jun 23, 2020 via email

@lfoppiano
Copy link
Collaborator

lfoppiano commented Jul 30, 2020

@lucaspada894 I'm closing this issue. If more work is needed, feel free to reopen it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need help Issues where the contributors are even more incompetent than usual Windows-specific Issue visible only on Windows environments
Projects
None yet
Development

No branches or pull requests

3 participants