Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when creating a new training data with windows 10 (wsl) #954

Closed
malee1382 opened this issue Oct 10, 2022 · 3 comments
Closed

Error when creating a new training data with windows 10 (wsl) #954

malee1382 opened this issue Oct 10, 2022 · 3 comments
Labels
Windows-specific Issue visible only on Windows environments

Comments

@malee1382
Copy link

malee1382 commented Oct 10, 2022

Hi,

I am trying to create a training data in the batch mode following https://grobid.readthedocs.io/en/latest/Grobid-batch/
Since my OS is Windows 10, I use WSL to be able to handle it in Linux. Java version 1.8.0_311.

To test, I have created two folders under my grobid directory: /grobid/GROBID-RETRA/input (containing 10 sample PDFs where the largest one is 3000 KB) and /grobid/GROBID-RETRA/output

And run the following:

GHUM-L-E144KVB6:/mnt/host/c/grobid# java.exe -Xmx4G -jar grobid-core/build/libs/grobid-core-0.7.1-onejar.jar -gH grobid-home -dIn GROBID-RETRA/input -dOut GROBID-RE TRA/output -exe createTraining

Here is what I got once running it:

okt 10, 2022 9:29:06 PM org.grobid.core.main.GrobidHomeFinder getGrobidHomePathOrLoadFromClasspath
WARNING: No Grobid property was provided. Attempting to find Grobid home in the current directory...
okt 10, 2022 9:29:06 PM org.grobid.core.main.GrobidHomeFinder findGrobidHomeOrFail
WARNING: ***************************************************************
okt 10, 2022 9:29:06 PM org.grobid.core.main.GrobidHomeFinder findGrobidHomeOrFail
WARNING: *** USING GROBID HOME: C:\grobid\grobid-home
okt 10, 2022 9:29:06 PM org.grobid.core.main.GrobidHomeFinder findGrobidHomeOrFail
WARNING: ***************************************************************
okt 10, 2022 9:29:06 PM org.grobid.core.main.GrobidHomeFinder getGrobidHomePathOrLoadFromClasspath
WARNING: No Grobid property was provided. Attempting to find Grobid home in the current directory...
okt 10, 2022 9:29:06 PM org.grobid.core.main.GrobidHomeFinder findGrobidHomeOrFail
WARNING: ***************************************************************
okt 10, 2022 9:29:06 PM org.grobid.core.main.GrobidHomeFinder findGrobidHomeOrFail
WARNING: *** USING GROBID HOME: C:\grobid\grobid-home
okt 10, 2022 9:29:06 PM org.grobid.core.main.GrobidHomeFinder findGrobidHomeOrFail
WARNING: ***************************************************************
okt 10, 2022 9:29:06 PM org.grobid.core.main.GrobidHomeFinder findGrobidConfigOrFail
WARNING: Grobid config file location was not explicitly set via 'org.grobid.config' system variable, defaulting to: C:\grobid\grobid-home\config\grobid.yaml
okt 10, 2022 9:29:06 PM org.grobid.core.main.LibraryLoader load
INFO: Loading external native sequence labelling library
okt 10, 2022 9:29:06 PM org.grobid.core.main.LibraryLoader load
INFO: Loading Wapiti native library...
okt 10, 2022 9:29:06 PM org.grobid.core.main.LibraryLoader load
INFO: Native library for sequence labelling loaded
okt 10, 2022 9:29:06 PM org.grobid.core.lexicon.Lexicon initDictionary
INFO: Initiating dictionary
okt 10, 2022 9:29:06 PM org.grobid.core.lexicon.Lexicon initDictionary
INFO: End of Initialization of dictionary
okt 10, 2022 9:29:06 PM org.grobid.core.lexicon.Lexicon initNames
INFO: Initiating names
okt 10, 2022 9:29:06 PM org.grobid.core.lexicon.Lexicon initNames
INFO: End of initialization of names
okt 10, 2022 9:29:07 PM org.grobid.core.lexicon.Lexicon initCountryCodes
INFO: Initiating country codes
okt 10, 2022 9:29:07 PM org.grobid.core.lexicon.Lexicon initCountryCodes
INFO: End of initialization of country codes
Pdf_FilenamePeeters_3289944_Original.pdf
Pdf_FilenamePeeters_3289945_Original.pdf
Pdf_FilenamePeeters_3289948_Original.pdf
Pdf_FilenamePeeters_3289953_Original.pdf
Pdf_FilenamePeeters_3289954_Original.pdf
Pdf_FilenamePeeters_3289955__Original.pdf
Pdf_FilenamePeeters_3289963_Original.pdf
Pdf_FilenamePeeters_3289965_Original.pdf
Pdf_FilenamePeeters_3289969_Original.pdf
Pdf_FilenamePeeters_3290025_Original.pdf
10 files to be processed.
GROBID-RETRA\input\Pdf_FilenamePeeters_3289944_Original.pdf
okt 10, 2022 9:29:07 PM org.grobid.core.jni.WapitiModel init
INFO: Loading model: C:\grobid\grobid-home\models\fulltext\model.wapiti (size: 26707735)
[Wapiti] Loading model: "C:\grobid\grobid-home\models\fulltext\model.wapiti"
Model path: C:\grobid\grobid-home\models\fulltext\model.wapiti
okt 10, 2022 9:29:14 PM org.grobid.core.engines.Engine batchCreateTraining
SEVERE: An error occured while processing the following pdf: GROBID-RETRA\input\Pdf_FilenamePeeters_3289944_Original.pdf
org.grobid.core.exceptions.GrobidException: [GENERAL] An exception occurred while running Grobid training data generation for full text.
        at org.grobid.core.engines.FullTextParser.createTraining(FullTextParser.java:1452)
        at org.grobid.core.engines.Engine.createTraining(Engine.java:462)
        at org.grobid.core.engines.Engine.batchCreateTraining(Engine.java:566)
        at org.grobid.core.engines.ProcessEngine.createTraining(ProcessEngine.java:379)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.grobid.core.utilities.Utilities.launchMethod(Utilities.java:339)
        at org.grobid.core.main.batch.GrobidMain.main(GrobidMain.java:189)
Caused by: org.grobid.core.exceptions.GrobidException: [PDFALTO_CONVERSION_FAILURE] PDF to XML conversion failed on pdf file GROBID-RETRA\input\Pdf_FilenamePeeters_3289944_Original.pdf
        at org.grobid.core.document.DocumentSource.processPdfaltoThreadMode(DocumentSource.java:209)
        at org.grobid.core.document.DocumentSource.pdfalto(DocumentSource.java:155)
        at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:64)
        at org.grobid.core.engines.FullTextParser.createTraining(FullTextParser.java:1001)
        ... 9 more

So, for each pdf I got the same error.

To me, the error is not informative at all so what steps should I follow to dig into this? Or am I missing something?

Thanks in advance!
Mehmet

@lfoppiano lfoppiano added the Windows-specific Issue visible only on Windows environments label Oct 11, 2022
@lfoppiano
Copy link
Collaborator

Hi @malee1382 I'm not sure the WSL works fine with Grobid.
I wonder which OS type does Java returns.

Try to run this in WSL?

  1. Just copy it in a file called test.java:
public class test {
    public static void main(String[] argv) {
        System.out.println(System.getProperty("os.name"));
        System.out.println(System.getProperty("os.version"));
        System.out.println(System.getProperty("os.arch"));
    }
}
  1. And then compile javac test.java and then run java test, and let me know what is the result?

If this program result in Linux then could you also try to run the following:

Enable the WSL mode and try to run pdfalto from the shell:

YOURPATH/pdfalto -fullFontName -noLineNumbers -noImage  -annotation  -outline  -filesLimit 2000  PATH_TO_PDF PATH_TO_OUTPUT

@malee1382
Copy link
Author

Dear Luca,

Many thanks for your prompt reply!

Indeed, when I ran the test.java it still returns windows. I also tried it with my intellij where I can set up wsl, it prints linux. So apparently, I am still missing something when setting up my wsl.

Instead, now I have tried it with docker with the following compose file:

version: '3'
services:
grobid:
image: lfoppiano/grobid:0.7.1
container_name: grobid
restart: unless-stopped
ports:
- "8070:8070"
volumes:
- ./:/opt/grobid/retrain

,where retrain contains all the necessary folders (e.g., grobid-home, core, input (having the sample PDFs), output).

Worked very smoothly!

Many thanks again!

@lfoppiano
Copy link
Collaborator

lfoppiano commented Oct 12, 2022

That's great. 😄

I've added the reference to this issue in the documentation, in case other people are trying to use the WSL mode: https://grobid.readthedocs.io/en/latest/Troubleshooting/#windows-related-issues

I'm closing the issue. Feel free to reopen in case of needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Windows-specific Issue visible only on Windows environments
Projects
None yet
Development

No branches or pull requests

2 participants