MacOS tess4j ghostscript issue #195

weiw11 · 2020-08-21T00:46:50Z

I'm trying to extract text from PDFs using tess4j but it's outputting EE ee EE EE EE ee and similar text, even though windows is significantly better.

I had the same problem on windows but moving gsdll64.dll to src/main/resources fixed the issue for me and the output is almost 100% accurate. I can't seem to figure out how to fix this on macOS.

I added -Djna.library.path=/usr/local/lib to my args and it still doesn't work even though both ghostscript and tesseract is installed via homebrew and should have a symlink to /usr/local/lib. I also tried moving libtesseract.4.dylib and libgs.9.52.dylib to src/main/resources as well and tried moving them to a darwin folder under resources doesn't help either.

Am I doing something wrong or does tess4j just not work as well on macOS compared to windows?

The PDFs I'm using is also very clear without any noise or marks.

Tess4J: 4.5.2
Tesseract: 4.1.1
Leptonica: 1.80.0
Ghostscript: 9.5.2

This is using eng.traineddata.

The text was updated successfully, but these errors were encountered:

nguyenq · 2020-08-21T15:35:42Z

Ghost4J is looking for a libgs.dylib file or symlink in the system path. You can place the symlink in the same location as libtesseract.dylib and then set jna.library.path variable to it.

Run your program with system property jna.debug_load=true to see the locations JNA is looking to find the native libraries.

weiw11 · 2020-08-21T16:11:09Z

Thanks for the reply, but the results I'm getting is still the same as when I put the dylib files to my project's src/main/resources/ folder.

I tried setting jna.library.path to /usr/local/lib as /usr/lib/ is protected by System Integrity Protection on macOS. /usr/local/lib contains the symlinks set by homebrew to the libraries installed.
Output of jna.debug_load=true:

Aug 21, 2020 12:03:15 PM com.sun.jna.Native extractFromResourcePath
INFO: Looking in classpath from jdk.internal.loader.ClassLoaders$AppClassLoader@d8a6ab6b for /com/sun/jna/darwin/libjnidispatch.jnilib
Aug 21, 2020 12:03:15 PM com.sun.jna.Native extractFromResourcePath
INFO: Found library resource at jar:file:///Users/username/.gradle/caches/modules-2/files-2.1/net.java.dev.jna/jna/5.5.0/e0845217c4907822403912ad6828d8e0b256208/jna-5.5.0.jar!/com/sun/jna/darwin/libjnidispatch.jnilib
Aug 21, 2020 12:03:15 PM com.sun.jna.Native extractFromResourcePath
INFO: Extracting library to /Users/username/Library/Caches/JNA/temp/jna15103312618634115760.tmp
Aug 21, 2020 12:03:16 PM com.sun.jna.NativeLibrary loadLibrary
INFO: Looking for library 'gs'
Aug 21, 2020 12:03:16 PM com.sun.jna.NativeLibrary loadLibrary
INFO: Adding paths from jna.library.path: /usr/local/lib
Aug 21, 2020 12:03:16 PM com.sun.jna.NativeLibrary loadLibrary
INFO: Trying /usr/local/lib/libgs.dylib
Aug 21, 2020 12:03:16 PM com.sun.jna.NativeLibrary loadLibrary
INFO: Found library 'gs' at /usr/local/lib/libgs.dylib
Aug 21, 2020 12:03:19 PM com.sun.jna.NativeLibrary loadLibrary
INFO: Looking for library 'tesseract'
Aug 21, 2020 12:03:19 PM com.sun.jna.NativeLibrary loadLibrary
INFO: Adding paths from jna.library.path: /usr/local/lib:/var/folders/lf/xh6ywk792nn87_ss1vm29j2c0000gn/T/tess4j/win32-x86-64
Aug 21, 2020 12:03:19 PM com.sun.jna.NativeLibrary loadLibrary
INFO: Trying /usr/local/lib/libtesseract.dylib
Aug 21, 2020 12:03:19 PM com.sun.jna.NativeLibrary loadLibrary
INFO: Found library 'tesseract' at /usr/local/lib/libtesseract.dylib

nguyenq · 2020-08-21T17:50:11Z

Looks like it successfully found and loaded all the required native libraries. You need to look at the output images from GhostScript via CLI. Try to perform recognition on them using tesseract CLI. If the output is the same, maybe you need to use another library to convert PDF to images -- try PDFBox.

weiw11 · 2020-08-21T18:26:28Z

Using terminal and the commands:

gs -dNOPAUSE -dBATCH -sDEVICE=png16m -r300 -sOutputFile="Pic-%d.png" "input.pdf"
tesseract Pic-1.png output -l eng

I was able to get the images and extract the text correctly with similar results on windows.

Does tess4j have a class which outputs the image it got from the pdf so I can check that?

nguyenq · 2020-08-21T20:31:00Z

If you fed Pic-1.png to tess4j, would you get the correct result?

You have used png16m for DEVICE while tess4j uses pnggray. Please retest with the value tess4j uses.

No, tess4j would delete intermediate image files to keep it clean.

You also can use PDFBox by calling System.setProperty(PDF_LIBRARY, PDFBOX);

weiw11 · 2020-08-21T21:39:29Z

Using Pic-1.png with png16m and pnggray outputs the expected results similar to window's output the same results as previous.

I got it working by using the following code and changing the image output to binary instead:

private String getPDFText(File file) {
    ITesseract tesseract = new Tesseract();
    try {
        tesseract.setDatapath("tessdata");
        tesseract.setLanguage("eng");
        tesseract.setTessVariable("user_defined_dpi", "300");

        PDDocument document = PDDocument.load(file);
        PDFRenderer pdfRenderer = new PDFRenderer(document);

        String output = "";
        for (int page = 0; page < document.getNumberOfPages(); ++page)
        {
            BufferedImage bi = pdfRenderer.renderImageWithDPI(page, 300, ImageType.BINARY);
            output += tesseract.doOCR(bi);
        }
        return output;
    } catch (TesseractException | IOException e) {
        e.printStackTrace();
    }
    return null;
}

For PDFBox, do you mean to call System.setProperty(PdfUtilities.PDF_LIBRARY, PdfUtilities.PDFBOX);? I tried calling that but it still doesn't work.

nguyenq · 2020-08-21T22:57:52Z

Please look in the unit tests for example. With the property set, PDFBox will be used instead of GS.

weiw11 · 2020-08-22T00:31:14Z

Calling System.setProperty(PdfUtilities.PDF_LIBRARY, PdfUtilities.PDFBOX); provides the same inaccurate results as ghostscript.

Is there any possibility of implementing variable args to ghostscript or pdfbox methods? This way we can use our own custom args to process the images.

Example:

public synchronized static File[] convertPdf2Png(File inputPdfFile, String... customArgs) throws IOException {
        Path path = Files.createTempDirectory("tessimages");
        File imageDir = path.toFile();

        //get Ghostscript instance
        Ghostscript gs = Ghostscript.getInstance();

        //prepare Ghostscript interpreter parameters
        //refer to Ghostscript documentation for parameter usage
        List<String> gsArgs = new ArrayList<String>();

        // Adds arguments passed to gsArgs
        for (String s : customArgs) {
                gsArgs.add(s);
        }
        ....

nguyenq · 2020-08-22T01:53:19Z

You certainly can implement your custom conversion/processing method and then pass the result images to tess4j for recognition.

weiw11 · 2020-08-22T03:42:00Z

That's true. Thanks so much for all the help provided!

weiw11 · 2020-08-22T15:56:32Z

@nguyenq So I did some testing and found out that tess4j on macOS didn't actually process the image (convert it to pnggray) before trying to get the text.

Below is a cropped image of the differences:

This is the png of tesseract's output of .tiff. The image came out colored and the text is blocky and not really clear. DPI is still 300.

This is the output png using the command: gs -dNOPAUSE -dQUIET -dBATCH -dSAFER -sDEVICE=pnggray -r300 -dGraphicsAlphaBits=4 -dTextAlphaBits=4 -sOutputFile="Pic-%d.png" "input.pdf". I copied the args used by PdfGsUtilities.

I don't think it's calling ghostscript correctly as it's the exact image that I got in windows if I didn't move gsdll64.dll to my resources folder.

nguyenq · 2020-08-22T16:04:48Z

If Ghostscript is not available, PDFBox is used in PDF-to-image conversion.

How did you produce the first (blurry) image?

weiw11 · 2020-08-22T16:12:11Z

This is the code I used for the first image:

    public void tessTest() {
        Tesseract tesseract = new Tesseract();
        try {
            tesseract.setDatapath("data");
            tesseract.setLanguage("eng");
            tesseract.setTessVariable("user_defined_dpi", "300");

            System.out.println(tesseract.doOCR(new File("inputFile.pdf")));
        } catch (TesseractException e) {
            e.printStackTrace();
        }
    }

If PDFBox is used instead, that explains the output image when I call System.setProperty(PdfUtilities.PDF_LIBRARY, PdfUtilities.PDFBOX); on my windows machine to test. It creates the results exactly like on the mac.

Edit: That doesn't really make any sense on the mac though seeing as JNA states that it found the libraries?

nguyenq · 2020-08-22T20:45:18Z

The source code is available. You certainly can debug to get to the bottom of it.

weiw11 · 2020-08-23T16:50:45Z

~~The issue appears to be PdfBox's ImageType.RGB. I changed to ImageType.GRAY to match the output of PdfGsUtilities.~~

Updating PDFBox to 2.0.21 fixes the issue.

2.0.20 output:

2.0.21 output:

nguyenq · 2020-08-23T18:09:22Z

We constanly keep up to date with the dependencies. Will publish an update soon.

Thank you.

weiw11 mentioned this issue Aug 23, 2020

master #196

Closed

weiw11 closed this as completed Aug 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MacOS tess4j ghostscript issue #195

MacOS tess4j ghostscript issue #195

weiw11 commented Aug 21, 2020 •

edited

nguyenq commented Aug 21, 2020 •

edited

weiw11 commented Aug 21, 2020

nguyenq commented Aug 21, 2020 •

edited

weiw11 commented Aug 21, 2020

nguyenq commented Aug 21, 2020

weiw11 commented Aug 21, 2020 •

edited

nguyenq commented Aug 21, 2020 •

edited

weiw11 commented Aug 22, 2020

nguyenq commented Aug 22, 2020 •

edited

weiw11 commented Aug 22, 2020

weiw11 commented Aug 22, 2020

nguyenq commented Aug 22, 2020

weiw11 commented Aug 22, 2020 •

edited

nguyenq commented Aug 22, 2020

weiw11 commented Aug 23, 2020 •

edited

nguyenq commented Aug 23, 2020

MacOS tess4j ghostscript issue #195

MacOS tess4j ghostscript issue #195

Comments

weiw11 commented Aug 21, 2020 • edited

nguyenq commented Aug 21, 2020 • edited

weiw11 commented Aug 21, 2020

nguyenq commented Aug 21, 2020 • edited

weiw11 commented Aug 21, 2020

nguyenq commented Aug 21, 2020

weiw11 commented Aug 21, 2020 • edited

nguyenq commented Aug 21, 2020 • edited

weiw11 commented Aug 22, 2020

nguyenq commented Aug 22, 2020 • edited

weiw11 commented Aug 22, 2020

weiw11 commented Aug 22, 2020

nguyenq commented Aug 22, 2020

weiw11 commented Aug 22, 2020 • edited

nguyenq commented Aug 22, 2020

weiw11 commented Aug 23, 2020 • edited

nguyenq commented Aug 23, 2020

weiw11 commented Aug 21, 2020 •

edited

nguyenq commented Aug 21, 2020 •

edited

nguyenq commented Aug 21, 2020 •

edited

weiw11 commented Aug 21, 2020 •

edited

nguyenq commented Aug 21, 2020 •

edited

nguyenq commented Aug 22, 2020 •

edited

weiw11 commented Aug 22, 2020 •

edited

weiw11 commented Aug 23, 2020 •

edited