Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MacOS tess4j ghostscript issue #195

Closed
weiw11 opened this issue Aug 21, 2020 · 16 comments
Closed

MacOS tess4j ghostscript issue #195

weiw11 opened this issue Aug 21, 2020 · 16 comments

Comments

@weiw11
Copy link

weiw11 commented Aug 21, 2020

I'm trying to extract text from PDFs using tess4j but it's outputting EE ee EE EE EE ee and similar text, even though windows is significantly better.

I had the same problem on windows but moving gsdll64.dll to src/main/resources fixed the issue for me and the output is almost 100% accurate. I can't seem to figure out how to fix this on macOS.

I added -Djna.library.path=/usr/local/lib to my args and it still doesn't work even though both ghostscript and tesseract is installed via homebrew and should have a symlink to /usr/local/lib. I also tried moving libtesseract.4.dylib and libgs.9.52.dylib to src/main/resources as well and tried moving them to a darwin folder under resources doesn't help either.

Am I doing something wrong or does tess4j just not work as well on macOS compared to windows?

The PDFs I'm using is also very clear without any noise or marks.

Tess4J: 4.5.2
Tesseract: 4.1.1
Leptonica: 1.80.0
Ghostscript: 9.5.2

This is using eng.traineddata.

@nguyenq
Copy link
Owner

nguyenq commented Aug 21, 2020

Ghost4J is looking for a libgs.dylib file or symlink in the system path. You can place the symlink in the same location as libtesseract.dylib and then set jna.library.path variable to it.

Run your program with system property jna.debug_load=true to see the locations JNA is looking to find the native libraries.

@weiw11
Copy link
Author

weiw11 commented Aug 21, 2020

Thanks for the reply, but the results I'm getting is still the same as when I put the dylib files to my project's src/main/resources/ folder.

I tried setting jna.library.path to /usr/local/lib as /usr/lib/ is protected by System Integrity Protection on macOS. /usr/local/lib contains the symlinks set by homebrew to the libraries installed.
Output of jna.debug_load=true:

Aug 21, 2020 12:03:15 PM com.sun.jna.Native extractFromResourcePath
INFO: Looking in classpath from jdk.internal.loader.ClassLoaders$AppClassLoader@d8a6ab6b for /com/sun/jna/darwin/libjnidispatch.jnilib
Aug 21, 2020 12:03:15 PM com.sun.jna.Native extractFromResourcePath
INFO: Found library resource at jar:file:///Users/username/.gradle/caches/modules-2/files-2.1/net.java.dev.jna/jna/5.5.0/e0845217c4907822403912ad6828d8e0b256208/jna-5.5.0.jar!/com/sun/jna/darwin/libjnidispatch.jnilib
Aug 21, 2020 12:03:15 PM com.sun.jna.Native extractFromResourcePath
INFO: Extracting library to /Users/username/Library/Caches/JNA/temp/jna15103312618634115760.tmp
Aug 21, 2020 12:03:16 PM com.sun.jna.NativeLibrary loadLibrary
INFO: Looking for library 'gs'
Aug 21, 2020 12:03:16 PM com.sun.jna.NativeLibrary loadLibrary
INFO: Adding paths from jna.library.path: /usr/local/lib
Aug 21, 2020 12:03:16 PM com.sun.jna.NativeLibrary loadLibrary
INFO: Trying /usr/local/lib/libgs.dylib
Aug 21, 2020 12:03:16 PM com.sun.jna.NativeLibrary loadLibrary
INFO: Found library 'gs' at /usr/local/lib/libgs.dylib
Aug 21, 2020 12:03:19 PM com.sun.jna.NativeLibrary loadLibrary
INFO: Looking for library 'tesseract'
Aug 21, 2020 12:03:19 PM com.sun.jna.NativeLibrary loadLibrary
INFO: Adding paths from jna.library.path: /usr/local/lib:/var/folders/lf/xh6ywk792nn87_ss1vm29j2c0000gn/T/tess4j/win32-x86-64
Aug 21, 2020 12:03:19 PM com.sun.jna.NativeLibrary loadLibrary
INFO: Trying /usr/local/lib/libtesseract.dylib
Aug 21, 2020 12:03:19 PM com.sun.jna.NativeLibrary loadLibrary
INFO: Found library 'tesseract' at /usr/local/lib/libtesseract.dylib

@nguyenq
Copy link
Owner

nguyenq commented Aug 21, 2020

Looks like it successfully found and loaded all the required native libraries. You need to look at the output images from GhostScript via CLI. Try to perform recognition on them using tesseract CLI. If the output is the same, maybe you need to use another library to convert PDF to images -- try PDFBox.

@weiw11
Copy link
Author

weiw11 commented Aug 21, 2020

Using terminal and the commands:

gs -dNOPAUSE -dBATCH -sDEVICE=png16m -r300 -sOutputFile="Pic-%d.png" "input.pdf"
tesseract Pic-1.png output -l eng

I was able to get the images and extract the text correctly with similar results on windows.

Does tess4j have a class which outputs the image it got from the pdf so I can check that?

@nguyenq
Copy link
Owner

nguyenq commented Aug 21, 2020

If you fed Pic-1.png to tess4j, would you get the correct result?

You have used png16m for DEVICE while tess4j uses pnggray. Please retest with the value tess4j uses.

No, tess4j would delete intermediate image files to keep it clean.

You also can use PDFBox by calling System.setProperty(PDF_LIBRARY, PDFBOX);

@weiw11
Copy link
Author

weiw11 commented Aug 21, 2020

Using Pic-1.png with png16m and pnggray outputs the expected results similar to window's output the same results as previous.

I got it working by using the following code and changing the image output to binary instead:

private String getPDFText(File file) {
    ITesseract tesseract = new Tesseract();
    try {
        tesseract.setDatapath("tessdata");
        tesseract.setLanguage("eng");
        tesseract.setTessVariable("user_defined_dpi", "300");

        PDDocument document = PDDocument.load(file);
        PDFRenderer pdfRenderer = new PDFRenderer(document);

        String output = "";
        for (int page = 0; page < document.getNumberOfPages(); ++page)
        {
            BufferedImage bi = pdfRenderer.renderImageWithDPI(page, 300, ImageType.BINARY);
            output += tesseract.doOCR(bi);
        }
        return output;
    } catch (TesseractException | IOException e) {
        e.printStackTrace();
    }
    return null;
}

For PDFBox, do you mean to call System.setProperty(PdfUtilities.PDF_LIBRARY, PdfUtilities.PDFBOX);? I tried calling that but it still doesn't work.

@nguyenq
Copy link
Owner

nguyenq commented Aug 21, 2020

Please look in the unit tests for example. With the property set, PDFBox will be used instead of GS.

@weiw11
Copy link
Author

weiw11 commented Aug 22, 2020

Calling System.setProperty(PdfUtilities.PDF_LIBRARY, PdfUtilities.PDFBOX); provides the same inaccurate results as ghostscript.

Is there any possibility of implementing variable args to ghostscript or pdfbox methods? This way we can use our own custom args to process the images.

Example:

public synchronized static File[] convertPdf2Png(File inputPdfFile, String... customArgs) throws IOException {
        Path path = Files.createTempDirectory("tessimages");
        File imageDir = path.toFile();

        //get Ghostscript instance
        Ghostscript gs = Ghostscript.getInstance();

        //prepare Ghostscript interpreter parameters
        //refer to Ghostscript documentation for parameter usage
        List<String> gsArgs = new ArrayList<String>();

        // Adds arguments passed to gsArgs
        for (String s : customArgs) {
                gsArgs.add(s);
        }
        ....

@nguyenq
Copy link
Owner

nguyenq commented Aug 22, 2020

You certainly can implement your custom conversion/processing method and then pass the result images to tess4j for recognition.

@weiw11
Copy link
Author

weiw11 commented Aug 22, 2020

That's true. Thanks so much for all the help provided!

@weiw11
Copy link
Author

weiw11 commented Aug 22, 2020

@nguyenq So I did some testing and found out that tess4j on macOS didn't actually process the image (convert it to pnggray) before trying to get the text.

Below is a cropped image of the differences:

This is the png of tesseract's output of .tiff. The image came out colored and the text is blocky and not really clear. DPI is still 300.
Screen Shot 2020-08-22 at 11 22 09 AM

This is the output png using the command: gs -dNOPAUSE -dQUIET -dBATCH -dSAFER -sDEVICE=pnggray -r300 -dGraphicsAlphaBits=4 -dTextAlphaBits=4 -sOutputFile="Pic-%d.png" "input.pdf". I copied the args used by PdfGsUtilities.
Screen Shot 2020-08-22 at 11 22 26 AM

I don't think it's calling ghostscript correctly as it's the exact image that I got in windows if I didn't move gsdll64.dll to my resources folder.

@nguyenq
Copy link
Owner

nguyenq commented Aug 22, 2020

If Ghostscript is not available, PDFBox is used in PDF-to-image conversion.

How did you produce the first (blurry) image?

@weiw11
Copy link
Author

weiw11 commented Aug 22, 2020

This is the code I used for the first image:

    public void tessTest() {
        Tesseract tesseract = new Tesseract();
        try {
            tesseract.setDatapath("data");
            tesseract.setLanguage("eng");
            tesseract.setTessVariable("user_defined_dpi", "300");

            System.out.println(tesseract.doOCR(new File("inputFile.pdf")));
        } catch (TesseractException e) {
            e.printStackTrace();
        }
    }

If PDFBox is used instead, that explains the output image when I call System.setProperty(PdfUtilities.PDF_LIBRARY, PdfUtilities.PDFBOX); on my windows machine to test. It creates the results exactly like on the mac.

Edit: That doesn't really make any sense on the mac though seeing as JNA states that it found the libraries?

@nguyenq
Copy link
Owner

nguyenq commented Aug 22, 2020

The source code is available. You certainly can debug to get to the bottom of it.

@weiw11 weiw11 mentioned this issue Aug 23, 2020
@weiw11
Copy link
Author

weiw11 commented Aug 23, 2020

The issue appears to be PdfBox's ImageType.RGB. I changed to ImageType.GRAY to match the output of PdfGsUtilities.

Updating PDFBox to 2.0.21 fixes the issue.

2.0.20 output:
ShareX_dllhost_2020-08-23_13-51-52

2.0.21 output:
ShareX_dllhost_2020-08-23_13-51-18

@nguyenq
Copy link
Owner

nguyenq commented Aug 23, 2020

We constanly keep up to date with the dependencies. Will publish an update soon.

Thank you.

@weiw11 weiw11 closed this as completed Aug 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants