New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MacOS tess4j ghostscript issue #195
Comments
Ghost4J is looking for a Run your program with system property |
Thanks for the reply, but the results I'm getting is still the same as when I put the I tried setting
|
Looks like it successfully found and loaded all the required native libraries. You need to look at the output images from GhostScript via CLI. Try to perform recognition on them using tesseract CLI. If the output is the same, maybe you need to use another library to convert PDF to images -- try PDFBox. |
Using terminal and the commands:
I was able to get the images and extract the text correctly with similar results on windows. Does tess4j have a class which outputs the image it got from the pdf so I can check that? |
If you fed You have used No, tess4j would delete intermediate image files to keep it clean. You also can use PDFBox by calling |
Using Pic-1.png with I got it working by using the following code and changing the image output to binary instead:
For PDFBox, do you mean to call |
Please look in the unit tests for example. With the property set, PDFBox will be used instead of GS. |
Calling Is there any possibility of implementing variable args to ghostscript or pdfbox methods? This way we can use our own custom args to process the images. Example:
|
You certainly can implement your custom conversion/processing method and then pass the result images to tess4j for recognition. |
That's true. Thanks so much for all the help provided! |
@nguyenq So I did some testing and found out that tess4j on macOS didn't actually process the image (convert it to pnggray) before trying to get the text. Below is a cropped image of the differences: This is the png of tesseract's output of .tiff. The image came out colored and the text is blocky and not really clear. DPI is still 300. This is the output png using the command: I don't think it's calling |
If Ghostscript is not available, PDFBox is used in PDF-to-image conversion. How did you produce the first (blurry) image? |
This is the code I used for the first image:
If PDFBox is used instead, that explains the output image when I call Edit: That doesn't really make any sense on the mac though seeing as JNA states that it found the libraries? |
The source code is available. You certainly can debug to get to the bottom of it. |
We constanly keep up to date with the dependencies. Will publish an update soon. Thank you. |
I'm trying to extract text from PDFs using tess4j but it's outputting
EE ee EE EE EE ee
and similar text, even though windows is significantly better.I had the same problem on windows but moving
gsdll64.dll
tosrc/main/resources
fixed the issue for me and the output is almost 100% accurate. I can't seem to figure out how to fix this on macOS.I added
-Djna.library.path=/usr/local/lib
to my args and it still doesn't work even though both ghostscript and tesseract is installed via homebrew and should have a symlink to/usr/local/lib
. I also tried movinglibtesseract.4.dylib
andlibgs.9.52.dylib
tosrc/main/resources
as well and tried moving them to adarwin
folder under resources doesn't help either.Am I doing something wrong or does tess4j just not work as well on macOS compared to windows?
The PDFs I'm using is also very clear without any noise or marks.
Tess4J: 4.5.2
Tesseract: 4.1.1
Leptonica: 1.80.0
Ghostscript: 9.5.2
This is using eng.traineddata.
The text was updated successfully, but these errors were encountered: