New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCR not working correctly in OpenKM 6.3.11 #303
Comments
Some more info. I tried to get logging to work, recreating the whole container including the database, after which - to my surprise - OpenKM chose the Moreover, the docs on logging are incorrect. For 6.3 CE, it states that we need to create a After investigating the container logs, I found it said it was loading logback configuration, which, again according to the docs, are available from 6.4.28 onward, which I don't even use. However, after editing the /opt/tomcat/logback.xml, logging changed as expected, so 6.3 CE seems to use logback, not log4j. With DEBUG logging enabled for the whole com.openkm package, testing text extraction gave me the following lines:
So for some reason it seems to choose the |
Another update, I am currently constantly recreating the setup to see what happens. Now, after being on
Again, |
Okay, so after conducting some more tests, I am absolutely convinced this is a bug. OpenKM almost reliably cycles In the Wiki it's stated that the I've found that the I've created a demo repository to be found here with which you can test this, the README contains info on how to quickly nuke the setup and create a new one, after which OpenKM is almost guaranteed to have chosen a different TextExtractor, with the I'll check out the source later today and have a go at debugging the |
public BarcodeTextExtractor() {
super(new String[]{"image/tiff", "image/gif", "image/jpeg", "image/png"});
} This list indicates to the RegisteredExtractor that is a candidate to be used extracting by files with these mime types. The problem comes when several extractors are enabled and several have for example "image/tiff" -> then is unpredictable what text extractor will be used to process the document ( it depends on the order of the text extractor list loaded in the start-up). This is a know issue and a possible improvement should be set some priority for each the text extractor. In most cases should only use Tesseract3Extractor and disable the others.
Here have the code of current TextExtractors https://github.com/openkm/document-management-system/tree/master/src/main/java/com/openkm/extractor The main class that process each extractor based on document type is RegisteredExtractors, there can see the warning document-management-system/src/main/java/com/openkm/extractor/RegisteredExtractors.java Line 122 in eca35fc
|
Thanks for getting in touch, don't worry, you don't need to explain to me what a WARN is :) I posted it there because I expected to see some ERROR, but since I didn't, I simply wanted to copy what exactly I was seeing so you knew what I was talking about. Nevertheless, thank you for explaining the underlying problem with the ambiguity of extractors to me, and thank you even more for pointing towards the plugin section as that solved my problem/answered my initial question. Disabling Abby, Cuneiform, and Barcode in the TextExtractor plugin, of course, did the trick for I feel this info should absolutely go into the docs about setting up OCR, don't you think so as well? I spent the better part of a day to find out what's going on there and wasn't sure if it was me or OpenKM who lost their mind there. So thanks for clearing that up for me, again, it wasn't at all about the warning, but about the unpredictable behavior making no sense to me :) Issue will be closed, but this really needs to be documented |
Hi @TheKvist Following your suggestion, I have updated the documentation description at https://docs.openkm.com/kcenter/view/okm-6.3-com/configuring-ocr-engine.html and added a new section at https://docs.openkm.com/kcenter/view/okm-6.3-com/plugins.html ( hope now will be more clear ). |
Hello there,
right away, I'm a complete noob when it comes to both, OpenKM, and OCR and I am currently just experimenting around, trying to get things to work based on the OpenKM Docker image. However, I am having trouble getting OCR to work. According to the docs, Tesseract seems to be the recommended OCR engine to use, so having no idea about anything in this topic, I chose to stick with it. However, I have not been able to get it to work.
After more testing, I have confirmed the issue to occur only with
openkm/opence:6.3.11
, versions6.3.9
and6.3.8
work fine, so the issue info is:OS: Windows 10 using Ubuntu 18.04 in WSL2
Docker Desktop: 3.5.2 (66501)
Docker Engine: 20.10.7
Docker Compose: 1.29.2
OpenKM: 6.3.11
and
OS: Debian 10 Buster
Docker Engine: 20.10.7, build f0df350
Docker Compose: 1.29.2, build 5becea4c
OpenKM: 6.3.11
The
6.3.11
Docker image has Tesseract version4.0.0-beta.1
pre-installed and running the binary from a bash inside the container actually works as intended. As dictated by the docs, I changedsystem.ocr
to/usr/bin/tesseract ${fileIn} ${fileOut}
.After configuration was done I went to Administration > Utilities > Check text extraction and uploaded an image I verified Tesseract was able to process correctly (specifically, this one). The Utility finishes almost immediately with an empty result, saying it used
com.openkm.extractor.AbbyTextExtractor
, even though this extractor is not even configured anywhere.Here's a screenshot of the result, and of the
registered.text.extractors
with anAbbyTextExtractor
nowhere to be foundTest result:
registered.text.extractors
:To much dismay, the logs stay completely silent when testing the extraction, but when uploading the image in question, I get a single
Unfortunately, this is not enough detail to make out what exactly the problem is, but my strongest guess right now is that this is happening because OpenKM only has a
Tesseract3TextExtractor
but not aTesseract4TextExtractor
, but as said, this is merely a guess and I'm not even sure if this makes any difference as the configuredsystem.ocr
command should still work with Tesseract 4.So basically, I have three questions:
AbbyTextExtractor
in the first place, where is this configured?Thank you so much!
The text was updated successfully, but these errors were encountered: