OCR not working correctly in OpenKM 6.3.11 #303

TheKvist · 2021-08-30T09:46:42Z

Hello there,

right away, I'm a complete noob when it comes to both, OpenKM, and OCR and I am currently just experimenting around, trying to get things to work based on the OpenKM Docker image. However, I am having trouble getting OCR to work. According to the docs, Tesseract seems to be the recommended OCR engine to use, so having no idea about anything in this topic, I chose to stick with it. However, I have not been able to get it to work.

After more testing, I have confirmed the issue to occur only with openkm/opence:6.3.11, versions 6.3.9 and 6.3.8 work fine, so the issue info is:

OS: Windows 10 using Ubuntu 18.04 in WSL2
Docker Desktop: 3.5.2 (66501)
Docker Engine: 20.10.7
Docker Compose: 1.29.2
OpenKM: 6.3.11

and

OS: Debian 10 Buster
Docker Engine: 20.10.7, build f0df350
Docker Compose: 1.29.2, build 5becea4c
OpenKM: 6.3.11

The 6.3.11 Docker image has Tesseract version 4.0.0-beta.1 pre-installed and running the binary from a bash inside the container actually works as intended. As dictated by the docs, I changed system.ocr to /usr/bin/tesseract ${fileIn} ${fileOut}.

After configuration was done I went to Administration > Utilities > Check text extraction and uploaded an image I verified Tesseract was able to process correctly (specifically, this one). The Utility finishes almost immediately with an empty result, saying it used com.openkm.extractor.AbbyTextExtractor, even though this extractor is not even configured anywhere.

Here's a screenshot of the result, and of the registered.text.extractors with an AbbyTextExtractor nowhere to be found

Test result:

registered.text.extractors:

To much dismay, the logs stay completely silent when testing the extraction, but when uploading the image in question, I get a single

openkm_1  | 2021-08-30 09:25:00,068 [Thread-387] INFO  c.o.extractor.TextExtractorWorker - processSerial.Working on {docUuid=104695f5-d697-48f6-9eb5-3d305f6491b4, docPath=/okm:root/test/text-recognized-eng.png, docVerUuid=2cc47a6a-8782-4663-b32c-77c1ea110f3e, date=Mon Aug 30 09:24:16 UTC 2021}
openkm_1  | 2021-08-30 09:25:00,751 [Thread-387] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/test/text-recognized-eng.png': Too few text extracted

Unfortunately, this is not enough detail to make out what exactly the problem is, but my strongest guess right now is that this is happening because OpenKM only has a Tesseract3TextExtractor but not a Tesseract4TextExtractor, but as said, this is merely a guess and I'm not even sure if this makes any difference as the configured system.ocr command should still work with Tesseract 4.

So basically, I have three questions:

Why does OpenKM think it should use AbbyTextExtractor in the first place, where is this configured?
How can I tell OpenKM not to use it?
How can I get OCR using Tesseract to work?

Thank you so much!

The text was updated successfully, but these errors were encountered:

TheKvist · 2021-08-30T10:42:27Z

Some more info. I tried to get logging to work, recreating the whole container including the database, after which - to my surprise - OpenKM chose the Tesseract3TextExtractor. I then recreated the whole thing again, just to see if I could reproduce it and suddenly, OpenKM selected the BarcodeTextExtractor. Recreating everything again had no effect, it always chooses this extractor now.

Moreover, the docs on logging are incorrect. For 6.3 CE, it states that we need to create a /opt/tomcat/conf/log4j.properties, but creating such a file has no effect at all. Also, there is no "automatic reload" of the configuration taking place at any point, editing said file.

After investigating the container logs, I found it said it was loading logback configuration, which, again according to the docs, are available from 6.4.28 onward, which I don't even use. However, after editing the /opt/tomcat/logback.xml, logging changed as expected, so 6.3 CE seems to use logback, not log4j.

With DEBUG logging enabled for the whole com.openkm package, testing text extraction gave me the following lines:

openkm_1  | 2021-08-30 10:39:05,783 [http-nio-0.0.0.0-8080-exec-12] DEBUG c.o.s.a.CheckTextExtractionServlet - doPost(SecurityContextHolderAwareRequestWrapper[ org.springframework.security.web.context.HttpSessionSecurityContextRepository$Servlet3SaveToSessionRequestWrapper@22ea75ca], org.springframework.security.web.context.HttpSessionSecurityContextRepository$SaveToSessionResponseWrapper@208875a3)
openkm_1  | 2021-08-30 10:39:05,790 [http-nio-0.0.0.0-8080-exec-12] DEBUG c.o.extractor.RegisteredExtractors - findExtractors(false)
openkm_1  | 2021-08-30 10:39:05,791 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.core.Config - getManager(classpath://com.openkm.extractor.**)
openkm_1  | 2021-08-30 10:39:07,061 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.HTMLTextExtractor)
openkm_1  | 2021-08-30 10:39:07,066 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.MsWordTextExtractor)
openkm_1  | 2021-08-30 10:39:07,067 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.NativeMsExcelTextExtractor)
openkm_1  | 2021-08-30 10:39:07,069 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.ExifTextExtractor)
openkm_1  | 2021-08-30 10:39:07,070 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.OpenOfficeTextExtractor)
openkm_1  | 2021-08-30 10:39:07,072 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.MsExcelTextExtractor)
openkm_1  | 2021-08-30 10:39:07,073 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.MsPowerPointTextExtractor)
openkm_1  | 2021-08-30 10:39:07,074 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.SourceCodeTextExtractor)
openkm_1  | 2021-08-30 10:39:07,075 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.BarcodeTextExtractor)
openkm_1  | 2021-08-30 10:39:07,076 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.CuneiformTextExtractor)
openkm_1  | 2021-08-30 10:39:07,077 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.XMLTextExtractor)
openkm_1  | 2021-08-30 10:39:07,079 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.RTFTextExtractor)
openkm_1  | 2021-08-30 10:39:07,080 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.PlainTextExtractor)
openkm_1  | 2021-08-30 10:39:07,081 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.PdfTextExtractor)
openkm_1  | 2021-08-30 10:39:07,084 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.Tesseract2TextExtractor)
openkm_1  | 2021-08-30 10:39:07,084 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.Tesseract3TextExtractor)
openkm_1  | 2021-08-30 10:39:07,086 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.MsOffice2007TextExtractor)
openkm_1  | 2021-08-30 10:39:07,086 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.OOTextExtractor)
openkm_1  | 2021-08-30 10:39:07,087 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.AudioTextExtractor)
openkm_1  | 2021-08-30 10:39:07,088 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.AbbyTextExtractor)
openkm_1  | 2021-08-30 10:39:07,089 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.MsOutlookTextExtractor)
openkm_1  | 2021-08-30 10:39:07,090 [http-nio-0.0.0.0-8080-exec-12] DEBUG c.o.extractor.RegisteredExtractors - Text extractor for 'image/png' found: class com.openkm.extractor.BarcodeTextExtractor
openkm_1  | 2021-08-30 10:39:07,090 [http-nio-0.0.0.0-8080-exec-12] DEBUG c.o.extractor.RegisteredExtractors - findExtractors(false)
openkm_1  | 2021-08-30 10:39:07,090 [http-nio-0.0.0.0-8080-exec-12] DEBUG c.o.extractor.RegisteredExtractors - Text extractor for 'image/png' found: class com.openkm.extractor.BarcodeTextExtractor
openkm_1  | 2021-08-30 10:39:07,325 [http-nio-0.0.0.0-8080-exec-12] WARN  c.o.extractor.BarcodeTextExtractor - Failed to extract barcode text
openkm_1  | com.google.zxing.NotFoundException: null

So for some reason it seems to choose the BarcodeTextExtractor now before it even gets to consider any of the Tesseract*TextExtractor classes.

TheKvist · 2021-08-30T11:03:11Z

Another update, I am currently constantly recreating the setup to see what happens. Now, after being on Tesseract3TextExtractor once more, it has changed to CuneiformTextExtractor with this log:

openkm_1  | 2021-08-30 10:59:52,617 [http-nio-0.0.0.0-8080-exec-12] DEBUG c.o.s.a.CheckTextExtractionServlet - doPost(SecurityContextHolderAwareRequestWrapper[ org.springframework.security.web.context.HttpSessionSecurityContextRepository$Servlet3SaveToSessionRequestWrapper@4ac02e4f], org.springframework.security.web.context.HttpSessionSecurityContextRepository$SaveToSessionResponseWrapper@1c03a69b)
openkm_1  | 2021-08-30 10:59:52,619 [http-nio-0.0.0.0-8080-exec-12] DEBUG c.o.extractor.RegisteredExtractors - findExtractors(false)
openkm_1  | 2021-08-30 10:59:52,619 [http-nio-0.0.0.0-8080-exec-12] DEBUG c.o.extractor.RegisteredExtractors - Text extractor for 'image/png' found: class com.openkm.extractor.CuneiformTextExtractor
openkm_1  | 2021-08-30 10:59:52,619 [http-nio-0.0.0.0-8080-exec-12] DEBUG c.o.extractor.RegisteredExtractors - findExtractors(false)
openkm_1  | 2021-08-30 10:59:52,619 [http-nio-0.0.0.0-8080-exec-12] DEBUG c.o.extractor.RegisteredExtractors - Text extractor for 'image/png' found: class com.openkm.extractor.CuneiformTextExtractor
openkm_1  | 2021-08-30 10:59:52,619 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.MimeTypeDAO - findByName(image/png)
openkm_1  | 2021-08-30 10:59:52,623 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.MimeTypeDAO - findByName: {id=19, name=image/png, description=PNG, search=true, imageMime=image/gif, imageContent=[BIG], extensions=[png]}
openkm_1  | 2021-08-30 10:59:52,624 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.util.ExecutionUtils - runCmd(/usr/bin/tesseract /opt/tomcat/temp/okm3900154784953022994.png /opt/tomcat/temp/okm2471433750520939535.txt)
openkm_1  | 2021-08-30 10:59:52,625 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.util.ExecutionUtils - runCmdImpl([/usr/bin/tesseract, /opt/tomcat/temp/okm3900154784953022994.png, /opt/tomcat/temp/okm2471433750520939535.txt], 300000)
openkm_1  | 2021-08-30 10:59:53,195 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.util.ExecutionUtils - Normal program termination
openkm_1  | 2021-08-30 10:59:53,196 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.util.ExecutionUtils - Elapse time: 00:00:00
openkm_1  | 2021-08-30 10:59:53,196 [http-nio-0.0.0.0-8080-exec-12] DEBUG c.o.extractor.CuneiformTextExtractor - TEXT:

Again, CuneiformTextExtractor is not even visible in the registered.text.extractors and I have not the slightest idea where OpenKM gets these classes from. Instead of before with the BarcodeTextExtractor, where it listed a whole bunch of extractors, it now only seems to know this class.

TheKvist · 2021-08-30T13:53:27Z

Okay, so after conducting some more tests, I am absolutely convinced this is a bug. OpenKM almost reliably cycles Tesseract3, Abby, Cuneiform, and Barcode between new setups.

In the Wiki it's stated that the registered.text.extractors property needs to be modified. However, as demonstrated before, none of the extractors OpenKM selects, except for the Tesseract3TextExtractor which would be the correct one, is registered there. Just to make sure, I've deleted everything except Tesseract from that list, but OpenKM simply ignores this property and chooses whatever it wants.

I've found that the system.ocr property has nothing to do with it, but it's rather looking like OpenKM is "rolling a die" on startup which extractor to use, and stick to that until it's completely recreated from scratch.

I've created a demo repository to be found here with which you can test this, the README contains info on how to quickly nuke the setup and create a new one, after which OpenKM is almost guaranteed to have chosen a different TextExtractor, with the BarcodeTextExtractor being the most common.

I'll check out the source later today and have a go at debugging the RegisteredExtractors to see if I can find why that's happening.

darkman97i · 2021-08-31T07:37:09Z

You only can use Tesseract3TextExtractor ( this class works for tesseract 3 or upper version, there's no difference in the way to execute the external tool )
About the message WARN com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/test/text-recognized-eng.png': Too few texts extracted I think is so clear -> first of all this is a WARN -> WARNING not an ERROR and the application indicate have been extracted 0 or few words ( it means OCR engine is working otherwise will raise an error, but because few texts in the file or other reasons the OCR engine extracted few text.
The parameter registered.text.extractors should be removed in the next release ( really only is used at the start-up time to enable or disable extractors, but at the end should be used Administration > Tools > Plugins for its purpose, there's the place to enable or disable plugins. ).
If you take a look in a implementation of any text extractor will see somehting like:

public BarcodeTextExtractor() {
		super(new String[]{"image/tiff", "image/gif", "image/jpeg", "image/png"});
	}

This list indicates to the RegisteredExtractor that is a candidate to be used extracting by files with these mime types. The problem comes when several extractors are enabled and several have for example "image/tiff" -> then is unpredictable what text extractor will be used to process the document ( it depends on the order of the text extractor list loaded in the start-up). This is a know issue and a possible improvement should be set some priority for each the text extractor. In most cases should only use Tesseract3Extractor and disable the others.

Finally consider wiki.openkm.com totally deprecated and you should use https://docs.openkm.com/kcenter/view/okm-6.3-com/

Here have the code of current TextExtractors https://github.com/openkm/document-management-system/tree/master/src/main/java/com/openkm/extractor

The main class that process each extractor based on document type is RegisteredExtractors, there can see the warning

document-management-system/src/main/java/com/openkm/extractor/RegisteredExtractors.java

Line 122 in eca35fc

if (text.length() < MIN_EXTRACTION) {

( because text length have a length less than 16 characters )

TheKvist · 2021-08-31T08:13:43Z

Thanks for getting in touch,

don't worry, you don't need to explain to me what a WARN is :) I posted it there because I expected to see some ERROR, but since I didn't, I simply wanted to copy what exactly I was seeing so you knew what I was talking about.

Nevertheless, thank you for explaining the underlying problem with the ambiguity of extractors to me, and thank you even more for pointing towards the plugin section as that solved my problem/answered my initial question. Disabling Abby, Cuneiform, and Barcode in the TextExtractor plugin, of course, did the trick for 6.3.11 there.

I feel this info should absolutely go into the docs about setting up OCR, don't you think so as well? I spent the better part of a day to find out what's going on there and wasn't sure if it was me or OpenKM who lost their mind there.

So thanks for clearing that up for me, again, it wasn't at all about the warning, but about the unpredictable behavior making no sense to me :)

Issue will be closed, but this really needs to be documented

darkman97i · 2021-10-01T09:27:17Z

Hi @TheKvist

Following your suggestion, I have updated the documentation description at https://docs.openkm.com/kcenter/view/okm-6.3-com/configuring-ocr-engine.html and added a new section at https://docs.openkm.com/kcenter/view/okm-6.3-com/plugins.html ( hope now will be more clear ).

TheKvist changed the title ~~Can't get OCR to work~~ OCR not working correctly in OpenKM 6.3.11 Aug 30, 2021

TheKvist closed this as completed Aug 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR not working correctly in OpenKM 6.3.11 #303

OCR not working correctly in OpenKM 6.3.11 #303

TheKvist commented Aug 30, 2021 •

edited

TheKvist commented Aug 30, 2021

TheKvist commented Aug 30, 2021

TheKvist commented Aug 30, 2021

darkman97i commented Aug 31, 2021 •

edited

TheKvist commented Aug 31, 2021 •

edited

darkman97i commented Oct 1, 2021

OCR not working correctly in OpenKM 6.3.11 #303

OCR not working correctly in OpenKM 6.3.11 #303

Comments

TheKvist commented Aug 30, 2021 • edited

TheKvist commented Aug 30, 2021

TheKvist commented Aug 30, 2021

TheKvist commented Aug 30, 2021

darkman97i commented Aug 31, 2021 • edited

TheKvist commented Aug 31, 2021 • edited

darkman97i commented Oct 1, 2021

TheKvist commented Aug 30, 2021 •

edited

darkman97i commented Aug 31, 2021 •

edited

TheKvist commented Aug 31, 2021 •

edited