Skip to content
This repository has been archived by the owner on Feb 25, 2022. It is now read-only.

OCR issues

AndreiH edited this page Nov 26, 2018 · 7 revisions

Due to intermittent issues with OCR, we need to decide how much effort we want to invest in fixing it, or whether OCR support in Iris is something that we will postpone until a later date.

In order to monitor our progress, it is important to leave the current unit test for OCR unchanged.

Open questions:

  • Why are results so different across machines, even on the same platform? If they are using the same libraries, shouldn't they be the same?

These kind of results on different machines can be caused by multiple reasons:

  1. Small differences between versions 3.02 and 3.03. Despite seemingly minor version number difference, there are noticeable differences in recognition algorithms. We should make sure that Homebrew and Scoop package management tools install the same version, and we all have the same version.
  2. Also, the traineddata are important. Training Tesseract yourself may help improve the results. We need to have the traineddata installed on our all platforms. We currently install them only on Linux. We now also have new Traineddata for version 4.0.0: https://github.com/tesseract-ocr/tessdata/releases
  3. Different resolution across machines, different screenshot DPI -> this can also influence the screenshot text recognition.
  4. In my opinion, our OCR_search code needs to be simplified. As also found here: https://groups.google.com/forum/#!topic/tesseract-ocr/zkk9Dl86aL8 We are doing (by default) a lot of image pre-processing/modifications on Regions/Images captured for analyzation. Indeed sometimes this can help, but sometimes it can influence the results on different platforms. Code can be found in: https://github.com/mozilla/iris/blob/master/iris/api/core/util/image_remove_noise.py and https://github.com/mozilla/iris/blob/master/iris/api/core/util/ocr_search.py

Even though I have not thoroughly tested Tesseract with our all our needs, I am optimistic with what Tesseract v4 new algorithm can bring.

  • What are the exact failures from the OCR unit tests? Which lines are asserting false? If you have debug output, include it here.

I have fixed and re-enabled the OCR unit tests locally. I have tested them on all our platforms, and they are performing well, running without failures, without any differences between platforms. Couldn't see any issues also in other tests.

  • How close are the returned text results to what's on screen?

In all the tests done, with different languages (arabic, russian, chinese), and with different fonts (italic, bold and others), the accuracy was 90%. It also depends on what noise the image had. In some cases I also applied noise reduction filters.

  • What are common errors in the OCR text results?

Image unneeded noise, complicated fonts and poor quality images/screenshots.

  • What influences the accuracy of OCR? Region size? Text color? Other?

It can be caused by region size and pre-processing applied on images more than needed. Testing with regions is indeed better for accuracy, we can also zoom over captured images for accuracy.

  • Is OCR for Regions working as well as full-screen OCR?

It should, and it's more accurate and better to have regions, as long as the images taken are clear and without other noise.

  • How do our OCR results compare to Sikuli?

Needs more investigation, but as far as I read online, Sikuli OCR search is also using Tesseract OCR with a very poor and restricted implementation and it's not that accurate.

  • Is OCR something we can rely on or not? How much of the problem is in Tesseract vs Iris?

I am optimistic to say that we can definitely rely on Tesseract 4, especially for using the Firefox UI (menus, options)

  • If we know what the problems are, what do we need to do to fix them? How long will it take?

I think this was a question asked before release Iris 1.0 and I think it is safe to say that we can now rewrite code and integrate Tesseract 4 in Iris 2.0. Because it would require more Tesseract OCR understanding and how it works to write better and clean text recognition code. Example of recently found issue: https://github.com/mozilla/iris/issues/1522

  • What are limitations on text recognition? Where does it fail, with regard to text formatting, color, size, font, etc.? Conversely, what conditions must exist where we can reliably extract text from a region?

One limitation would be regarding different complicated fonts, even in my tests, with more than one Italic and Bold font, it would have 90% accuracy. Other limitations would be regarding the extra unrequired Noise in the Regions/Images that we submit for text conversions. Take for example the create_region.py text. We are creating the region between the Cat.PNGs, and it takes: https://screenshot.net/ooxlnbn . And the value returned for this simple text is exactly: ' 3 This 1s a cat s'. This is a good example of what extra noise and bad pre-processing can do to a simple image text.

  • What does Tesseract 4 provide that is better than the previous version?

Tesseract 4 adds a new OCR engine based on LSTM neural networks. The new version is faster and more accurate than version 3. All I saw on the internet is only positive feedback for using Version 4 over version 3. These slides explain the changes in more detail: https://github.com/tesseract-ocr/docs/blob/master/das_tutorial2016/6ModernizationEfforts.pdf

  • Our current API is limited. We have Region.text(), which provides a dump of all strings found in the region. We also have find/find_all. We should probably also support searching for text in the other Pattern-related APIs, such as exists, wait/wait_vanish. Investigate what it would take for us to support this.

This would take some time and effort to understand and rewrite the code and make sure that it's more accurate than what we have. But I definitely encourage us to test the use in all other methods as stated above.