Tesseract can't read the full image #777

TricksterCode210 · 2023-06-05T07:59:41Z

Tesseract.js version (version number for npm/GitHub release, or specific commit for repo)
tesseract.js: 4.1.0
react: 18.2.0
npm: 8.5.0

Describe the bug
I'm doing my thesis about OCR scanners and I found this tesseract,js. I wanted to create some examples to my professor and I found this issue (or maybe I did not configure it that correctly). I think tesseract does not recognize the new paragraphs. I don't know how to configure it correctly or that's a bug, but I really need your help.
(Sorry if I was confusing)

To Reproduce
Steps to reproduce the behavior:

Choose the image to work with
Wait for the results

Please attach any input image required to replicate this behavior.
Result:

Example image:

Expected behavior
It should give me a huge text output

Device Version:

OS + Version: Windows 10
Browser edge
Node version node: v16.14.2

Code:

import { useEffect, useState } from "react";
import { createWorker } from "tesseract.js";
import "./App.css";
function App() {
  const [ocr, setOcr] = useState("");
  const [imageData, setImageData] = useState(null);
  const worker = createWorker({
    logger: (m) => {
      console.log(m);
    },
  });
  const convertImageToText = async () => {
    if (!imageData) return;
    await (await worker).load();
    await (await worker).loadLanguage("hun");
    await (await worker).initialize("hun");
    const {
      data: { text },
    } = await (await worker).recognize(imageData, {rotateAuto: true});
    setOcr(text);
  };

  useEffect(() => {
    convertImageToText();
  }, [imageData]);

  function handleImageChange(e) {
    const file = e.target.files[0];
    if(!file)return;
    const reader = new FileReader();
    reader.onloadend = () => {
      const imageDataUri = reader.result;
      console.log({ imageDataUri });
      setImageData(imageDataUri);
    };
    reader.readAsDataURL(file);
  }
  return (
      <div className="App">
        <div>
          <p>Choose an Image</p>
          <input
              type="file"
              name=""
              id=""
              onChange={handleImageChange}
              accept="image/*"
          />
        </div>
        <div className="display-flex">
          <img src={imageData} alt="" srcset="" />
          <p>{ocr}</p>
        </div>
      </div>
  );
}
export default App;

The text was updated successfully, but these errors were encountered:

Balearica · 2023-06-05T17:27:09Z

It looks like this is an issue with Tesseract rather than any code specific to Tesseract.js or your implementation.

Tesseract uses a binarization algorithm provided by the Leptonica library by default (there are some configuration options related to binarization that you can look up in the Tesseract repo/documentation). This example shows how to access the intermediate images used by Tesseract. For your image, the binarization algorithm performs extremely poorly with the text being almost entirely erased--I've attached the binarized image.

I do not have a strong understanding of the Leptonica binarization algorithm bused by default, but the fact that the text is very light gray compared to some of the darker blacks found in the image appears to be what is tripping it up. To demonstrate, removed the image, and the text no longer disappears during the binarization step.

If you use the version with the image removed, Tesseract will spit out some text. I do not expect it to be accurate--I can hardly make out the words in the input image personally, so do not expect Tesseract to be able to recognize it accurately.

One final small thing--the worker.load function can be removed. That was necessary in old versions but does not do anything in v4 and above.

TricksterCode210 · 2023-06-06T07:33:17Z

Thank you for your well explained answer. That helped me a lot.

Balearica · 2023-06-06T22:43:34Z

No problem. Closing the issue given this is not a bug with Tesseract.js.

Balearica added dependency bug Valid bug where fixing is outside the scope of this repo recognition accuracy labels Jun 5, 2023

Balearica closed this as completed Jun 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tesseract can't read the full image #777

Tesseract can't read the full image #777

TricksterCode210 commented Jun 5, 2023

Balearica commented Jun 5, 2023

TricksterCode210 commented Jun 6, 2023

Balearica commented Jun 6, 2023

Tesseract can't read the full image #777

Tesseract can't read the full image #777

Comments

TricksterCode210 commented Jun 5, 2023

Balearica commented Jun 5, 2023

TricksterCode210 commented Jun 6, 2023

Balearica commented Jun 6, 2023