Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract can't read the full image #777

Closed
TricksterCode210 opened this issue Jun 5, 2023 · 3 comments
Closed

Tesseract can't read the full image #777

TricksterCode210 opened this issue Jun 5, 2023 · 3 comments
Labels
dependency bug Valid bug where fixing is outside the scope of this repo recognition accuracy

Comments

@TricksterCode210
Copy link

Tesseract.js version (version number for npm/GitHub release, or specific commit for repo)
tesseract.js: 4.1.0
react: 18.2.0
npm: 8.5.0

Describe the bug
I'm doing my thesis about OCR scanners and I found this tesseract,js. I wanted to create some examples to my professor and I found this issue (or maybe I did not configure it that correctly). I think tesseract does not recognize the new paragraphs. I don't know how to configure it correctly or that's a bug, but I really need your help.
(Sorry if I was confusing)

To Reproduce
Steps to reproduce the behavior:

  1. Choose the image to work with
  2. Wait for the results

Please attach any input image required to replicate this behavior.
Result:
image
Example image:
image

Expected behavior
It should give me a huge text output

Device Version:

  • OS + Version: Windows 10
  • Browser edge
  • Node version node: v16.14.2

Code:

import { useEffect, useState } from "react";
import { createWorker } from "tesseract.js";
import "./App.css";
function App() {
  const [ocr, setOcr] = useState("");
  const [imageData, setImageData] = useState(null);
  const worker = createWorker({
    logger: (m) => {
      console.log(m);
    },
  });
  const convertImageToText = async () => {
    if (!imageData) return;
    await (await worker).load();
    await (await worker).loadLanguage("hun");
    await (await worker).initialize("hun");
    const {
      data: { text },
    } = await (await worker).recognize(imageData, {rotateAuto: true});
    setOcr(text);
  };

  useEffect(() => {
    convertImageToText();
  }, [imageData]);

  function handleImageChange(e) {
    const file = e.target.files[0];
    if(!file)return;
    const reader = new FileReader();
    reader.onloadend = () => {
      const imageDataUri = reader.result;
      console.log({ imageDataUri });
      setImageData(imageDataUri);
    };
    reader.readAsDataURL(file);
  }
  return (
      <div className="App">
        <div>
          <p>Choose an Image</p>
          <input
              type="file"
              name=""
              id=""
              onChange={handleImageChange}
              accept="image/*"
          />
        </div>
        <div className="display-flex">
          <img src={imageData} alt="" srcset="" />
          <p>{ocr}</p>
        </div>
      </div>
  );
}
export default App;
@Balearica
Copy link
Member

It looks like this is an issue with Tesseract rather than any code specific to Tesseract.js or your implementation.

Tesseract uses a binarization algorithm provided by the Leptonica library by default (there are some configuration options related to binarization that you can look up in the Tesseract repo/documentation). This example shows how to access the intermediate images used by Tesseract. For your image, the binarization algorithm performs extremely poorly with the text being almost entirely erased--I've attached the binarized image.

download (5)

I do not have a strong understanding of the Leptonica binarization algorithm bused by default, but the fact that the text is very light gray compared to some of the darker blacks found in the image appears to be what is tripping it up. To demonstrate, removed the image, and the text no longer disappears during the binarization step.

user_example_1a
download (6)

If you use the version with the image removed, Tesseract will spit out some text. I do not expect it to be accurate--I can hardly make out the words in the input image personally, so do not expect Tesseract to be able to recognize it accurately.

One final small thing--the worker.load function can be removed. That was necessary in old versions but does not do anything in v4 and above.

@Balearica Balearica added dependency bug Valid bug where fixing is outside the scope of this repo recognition accuracy labels Jun 5, 2023
@TricksterCode210
Copy link
Author

Thank you for your well explained answer. That helped me a lot.

@Balearica
Copy link
Member

No problem. Closing the issue given this is not a bug with Tesseract.js.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependency bug Valid bug where fixing is outside the scope of this repo recognition accuracy
Projects
None yet
Development

No branches or pull requests

2 participants