Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update leptonica, tesseract, and libk2pdfopt #1800

Conversation

benoit-pierre
Copy link
Contributor

@benoit-pierre benoit-pierre commented May 30, 2024

Depend on koreader/libk2pdfopt#49.

Thanks to some upstream cleanups in tesseract, the resulting code size (bss+data+text) is reduced by ~1.4-2.3 MB.

Note: tesseract is compiled with legacy engine support, which means the resulting mode after initialization will depend on the type of language data available:

  • tessdata (the one we use for the testsuite): 22.4 MB for English; contains data for both legacy & new engine: run the LSTM recognizer (new), but allow fallback to Tesseract (legacy) when things get difficult.
  • tessdata-fast: 3.92 MB for English; best “value for money” in speed vs accuracy (Integer models): run the LSTM line recognizer only.
  • tessdata-best: 14.7 MB for English; best results on Google’s eval data, slower (Float models): same mode as above.

AFAIK, there are no legacy only language files.

I tested both modes, English only with tessdata & tessdata-fast, and it did not seem to make a difference (which might just mean that the fallback on legacy was never triggered). Disabling the legacy engine would save an additional ~500 KB. But I don't know if the new engine is (always) better, especially for other languages.

The version used by Linux distributions seem to vary: for example Arch Linux is using the English tessdata variant, while Ubuntu uses tessdata-fast. ¯\(ツ)


This change is Reviewable

Dependency for newer tesseract versions.
@Frenzie
Copy link
Member

Frenzie commented May 31, 2024

I tested both modes, English only with tessdata & tessdata-fast, and it did not seem to make a difference (which might just mean that the fallback on legacy was never triggered).

You mean on a device I presume?

@benoit-pierre benoit-pierre force-pushed the pr/update_leptonica_tesseract_libk2pdfopt branch from 4f81d97 to 588c233 Compare May 31, 2024 21:21
@benoit-pierre benoit-pierre marked this pull request as ready for review May 31, 2024 21:23
@benoit-pierre
Copy link
Contributor Author

I tested both modes, English only with tessdata & tessdata-fast, and it did not seem to make a difference (which might just mean that the fallback on legacy was never triggered).

You mean on a device I presume?

Mainly with the emulator.

- bump leptonica to 1.84.1
- bump tesseract to 5.3.4
- bump libk2pdfopt to 2.55
@benoit-pierre benoit-pierre force-pushed the pr/update_leptonica_tesseract_libk2pdfopt branch from 588c233 to b7d28fa Compare May 31, 2024 21:25
@Frenzie Frenzie merged commit be04eb0 into koreader:master May 31, 2024
1 check passed
@benoit-pierre benoit-pierre deleted the pr/update_leptonica_tesseract_libk2pdfopt branch May 31, 2024 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants