Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add line size metrics (ascender, descender, size) to line objects in blocks output #906

Closed
Balearica opened this issue Mar 30, 2024 · 1 comment · Fixed by #909
Closed

Comments

@Balearica
Copy link
Collaborator

There is currently no easy way to retrieve accurate line size metrics using Tesseract.js. Several sub-optimal ways of accomplishing this are listed below.

  • Use word object within blocks output format.
    • These metrics are rarely accurate, and unhelpful without additional information
      • There is no absolute mapping between font size and pixels
  • Use ascender/descender/x_size metrics in hocr output
    • These are useful and accurate (at least using Tesseract Legacy)
    • Unfortunately, getting the values is a hassle
      • Extracting requires using an XML parser or regular expressions
  • Calculate font size manually using character-level bounding boxes
    • This works, but is even more of a hassle than parsing from HOCR

The ascender/descender/row_height metrics from the hocr output should be added to the blocks output format. This will allow for easily retrieving accurate data about line size.

@Balearica
Copy link
Collaborator Author

There is a RowAttributes getter in Tesseract, however it is not accessible through Tesseract.js-core because of how recently it was added. Therefore, implementing this will require a new minor version of Tesseract.js-core. The implementation should be sure not to break code for users with old versions of Tesseract.js-core.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant