Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BBOX x and y coordinates #792

Closed
Lex-talionis opened this issue Jun 27, 2023 · 2 comments
Closed

BBOX x and y coordinates #792

Lex-talionis opened this issue Jun 27, 2023 · 2 comments

Comments

@Lex-talionis
Copy link

Hi,

I am converting my PDFS to images and the reading them no problem. But I'm not sure I understand what unit the X and Y coordinates in bbox section are.

For example.

  • When I process a pdf without tesseract.js OCR using pdf.js, and use pdf points x and y coordinates I can extract the text in the area no problem.

  • When I process a pdf using the tesseract.js OCR using the image derived from the pdf, and use pdf points x and y coordinates I can not extract the text in the area.

So what i'm asking is what unit of measurement is the bbox coordinates using and do i need to do something to it to get it to correlate to the original x and y coordinates from the pdf?

I hope that makes sense.

I have tried the docs but ive not been able to find the answer

thanks in advance

@Balearica
Copy link
Member

All Tesseract coordinates are in pixels. Tesseract only supports images, so this is the only relevant unit.

The conversion between PDF units (points/inches) and pixels will be determined by the DPI setting of the program you are using to render PDFs to images. Therefore, this is a question you will likely answer searching the pdf.js documentation rather than Tesseract.js documentation.

A Google search appears to indicate that PDF.js uses a default DPI of 150. If that is true the conversion would be points = pixels * (72 / 150).

@Lex-talionis
Copy link
Author

Thank you so much for your reply. I should of thought of that. I think I had my blinkers on and was thinking bbox was a PDF thing. much appreciated. Have a great day!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants