BBOX x and y coordinates #792

Lex-talionis · 2023-06-27T12:13:12Z

Hi,

I am converting my PDFS to images and the reading them no problem. But I'm not sure I understand what unit the X and Y coordinates in bbox section are.

For example.

When I process a pdf without tesseract.js OCR using pdf.js, and use pdf points x and y coordinates I can extract the text in the area no problem.
When I process a pdf using the tesseract.js OCR using the image derived from the pdf, and use pdf points x and y coordinates I can not extract the text in the area.

So what i'm asking is what unit of measurement is the bbox coordinates using and do i need to do something to it to get it to correlate to the original x and y coordinates from the pdf?

I hope that makes sense.

I have tried the docs but ive not been able to find the answer

thanks in advance

Balearica · 2023-06-27T23:03:52Z

All Tesseract coordinates are in pixels. Tesseract only supports images, so this is the only relevant unit.

The conversion between PDF units (points/inches) and pixels will be determined by the DPI setting of the program you are using to render PDFs to images. Therefore, this is a question you will likely answer searching the pdf.js documentation rather than Tesseract.js documentation.

A Google search appears to indicate that PDF.js uses a default DPI of 150. If that is true the conversion would be points = pixels * (72 / 150).

Lex-talionis · 2023-06-28T08:35:29Z

Thank you so much for your reply. I should of thought of that. I think I had my blinkers on and was thinking bbox was a PDF thing. much appreciated. Have a great day!!!

Lex-talionis closed this as completed Jun 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BBOX x and y coordinates #792

BBOX x and y coordinates #792

Lex-talionis commented Jun 27, 2023

Balearica commented Jun 27, 2023

Lex-talionis commented Jun 28, 2023

BBOX x and y coordinates #792

BBOX x and y coordinates #792

Comments

Lex-talionis commented Jun 27, 2023

Balearica commented Jun 27, 2023

Lex-talionis commented Jun 28, 2023