You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am converting my PDFS to images and the reading them no problem. But I'm not sure I understand what unit the X and Y coordinates in bbox section are.
For example.
When I process a pdf without tesseract.js OCR using pdf.js, and use pdf points x and y coordinates I can extract the text in the area no problem.
When I process a pdf using the tesseract.js OCR using the image derived from the pdf, and use pdf points x and y coordinates I can not extract the text in the area.
So what i'm asking is what unit of measurement is the bbox coordinates using and do i need to do something to it to get it to correlate to the original x and y coordinates from the pdf?
I hope that makes sense.
I have tried the docs but ive not been able to find the answer
thanks in advance
The text was updated successfully, but these errors were encountered:
All Tesseract coordinates are in pixels. Tesseract only supports images, so this is the only relevant unit.
The conversion between PDF units (points/inches) and pixels will be determined by the DPI setting of the program you are using to render PDFs to images. Therefore, this is a question you will likely answer searching the pdf.js documentation rather than Tesseract.js documentation.
A Google search appears to indicate that PDF.js uses a default DPI of 150. If that is true the conversion would be points = pixels * (72 / 150).
Thank you so much for your reply. I should of thought of that. I think I had my blinkers on and was thinking bbox was a PDF thing. much appreciated. Have a great day!!!
Hi,
I am converting my PDFS to images and the reading them no problem. But I'm not sure I understand what unit the X and Y coordinates in bbox section are.
For example.
When I process a pdf without tesseract.js OCR using pdf.js, and use pdf points x and y coordinates I can extract the text in the area no problem.
When I process a pdf using the tesseract.js OCR using the image derived from the pdf, and use pdf points x and y coordinates I can not extract the text in the area.
So what i'm asking is what unit of measurement is the bbox coordinates using and do i need to do something to it to get it to correlate to the original x and y coordinates from the pdf?
I hope that makes sense.
I have tried the docs but ive not been able to find the answer
thanks in advance
The text was updated successfully, but these errors were encountered: