Not able to detect tables from ocr to text converted pdf #988
-
Pdf plumber can able to detect tables from pdf but I am using ocrmypdf to convert images into text and tables in pdf. The o/p is also pdf having text and tables and the table structure visiblity wise looking as table but pdfplumber not able to detect that. How to detect tables from ocr converted scanned pdf |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 4 replies
-
Hi @kathimohan Appreciate your interest in the library. Request you to please share the PDF (redacting any sensitive text) so that we can assist you better. Usually, when you OCR a PDF, the data regarding the horizontal and vertical line separators (curves and edges) is usually lost. If that is the case, you can use the "text" strategy for the horizontal and vertical lines. Can also try using the explicit strategy and explicitly define the coordinates for the line separators. You can find more about the various options here. |
Beta Was this translation helpful? Give feedback.
-
out-converted.pdf |
Beta Was this translation helpful? Give feedback.
-
Thank you for your help, but explicit_vertical_lines can't use bcz different pdfs having different measurements and vertical_strategy": "text", can help some what but not able to detect exactly reliably as pdfplumber works for non scanned ones. Also not able to detect bold letters on converted scanned ones, can u pls help me how to get bold letters and improve reliablity of vertical_strategy": "text" approach |
Beta Was this translation helpful? Give feedback.
Thanks for the PDF. For this case, the best way will be to use the explicit vertical lines. The table settings that you can use is
The result will be