Not able to detect tables from ocr to text converted pdf #988

kathimohan · 2023-09-13T14:14:24Z

kathimohan
Sep 13, 2023

Pdf plumber can able to detect tables from pdf but I am using ocrmypdf to convert images into text and tables in pdf. The o/p is also pdf having text and tables and the table structure visiblity wise looking as table but pdfplumber not able to detect that. How to detect tables from ocr converted scanned pdf

Answered by samkit-jain

Sep 15, 2023

Thanks for the PDF. For this case, the best way will be to use the explicit vertical lines. The table settings that you can use is

{
    "vertical_strategy": "explicit",
    "horizontal_strategy": "text",
    "snap_tolerance": 5,
    "explicit_vertical_lines": [87, 125, 335, 402, 525]
}

The result will be

['JA', 'WAHARLAL NEHRU TECHNOLOGICAL', 'UNIVERSIT', 'Y HYDERABAD']
['', '', '', '']
['', 'Academic Calendar 2', '021-22', '']
['', '', '', '']
['', 'B. TECH./B.PHARM. Il & IV YEAR', 'S I & I! SEM', 'ESTERS']
['', '', '', '']
['I SEM', '', '', '']
['', '', '', '']
['', 'sue', '', 'Duration']
['S. No', 'Description', 'From', '__| To']
['', '', '', '']
['1', 'Commencement of I Semester cla…

View full answer

samkit-jain · 2023-09-14T07:07:24Z

samkit-jain
Sep 14, 2023
Collaborator

Hi @kathimohan Appreciate your interest in the library. Request you to please share the PDF (redacting any sensitive text) so that we can assist you better.

Usually, when you OCR a PDF, the data regarding the horizontal and vertical line separators (curves and edges) is usually lost. If that is the case, you can use the "text" strategy for the horizontal and vertical lines. Can also try using the explicit strategy and explicitly define the coordinates for the line separators. You can find more about the various options here.

0 replies

kathimohan · 2023-09-14T10:43:57Z

kathimohan
Sep 14, 2023
Author

out-converted.pdf
can u pls help me with this file

1 reply

samkit-jain Sep 15, 2023
Collaborator

Thanks for the PDF. For this case, the best way will be to use the explicit vertical lines. The table settings that you can use is

{
    "vertical_strategy": "explicit",
    "horizontal_strategy": "text",
    "snap_tolerance": 5,
    "explicit_vertical_lines": [87, 125, 335, 402, 525]
}

The result will be

['JA', 'WAHARLAL NEHRU TECHNOLOGICAL', 'UNIVERSIT', 'Y HYDERABAD']
['', '', '', '']
['', 'Academic Calendar 2', '021-22', '']
['', '', '', '']
['', 'B. TECH./B.PHARM. Il & IV YEAR', 'S I & I! SEM', 'ESTERS']
['', '', '', '']
['I SEM', '', '', '']
['', '', '', '']
['', 'sue', '', 'Duration']
['S. No', 'Description', 'From', '__| To']
['', '', '', '']
['1', 'Commencement of I Semester classwork', '', '06.092021']
['2', 'st . ‘ :\nx Spell of Insericsions (inetading', '06.09.2021 |', '06.11.2021 (9 Weeks)']
['', 'Dussehra Recess)', '', '']
['3', 'Dussehra Recess', '11.10.2021', '16.10.2021 (1 Week)']
['4', 'First Mid Term Examinations', '08.11.2021', '13.11.2021 (1 Week)']
['', 'Submission of First Mid Term Exam Marks', '', '']
['', 'to the University on or before', '', 'a0LLAET']
['6', '2"4 Spell of Instructions', '15.11.2021', '08.01.2022 (8 Weeks)']
['7', 'Second Mid Term Examinations', '10.01.2022', '18.01.2022 (1 Week)']
['8', 'Frepanition Holidays and. Practical', '19.01.2022 |', '25.01.2022 (1 Week)']
['', 'Examinations', '', '']
['', 'Submission of Second Mid Term Exam', '', '']
['g', 'Marks to the University on or before', '', 'asanioeks']
['10', 'End Semester Examinations', '27.01.2022', '| 09.02.2022']
['', '', '', '']
['II SEM', '', '', '']
['', '', '', '']
['', 'a', '', 'Duration']
['S. No', 'Description', 'From', 'To']
['', '', '', '']
['1', 'Commencement of II Semester classwork', '', '10.02.2022']
['2', '1 Spell of Instructions', '10.02.2022', '06.04.2022 (8 Weeks)']
['3', 'First Mid Term Examinations', '07.04.2022', '13.04.2022 (1 Week)']
['', 'Submission of First Mid Term Exam Marks', '', '']
['=', 'to the University on or before\n=a 2 : :', '', 'soit']
['5', '2 Spell of Instructions (including Summer', '16.04.2022 |', '24.06.2022 (10 Weeks)']
['', 'Vacation)', '', '']
['6', 'Summer Vacation', '09.05.2022', '21.05.2022 (2 Weeks)']
['7', 'Second Mid Term Examinations', '25.06.2022', '01.07.2022 (1 Week)']
['g _|', 'Erqpanation Holicigye mel Prectionl', '02.07.2022 |', '09.07.2022 (1 Week)']
['', 'Examinations', '', '']
['', 'Submission of Second Mid Term Exam', '', '']
['2', 'Marks to the University on or before', '', 'Onis']
['10', 'End Semester Examinations', '11.07.2022', '| 23.07.2022 (2 Weeks)']
['', '', '', '']
['', '', '', 'sa']

Answer selected by samkit-jain

kathimohan · 2023-09-15T10:20:00Z

kathimohan
Sep 15, 2023
Author

Thank you for your help, but explicit_vertical_lines can't use bcz different pdfs having different measurements and vertical_strategy": "text", can help some what but not able to detect exactly reliably as pdfplumber works for non scanned ones. Also not able to detect bold letters on converted scanned ones, can u pls help me how to get bold letters and improve reliablity of vertical_strategy": "text" approach

3 replies

samkit-jain Sep 15, 2023
Collaborator

Yes, you are right that the coordinates may not work for other similar PDFs. Unfortunately, there's not much that you can do when dealing with scanned statements. You'll have to get the proper ePDF versions of it. Or you'll have to write your own custom logic. For the "text" vertical approach, you will get best results once you crop the table area but that would mean that you know where the table is. And then you can tweak the parameters like "min_words_vertical" to get better results.

Specifically for scanned statements, my recommendation to you use services like Amazon Textract or use OpenCV to do contour detection and identify the table borders and then the vertical edges and then based on those, extract the tabular data.

samkit-jain Sep 15, 2023
Collaborator

An example: https://stackoverflow.com/a/51756462/7760998

samkit-jain Sep 16, 2023
Collaborator

You can also try using #974 (comment) as an alternate approach. The code in the discussion tries to find the number of columns in a page and can be applied to a table as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to detect tables from ocr to text converted pdf #988

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Not able to detect tables from ocr to text converted pdf #988

kathimohan Sep 13, 2023

Replies: 3 comments · 4 replies

samkit-jain Sep 14, 2023 Collaborator

kathimohan Sep 14, 2023 Author

samkit-jain Sep 15, 2023 Collaborator

kathimohan Sep 15, 2023 Author

samkit-jain Sep 15, 2023 Collaborator

samkit-jain Sep 15, 2023 Collaborator

samkit-jain Sep 16, 2023 Collaborator

kathimohan
Sep 13, 2023

Replies: 3 comments 4 replies

samkit-jain
Sep 14, 2023
Collaborator

kathimohan
Sep 14, 2023
Author

samkit-jain Sep 15, 2023
Collaborator

kathimohan
Sep 15, 2023
Author

samkit-jain Sep 15, 2023
Collaborator

samkit-jain Sep 15, 2023
Collaborator

samkit-jain Sep 16, 2023
Collaborator