Extracting Text returns empty/None for a similar table #1099
Replies: 3 comments 4 replies
-
Hi @lcabrera07, and thanks for your kind words about |
Beta Was this translation helpful? Give feedback.
-
I am attaching two PDFs. One where I can extract text from and one where I cannot. I'll also attach the debug images that pdfplumber outputs. One note is that when I mean it cannot, I mean the method extract_tables() return an empty array with four empty string elements. I am using the configuration: {"intersection_tolerance": 7, "text_keep_blank_chars": True } on both pdfs. |
Beta Was this translation helpful? Give feedback.
-
Here is another question... A pdf (attached below) I have is returning back character arrays instead of words (see the image below). I have tried to adjust the config with different values of {x_tolerance=3, y_tolerance=3, x_tolerance_ratio} but nothing changes when I use the extract_tables(). Am I not using the correct configuration values for words? |
Beta Was this translation helpful? Give feedback.
-
Hi, thank you for providing a python pdf extraction library!
I am working on pdfs that have an index I would like to extract. It works on an index with defined lines but it doesn't seem to work on an index that has table lines that don't intercept. I have used some variation of the config to define the (lines_strict) lines but it still returns text only from the one with defined lines. I had to use lines_strict because I think it's mistakenly perceiving the thicker vertical line as a table.
This image below shows two pdfs with the tool's debug image, the left without intersecting lines and the right with intersecting lines.
![Screenshot 2024-02-22 at 10 34 04 PM](https://private-user-images.githubusercontent.com/6684876/307240919-edc894f2-c11b-4d6c-8fa5-1170f782d7dd.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjIzNDAxMzEsIm5iZiI6MTcyMjMzOTgzMSwicGF0aCI6Ii82Njg0ODc2LzMwNzI0MDkxOS1lZGM4OTRmMi1jMTFiLTRkNmMtOGZhNS0xMTcwZjc4MmQ3ZGQucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDczMCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MzBUMTE0MzUxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9Y2IzM2Y4MzEwMzZjZDgyYjI4ODg5MzhhY2RjZmY0ZWJhMzJhOGMwMzc4ZjhlMTZjMjMxNTM5MGU4M2VhZWJiMiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.xWn41eNzWNvzLKqKlgYMuhNSk1Cw_bEX7FAeFE3iRXM)
I got the left by using lines_strict and intersection_tolerance = 10 passed into extract_tables(). The right is what I get without any config passed into extract_tables().
Not sure why the left pdf is ignoring the horizontal lines but I think it can be ignored if it considers the entire table 3 columns, meaning 3 cells. My next guess is that the text may be too far apart to extract. I continue to get [['', '', '']]
The pdf on the right works fine and extracts what I need using extract_tables(). I get the following [['&', '15th 14:3,5,6\n16 116:21\n200:14\n16th 187:15\n17011...
Is this a config issue? I attempted to use horizontal/vertical strategy with text but the text is not aligned correctly so I think it's even harder.
![Screenshot 2024-02-22 at 10 55 19 PM](https://private-user-images.githubusercontent.com/6684876/307242329-7fd7698f-ef10-4e36-85df-0e16fd9af4f0.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjIzNDAxMzEsIm5iZiI6MTcyMjMzOTgzMSwicGF0aCI6Ii82Njg0ODc2LzMwNzI0MjMyOS03ZmQ3Njk4Zi1lZjEwLTRlMzYtODVkZi0wZTE2ZmQ5YWY0ZjAucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDczMCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MzBUMTE0MzUxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9YjVhYzU0OTk4YjQ2MDRkNzg1ZDE4Y2IxMjNiMjk1NDVhOTJhM2NmMDYzMDBmZjRmMjdlMjYxYTY3ZDliM2UyNyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.4-Zwazo_qwBDg3i6Q3-uf4KscBOLRHW6Z_QyDAdrAMA)
Thanks in advanced.
Beta Was this translation helpful? Give feedback.
All reactions