Opaque/ curious behavior with extract_tables and extract_text parameters #1006
Replies: 4 comments 9 replies
-
Hey Justin. Can you perhaps show how/where you are seeing duplication of I can't seem to see it using the example PDF. import pdfplumber
import pandas as pd
pdf = pdfplumber.open("Downloads/Tech.Writing.Schedule.pdf")
table_settings = {
"snap_x_tolerance": 10,
}
# using pandas just to "pretty-print" the table
pd.DataFrame(pdf.pages[0].extract_tables(table_settings)[0])
|
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Ah okay. I now see the duplicates. The reason I wasn't seeing them is because I didn't use With (440.1681355932205, 89.75999999999999, 573.8242622950826, 90.24000000000001) If we save it as an image to inspect it, we see no text. bbox = 440.1681355932205, 89.75999999999999, 573.8242622950826, 90.24000000000001
page.crop(bbox).to_image(300).save('crop.png') I was a bit confused by this behaviour and posted a similar question recently: #930 >>> page.crop(bbox).extract_text()
'to the Class'
>>> page.within_bbox(bbox).extract_text()
'' Because some of the "pixels" fall within the bbox (Sort of like an intersection vs. subset comparison.) That should at least explain why this is happening. It seems in this particular case, you need [CELL]: (440.1681355932205, 64.55999999999995, 573.8242622950826, 89.75999999999999)
[CELL]: (440.1681355932205, 89.75999999999999, 573.8242622950826, 90.24000000000001) |
Beta Was this translation helpful? Give feedback.
-
That's really helpful, thanks. I did use the within_bbox() instead of crop() and it has helped with the duplicates. I also took your idea of saving each cell 'snapshot' as an image and the extracted cell text in a text file - it was an extremely efficient debugging method (I paired the two files with a timestamp and put in a slight time delay to allow it to write the files). I found that it was reading the cells properly. The reason I'm using find_tables and not extract_table/s is because a) I need the table cell coordinates so that I can map the cells to the corresponding weekdays and time periods for the weekly calendar type formats and b) I want to use the same approach for both formats (weekly calendar vs list of date events). The extract_table/s methods don't provide the cell coordinates, correct? My goal is to 1) code the table recognition as generally as possible to allow the capture of a variety of formats and 2) avoid file specific post-processing. |
Beta Was this translation helpful? Give feedback.
-
Thanks very much for a great library. I'm just beginning and struggling to understand and manage the parameters which govern the extract_tables and extract_text methods, but I can't figure out what to change in order to fix some puzzling behaviour.
In the table I'm using as a test (first page of attached pdf), I'm getting all the bounding box coordinates and the cell contents with no problem, using
I was unable to get any sensible output from extract_text from each cell, iterating through the given bounding boxes, so I gave up on that. Part of that might have been that none of the parameters I tried changing seemed to do much (to be fair, it was pretty random as I couldn't understand the documentation for it).
I now apply extract_tables on a single cell 'tables' that are iteratively cropped from the original table, using the same parameters above.
This works a bit better, in that I get all the text and can associate it with a cell/ row/ column. However, I get occasional duplicates, seemingly from cells that have multi-line text in them, and the second line of the text is identified as a cell-within-a-cell. For example, in the top right-most table cell, there is the text:
Blog 1 - Introduce Yourself
to the Class
This text is successfully retrieved, but extract_tables method gives me another cell directly below containing
to the Class
and all the other fields/ columns in that row are blank - clearly an erroneous repeat of the second line.
I am currently post-processing the data and I can remove the rows with nothing in the first column, but I feel like I'm missing something here! Should I be using another technique? What should I do to dial down the sensitivity? All I want to do is extract the text cleanly and uniquely from the table cells that are already accurately identified and I need to minimise layout specific post-processing as this is the first of several table formats I plan to be processing.
Thanks in advance for any comments, suggestions.
Justin
Tech Writing Schedule.pdf
Beta Was this translation helpful? Give feedback.
All reactions