Problems extracting two tables from same page #1085
Replies: 2 comments
-
Thanks for sharing this PDF, @gam32bit. It's a fascinating example. There are a couple of related, obscure things going on. The overarching issue is that — for some reason — the top tables on those pages are very, very, very slightly (imperceptibly) skewed. We're talking infinitesimal differences in positioning. Take the first line on the 4th page ( {'x0': 72.30546188719117,
'y0': 652.4448951290829,
'x1': 72.3054624052992,
'y1': 677.5428190549828,
'width': 5.181080240390656e-07,
'height': 25.097923925899977,
[...]} This is a vertical line that is supposed to have the same Seemingly related to this, many of the graphical elements that typically would be recognized as page.to_image().debug_tablefinder() The good news is that repairing the PDF seems to fix this: pdf = pdfplumber.open(
"Summary-of-Opioid-Funds-to-Virginia-Localities-as-of-Jan-2023.pdf",
repair=True
)
page = pdf.pages[3]
im = page.to_image()
im.debug_tablefinder() Now you have a slightly different problem, which is that the tables are merged. Likely solvable via post-processing, but another way you could fix is to filter out the gray rects: filtered = page.filter(lambda obj: obj.get("non_stroking_color") != (0.851563,))
filtered.to_image().debug_tablefinder() That looks better! Unfortunately, there's another issue, which is that the ever-so-slight skew of the text leads assert page.chars
for obj in page._objects["char"]:
obj.update({"upright": True }) Putting this all together: pdf = pdfplumber.open(
"Summary-of-Opioid-Funds-to-Virginia-Localities-as-of-Jan-2023.pdf",
repair=True
)
def parse_tables(page):
assert page.chars
for obj in page._objects["char"]:
obj.update({"upright": True })
filtered = page.filter(lambda obj: obj.get("non_stroking_color") != (0.851563,))
return filtered.extract_tables() So |
Beta Was this translation helpful? Give feedback.
-
Wow! This is an amazing response. Really appreciate you taking the time to
work through this and explain. I will follow up if I have any other issues.
Thanks again!
Joe
…On Sat, Feb 10, 2024 at 6:05 PM Jeremy Singer-Vine ***@***.***> wrote:
Thanks for sharing this PDF, @gam32bit <https://github.com/gam32bit>.
It's a fascinating example. There are a couple of related, obscure things
going on.
The overarching issue is that — for *some* reason — the top tables on
those pages are very, very, very slightly (imperceptibly) skewed. We're
talking infinitesimal differences in positioning. Take the first line on
the 4th page (pdf.pages[3].lines[0]):
{'x0': 72.30546188719117,
'y0': 652.4448951290829,
'x1': 72.3054624052992,
'y1': 677.5428190549828,
'width': 5.181080240390656e-07,
'height': 25.097923925899977,
[...]}
This is a vertical line that is supposed to have the same x0 and x1
values, but they're off by 5.181080240390656e-07. This throws off
pdfplumber, which considers objects with any skewness to be
non-rectilinear.
Seemingly related to this, many of the graphical elements that typically
would be recognized as rects aren't — and instead are parsed (by
pdfminer.six, the parser pdfplumber uses) as curves. For these reason,
most of the elements in the top table are not used in the table detection:
page.to_image().debug_tablefinder()
image.png (view on web)
<https://github.com/jsvine/pdfplumber/assets/534702/82c94660-9c39-412c-ad5f-18b0b3ce92bf>
The good news is that repairing the PDF seems to fix this:
pdf = pdfplumber.open(
"Summary-of-Opioid-Funds-to-Virginia-Localities-as-of-Jan-2023.pdf",
repair=True
)
page = pdf.pages[3]im = page.to_image()im.debug_tablefinder()
image.png (view on web)
<https://github.com/jsvine/pdfplumber/assets/534702/d4b3bb5c-0f0c-4708-a891-c849141726fa>
Now you have a slightly different problem, which is that the tables are
merged. Likely solvable via post-processing, but another way you could fix
is to filter out the gray rects:
filtered = page.filter(lambda obj: obj.get("non_stroking_color") != (0.851563,))filtered.to_image().debug_tablefinder()
image.png (view on web)
<https://github.com/jsvine/pdfplumber/assets/534702/6823f9ba-dd7c-4b9f-b3ee-7f576d753833>
That looks better! Unfortunately, there's another issue, which is that the
ever-so-slight skew of the *text* leads pdfminer.six to conclude that the
text is not "upright", which leads pdfplumber to process it differently
(and suboptimally). One way to fix this is to rewrite the upright
property:
assert page.charsfor obj in page._objects["char"]:
obj.update({"upright": True })
Putting this all together:
pdf = pdfplumber.open(
"Summary-of-Opioid-Funds-to-Virginia-Localities-as-of-Jan-2023.pdf",
repair=True
)
def parse_tables(page):
assert page.chars
for obj in page._objects["char"]:
obj.update({"upright": True })
filtered = page.filter(lambda obj: obj.get("non_stroking_color") != (0.851563,))
return filtered.extract_tables()
So parse_tables(pdf.pages[3]) should get you two tables, the first of
which should look like this:
Screenshot.2024-02-10.at.6.04.03.PM.png (view on web)
<https://github.com/jsvine/pdfplumber/assets/534702/dbec80a8-c737-43df-a819-8ac2cef1c892>
—
Reply to this email directly, view it on GitHub
<#1085 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACHWD4FMRX3W5ZPNPT76PP3YS74LZAVCNFSM6AAAAABCOKGNU6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DIMZQGEZDM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Joseph Caterine
(757) 876-9443
jwcaterine.com
|
Beta Was this translation helpful? Give feedback.
-
I am trying to extract data from this pdf, where most of the pages with data have two separate tables. Using extract_tables() only extracts the first table. Looking more into the pdfplumber documentation, it seems like I would have manually set parameters to draw the specific lines of both tables on the pages, but I'm wondering if there's a less time-consuming solution. I found a similar discussion about problems extracting multiple tables from the same page, but in that case the two tables were different, whereas in this case the tables appear the same. Summary-of-Opioid-Funds-to-Virginia-Localities-as-of-Jan-2023.pdf
Beta Was this translation helpful? Give feedback.
All reactions