Problems extracting two tables from same page #1085

gam32bit · 2024-01-28T17:20:16Z

gam32bit
Jan 28, 2024

I am trying to extract data from this pdf, where most of the pages with data have two separate tables. Using extract_tables() only extracts the first table. Looking more into the pdfplumber documentation, it seems like I would have manually set parameters to draw the specific lines of both tables on the pages, but I'm wondering if there's a less time-consuming solution. I found a similar discussion about problems extracting multiple tables from the same page, but in that case the two tables were different, whereas in this case the tables appear the same. Summary-of-Opioid-Funds-to-Virginia-Localities-as-of-Jan-2023.pdf

jsvine · 2024-02-10T23:05:21Z

jsvine
Feb 10, 2024
Maintainer

Thanks for sharing this PDF, @gam32bit. It's a fascinating example. There are a couple of related, obscure things going on.

The overarching issue is that — for some reason — the top tables on those pages are very, very, very slightly (imperceptibly) skewed. We're talking infinitesimal differences in positioning. Take the first line on the 4th page (pdf.pages[3].lines[0]):

{'x0': 72.30546188719117,
 'y0': 652.4448951290829,
 'x1': 72.3054624052992,
 'y1': 677.5428190549828,
 'width': 5.181080240390656e-07,
 'height': 25.097923925899977,
[...]}

This is a vertical line that is supposed to have the same x0 and x1 values, but they're off by 5.181080240390656e-07. This throws off pdfplumber, which considers objects with any skewness to be non-rectilinear.

Seemingly related to this, many of the graphical elements that typically would be recognized as rects aren't — and instead are parsed (by pdfminer.six, the parser pdfplumber uses) as curves. For these reason, most of the elements in the top table are not used in the table detection:

page.to_image().debug_tablefinder()

The good news is that repairing the PDF seems to fix this:

pdf = pdfplumber.open(
    "Summary-of-Opioid-Funds-to-Virginia-Localities-as-of-Jan-2023.pdf",
    repair=True
)

page = pdf.pages[3]
im = page.to_image()
im.debug_tablefinder()

Now you have a slightly different problem, which is that the tables are merged. Likely solvable via post-processing, but another way you could fix is to filter out the gray rects:

filtered = page.filter(lambda obj: obj.get("non_stroking_color") != (0.851563,))
filtered.to_image().debug_tablefinder()

That looks better! Unfortunately, there's another issue, which is that the ever-so-slight skew of the text leads pdfminer.six to conclude that the text is not "upright", which leads pdfplumber to process it differently (and suboptimally). One way to fix this is to rewrite the upright property:

assert page.chars
for obj in page._objects["char"]:
    obj.update({"upright": True })

Putting this all together:

pdf = pdfplumber.open(
    "Summary-of-Opioid-Funds-to-Virginia-Localities-as-of-Jan-2023.pdf",
    repair=True
)

def parse_tables(page):
    assert page.chars
    for obj in page._objects["char"]:
        obj.update({"upright": True })
    filtered = page.filter(lambda obj: obj.get("non_stroking_color") != (0.851563,))
    return filtered.extract_tables()

So parse_tables(pdf.pages[3]) should get you two tables, the first of which should look like this:

0 replies

gam32bit · 2024-02-12T13:20:46Z

gam32bit
Feb 12, 2024
Author

Wow! This is an amazing response. Really appreciate you taking the time to work through this and explain. I will follow up if I have any other issues. Thanks again! Joe

…

On Sat, Feb 10, 2024 at 6:05 PM Jeremy Singer-Vine ***@***.***> wrote: Thanks for sharing this PDF, @gam32bit <https://github.com/gam32bit>. It's a fascinating example. There are a couple of related, obscure things going on. The overarching issue is that — for *some* reason — the top tables on those pages are very, very, very slightly (imperceptibly) skewed. We're talking infinitesimal differences in positioning. Take the first line on the 4th page (pdf.pages[3].lines[0]): {'x0': 72.30546188719117, 'y0': 652.4448951290829, 'x1': 72.3054624052992, 'y1': 677.5428190549828, 'width': 5.181080240390656e-07, 'height': 25.097923925899977, [...]} This is a vertical line that is supposed to have the same x0 and x1 values, but they're off by 5.181080240390656e-07. This throws off pdfplumber, which considers objects with any skewness to be non-rectilinear. Seemingly related to this, many of the graphical elements that typically would be recognized as rects aren't — and instead are parsed (by pdfminer.six, the parser pdfplumber uses) as curves. For these reason, most of the elements in the top table are not used in the table detection: page.to_image().debug_tablefinder() image.png (view on web) <https://github.com/jsvine/pdfplumber/assets/534702/82c94660-9c39-412c-ad5f-18b0b3ce92bf> The good news is that repairing the PDF seems to fix this: pdf = pdfplumber.open( "Summary-of-Opioid-Funds-to-Virginia-Localities-as-of-Jan-2023.pdf", repair=True ) page = pdf.pages[3]im = page.to_image()im.debug_tablefinder() image.png (view on web) <https://github.com/jsvine/pdfplumber/assets/534702/d4b3bb5c-0f0c-4708-a891-c849141726fa> Now you have a slightly different problem, which is that the tables are merged. Likely solvable via post-processing, but another way you could fix is to filter out the gray rects: filtered = page.filter(lambda obj: obj.get("non_stroking_color") != (0.851563,))filtered.to_image().debug_tablefinder() image.png (view on web) <https://github.com/jsvine/pdfplumber/assets/534702/6823f9ba-dd7c-4b9f-b3ee-7f576d753833> That looks better! Unfortunately, there's another issue, which is that the ever-so-slight skew of the *text* leads pdfminer.six to conclude that the text is not "upright", which leads pdfplumber to process it differently (and suboptimally). One way to fix this is to rewrite the upright property: assert page.charsfor obj in page._objects["char"]: obj.update({"upright": True }) Putting this all together: pdf = pdfplumber.open( "Summary-of-Opioid-Funds-to-Virginia-Localities-as-of-Jan-2023.pdf", repair=True ) def parse_tables(page): assert page.chars for obj in page._objects["char"]: obj.update({"upright": True }) filtered = page.filter(lambda obj: obj.get("non_stroking_color") != (0.851563,)) return filtered.extract_tables() So parse_tables(pdf.pages[3]) should get you two tables, the first of which should look like this: Screenshot.2024-02-10.at.6.04.03.PM.png (view on web) <https://github.com/jsvine/pdfplumber/assets/534702/dbec80a8-c737-43df-a819-8ac2cef1c892> — Reply to this email directly, view it on GitHub <#1085 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACHWD4FMRX3W5ZPNPT76PP3YS74LZAVCNFSM6AAAAABCOKGNU6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DIMZQGEZDM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Joseph Caterine (757) 876-9443 jwcaterine.com

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems extracting two tables from same page #1085

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Problems extracting two tables from same page #1085

gam32bit Jan 28, 2024

Replies: 2 comments

jsvine Feb 10, 2024 Maintainer

gam32bit Feb 12, 2024 Author

gam32bit
Jan 28, 2024

jsvine
Feb 10, 2024
Maintainer

gam32bit
Feb 12, 2024
Author