You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One thing I noticed that the outer rectangle was missing connection dots at the bottom but I ignored it since the table was rightly captured.
If I crop ((0, 120, page.width, page.height)) the page from the top and then run .debug_tablefinder(), the output is
This time the output is a bit different because the content outside the table is also captured.
How did the left column got selected in the table extraction after cropping the page even though it is well outside the intersection tolerance?
While I was trying to find out the cause for this, I noticed that this time, the outer rectangle had those connection dots present at both the top and bottom. No matter how much top or bottom portion I cropped, the behaviour persisted, the top red line and the bottom red line in the output implied the presence of a rect object even though there isn't one because we cut off the top portion of the outer rectangle and the top red line at the edge shouldn't be there. Also, how come the red line at the bottom edge appear after cropping? I did some debugging and found that certain objects had negative coordinate values. Here's a screengrab of the PyCharm debugger (notice the negative value in "y0"):
The "y0" negative value meant that the bottom red line should ideally appear even if one did a zero crop ((0, 0, page.width, page.height)) and is verified by the following .debug_tablefinder() output:
Result of drawing those negative value rect objects:
The reason behind negative values is not due to a bug in pdfplumber or pdfminer.six but because of the PDF itself. To verify, I ran pdftk input.pdf output uncompressed.pdf uncompress and then opened uncompressed.pdf in a text editor and found 8 negative coordinate values in it. Not sure what purpose they serve ¯\_(ツ)_/¯
Should we treat negative coordinate values differently? Adding
Or if the rects are to be kept because they hold information that ideally should not be removed by the library, changes to get_bbox_overlap() or clip_obj() might be required. I also created a dummy PDF
with the similar layout and in that, if I crop the page and run table finder, the left column is not picked up (as expected).
Uncropped:
Cropped:
Code
importpdfplumberpdf=pdfplumber.open("file.pdf")
page=pdf.pages[0]
ts= {
"vertical_strategy": "lines",
"horizontal_strategy": "lines"
}
# 1. Get the first screenshotim=page.to_image(resolution=150)
im.reset().debug_tablefinder(ts)
im.save("out.png", format="PNG")
# 2. Get the second screenshotcropped=page.crop((0, 120, page.width, page.height))
im=cropped.to_image(resolution=150)
im.reset().debug_tablefinder(ts)
im.save("out.png", format="PNG")
# 3. Third screenshot from PyCharm# 4. Get the fourth screenshotcropped=page.crop((0, 0, page.width, page.height))
im=cropped.to_image(resolution=150)
im.reset().debug_tablefinder(ts)
im.save("out.png", format="PNG")
# 5. Get the fifth screenshotorig=page.objects["rect"]
cropped=page.crop((0, 0, page.width, page.height))
new= []
forobjinorig:
ifany(obj[key] <0forkeyin ["x0", "y0", "x1", "y1"]):
new.append(obj)
im=page.to_image(resolution=150)
im.draw_rects(new)
im.save("out.png", format="PNG")
# Now add:# if any(obj[key] < 0 for key in ["x0", "y0", "x1", "y1"]):# return None# under ``clip_obj()`` in ``utils.py``# 6. Get the sixth screenshotcropped=page.crop((0, 0, page.width, page.height))
im=cropped.to_image(resolution=150)
im.reset().debug_tablefinder(ts)
im.save("out.png", format="PNG")
The text was updated successfully, but these errors were encountered:
Just realised that the dummy PDF I shared above is not a good replication example because it is built up of line objects while the original PDF is of rect objects.
Working on more on this issue and trying out a bunch of things, I have come to understand that the results are the expected behaviour and I was confusing rect objects behaviour with line objects.
Prologue: May read like a story and has a lot of open-ended possibly-discussion-worthy questions.
v0.25.3
Table settings in use:
PDF
Output of
.debug_tablefinder()
on full page:One thing I noticed that the outer rectangle was missing connection dots at the bottom but I ignored it since the table was rightly captured.
If I crop (
(0, 120, page.width, page.height)
) the page from the top and then run.debug_tablefinder()
, the output isThis time the output is a bit different because the content outside the table is also captured.
While I was trying to find out the cause for this, I noticed that this time, the outer rectangle had those connection dots present at both the top and bottom. No matter how much top or bottom portion I cropped, the behaviour persisted, the top red line and the bottom red line in the output implied the presence of a
rect
object even though there isn't one because we cut off the top portion of the outer rectangle and the top red line at the edge shouldn't be there. Also, how come the red line at the bottom edge appear after cropping? I did some debugging and found that certain objects had negative coordinate values. Here's a screengrab of the PyCharm debugger (notice the negative value in"y0"
):The
"y0"
negative value meant that the bottom red line should ideally appear even if one did a zero crop ((0, 0, page.width, page.height)
) and is verified by the following.debug_tablefinder()
output:Result of drawing those negative value
rect
objects:The reason behind negative values is not due to a bug in
pdfplumber
orpdfminer.six
but because of the PDF itself. To verify, I ranpdftk input.pdf output uncompressed.pdf uncompress
and then openeduncompressed.pdf
in a text editor and found 8 negative coordinate values in it. Not sure what purpose they serve¯\_(ツ)_/¯
Should we treat negative coordinate values differently? Adding
at
pdfplumber/pdfplumber/utils.py
Line 381 in 3c5041a
.debug_tablefinder()
output:Or if the rects are to be kept because they hold information that ideally should not be removed by the library, changes to
get_bbox_overlap()
orclip_obj()
might be required. I also created a dummy PDFwith the similar layout and in that, if I crop the page and run table finder, the left column is not picked up (as expected).
Uncropped:
Cropped:
Code
The text was updated successfully, but these errors were encountered: