Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] Handling out-of-page rect objects #267

Closed
samkit-jain opened this issue Sep 3, 2020 · 2 comments
Closed

[DISCUSSION] Handling out-of-page rect objects #267

samkit-jain opened this issue Sep 3, 2020 · 2 comments

Comments

@samkit-jain
Copy link
Collaborator

samkit-jain commented Sep 3, 2020

Prologue: May read like a story and has a lot of open-ended possibly-discussion-worthy questions.


v0.25.3
Table settings in use:

{
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines"
}

PDF


Output of .debug_tablefinder() on full page:
image

One thing I noticed that the outer rectangle was missing connection dots at the bottom but I ignored it since the table was rightly captured.

If I crop ((0, 120, page.width, page.height)) the page from the top and then run .debug_tablefinder(), the output is
image

This time the output is a bit different because the content outside the table is also captured.

How did the left column got selected in the table extraction after cropping the page even though it is well outside the intersection tolerance?

While I was trying to find out the cause for this, I noticed that this time, the outer rectangle had those connection dots present at both the top and bottom. No matter how much top or bottom portion I cropped, the behaviour persisted, the top red line and the bottom red line in the output implied the presence of a rect object even though there isn't one because we cut off the top portion of the outer rectangle and the top red line at the edge shouldn't be there. Also, how come the red line at the bottom edge appear after cropping? I did some debugging and found that certain objects had negative coordinate values. Here's a screengrab of the PyCharm debugger (notice the negative value in "y0"):
image

The "y0" negative value meant that the bottom red line should ideally appear even if one did a zero crop ((0, 0, page.width, page.height)) and is verified by the following .debug_tablefinder() output:
image

Result of drawing those negative value rect objects:
image

The reason behind negative values is not due to a bug in pdfplumber or pdfminer.six but because of the PDF itself. To verify, I ran pdftk input.pdf output uncompressed.pdf uncompress and then opened uncompressed.pdf in a text editor and found 8 negative coordinate values in it. Not sure what purpose they serve ¯\_(ツ)_/¯

Should we treat negative coordinate values differently? Adding

if any(obj[key] < 0 for key in ["x0", "y0", "x1", "y1"]):
    return None

at

def clip_obj(obj, bbox):
drops them but it only works when cropping the page.

print(len(page.objects["rect"]))
# 230
orig = page.objects["rect"]
page = page.crop((0, 0, page.width, page.height))
print(len(page.objects["rect"]))
# 222

.debug_tablefinder() output:
image

Or if the rects are to be kept because they hold information that ideally should not be removed by the library, changes to get_bbox_overlap() or clip_obj() might be required. I also created a dummy PDF
with the similar layout and in that, if I crop the page and run table finder, the left column is not picked up (as expected).

Uncropped:
image

Cropped:
image


Code

import pdfplumber

pdf = pdfplumber.open("file.pdf")
page = pdf.pages[0]
ts = {
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines"
}

# 1. Get the first screenshot
im = page.to_image(resolution=150)
im.reset().debug_tablefinder(ts)
im.save("out.png", format="PNG")

# 2. Get the second screenshot
cropped = page.crop((0, 120, page.width, page.height))
im = cropped.to_image(resolution=150)
im.reset().debug_tablefinder(ts)
im.save("out.png", format="PNG")

# 3. Third screenshot from PyCharm

# 4. Get the fourth screenshot
cropped = page.crop((0, 0, page.width, page.height))
im = cropped.to_image(resolution=150)
im.reset().debug_tablefinder(ts)
im.save("out.png", format="PNG")

# 5. Get the fifth screenshot
orig = page.objects["rect"]
cropped = page.crop((0, 0, page.width, page.height))
new = []
for obj in orig:
    if any(obj[key] < 0 for key in ["x0", "y0", "x1", "y1"]):
        new.append(obj)

im = page.to_image(resolution=150)
im.draw_rects(new)
im.save("out.png", format="PNG")

# Now add:
# if any(obj[key] < 0 for key in ["x0", "y0", "x1", "y1"]):
#     return None
# under ``clip_obj()`` in ``utils.py``

# 6. Get the sixth screenshot
cropped = page.crop((0, 0, page.width, page.height))
im = cropped.to_image(resolution=150)
im.reset().debug_tablefinder(ts)
im.save("out.png", format="PNG")
@samkit-jain
Copy link
Collaborator Author

Just realised that the dummy PDF I shared above is not a good replication example because it is built up of line objects while the original PDF is of rect objects.

@samkit-jain
Copy link
Collaborator Author

Working on more on this issue and trying out a bunch of things, I have come to understand that the results are the expected behaviour and I was confusing rect objects behaviour with line objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant