Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent results when cropping an already cropped page #245

Closed
samkit-jain opened this issue Aug 7, 2020 · 5 comments
Closed

Inconsistent results when cropping an already cropped page #245

samkit-jain opened this issue Aug 7, 2020 · 5 comments
Assignees
Labels

Comments

@samkit-jain
Copy link
Collaborator

Describe the bug

When cropping an already cropped page, the objects are not preserved.

Code to reproduce the problem

import pdfplumber

# Make sure the file is downloaded at file.pdf
pdf = pdfplumber.open("file.pdf")
page = pdf.pages[0]

# Crop and save the top page and keep only the bottom 20%.
bottom = page.crop((0, 0.8 * float(page.height), page.width, page.height))
im = bottom.to_image(resolution=150)
im.save("bottom.png", format="PNG")

# Now, crop and save the left half of the cropped page.
bottom_left = bottom.crop((0, 0, 0.5 * float(bottom.width), bottom.height))
im = bottom_left.to_image(resolution=150)
im.save("bottom_left.png", format="PNG")

# Now, crop and save the right half of the cropped page.
bottom_right = bottom.crop((0.5 * float(bottom.width), 0, bottom.width, bottom.height))
im = bottom_right.to_image(resolution=150)
im.save("bottom_right.png", format="PNG")

PDF file

examples/pdfs/ag-energy-round-up-2017-02-24.pdf

Expected behavior

  • bottom.png - The bottom portion of the page is saved.
  • bottom_left.png - The left half of bottom portion of the page is saved.
  • bottom_right.png - The right half of the bottom portion of the page is saved.

Actual behavior

  • bottom.png - The bottom portion of the page is saved.
  • bottom_left.png - The left half of the top portion of the page is saved.
  • bottom_right.png - The right half of the top portion of the page is saved.

Screenshots

bottom.png
bottom

bottom_left.png
bottom_left

bottom_right.png
bottom_right

Environment

  • pdfplumber version: 0.5.22
  • Python version: 3.8.2
  • OS: Ubuntu 18.04 LTS

Additional context

The issue was found when working on #244

@jsvine
Copy link
Owner

jsvine commented Aug 7, 2020

Thanks for flagging this, @samkit-jain! I think this is what's happening:

  • pdfplumber does not adjust coordinates after a crop (this is intentional, but open to discussion).

  • The second and third crop commands in the example assume (understandably) that the coordinates have been adjusted after the initial crop.

I can see a handful of potential solutions:

  • (a) Do nothing, but communicate better to users that .crop's bbox should be in terms of the original PDF, not the crop.

  • (b) Add a parameter to .crop, such as relative = True, that would let users indicate that they're providing relative coordinates, not absolute ones.

  • (c) Change the default, so that .crop assumes a relative-position bbox, but provide a parameter (e.g., relative = False) that reverts to the original approach.

  • (d) Make it so that pdfplumber automatically adjusts all coordinates (not just of the page's bbox, but of all extracted objects as well) when cropping.

Thoughts?

@samkit-jain
Copy link
Collaborator Author

samkit-jain commented Aug 8, 2020

pdfplumber does not adjust coordinates after a crop (this is intentional, but open to discussion).

What is the reasoning behind this?

(d) Make it so that pdfplumber automatically adjusts all coordinates (not just of the page's bbox, but of all extracted objects as well) when cropping.

If by this you mean that the cropped page would be treated as a "real" page and all the operations like extract_text, extract_words, etc can be run on it as if it was the parent page, then yes, I'd prefer option D.

@jsvine
Copy link
Owner

jsvine commented Aug 9, 2020

pdfplumber does not adjust coordinates after a crop (this is intentional, but open to discussion).

What is the reasoning behind this?

Great question; I should have elaborated earlier. The answer: Mostly for simplicity. A page's bounding box should, I think, exist in the same coordinate system as each object on the page. So changing the coordinates of the bounding box would mean changing the coordinates of all objects in the resulting cropped page. To me, ensuring that all coordinates relevant to all objects were moved post-crop seemed a tricky task.

To take an oversimplified example: Let's say a page's original bounding box was 0, 0, 20, 20, and we have a single point 10, 10. If we crop the page to 10, 10, 20, 20 ...

  • Currently: the cropped page's bounding box would be 10, 10, 20, 20 and the point would remain at 10, 10

  • If we readjusted all coordinates post-crop: the cropped page's new bounding box would be 0, 0, 10, 10 and the point would need to be moved to 0, 0.

The more I think about this, I think I prefer the following combination of improvements:

  • Keeping the current cropping system as-is — i.e., it does not alter the coordinate system

  • Throwing an error (or warning) if a user tries to crop part of a cropped page (or any page) that is not fully within the page's bounding box

  • Adding a relative parameter to both .crop and .within_bbox that allows users to pass a relative bounding box instead of using the absolute-coordinate system. E.g., to crop the example I mentioned above, to get the bottom half of the cropped page, they could call either cropped_page.crop((10, 15, 20, 20)) or cropped_page.crop((0, 5, 10, 10), relative = True) — the two would be equivalent.

@samkit-jain
Copy link
Collaborator Author

Thanks for the explanation. +1 for the warning. Also, adding a new relative parameter would also be the most backwards compatible.

jsvine added a commit that referenced this issue Aug 14, 2020
Addresses #245

Also adds `relative` param to .within_bbox, and adds a new
utils.calculate_area(bbox) method.
jsvine added a commit that referenced this issue Aug 15, 2020
jsvine added a commit that referenced this issue Aug 15, 2020
@jsvine
Copy link
Owner

jsvine commented Aug 15, 2020

Closing now, as resolved in two commits referenced above and available in v0.5.23. Thanks again for raising the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants