Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page cropbox is not used for bbox if present #1054

Closed
stefanw opened this issue Dec 7, 2023 · 3 comments
Closed

Page cropbox is not used for bbox if present #1054

stefanw opened this issue Dec 7, 2023 · 3 comments
Labels

Comments

@stefanw
Copy link

stefanw commented Dec 7, 2023

Describe the bug

When a PDF page contains a cropbox that differs from the mediabox the positions of the extracted text will not be correct and drawing them via page.to_image().draw_rects(...) will not work as expected.

Code to reproduce the problem

import pdfplumber

pdf = pdfplumber.open("pdfplumber-cropbox.pdf", repair=True)
page = pdf.pages[0]
im = page.to_image()
im.draw_rects(page.extract_words())
im.save("pdfplumber-bad-cropbox.jpg")
im

PDF file

pdfplumber-cropbox.pdf

Expected behavior

I would expect pdfplumber to take the cropbox as the bbox.

Actual behavior

Even when the cropbox is present it is not taken as the page's bbox.

Screenshots

pdfplumber-bad-cropbox

Environment

  • pdfplumber version: 0.10.2
  • Python version: 3.11
  • OS: Mac

Additional context

The problem seems to be this line in the Page class:

self.mediabox = resolve_all(mediabox) or self.cropbox

The mediabox is a required page attribute, so self.cropbox will never be assigned as the mediabox.
The cropbox is optional, so my guess is that the code should be:

self.mediabox = self.cropbox or resolve_all(mediabox)

Indeed this change improves the positioning significantly, although it's still not perfectly aligned.
pdfplumber-better-cropbox

@stefanw stefanw added the bug label Dec 7, 2023
@jsvine
Copy link
Owner

jsvine commented Dec 22, 2023

Hi @stefanw, and thank you for flagging this! I've now spent a couple of hours looking into it and wanted to send you this update. Although I think .mediabox is correctly assigned as the main .bbox (since it's the mediabox which most closely corresponds to the possible user space), there definitely is a bug here, and I think it may have multiple nuanced sources — more to do with the PageImage conversion, but also some related calculations. In any case, will post again here when I'm closer to a solid fix. Thanks again.

@stefanw
Copy link
Author

stefanw commented Dec 23, 2023

Hey @jsvine, thanks for looking into this. I think the cropbox is meant to represent the visible user space and is used in other implementations as the transformation base (e.g. pdfbox). The PDF standard says:

The crop box defines the region to which the contents of the page are to be
clipped (cropped) when displayed or printed. Unlike the other boxes, the crop
box has no defined meaning in terms of physical page geometry or intended
use; it merely imposes clipping on the page contents. However, in the absence
of additional information (such as imposition instructions specified in a JDF
or PJTF job ticket), the crop box will determine how the page’s contents are to
be positioned on the output medium
. The default value is the page’s media
box.
(emphasis mine)

jsvine added a commit that referenced this issue Jan 6, 2024
... fixing various issues with PageImage. Also adds
force_mediabox parameter to Page.to_image(...).

Thanks to @stefanw for flagging:
    #1054
@jsvine
Copy link
Owner

jsvine commented Jan 6, 2024

Hi @stefanw, this should now be fixed via 07d9997 (now available on develop), which standardizes the handling of .cropbox and .mediabox, particularly with regards to PageImage.

As a demonstration of the fixes, now the character and line bounding boxes are rendered as expected:

Screenshot

Your interpretation of .cropbox is correct, but Page.bbox represents something slightly different in pdfplumber: The box in which the page's objects (visible or not) are expected to exist. This becomes relevant for Page.crop(...), which needs to track the area to which the user has filtered the page objects. For that, I believe we want to start with .bbox = .mediabox because users will (I believe) initially want programmatic access to all graphical objects regardless of whether they are placed within .cropbox.

Thanks again for opening this issue, which helped me to identify inconsistencies in how the various boxes were handled. Closing it for now, but feel free to continue the conversation, point out edge-cases I might have missed, et cetera.

@jsvine jsvine closed this as completed Jan 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants