Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_image of grey text results in a fully white image #443

Closed
linuxsoftware opened this issue Jun 7, 2021 · 3 comments
Closed

to_image of grey text results in a fully white image #443

linuxsoftware opened this issue Jun 7, 2021 · 3 comments

Comments

@linuxsoftware
Copy link

Thank you for this extremely useful library.

I had a problem with visual debugging of a PDF that was mostly grey. All the text turned white so it could not be seen.

Here is an example PDF.

The problem is ImageMagick creates the image of the page as a 16bit greyscale PNG, but Pillow has a documented issue with converting that to RGB. (See https://stackoverflow.com/questions/19892919/pil-converting-an-image-with-mode-i-to-rgb-results-in-a-fully-white-image and python-pillow/Pillow#3011)

My hack has been to change display.py so that ImageMagick creates the image as an 8bit PNG using convert("png8"), which Pillow can then cope with. This "works for me".

--- a/pdfplumber/display.py
+++ b/pdfplumber/display.py
@@ -41,7 +41,7 @@ def get_page_image(stream, page_no, resolution):
         if img.alpha_channel:
             img.background_color = wand.image.Color("white")
             img.alpha_channel = "remove"
-        with img.convert("png") as png:
+        with img.convert("png8") as png:
             im = PIL.Image.open(BytesIO(png.make_blob()))
             return im.convert("RGB")

Environment

  • pdfplumber version: 0.5.28
  • ImageMagick version: 6.9.11.27
  • Wand version: 0.6.6
  • Pillow version: 8.2.0
  • Python version: 3.7.9
  • OS: Linux
@jsvine
Copy link
Owner

jsvine commented Jun 8, 2021

Hi @linuxsoftware, and thanks for flagging this! Since the default seems to work well for most PDFs, I'd lean toward an approach that allows the user to specify the conversion mode via an argument passed to get_page_image(...) and Page.to_image(...). I'll put this on my todo list, though you're also welcome to submit a PR.

@linuxsoftware
Copy link
Author

I was thinking about this and realized it is already possible to pass a user-created original image in to to_image so perhaps the code does not need to change at all.

e.g.

def my_page_image(page):
    stream = page.pdf.stream
    page_no = page.page_number - 1
    with wand.image.Image(resolution=150,
                          filename=f"{stream.name}[{page_no}]") as img:
        with img.convert("png8") as png:
            im = PIL.Image.open(BytesIO(png.make_blob()))
            return im.convert("RGB")

pi=page.to_image(original=my_page_image(page))

The main thing is for the user to realize the 8 bit limitation of Pillow when converting images. Perhaps it is enough that this conversation will now show up in searches, or perhaps it's worth a note in the Visual Debugging documentation?

@jsvine
Copy link
Owner

jsvine commented Jul 20, 2022

I believe that the latest version(s) of pdfplumber, which make some more generalized improvements/changes, now convert your PDF to an acceptable image:

tmp-grey

@jsvine jsvine closed this as completed Jul 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants