to_image of grey text results in a fully white image #443

linuxsoftware · 2021-06-07T22:34:23Z

Thank you for this extremely useful library.

I had a problem with visual debugging of a PDF that was mostly grey. All the text turned white so it could not be seen.

The problem is ImageMagick creates the image of the page as a 16bit greyscale PNG, but Pillow has a documented issue with converting that to RGB. (See https://stackoverflow.com/questions/19892919/pil-converting-an-image-with-mode-i-to-rgb-results-in-a-fully-white-image and python-pillow/Pillow#3011)

My hack has been to change display.py so that ImageMagick creates the image as an 8bit PNG using convert("png8"), which Pillow can then cope with. This "works for me".

--- a/pdfplumber/display.py
+++ b/pdfplumber/display.py
@@ -41,7 +41,7 @@ def get_page_image(stream, page_no, resolution):
         if img.alpha_channel:
             img.background_color = wand.image.Color("white")
             img.alpha_channel = "remove"
-        with img.convert("png") as png:
+        with img.convert("png8") as png:
             im = PIL.Image.open(BytesIO(png.make_blob()))
             return im.convert("RGB")

Environment

pdfplumber version: 0.5.28
ImageMagick version: 6.9.11.27
Wand version: 0.6.6
Pillow version: 8.2.0
Python version: 3.7.9
OS: Linux

The text was updated successfully, but these errors were encountered:

jsvine · 2021-06-08T21:58:32Z

Hi @linuxsoftware, and thanks for flagging this! Since the default seems to work well for most PDFs, I'd lean toward an approach that allows the user to specify the conversion mode via an argument passed to get_page_image(...) and Page.to_image(...). I'll put this on my todo list, though you're also welcome to submit a PR.

linuxsoftware · 2021-06-08T23:10:14Z

I was thinking about this and realized it is already possible to pass a user-created original image in to to_image so perhaps the code does not need to change at all.

e.g.

def my_page_image(page):
    stream = page.pdf.stream
    page_no = page.page_number - 1
    with wand.image.Image(resolution=150,
                          filename=f"{stream.name}[{page_no}]") as img:
        with img.convert("png8") as png:
            im = PIL.Image.open(BytesIO(png.make_blob()))
            return im.convert("RGB")

pi=page.to_image(original=my_page_image(page))

The main thing is for the user to realize the 8 bit limitation of Pillow when converting images. Perhaps it is enough that this conversation will now show up in searches, or perhaps it's worth a note in the Visual Debugging documentation?

jsvine · 2022-07-20T22:43:07Z

I believe that the latest version(s) of pdfplumber, which make some more generalized improvements/changes, now convert your PDF to an acceptable image:

linuxsoftware added the bug label Jun 7, 2021

jsvine added the enhancement label Jun 8, 2021

jsvine closed this as completed Jul 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_image of grey text results in a fully white image #443

to_image of grey text results in a fully white image #443

linuxsoftware commented Jun 7, 2021

jsvine commented Jun 8, 2021

linuxsoftware commented Jun 8, 2021

jsvine commented Jul 20, 2022

to_image of grey text results in a fully white image #443

to_image of grey text results in a fully white image #443

Comments

linuxsoftware commented Jun 7, 2021

Environment

jsvine commented Jun 8, 2021

linuxsoftware commented Jun 8, 2021

jsvine commented Jul 20, 2022