Acrobat cannot display transformed PDFs with a decimal precision > 19 #1376

mrknwk · 2022-09-28T19:29:00Z

Explanation

Since PyPDF2 version 2.10.9, floats are represented using their intrinsic precision instead of reducing the precision to 5 decimal places.

Acrobat Reader seems to have a limitation in displaying PDFs with a decimal precision > 19. When you apply page.scale_by() to a PDF page using a non-integer value, Acrobat (22.002.20212) displays the transformed page as empty square.

@programmarchy has already proposed a solution in #1267.

Environment

$ python -m platform
Windows-10-10.0.22621-SP0

$ python --version
Python 3.9.1

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.11.0

Code + PDF

from PyPDF2 import PdfReader, PdfWriter

with open("input.pdf", "rb") as input:
    reader = PdfReader(input)
    writer = PdfWriter()
    page = reader.pages[0]
    page.scale_by(10/7)
    writer.add_page(page)
    with open("output.pdf", "wb") as output:
        writer.write(output)

input.pdf
output.pdf [intrinsic precision]

If you change the precision in pypdf.generic._base.FloatObject from

f"{self:f}".rstrip("0")

to

f"{self:.19f}".rstrip("0")

Acrobat displays the resulting PDF correctly, while .20f cannot be displayed anymore.

The text was updated successfully, but these errors were encountered:

lutts · 2022-10-15T17:09:26Z

you can use decimal.getContext().prec to change the default precision

import decimal

decimal.getcontext().prec = 19

so, I think do not hardcode precision in pyPDF2 is a correct behavior, this is not a bug, but should leave a comment is pyPDF2's document

programmarchy · 2022-10-15T17:50:05Z

Although that changes the precision for operations like rounding numbers, it does not have an affect on string formatting, unfortunately. So I propose a separate context to manage formatting settings for PyPDF.

mrknwk · 2022-10-15T20:00:12Z

@lutts sure, it's definitely not a bug, but Acrobat is the de facto standard viewer for Windows, so I guess it would be good to somehow make it transparent that transformations could cause Acrobat display problems.

ztravis · 2022-11-15T00:30:26Z

This definitely seems like a bug to me in that I think everyone expects the output to be viewable in acrobat (the de facto standard as mentioned above). Would it be possible to default to a precision > 5 and < 20 (say, 19) and then provide for a configurable higher precision if desired?

chrysn · 2022-12-27T11:22:57Z

Note that this affects decimals displayed in various places. I've encountered this setting boxes (there, I worked around using page[NameObject('/%s'%b)] = ArrayObject(FloatObject(x.quantize(Decimal(10)**(-10))) for x in boxdimensions[b])), and in page sizes, which might serve as an even more practical (because standalone) example:

import PyPDF2
from decimal import Decimal
pdf = PyPDF2.PdfFileWriter()
pdf.addBlankPage(Decimal("10.0000000000000000000000000000001"), 5)
with open("test.pdf", "wb") as of:
    pdf.write(of)

It may also be worth mentioning (mainly for googlability) that when trying to anything with that document in Acrobat, it shows "There was a problem reading this document (14)".

As to the characterization of the precision tolerated, I've conducted some experiments: It seems that Acrobat tolerates 19 digits after the decimal point. This is not what Decimal's prec does -- that sets the number of digits in the mantissa, and not the number of digits after the decimal point. In particular, merely setting the Decimal output precision to 19 would still produce Acrobat-broken PDF documents from operations like Decimal("0.001") / Decimal("7") (which at prec=19 has 22 digits after the decimal point).

chrysn · 2022-12-27T11:49:05Z

To make the original tests I've done verifiable:

pdf.addBlankPage(Decimal("1000.0000000000000000001"), 5) works (4 digits before, 19 after the dot) -- this is a number which a Decimal context with prec=19 would not output.
pdf.addBlankPage(Decimal("1000.00000000000000000012"), 5) does not work (4 digits before, 20 after the dot)
pdf.addBlankPage(Decimal("1.0000000000000000001"), 5) works (1 digits before, 19 after the dot)
pdf.addBlankPage(Decimal("1.00000000000000000012"), 5) does not work (1 digits before, 20 after the dot)

(This is all consistent with my previous statement of "it's the digits after the dot, not the mantissa length").

However, I've done one more test:

pdf.addBlankPage(Decimal("0.01000000000000000012"), 5) also works (20 digits after the dot, thereof 1 leading zero).

This indicates some mixed scheme in which 19 places after the dot are tolerated, but small numbers do use something more float-like. It should be noted though that there appears to be a minimal page size (1.1mm), so maybe that test is not ideal.

Therefore, I'm going with a test closer to my original use case -- also containing a more comprehensive set of what works and what doesn't:

import PyPDF2
from PyPDF2.generic import NameObject as N, ArrayObject as A, FloatObject as F
from decimal import Decimal
pdf = PyPDF2.PdfFileWriter()
p = pdf.addBlankPage(30, 30)
p[N('/ArtBox')] = A((F(0), F(0), F(1), F(Decimal("10.0000000000000000001")))) # works
p[N('/ArtBox')] = A((F(0), F(0), F(1), F(Decimal("1.0000000000000000001")))) # works
p[N('/ArtBox')] = A((F(0), F(0), F(1), F(Decimal("1.00000000000000000001")))) # broken
p[N('/ArtBox')] = A((F(0), F(0), F(1), F(Decimal("0.10000000000000000001")))) # works
p[N('/ArtBox')] = A((F(0), F(0), F(1), F(Decimal("0.100000000000000000012")))) # works
p[N('/ArtBox')] = A((F(0), F(0), F(1), F(Decimal("0.10000000000000000001234")))) # works
p[N('/ArtBox')] = A((F(0), F(0), F(1), F(Decimal("0.100000000000000000012347890123456")))) # works
p[N('/ArtBox')] = A((F(0), F(0), F(1), F(Decimal("0.100000000000000000012347890123456789999999999999999999999999999999999999")))) # works
with open("test.pdf", "wb") as of:
    pdf.write(of)

I'd summarize this as "19 digits after the dot always work; if it's zero before the dot, there is no practical limit".

chrysn · 2022-12-27T12:05:41Z

I've done some digging in the specs (Adobe® Portable Document Format Version 1.7 is what I read):

Acrobat is documented to use single-precision floats (at least in "Type 4 functions"; Appendix H item 44). Makes one wonder why such a note is in a standard. It's relevant here because 19 digits are more like a conservative estimate of the decimal digits for a double precision number. (Which also means that at any rate, unless numerically highly unstable things are done, 19 digits are enough practically).
The standard places no limits on the precision, just notes that implementations may have limited range or precision, and gives "5 digits" as typical precision for reals (Table C.1). It places no limits I'd have found on the number of digits in a document.

pubpub-zz · 2023-01-09T22:46:04Z

all
we do not need lots of digits unless the number is very small as pdf standard only accept floating number but no engineer format

I would propose to replace __repr__ in FloatObject with this code

    def __repr__(self) -> str:
        if self == self.to_integral():
            # If this is an integer, format it with no decimal place.
            return str(self.quantize(decimal.Decimal(1)))
        else:
            # Otherwise, format it with a decimal place, taking care to
            # remove any extraneous trailing zeros.
            if self>=0:
                nb=7
            elif self>=7:
                nb=10
            else: #<1e-7
                nb=8-int(decimal.getcontext().log10(self))
            #print("nb",self,nb)
            return f"{self:.{nb}f}".rstrip("0")

your opinions ?

chrysn · 2023-01-10T11:14:08Z

LGTM except for the `elif self>=7:` part being inconsistent (do we need that middle case at all? log10 isn't that costly). As this is not a PDF but an Acrobat detail, I'd recommend leaving a note to that effect (for the benefit of later devs who might be tempted to refactor here).

pubpub-zz · 2023-01-11T20:01:11Z

@chrysn,

oops. mistake : it should be read elif self>=7: this sould prevent too many calls to log10

chrysn · 2023-01-12T14:12:09Z

It'd probably be if self >= 1e-7: nb = 15 (which is the longest the else branch would produce for that range of numbers) -- but log10 is really cheap: with d = Decimal("0.1"), %timeit d.log10() is 6 times faster (!) than %timeit d > 0.0001. So it's not even an optimization.

I've spotted two more small bugs (it should be self >= 1, and doesn't consider negatives), so maybe go for something simpler altogether?

    def __repr__(self):
        """Represent the number in decimal format with up to 19 decimal digits,
        or to with the available precision when the number's integral part is
        zero.

        Reducing precision accomodates Adobe Acrobat (which fails to load files
        containing more precise numbers).

        >>> D("10")
        10
        >>> D("10.0000000000000000000001")
        10
        >>> D("10.0000000000100000000001")
        10.00000000001
        >>> D("0.0000000000100000000001")
        0.0000000000100000000001
        >>> D("100000000000000000000.0000000000100000000001")
        100000000000000000000.00000000001
        """
        if abs(self) >= 1:
            return f"{self:.19f}".rstrip("0").rstrip(".")
        else:
            return f"{self:f}"

pubpub-zz · 2023-01-13T21:05:35Z

It'd probably be if self >= 1e-7: nb = 15 (which is the longest the else branch would produce for that range of numbers) -- but log10 is really cheap: with d = Decimal("0.1"), %timeit d.log10() is 6 times faster (!) than %timeit d > 0.0001. So it's not even an optimization.

This could be interesting.

about D("10.0000000000000000000001") for me, getting 10 is not abnormal as the number will be stored in most (maybe all) in double : we have a maximum of 16 digits...
For me the issue is only for very small number (such as 1e-10) and very big were we have to allow more digits
I would propose this:

    def __repr__(self):
        nb = int(decimal.getcontext().log10(abs(self)))
        return f"{self:.{max(1,16-nb)}f}".rstrip("0").rstrip(".")

programmarchy · 2023-01-13T21:25:29Z

Is sprinkling in more magic numbers really an ideal solution to this problem? This feels too clever. My sense is that this will lead to a whack-a-mole situation that will never quite cover every edge case. It also makes the code more difficult to understand.

I propose merging #1499 but making the AcrobatContext the default context so it's the default behavior. Seems like that would address the case here, and in general. Then, if someone has a case like I had in #1267 they can opt to use a context that uses intrinsic precision. Contexts aren't a perfect solution, but at least they isolate the magic numbers somewhere they can be documented in detail.

pubpub-zz · 2023-01-14T08:41:27Z

16 is not a magic number : it corresponds to the number of digits for the 52-bits mantissa of a double which correspond to the standard implementation nowadays.
The idea of change the decimal context is not a so good idea as it may have impact of the rest of the programs.

Also, this should be compatible with float implementation : this should allow us to move from class FloatObject(decimal.Decimal, PdfObject) to class FloatObject(float, PdfObject) and improve Performances (cf #68)
@MartinThoma : your opinion about it ?

programmarchy · 2023-01-14T14:44:58Z

I was referring to the overall thread not specifically your previous comment @pubpub-zz

Also, #1499 does not alter the DecimalContext. It defines a new context that is specific to PyPDF, which would not impact other programs.

Decimal was replaced by float in order to fix bugs. It might also improve speed in some cases. It is a preparation for #1567 Fixes #1527 Fixes #1376

joshhendo mentioned this issue Dec 13, 2022

BUG: 1376 Acrobat Scale #1499

Closed

mrknwk mentioned this issue Jan 4, 2023

Cannot open pdf file with Adobe Acrobat Reader after transformation #1527

Closed

MartinThoma mentioned this issue Jan 29, 2023

BUG: Replace decimal by float #1563

Merged

MartinThoma closed this as completed in #1563 Feb 4, 2023

MartinThoma pushed a commit that referenced this issue Feb 4, 2023

BUG: Replace decimal by float (#1563)

6ec88ad

Decimal was replaced by float in order to fix bugs. It might also improve speed in some cases. It is a preparation for #1567 Fixes #1527 Fixes #1376

mrknwk mentioned this issue Feb 5, 2023

Output PDF has loss of data #1607

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Acrobat cannot display transformed PDFs with a decimal precision > 19 #1376

Acrobat cannot display transformed PDFs with a decimal precision > 19 #1376

mrknwk commented Sep 28, 2022 •

edited

Loading

lutts commented Oct 15, 2022 •

edited

Loading

programmarchy commented Oct 15, 2022

mrknwk commented Oct 15, 2022 •

edited

Loading

ztravis commented Nov 15, 2022 •

edited

Loading

chrysn commented Dec 27, 2022

chrysn commented Dec 27, 2022

chrysn commented Dec 27, 2022

pubpub-zz commented Jan 9, 2023 •

edited

Loading

chrysn commented Jan 10, 2023 via email

pubpub-zz commented Jan 11, 2023

chrysn commented Jan 12, 2023

pubpub-zz commented Jan 13, 2023

programmarchy commented Jan 13, 2023 •

edited

Loading

pubpub-zz commented Jan 14, 2023

programmarchy commented Jan 14, 2023

Acrobat cannot display transformed PDFs with a decimal precision > 19 #1376

Acrobat cannot display transformed PDFs with a decimal precision > 19 #1376

Comments

mrknwk commented Sep 28, 2022 • edited Loading

Explanation

Environment

Code + PDF

lutts commented Oct 15, 2022 • edited Loading

programmarchy commented Oct 15, 2022

mrknwk commented Oct 15, 2022 • edited Loading

ztravis commented Nov 15, 2022 • edited Loading

chrysn commented Dec 27, 2022

chrysn commented Dec 27, 2022

chrysn commented Dec 27, 2022

pubpub-zz commented Jan 9, 2023 • edited Loading

chrysn commented Jan 10, 2023 via email

pubpub-zz commented Jan 11, 2023

chrysn commented Jan 12, 2023

pubpub-zz commented Jan 13, 2023

programmarchy commented Jan 13, 2023 • edited Loading

pubpub-zz commented Jan 14, 2023

programmarchy commented Jan 14, 2023

mrknwk commented Sep 28, 2022 •

edited

Loading

lutts commented Oct 15, 2022 •

edited

Loading

mrknwk commented Oct 15, 2022 •

edited

Loading

ztravis commented Nov 15, 2022 •

edited

Loading

pubpub-zz commented Jan 9, 2023 •

edited

Loading

programmarchy commented Jan 13, 2023 •

edited

Loading