Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Acrobat cannot display transformed PDFs with a decimal precision > 19 #1376

Closed
mrknwk opened this issue Sep 28, 2022 · 15 comments · Fixed by #1563
Closed

Acrobat cannot display transformed PDFs with a decimal precision > 19 #1376

mrknwk opened this issue Sep 28, 2022 · 15 comments · Fixed by #1563

Comments

@mrknwk
Copy link

mrknwk commented Sep 28, 2022

Explanation

Since PyPDF2 version 2.10.9, floats are represented using their intrinsic precision instead of reducing the precision to 5 decimal places.

Acrobat Reader seems to have a limitation in displaying PDFs with a decimal precision > 19. When you apply page.scale_by() to a PDF page using a non-integer value, Acrobat (22.002.20212) displays the transformed page as empty square.

@programmarchy has already proposed a solution in #1267.

Environment

$ python -m platform
Windows-10-10.0.22621-SP0

$ python --version
Python 3.9.1

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.11.0

Code + PDF

from PyPDF2 import PdfReader, PdfWriter

with open("input.pdf", "rb") as input:
    reader = PdfReader(input)
    writer = PdfWriter()
    page = reader.pages[0]
    page.scale_by(10/7)
    writer.add_page(page)
    with open("output.pdf", "wb") as output:
        writer.write(output)

input.pdf
output.pdf [intrinsic precision]

If you change the precision in pypdf.generic._base.FloatObject from

f"{self:f}".rstrip("0")

to

f"{self:.19f}".rstrip("0")

Acrobat displays the resulting PDF correctly, while .20f cannot be displayed anymore.

@lutts
Copy link

lutts commented Oct 15, 2022

you can use decimal.getContext().prec to change the default precision

import decimal

decimal.getcontext().prec = 19

so, I think do not hardcode precision in pyPDF2 is a correct behavior, this is not a bug, but should leave a comment is pyPDF2's document

@programmarchy
Copy link
Contributor

Although that changes the precision for operations like rounding numbers, it does not have an affect on string formatting, unfortunately. So I propose a separate context to manage formatting settings for PyPDF.

@mrknwk
Copy link
Author

mrknwk commented Oct 15, 2022

@lutts sure, it's definitely not a bug, but Acrobat is the de facto standard viewer for Windows, so I guess it would be good to somehow make it transparent that transformations could cause Acrobat display problems.

@ztravis
Copy link
Contributor

ztravis commented Nov 15, 2022

This definitely seems like a bug to me in that I think everyone expects the output to be viewable in acrobat (the de facto standard as mentioned above). Would it be possible to default to a precision > 5 and < 20 (say, 19) and then provide for a configurable higher precision if desired?

@chrysn
Copy link

chrysn commented Dec 27, 2022

Note that this affects decimals displayed in various places. I've encountered this setting boxes (there, I worked around using page[NameObject('/%s'%b)] = ArrayObject(FloatObject(x.quantize(Decimal(10)**(-10))) for x in boxdimensions[b])), and in page sizes, which might serve as an even more practical (because standalone) example:

import PyPDF2
from decimal import Decimal
pdf = PyPDF2.PdfFileWriter()
pdf.addBlankPage(Decimal("10.0000000000000000000000000000001"), 5)
with open("test.pdf", "wb") as of:
    pdf.write(of)

It may also be worth mentioning (mainly for googlability) that when trying to anything with that document in Acrobat, it shows "There was a problem reading this document (14)".


As to the characterization of the precision tolerated, I've conducted some experiments: It seems that Acrobat tolerates 19 digits after the decimal point. This is not what Decimal's prec does -- that sets the number of digits in the mantissa, and not the number of digits after the decimal point. In particular, merely setting the Decimal output precision to 19 would still produce Acrobat-broken PDF documents from operations like Decimal("0.001") / Decimal("7") (which at prec=19 has 22 digits after the decimal point).

@chrysn
Copy link

chrysn commented Dec 27, 2022

To make the original tests I've done verifiable:

  • pdf.addBlankPage(Decimal("1000.0000000000000000001"), 5) works (4 digits before, 19 after the dot) -- this is a number which a Decimal context with prec=19 would not output.
  • pdf.addBlankPage(Decimal("1000.00000000000000000012"), 5) does not work (4 digits before, 20 after the dot)
  • pdf.addBlankPage(Decimal("1.0000000000000000001"), 5) works (1 digits before, 19 after the dot)
  • pdf.addBlankPage(Decimal("1.00000000000000000012"), 5) does not work (1 digits before, 20 after the dot)

(This is all consistent with my previous statement of "it's the digits after the dot, not the mantissa length").

However, I've done one more test:

  • pdf.addBlankPage(Decimal("0.01000000000000000012"), 5) also works (20 digits after the dot, thereof 1 leading zero).

This indicates some mixed scheme in which 19 places after the dot are tolerated, but small numbers do use something more float-like. It should be noted though that there appears to be a minimal page size (1.1mm), so maybe that test is not ideal.

Therefore, I'm going with a test closer to my original use case -- also containing a more comprehensive set of what works and what doesn't:

import PyPDF2
from PyPDF2.generic import NameObject as N, ArrayObject as A, FloatObject as F
from decimal import Decimal
pdf = PyPDF2.PdfFileWriter()
p = pdf.addBlankPage(30, 30)
p[N('/ArtBox')] = A((F(0), F(0), F(1), F(Decimal("10.0000000000000000001")))) # works
p[N('/ArtBox')] = A((F(0), F(0), F(1), F(Decimal("1.0000000000000000001")))) # works
p[N('/ArtBox')] = A((F(0), F(0), F(1), F(Decimal("1.00000000000000000001")))) # broken
p[N('/ArtBox')] = A((F(0), F(0), F(1), F(Decimal("0.10000000000000000001")))) # works
p[N('/ArtBox')] = A((F(0), F(0), F(1), F(Decimal("0.100000000000000000012")))) # works
p[N('/ArtBox')] = A((F(0), F(0), F(1), F(Decimal("0.10000000000000000001234")))) # works
p[N('/ArtBox')] = A((F(0), F(0), F(1), F(Decimal("0.100000000000000000012347890123456")))) # works
p[N('/ArtBox')] = A((F(0), F(0), F(1), F(Decimal("0.100000000000000000012347890123456789999999999999999999999999999999999999")))) # works
with open("test.pdf", "wb") as of:
    pdf.write(of)

I'd summarize this as "19 digits after the dot always work; if it's zero before the dot, there is no practical limit".

@chrysn
Copy link

chrysn commented Dec 27, 2022

I've done some digging in the specs (Adobe® Portable Document Format Version 1.7 is what I read):

  • Acrobat is documented to use single-precision floats (at least in "Type 4 functions"; Appendix H item 44). Makes one wonder why such a note is in a standard. It's relevant here because 19 digits are more like a conservative estimate of the decimal digits for a double precision number. (Which also means that at any rate, unless numerically highly unstable things are done, 19 digits are enough practically).
  • The standard places no limits on the precision, just notes that implementations may have limited range or precision, and gives "5 digits" as typical precision for reals (Table C.1). It places no limits I'd have found on the number of digits in a document.

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Jan 9, 2023

all
we do not need lots of digits unless the number is very small as pdf standard only accept floating number but no engineer format

I would propose to replace __repr__ in FloatObject with this code

    def __repr__(self) -> str:
        if self == self.to_integral():
            # If this is an integer, format it with no decimal place.
            return str(self.quantize(decimal.Decimal(1)))
        else:
            # Otherwise, format it with a decimal place, taking care to
            # remove any extraneous trailing zeros.
            if self>=0:
                nb=7
            elif self>=7:
                nb=10
            else: #<1e-7
                nb=8-int(decimal.getcontext().log10(self))
            #print("nb",self,nb)
            return f"{self:.{nb}f}".rstrip("0")

your opinions ?

@chrysn
Copy link

chrysn commented Jan 10, 2023 via email

@pubpub-zz
Copy link
Collaborator

@chrysn,

oops. mistake : it should be read elif self>=7: this sould prevent too many calls to log10

@chrysn
Copy link

chrysn commented Jan 12, 2023

It'd probably be if self >= 1e-7: nb = 15 (which is the longest the else branch would produce for that range of numbers) -- but log10 is really cheap: with d = Decimal("0.1"), %timeit d.log10() is 6 times faster (!) than %timeit d > 0.0001. So it's not even an optimization.

I've spotted two more small bugs (it should be self >= 1, and doesn't consider negatives), so maybe go for something simpler altogether?

    def __repr__(self):
        """Represent the number in decimal format with up to 19 decimal digits,
        or to with the available precision when the number's integral part is
        zero.

        Reducing precision accomodates Adobe Acrobat (which fails to load files
        containing more precise numbers).

        >>> D("10")
        10
        >>> D("10.0000000000000000000001")
        10
        >>> D("10.0000000000100000000001")
        10.00000000001
        >>> D("0.0000000000100000000001")
        0.0000000000100000000001
        >>> D("100000000000000000000.0000000000100000000001")
        100000000000000000000.00000000001
        """
        if abs(self) >= 1:
            return f"{self:.19f}".rstrip("0").rstrip(".")
        else:
            return f"{self:f}"

@pubpub-zz
Copy link
Collaborator

It'd probably be if self >= 1e-7: nb = 15 (which is the longest the else branch would produce for that range of numbers) -- but log10 is really cheap: with d = Decimal("0.1"), %timeit d.log10() is 6 times faster (!) than %timeit d > 0.0001. So it's not even an optimization.

This could be interesting.

about D("10.0000000000000000000001") for me, getting 10 is not abnormal as the number will be stored in most (maybe all) in double : we have a maximum of 16 digits...
For me the issue is only for very small number (such as 1e-10) and very big were we have to allow more digits
I would propose this:

    def __repr__(self):
        nb = int(decimal.getcontext().log10(abs(self)))
        return f"{self:.{max(1,16-nb)}f}".rstrip("0").rstrip(".")

@programmarchy
Copy link
Contributor

programmarchy commented Jan 13, 2023

Is sprinkling in more magic numbers really an ideal solution to this problem? This feels too clever. My sense is that this will lead to a whack-a-mole situation that will never quite cover every edge case. It also makes the code more difficult to understand.

I propose merging #1499 but making the AcrobatContext the default context so it's the default behavior. Seems like that would address the case here, and in general. Then, if someone has a case like I had in #1267 they can opt to use a context that uses intrinsic precision. Contexts aren't a perfect solution, but at least they isolate the magic numbers somewhere they can be documented in detail.

@pubpub-zz
Copy link
Collaborator

16 is not a magic number : it corresponds to the number of digits for the 52-bits mantissa of a double which correspond to the standard implementation nowadays.
The idea of change the decimal context is not a so good idea as it may have impact of the rest of the programs.

Also, this should be compatible with float implementation : this should allow us to move from class FloatObject(decimal.Decimal, PdfObject) to class FloatObject(float, PdfObject) and improve Performances (cf #68)
@MartinThoma : your opinion about it ?

@programmarchy
Copy link
Contributor

I was referring to the overall thread not specifically your previous comment @pubpub-zz

Also, #1499 does not alter the DecimalContext. It defines a new context that is specific to PyPDF, which would not impact other programs.

MartinThoma pushed a commit that referenced this issue Feb 4, 2023
Decimal was replaced by float in order to fix bugs.

It might also improve speed in some cases.
It is a preparation for #1567

Fixes #1527
Fixes #1376
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants