Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Format floats using their intrinsic decimal precision #1267

Merged
merged 3 commits into from Sep 18, 2022

Conversation

programmarchy
Copy link
Contributor

Since FloatObject is represented as a decimal, format numbers using their intrinsic precision, instead of reducing the precision to 5 decimal places.

This fixes rendering issues for PDFs that contain coordinates, transformations, etc. with real numbers containing more than 5 decimal places of precision. For example, PDFs exported from Microsoft PowerPoint contain numbers with up to 11 decimal places.

Fixes: #1266

@programmarchy programmarchy force-pushed the decimal-precision branch 2 times, most recently from e73cb59 to d7d447c Compare August 25, 2022 03:02
@codecov
Copy link

codecov bot commented Aug 25, 2022

Codecov Report

Base: 94.63% // Head: 94.63% // Increases project coverage by +0.00% 🎉

Coverage data is based on head (6b7f7ef) compared to base (3cf80bf).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1267   +/-   ##
=======================================
  Coverage   94.63%   94.63%           
=======================================
  Files          30       30           
  Lines        5140     5141    +1     
  Branches     1058     1058           
=======================================
+ Hits         4864     4865    +1     
  Misses        164      164           
  Partials      112      112           
Impacted Files Coverage Δ
PyPDF2/generic/_base.py 100.00% <ø> (ø)
PyPDF2/_writer.py 91.06% <100.00%> (+0.01%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@programmarchy programmarchy changed the title Format floats using the intrinsic decimal precision Format floats using their intrinsic decimal precision Aug 25, 2022
tests/test_writer.py Outdated Show resolved Hide resolved
PyPDF2/generic/_base.py Outdated Show resolved Hide resolved
@programmarchy programmarchy marked this pull request as draft August 29, 2022 14:00
@programmarchy programmarchy marked this pull request as ready for review August 29, 2022 15:46
@programmarchy programmarchy force-pushed the decimal-precision branch 6 times, most recently from 5611919 to a71d15b Compare August 31, 2022 13:56
…ng to 5 decimal places

Explicitly format floats in outline color test so they can be compared
rather than adding a precision property to FloatObject
@programmarchy
Copy link
Contributor Author

Rebased this PR so tests are passing, and believe all the changes requested in the last review have been addressed.

Could you take another look, @MasterOdin?

Copy link
Member

@MartinThoma MartinThoma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good from my side - thank you for writing a unit test 🤗

@MartinThoma
Copy link
Member

@MasterOdin You were a lot more involved in this PR than I was. What do you think?

@MasterOdin
Copy link
Member

Looks good. Thanks for the work here @programmarchy! 👍

@MartinThoma MartinThoma changed the title Format floats using their intrinsic decimal precision BUG: Format floats using their intrinsic decimal precision Sep 18, 2022
@MartinThoma MartinThoma merged commit 5aeb926 into py-pdf:main Sep 18, 2022
@MartinThoma
Copy link
Member

Thank you both for all the work you put in it 🙏

I've just merged the PR and I will release it today to PyPI :-)

@MartinThoma
Copy link
Member

@programmarchy If you want, I can add you to https://pypdf2.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-)

@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Sep 18, 2022
MartinThoma added a commit that referenced this pull request Sep 18, 2022
New Features (ENH):
-  Add rotation property and transfer_rotate_to_content (#1348)

Performance Improvements (PI):
-  Avoid string concatenation with large embedded base64-encoded images (#1350)

Bug Fixes (BUG):
-  Format floats using their intrinsic decimal precision (#1267)

Robustness (ROB):
-  Fix merge_page for pages without resources (#1349)

Full Changelog: 2.10.8...2.10.9
@programmarchy
Copy link
Contributor Author

@programmarchy If you want, I can add you to https://pypdf2.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-)

That would be very cool, thank you @MartinThoma!

@mrknwk
Copy link

mrknwk commented Sep 23, 2022

When you use page.scale_by() with a non-integer value in 2.10.9, Acrobat (22.002.20212) displays the transformed PDF pages as empty square drawing areas. Maybe one could leave the default precision at less than 20 digits (this is the Acrobat tipping point, I guess) for compatibility reasons and make it configurable?

@programmarchy
Copy link
Contributor Author

programmarchy commented Sep 24, 2022

Interesting, would you mind sharing how you came to find out 20 digits is the tipping point for Acrobat, @mrknwk?

One way I was thinking of to make this configurable would be to adopt context vars as implemented in decimal.Context for example. The context provides sane defaults with a central point for changing behavior.

It would allow us to write something like:

import PyPDF2
from PyPDF2 import PdfReader, PdfWriter
from PyPDF2.context import Context, StripExtraTrailingZeros, QuantizeInteger

ctx = StreamContext()
ctx.max_prec = 5  # specify maximum precision
ctx.flags = [
    StripExtraTrailingZeros,
    QuantizeInteger,
]  # could also specify additional format flags
PyPDF2.setcontext(ctx)

reader = PdfReader("./path/to/file.pdf")
reader.pages[0].scale_by(0.5)
writer = PdfWriter()
writer.add_page(reader.pages[0])
...

Or like this:

with PyPDF2.localcontext() as ctx:
    ctx.max_prec = 5  # specify maximum precision
    ctx.flags = [
        StripExtraTrailingZeros,
        QuantizeInteger,
    ]  # could also specify additional format flags
    ...

Or maybe this:

PyPDF2.setcontext(AdobeAcrobactContext)

Might make sense to open a separate issue to discuss further.

@mrknwk
Copy link

mrknwk commented Sep 24, 2022

@programmarchy It really was just trial and error. 😊

But 20 digits is also the limit that one of the maintainers of PDF Arranger found in a test. He contacted Adobe about it and apparently it is an Acrobat "implementation level limitation". So maybe the third option would be a nice way to go then.

@programmarchy
Copy link
Contributor Author

@mrknwk I'd be happy to take a stab at implementing the above. Could you please create a GitHub issue with a corresponding sample PDF, and tag me?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Path objects get copied with incorrect position and scale
4 participants