Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF Read/Merge generates broken documents since update 2.10.6 #1344

Closed
Merinorus opened this issue Sep 12, 2022 · 7 comments · Fixed by #1345
Closed

PDF Read/Merge generates broken documents since update 2.10.6 #1344

Merinorus opened this issue Sep 12, 2022 · 7 comments · Fixed by #1345

Comments

@Merinorus
Copy link

Merinorus commented Sep 12, 2022

Hello,

Since update 2.10.6, some PDF documents are not merged correctly. Same with version 2.10.7.
Previous versions (2.10.5 and below) behave correctly.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-Ubuntu-20.04-focal
Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-debian-11.2

$ python3 -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.6

Code + PDF

This is a minimal, complete example that shows the issue:

# requirements.txt
diff-pdf-visually==1.7.0
PyPDF2==2.10.6
pytest==7.1.3
# test_same_pdf.py
import io
import shutil
import tempfile
from PyPDF2 import PdfMerger, PdfReader
from diff_pdf_visually import pdf_similar
import os
import pytest
import logging


logger = logging.getLogger(__name__)

FILE_INPUT_URI = "arret_maladie.pdf"
FILE_OUTPUT_URI = "output.pdf"


def file_as_bytesio(filepath: str):
    """Open a file as BytesIO, read only."""
    with open(filepath, "rb") as f:
        return io.BytesIO(f.read())


def test_pdf_merger():

    # Open document and merge it into a temporary file
    merger = PdfMerger()

    merger.append(PdfReader(file_as_bytesio(FILE_INPUT_URI)))

    # Write the final merged document
    temp_file = tempfile.NamedTemporaryFile()
    merger.write(temp_file.name)

    temp_file_path = temp_file.name

    # Compare VISUALLY the content of the newly generated file with the expected content
    if not pdf_similar(temp_file_path, FILE_INPUT_URI):
        # If files don't match visually, the test fails.
        # Copy the newly generated file to the current directory to manually check what is wrong
        current_dir = os.path.dirname(__file__)
        new_file_path = os.path.join(current_dir, FILE_OUTPUT_URI)
        shutil.copy2(temp_file.name, new_file_path)
        logger.error(f"The newly merged file does not match with the intput file.")
        # Fail the test on purpose
        assert False

Here is the PDF that caused the issue:
input.pdf

Here is the output (simple PdfReader -> PdfMerger):
output.pdf

Traceback

This is the complete Traceback I see:

Converting each page of the PDFs to an image...
  PDFs have same number of pages. Checking each pair of converted images...
Min sig = 13.4442, significant?=True. The PDFs are different. The most different pages are: page 1 (sgf. 13.4442), page 2 (sgf. 13.6684), page 3 (sgf. 13.9239).
ERROR    test_same_pdf:test_same_pdf.py:44 The newly merged file does not match with the intput file.

Thank you for taking the time to investigate.
The document is a French official form, I guess it's fine for using it in automated tests, but not sure.

@Merinorus Merinorus changed the title PDF Merging is broken since update 2.10.6 PDF Read/Merge generates broken documents since update 2.10.6 Sep 12, 2022
@pubpub-zz
Copy link
Collaborator

Thanks to locating the issue,
Under analysis

@pubpub-zz
Copy link
Collaborator

The PDF has some fonts where the name includes some spaces (encoded with #20). The decoding takes that properly but not the writing leaving some invalides spaces leading to invalid fields.
Some cases with UTF-8 has been identified (typically for some chinese name fonts) : To anyone some PDF would be welcomed

else fix in progress

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Sep 13, 2022
@pubpub-zz
Copy link
Collaborator

@Merinorus
Have a look at the PR, If you want to test the mod on more changes

@Merinorus
Copy link
Author

Hi @pubpub-zz,
I just tested your PR with more files, working perfectly.
That was fast! Thank you very much!

MartinThoma pushed a commit that referenced this issue Sep 14, 2022
Three kinds of changes were made in this PR

1) _cmap.py : the str is coming from `/Encoding` which stores a NameObject : The conversion is already performed; no need to force it.
2) _page.py : Replaced obsolete call in _debug_for_extract()
3) _base.py :
3.1) unnumber : all `#xx` should be performed prior to conversion to str (using utf-8) to allow multi language text
3.2) read_from_stream : if utf-8 (normally the only one required) or gbk (kept to prevent regression) we will use charmap to get some sequence of chars
3.3) renumber : added to recode in #xx sequence. renumber will also be compatible with utf-8 chars

Closes #1344
@MartinThoma
Copy link
Member

The fixing PR was just merged. I will make a release to PyPI in a few minutes.

@MartinThoma
Copy link
Member

@Merinorus Thank you for reporting the issue! We value issue reports and feedback if the fixes worked. If you want, I would add you as a contributor: https://pypdf2.readthedocs.io/en/latest/meta/CONTRIBUTORS.html

@Merinorus
Copy link
Author

@MartinThoma That's very kind! I accept your offer. ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants