Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text Documents are altered when merged #1058

Closed
Phiwatec opened this issue Jul 4, 2022 · 5 comments
Closed

Text Documents are altered when merged #1058

Phiwatec opened this issue Jul 4, 2022 · 5 comments
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfMerger The PdfMerger component is affected

Comments

@Phiwatec
Copy link

Phiwatec commented Jul 4, 2022

When using the PdfMerger to merge (append) two simple pdf files it merges them incorrectly.
The first is appended without a problem. But the second one is a mixture of both. If the pdfs are dissimilar this does not happen.
This happens both with pdfs created from LibreOffice Writer and an online convert Service.

$ python -m platform
Linux-5.10.0-13-amd64-x86_64-with-glibc2.31

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.4.1

$ python --version
Python 3.9.2

This is a minimal, complete example that shows the issue:

#Basic example from docs
from PyPDF2 import PdfFileMerger, PdfFileReader
merger = PdfFileMerger()
merger.append("LO_first.pdf")
merger.append("LO_second.pdf")
merger.write("LO_out.pdf")
merger.close()

There a six attached files:
LO_first.pdf and LO_second.pdf are files created with Libreoffice Writer.
LO_out.pdf is the resulting wrong file
online_first.pdf and online_second.pdf are created using an online convert service from a plaintext file
online_out.pdf is the resulting wrong file

When using Firefox two view the resulting document it does not show a hidden character:
firefox
When using Chrome or Okular it shows a non printable character:
chrome

In both cases it should be bcde and not abc.

If the files contain very similar content ( "test1", "test2", "test3",etc.) the first file is used for all pages.

Thanks in advance for looking at this behavior.

@MartinThoma MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfMerger The PdfMerger component is affected labels Jul 4, 2022
@MartinThoma
Copy link
Member

Thank you for reporting the issue. I'll have a look this week (and I hope @MasterOdin / @pubpub-zz have some time as well 😅)

@Phiwatec
Copy link
Author

Phiwatec commented Jul 4, 2022

Thank you :)

@pubpub-zz
Copy link
Collaborator

I think the issue is due to the fact that the two initial pdf files are identifying a font named /F1 which has a different definitions in the two files. the first one only refers 3 characters "a" "b" "c" associated with coding 01 02 03 whereas in second find 01 02 03 04 are associated with "b" "c" "d" "e"
after merging the page 2 refers to the same font object as on page 1, which is inducing the incorrect data.
this is a bug confirmed

@MartinThoma
Copy link
Member

MartinThoma commented Jul 5, 2022

@Phiwatec This issue was likely the same as #1062. It was introduced in 2.4.1 via #207. It was fixed in 2.4.2 (released moments ago) via #1063. A test was added to prevent this issue from happening again.

Could you please check if things work with PyPDF2==2.4.2 for you again?

@Phiwatec
Copy link
Author

Phiwatec commented Jul 5, 2022

Thanks for the quick reply. It now works perfectly fine :)
Thank you for your time.

@Phiwatec Phiwatec closed this as completed Jul 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfMerger The PdfMerger component is affected
Projects
None yet
Development

No branches or pull requests

3 participants