Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError 'utf-8' codec can't decode byte 0xac in position 0: invalid start byte #1758

Closed
nalin-udhaar opened this issue Mar 31, 2023 · 10 comments · Fixed by #1768
Closed

Comments

@nalin-udhaar
Copy link

nalin-udhaar commented Mar 31, 2023

While trying to merge 2 PDF's, I get the following error.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-144-generic-x86_64-with-glibc2.29

$ python -c "import pypdf;print(pypdf.__version__)"
3.7.0

Code + PDF

This is a minimal, complete example that shows the issue:

def merge_pdf(source_pdf, pdf_to_merge: list):
    """
    :param source_pdf:
    :param pdf_to_merge: list of pdf to append to source_pdf
    :return:
    """
    merger = PdfMerger(strict=False)
    items = [source_pdf]
    items.extend(pdf_to_merge)
    pdf_merged_buffer = io.BytesIO()
    for _pdf_file in items:
        # Append PDF files
        pdf_buffer = PdfReader(io.BytesIO(_pdf_file), strict=False)
        merger.append(pdf_buffer, import_outline=False)
    merger.write(pdf_merged_buffer)
    merger.close()

I can share the pdf over a private channel.

Traceback

This is the Traceback I see, only adding the part with pypdf:
  File "/home/nalin/workspace/project/utils/pdf_utils.py", line 185, in merge_pdf
    merger.write(pdf_merged_buffer)
  File "/home/nalin/.virtualenv/project/lib/python3.8/site-packages/pypdf/_merger.py", line 333, in write
    self.output.add_page(page.pagedata)
  File "/home/nalin/.virtualenv/project/lib/python3.8/site-packages/pypdf/_writer.py", line 359, in add_page
    return self._add_page(page, list.append, excluded_keys)
  File "/home/nalin/.virtualenv/project/lib/python3.8/site-packages/pypdf/_writer.py", line 311, in _add_page
    other = page_org.pdf.pdf_header
  File "/home/nalin/.virtualenv/project/lib/python3.8/site-packages/pypdf/_reader.py", line 363, in pdf_header
    pdf_file_version = self.stream.read(8).decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xac in position 0: invalid start byte

PDF can be opened with a PDF viewer.

@pubpub-zz
Copy link
Collaborator

You can send the file to @MartinThoma info@martin-thoma.de

@MartinThoma
Copy link
Member

I think the fix might be rather easy .. .at least for the first issue: #1759

@pubpub-zz
Copy link
Collaborator

@nalin-udhaar, we are still waiting for your file

@nalin-udhaar
Copy link
Author

I have shared the PDF over the the mentioned email.

@pubpub-zz
Copy link
Collaborator

I've reviewed your file and this file is damaged (you can check that with a binary editor) I've proposed an alternative robustness improvement

@nalin-udhaar
Copy link
Author

do you want me to share that file again?

@pubpub-zz
Copy link
Collaborator

@nalin-udhaar
Your mail was good but if you look to the file with a binary editor you will not be able to find the standard "%PDF-" which is expected.
I agree that at least acrobat reader cope with this situation (even if some elements may not be displayed), that's why I've proposed a fix for pypdf to be more tolerant too.

@MartinThoma
Copy link
Member

I just merged #1768 to main. It will be part of pypdf > 3.7.0 which will be released to PyPI on Sunday (9th of April 2023)

@MartinThoma
Copy link
Member

@nalin-udhaar Thank you for letting us know! If you want, I put you on the list of contributors: https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html

@nalin-udhaar
Copy link
Author

yeah, that will be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants