UnicodeDecodeError 'utf-8' codec can't decode byte 0xac in position 0: invalid start byte #1758

nalin-udhaar · 2023-03-31T12:17:27Z

While trying to merge 2 PDF's, I get the following error.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-144-generic-x86_64-with-glibc2.29

$ python -c "import pypdf;print(pypdf.__version__)"
3.7.0

Code + PDF

This is a minimal, complete example that shows the issue:

def merge_pdf(source_pdf, pdf_to_merge: list):
    """
    :param source_pdf:
    :param pdf_to_merge: list of pdf to append to source_pdf
    :return:
    """
    merger = PdfMerger(strict=False)
    items = [source_pdf]
    items.extend(pdf_to_merge)
    pdf_merged_buffer = io.BytesIO()
    for _pdf_file in items:
        # Append PDF files
        pdf_buffer = PdfReader(io.BytesIO(_pdf_file), strict=False)
        merger.append(pdf_buffer, import_outline=False)
    merger.write(pdf_merged_buffer)
    merger.close()

I can share the pdf over a private channel.

Traceback

This is the Traceback I see, only adding the part with pypdf:
  File "/home/nalin/workspace/project/utils/pdf_utils.py", line 185, in merge_pdf
    merger.write(pdf_merged_buffer)
  File "/home/nalin/.virtualenv/project/lib/python3.8/site-packages/pypdf/_merger.py", line 333, in write
    self.output.add_page(page.pagedata)
  File "/home/nalin/.virtualenv/project/lib/python3.8/site-packages/pypdf/_writer.py", line 359, in add_page
    return self._add_page(page, list.append, excluded_keys)
  File "/home/nalin/.virtualenv/project/lib/python3.8/site-packages/pypdf/_writer.py", line 311, in _add_page
    other = page_org.pdf.pdf_header
  File "/home/nalin/.virtualenv/project/lib/python3.8/site-packages/pypdf/_reader.py", line 363, in pdf_header
    pdf_file_version = self.stream.read(8).decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xac in position 0: invalid start byte

PDF can be opened with a PDF viewer.

The text was updated successfully, but these errors were encountered:

pubpub-zz · 2023-03-31T12:55:17Z

You can send the file to @MartinThoma info@martin-thoma.de

MartinThoma · 2023-03-31T14:14:42Z

I think the fix might be rather easy .. .at least for the first issue: #1759

pubpub-zz · 2023-04-02T07:54:36Z

@nalin-udhaar, we are still waiting for your file

nalin-udhaar · 2023-04-04T05:07:44Z

I have shared the PDF over the the mentioned email.

fixes py-pdf#1758

pubpub-zz · 2023-04-04T21:07:19Z

I've reviewed your file and this file is damaged (you can check that with a binary editor) I've proposed an alternative robustness improvement

nalin-udhaar · 2023-04-06T04:39:10Z

do you want me to share that file again?

pubpub-zz · 2023-04-06T05:45:53Z

@nalin-udhaar
Your mail was good but if you look to the file with a binary editor you will not be able to find the standard "%PDF-" which is expected.
I agree that at least acrobat reader cope with this situation (even if some elements may not be displayed), that's why I've proposed a fix for pypdf to be more tolerant too.

Fixes #1758

MartinThoma · 2023-04-06T12:12:38Z

I just merged #1768 to main. It will be part of pypdf > 3.7.0 which will be released to PyPI on Sunday (9th of April 2023)

MartinThoma · 2023-04-06T12:13:16Z

@nalin-udhaar Thank you for letting us know! If you want, I put you on the list of contributors: https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html

nalin-udhaar · 2023-04-07T07:07:46Z

yeah, that will be great.

MartinThoma mentioned this issue Mar 31, 2023

ROB: Capture UnicodeDecodeError at PdfReader.pdf_header #1759

Closed

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Apr 4, 2023

ENH : cope with corrupted pdf header

4245d02

fixes py-pdf#1758

pubpub-zz mentioned this issue Apr 4, 2023

ROB: Capture UnicodeDecodeError at PdfReader.pdf_header #1768

Merged

MartinThoma closed this as completed in #1768 Apr 6, 2023

MartinThoma pushed a commit that referenced this issue Apr 6, 2023

ROB: Capture UnicodeDecodeError at PdfReader.pdf_header (#1768)

8146729

Fixes #1758

MartinThoma added a commit that referenced this issue Apr 9, 2023

DOC: Add nalin-udhaar for #1758 as a contributor

5ada43b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError 'utf-8' codec can't decode byte 0xac in position 0: invalid start byte #1758

UnicodeDecodeError 'utf-8' codec can't decode byte 0xac in position 0: invalid start byte #1758

nalin-udhaar commented Mar 31, 2023 •

edited

Loading

pubpub-zz commented Mar 31, 2023

MartinThoma commented Mar 31, 2023

pubpub-zz commented Apr 2, 2023

nalin-udhaar commented Apr 4, 2023

pubpub-zz commented Apr 4, 2023

nalin-udhaar commented Apr 6, 2023

pubpub-zz commented Apr 6, 2023

MartinThoma commented Apr 6, 2023

MartinThoma commented Apr 6, 2023

nalin-udhaar commented Apr 7, 2023

UnicodeDecodeError 'utf-8' codec can't decode byte 0xac in position 0: invalid start byte #1758

UnicodeDecodeError 'utf-8' codec can't decode byte 0xac in position 0: invalid start byte #1758

Comments

nalin-udhaar commented Mar 31, 2023 • edited Loading

Environment

Code + PDF

Traceback

pubpub-zz commented Mar 31, 2023

MartinThoma commented Mar 31, 2023

pubpub-zz commented Apr 2, 2023

nalin-udhaar commented Apr 4, 2023

pubpub-zz commented Apr 4, 2023

nalin-udhaar commented Apr 6, 2023

pubpub-zz commented Apr 6, 2023

MartinThoma commented Apr 6, 2023

MartinThoma commented Apr 6, 2023

nalin-udhaar commented Apr 7, 2023

nalin-udhaar commented Mar 31, 2023 •

edited

Loading