Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document solution to "character out of range during base 85 decode" #228

Closed
metaperl opened this issue Jul 26, 2021 · 10 comments
Closed

Document solution to "character out of range during base 85 decode" #228

metaperl opened this issue Jul 26, 2021 · 10 comments

Comments

@metaperl
Copy link

it would be nice if a FAQ documented that when merging PDFs, the default call to .save() can potentially lead to this exception.

The solution is to call .save() with stream_decode_level=StreamDecodeLevel.none, compress_streams=False.

@jbarlow83
Copy link
Member

It's not a FAQ because no one's ever asked the question before :P

That most likely means that the PDF contains an error in some base 85-encoded stream. The workaround you found is forcing decompression of streams and will likely lead to a large file. If you've seen it multiple times, chances are you are dealing with PDF software that consistently produces this error.

You should probably do .save(recompress_flate=True) instead to force recompression. The settings you use will decompress everything and make very large files.

If you're able to share the file I may be able to see if there's a way to improve the standard behavior.

@metaperl
Copy link
Author

metaperl commented Jul 29, 2021

You should probably do .save(recompress_flate=True) instead to force recompression. The settings you use will decompress everything and make very large files.

Using that save approach yielded: AttributeError: 'pikepdf._qpd.Pdf' object has no attribute '_original_filename' in line 801 which is: if not filename_or_stream and self._original_filename:`

I think I know why this is happening, but I dont want to say anything that might reveal the nature of what we are doing in public.

If you're able to share the file I may be able to see if there's a way to improve the standard behavior.

I have made a request about this, but I dont think it will happen soon if at all.

One other thing: .save() yields a Python RuntimeError instead of a custom PikePDF exception. Do you think that is desired behavior?

@jbarlow83
Copy link
Member

The line 801 error occurs because you called .save() without a destination filename, not because of any other issue.

@metaperl
Copy link
Author

The line 801 error occurs because you called .save() without a destination filename, not because of any other issue.

Yes, but when you are streaming out to an io.BytesIO datastore, there is no need for a destination filename, just the output stream. This error only occurs with the .save(recompress_flate=True) option. Not with .save(pdf_out) or .save(pdf_out, stream_decode_level=StreamDecodeLevel.none, compress_streams=False).

pdf_out = io.BytesIO()
pdf_merger = Pdf.new()
pdf_merger.save(pdf_out) # works 99% of the time
pdf_merger.save(pdf_out, stream_decode_level=StreamDecodeLevel.none, compress_streams=False) # works the other 1% of the time
pdf_merger.save(pdf_out, recompress_flate=True) # fails with the error mentioned above

@jbarlow83
Copy link
Member

Please try testing pdf_merger.save(pdf_out, recompress_flate=True). I don't believe you actually executed the code above since it 1) works for me verbatim 2) the attribute lookup of _original_filename occurs on a conditional path that is only taken if the save destination is omitted.

I did write .save(recompress_flate=True) which I was misleading and sent you in the wrong direction.

@jbarlow83
Copy link
Member

jbarlow83 commented Jul 29, 2021

The character out of range error is reproducible with code such as:

pdf_merger = Pdf.new()
pdf_merger.add_blank_page()
pdf_merger.pages[0].Contents = pdf_merger.make_stream(
    b'\xba\xad',
    Filter=pikepdf.Name.ASCII85Decode
)
out = io.BytesIO()
pdf_merger.save(out)

i.e. putting some garbage in a ASCII85 stream and saving it.

Now that we have a reproducer, testing the methods explored so far gives:

recompress_flate=True - also raises the runtime error
stream_decode_level=StreamDecodeLevel.none, compress_streams=False - saves, but technically the PDF is invalid, garbage in garbage out

My suggestion would be to iterate all objects and look for ASCII85 streams and try to figure what is producing them and if you can remove or repair the invalid streams. That would look something like this:

for obj in pdf.objects:
    if isinstance(obj, pikepdf.Stream) and pikepdf.Name.Filter in obj:
        if pikepdf.Name.ASCII85Decode in obj.Filter.wrap_in_array().as_list():
            try:
                obj.read_bytes()  # Try decoding
            except RuntimeError:
                print(f"{obj.objgen} has an invalid ascii85 stream")

At this point I don't see a viable general solution to this issue - the application (user of pikepdf) needs to decide what to do with invalid input data. If for example, this particular ascii85 error is easily detectable and repairable, then we certainly could implement some kind of solution.

@metaperl
Copy link
Author

Please try testing pdf_merger.save(pdf_out, recompress_flate=True)

I did. And it fails just as pdf_merger.save(pdf_out)

I'm happy with using pdf_merger.save(pdf_out, stream_decode_level=StreamDecodeLevel.none, compress_streams=False) in the rare times this does occur. The output file does open and can be read even though there might be something wrong from a technical standpoint.

But I do appreciate your suggestions on how to debug this. One day I might use that suggestion. But for the time being I think we can close this ticket.

@metaperl
Copy link
Author

metaperl commented Aug 9, 2021

The settings you use will decompress everything and make very large files.

In our experience, the files are actually a bit smaller or a bit larger, about 50% each way. But they are not significantly larger.

@metaperl metaperl reopened this Aug 9, 2021
@metaperl
Copy link
Author

metaperl commented Aug 9, 2021

The solution is to call .save() with stream_decode_level=StreamDecodeLevel.none, compress_streams=False.

There is no need for stream_decode_level=StreamDecodeLevel.none because the default value of this parameter is already None.

@jbarlow83
Copy link
Member

Closing since this is superceded by #240

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants