Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leftovers and concatenated compressed streams don't work #29

Open
k0001 opened this issue Jun 3, 2017 · 8 comments
Open

Leftovers and concatenated compressed streams don't work #29

k0001 opened this issue Jun 3, 2017 · 8 comments

Comments

@k0001
Copy link
Owner

k0001 commented Jun 3, 2017

Both decompress and decompress' are expected to automatically process concatenated compressed input. However, this is currently not true for decompress', and moreover, the leftover behavior in these cases is incorrect, as demonstrated by the (currently failing) test case added in c866d0f

cc @dimbleby

@k0001
Copy link
Owner Author

k0001 commented Jun 3, 2017

See #30 as well

@dimbleby
Copy link
Contributor

dimbleby commented Jun 3, 2017

The behaviour of decompress' has always been to unzip exactly one archive, and then return leftovers, which is what the testcase shows it doing.

In #27 I had decompress use this as a helper function: the behaviour of decompress'is not changed by that pull. (And, as best I understand, this is still true after your rework).

(I'm not convinced that decompress' is useful enough to be exposed, but perhaps I'm just failing to see cases where it's what's wanted.)

A function with the behaviour that you're testing for would have to be quite a bit more complicated. You'd need to try decompressing leftovers in the hope that they were another archive; and then back up on yourself when that turned out to be wrong. I suppose you could build this too out of the existing decompress'.

(I'm even less convinced that the function that you're testing for would be useful enough to expose - it's kinda weird, and less useful even than the existing function as a building block).

Suggest either:

  • just document that decompress' does what it does, continue to expose it, perhaps it's a useful helper function for people
  • or else retire it from the public interface (and in that case maybe also retire it altogether and implement decompress a little more directly).

@k0001
Copy link
Owner Author

k0001 commented Jun 3, 2017

@dimbleby a decompress' that returns leftovers needs to be exposed. This is required for parsing streams that contain both compressed and uncompressed chunks, like in the case of git pack files.

Yes, I agree that #27 didn't change the behavior of decompress', but it did change the behavior of decompress and now these two behave differently, and that's undesirable. Since concatenated gzip streams are valid, then decompress' must deal with that too. As you say, the implementation will be tricky, but I'd rather implement it correctly inside pipes-zlib so that people don't need to worry about this.

@k0001
Copy link
Owner Author

k0001 commented Jun 3, 2017

@dimbleby something that is not yet clear to me is whether this "concatenated stream" is a GZip only feature, or a Zlib feature. Do you know more about this?

@k0001
Copy link
Owner Author

k0001 commented Jun 3, 2017

This concatenation support is explicitly mentioned in the gzip format http://www.gzip.org/zlib/rfc-gzip.html#file-format

@k0001
Copy link
Owner Author

k0001 commented Jun 3, 2017

In https://www.ietf.org/rfc/rfc1950.txt it says:

      A zlib stream has the following structure:

           0   1
         +---+---+
         |CMF|FLG|   (more-->)
         +---+---+

      (if FLG.FDICT set)

           0   1   2   3
         +---+---+---+---+
         |     DICTID    |   (more-->)
         +---+---+---+---+

         +=====================+---+---+---+---+
         |...compressed data...|    ADLER32    |
         +=====================+---+---+---+---+

      Any data which may appear after ADLER32 are not part of the zlib
      stream.

So, it looks like concatenation is not supposed to work with Zlib.

@dimbleby
Copy link
Contributor

dimbleby commented Jun 3, 2017

It's news to me, but it looks as though you're right that concatenated archives is specifically a gzip thing. What a minefield! So maybe what we want is:

  • Zlib.decompress just calls Zlib.decompress', and throws an exception if there are any leftovers
  • GZip.decompress loops on GZip.decompress', as per the current code

(Above is written on the basis of the version of decompress' that handles only a single archive).

===

If you want to expose building blocks, I'd still claim that the version of decompress' that we currently have is the fundamental one. Eg I could imagine that some users might care about being able to distinguish the concatenated archives so that they could separately recover file1 and file2, rather than only the concatenation. So I think it would be a mistake to remove that function in favour of a less fundamental version.

However, if you know of use cases where the version of decompress' that you were expecting is desirable, then perhaps implementing that as well would be a useful enhancement?

@k-bx
Copy link

k-bx commented Mar 23, 2020

For anyone who's looking for a quick solution, here's a full working code snippet:

decompressAll ::
     MonadIO m => P.Producer ByteString m r -> P.Producer ByteString m r
decompressAll p = do
  er <- PGZip.decompress' p
  case er of
    Left leftover -> decompressAll leftover
    Right r -> return r

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants