Add support for concatenated gzip files. #6442

eendeego · 2013-10-31T01:09:33Z

No description provided.

Nodejs-Jenkins · 2013-10-31T01:09:37Z

Thank you for contributing this pull request! Here are a few pointers to make sure your submission will be considered for inclusion.

Commit eendeego/node@d69b312 has the following error(s):

First line of commit message must be no longer than 50 characters
Commit message must indicate the subsystem this commit changes

Commit eendeego/node@7934cc7 has the following error(s):

First line of commit message must be no longer than 50 characters
Commit message must indicate the subsystem this commit changes

The following commiters were not found in the CLA:

Luis Reis

You can fix all these things without opening another issue.

Please see CONTRIBUTING.md for more information

eendeego · 2013-10-31T01:16:37Z

Already Signed the CLA as "Luis Miguel Fonseca dos Reis"

indutny · 2013-10-31T08:16:03Z

Thanks for contributing this!

Perhaps, it'd be better if zlib would be able to do "partial" decompression and return the offset it has stopped at? /cc @isaacs

eendeego · 2013-10-31T09:47:13Z

That can be done, but there must be a way to know the Z_STREAM_END status and issue the necessary inflateReset (which I think is already visible in JS space.)

rlidwka · 2013-10-31T19:17:38Z

What the behaviour is when there is a garbage at the end, something like gzip(a)+gzip(b)+plaintext? There is no test for it.

gzip for example decompresses it properly, but prints out a warning to stdout:

gzip: 1: decompression OK, trailing garbage ignored

eendeego · 2013-11-04T00:52:23Z

Hi @indutny, I've been looking at some form to emit data events for each [zlib] stream termination. I have these issues:

The code for the inflateReset/inflate should actually be a loop.
If the loop was implemented in the Process function (like the if/inflate is now), that thread would block until all the last buffer received by Write was consumed. I think this is the opposite behavior from what I understand from your comment. This would also make the After function awkward because it would have to buffer multiple chunks and eventually an error (as referred by @rlidwka).
I've tried to move the inflateReset/inflate code from Process to After, just after the call to JS land (see below), but I'm having trouble because the 'end' event is being emitted/received before processing the last [zlib] stream on the test.

Z_STREAM_END handling on node_zlib.cc / After function:

// [...]
MakeCallback(env, handle, env->callback_string(), ARRAY_SIZE(args), args);

if (ctx->err_ == Z_STREAM_END && ctx->strm_.avail_in > 0) {
  ctx->err_ = inflateReset(&ctx->strm_);
  uv_work_t* work_req = &(ctx->work_req_);
  uv_queue_work(ctx->env()->event_loop(),
                work_req,
                ZCtx::Process,
                ZCtx::After);
}

ctx->Unref();
// [...]

Here's the stack trace for the failed test:

assert.js:98
  throw new assert.AssertionError({
        ^
AssertionError: "123\n456\n" == "123\n456\n789\n"
    at Gunzip.<anonymous> (/Users/lmfr/work/node-build/node/test/simple/test-zlib-from-multiple-gzip.js:70:16)
    at Gunzip.EventEmitter.emit (events.js:125:20)
    at _stream_readable.js:896:16
    at process._tickCallback (node.js:316:11)

Note that the test fails exactly with the same error even if Zlib.prototype.close's code is replaced by a return statement.

It seems that even if this worked well, I would be spawning too many threads just to be able to send chunks ending with Z_STREAM_END return codes.
It seems that JS can only be called from the After function.

What am I doing/assuming wrong ?

indutny · 2013-11-04T09:45:36Z

Wow, thanks for trying it out!

I think that it should be ok if it would just return offset that it stopped at, after that zlib.js can invoke inflate again with buf.slice(offset) and continue doing it's filthy stuff.

So, please don't call it in a loop and let js handle it. This should be much simpler than writing all this logic in C++.

Please, let me know if you need any other hints and/or help!

eendeego · 2013-11-07T01:22:38Z

This last commit moves stream end handling to JS land - I'm not sure the test on the _transform() callback is the proper one. It seems that being able to check for a Z_STREAM_END would be the best choice, but ctx->err_ is not exposed to JS code.

To be able to handle the garbage example as given by @rlidwka, I had to create a new send_error var on the worker After callback due to race conditions.

Waiting for feedback...

indutny · 2013-12-05T15:24:17Z

Sorry, I know it took awhile for me to get there. But I'm still not ready to give you any feedback. I don't really enjoy that it does it automatically, but I'm not sure how it should look API-wise. I'll figure it out a bit later.

trevnorris · 2013-12-06T22:17:03Z

lib/zlib.js

+                                       self._buffer,
+                                       self._offset,
+                                       self._chunkSize);
+      newReq.callback = callback; // this same function


will lint complain about not having two spaces between the code and the comment, or is that just for cpp?

It's just the same as every comment in the same file - actually that line it's copied from some lines above.

cool. wasn't worried, but also couldn't remember which lint would check for that.

trevnorris · 2013-12-06T22:19:57Z

This'll need to be rebased.

eendeego · 2013-12-06T22:45:58Z

@trevnorris Apart from the // this same function comment, the spacing incidents were derived from my sloppiness. Those are fixed in the last commit

Sorry.

indutny · 2013-12-06T22:49:11Z

@trevnorris I don't really enjoy how it automatically continues decompressing it without letting user know that it does it. I think it should be explicit.

In other words, ideally, I think that there should be a stream.completed set to true or something like this. And if user will call stream.read() again - it'll continue reading from the place where it stopped doing it.

The main problem is that this mode is not really supported by streams. @isaacs if you have time - may I ask your advice on this topic?

eendeego · 2013-12-06T23:01:22Z

@indutny Implementations on other languages also do this transparently. How about some (ignorable) event sent to the client code ?

indutny · 2014-01-08T21:12:02Z

@luismreis: I had a conversation with @tjfontaine, I think we are going to land it. @tjfontaine was going to review it for one more time. Assigning issue to him

indutny · 2014-01-23T12:59:02Z

@tjfontaine reminder

tjfontaine · 2014-01-27T19:39:58Z

lib/zlib.js

@@ -510,6 +510,25 @@ Zlib.prototype._transform = function(chunk, encoding, cb) {
      return;
    }

+    if (availInAfter > 0) {


is there a sane way to refactor this such that these two clauses can be combined instead of duplicated?

tjfontaine · 2014-01-27T19:40:26Z

I'd like to land this, but am hoping you're able to refactor to reduce some of the duplication

indutny · 2014-02-18T12:39:38Z

@luismreis heya, any news?

seishun · 2014-03-22T15:45:41Z

@indutny if it doesn't continue to decompress automatically, then how should it work with the synchronous zlib functions? Or have you agreed with the automatic approach?

indutny · 2014-03-22T16:00:47Z

Yes, I agreed with it after all.

seishun · 2014-03-22T16:10:02Z

Alright, I'll see if I can rebase and refactor this and make a new PR.

eendeego · 2014-03-23T22:58:16Z

Hi @indutny, I haven't got much time recently. I'd like to pick it up, but I'm short on time.

seishun · 2014-03-23T23:37:32Z

It seems this PR makes it an error to have trailing garbage. According to gzip.org, it's not unusual for tar.gz files to be padded with zeroes. Also, I don't see a way to retrieve the decompressed data when using the sync functions if it throws on garbage.

On the other hand, Python's implementation raises an exception when there's trailing garbage in gzip (but not zlib) and of course doesn't return anything. But then again, Python's API is terrible anyway in my opinion.

How do you think we should approach this? How do other implementations do it?

eendeego · 2014-03-24T01:10:09Z

Yes, it throws an error, I know that I was reproducing behavior from something else, but can't recall now.

bnoordhuis · 2014-10-09T21:20:47Z

test/simple/test-zlib-from-multiple-gzip-with-garbage.js

+// OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE
+// USE OR OTHER DEALINGS IN THE SOFTWARE.
+
+// test unzipping a file that was created by contactenating multiple gzip


Typo: s/contactenating/concatenating/

bnoordhuis · 2014-10-09T21:25:31Z

Some nits but otherwise LGTM.

eendeego · 2014-10-11T20:36:17Z

Hey, @bnoordhuis, @trevnorris, thank you very much for the comments. I've made all the suggested changes except for adding the documentation.

Why ? Because per the RFC this isn't a extra feature, it's something that should have always been there. However, it's amazing to google for this kind of issue and find out that every language's zlib implementation had the same problem at some point.

So... if you still think this documentation is necessary, I suggest adding the following paragraph to doc/api/zlib.markdown just before the Examples title:

The deflate feature can transparently decompress concatenated gzip streams
as specified in section 2.2 of the GZip RFC.

eendeego · 2014-10-16T00:07:31Z

@trevnorris, @tjfontaine, @indutny ... ping...

eugeneware · 2014-10-19T12:24:28Z

I agree with @luismreis . This is not really a new feature. It's a bug fix. Concatenated Gzips data sets are a valid gzip file and stream.

I got caught with this bug when trying to uncompress the web data from http://commoncrawl.org which uses this aspect of the gzip file format to enable indexing into a gzip file.

The python code that common-crawl supply supports this form of gzip streaming, I was a little surprised to find that node.js didn't.

But a big thanks to @luismreis for this PR!

trevnorris · 2014-10-23T05:35:25Z

@indutny What's your take? I'm alright with the change.

eendeego · 2014-11-07T01:45:27Z

Sorry for the spam, but, ... @trevnorris, @tjfontaine, @indutny ... ping...

eugeneware · 2014-11-07T06:20:53Z

Yes please! It's now been over 1 year since this original PR was submitted.

bnoordhuis · 2014-11-07T12:56:07Z

It LGTM last time I looked at it but I can't land it here. I can land it in node-forward but that repo is private for one or two more days.

indutny · 2014-11-07T13:04:25Z

LGTM too.

trevnorris · 2014-11-08T09:50:37Z

Overall LGTM.

Will this land cleanly on v0.12?

eendeego · 2014-11-08T21:01:34Z

It should, it was recently rebased.

indutny · 2014-11-09T02:14:29Z

Landed in node-forward/node@1183ba4, thank you!

eendeego · 2014-11-09T02:16:31Z

NO! Thank you!

eugeneware · 2014-11-10T06:26:47Z

Thank you! Will this bug fix be applied to v0.10 as well?

Reviewed-By: Fedor Indutny <fedor@indutny.com> PR-URL: #6442

chrisdickinson · 2014-12-16T23:15:27Z

Landed in 6f6a979 for v0.12. Thanks!

eugeneware · 2014-12-16T23:19:59Z

Thanks @chrisdickinson. Will this bug fix be rebased on to the 0.10 branch too?

misterdjules · 2015-01-07T23:15:36Z

@luismreis @chrisdickinson Should we re-open this PR so that we can fix the regression it caused (see #8962) and land it again in a later milestone?

In the meantime, I'm removing it from the 0.11.15 milestone.

eendeego · 2015-01-08T00:15:27Z

@misterdjules I don't really know what to say... According to the (gzip) spec we should be restarting the gunzip process each time there's more data, but, apparently, npm depends on the behavior of terminating on the first terminating gzip stream.

trevnorris reviewed Dec 6, 2013
View reviewed changes

ghost assigned tjfontaine Jan 8, 2014

tjfontaine reviewed Jan 27, 2014
View reviewed changes

springmeyer mentioned this pull request Jan 28, 2014

zlib module extracts only the first compressed data set in a gzip file #6032

Closed

eendeego added 2 commits September 9, 2014 17:30

zlib: Add gzip/gunzip modes to reset.

cc0794a

zlib: Have CheckError return an enum instead of a boolean.

31f2b2c

bnoordhuis reviewed Oct 9, 2014
View reviewed changes

eendeego added 3 commits October 11, 2014 12:24

zlib: Add require('../common') to multiple gzip stream tests.

4634e30

zlib: Fix typo in multiple gzip stream test comments.

fb973d3

zlib: Fix coding style issues on multiple gzip stream tests.

2a9ec95

misterdjules modified the milestones: 0.11.15, v0.12 Nov 21, 2014

chrisdickinson pushed a commit that referenced this pull request Dec 16, 2014

zlib: support concatenated gzip files

6f6a979

Reviewed-By: Fedor Indutny <fedor@indutny.com> PR-URL: #6442

chrisdickinson closed this Dec 16, 2014

misterdjules removed this from the 0.11.15 milestone Jan 7, 2015

eendeego unassigned tjfontaine Aug 31, 2015

addaleax mentioned this pull request Dec 17, 2015

Node's zlib.ungzip does not support concatenated files nodejs/node#4306

Closed

Add support for concatenated gzip files. #6442

Add support for concatenated gzip files. #6442

Conversation

eendeego commented Oct 31, 2013

Nodejs-Jenkins commented Oct 31, 2013

eendeego commented Oct 31, 2013

indutny commented Oct 31, 2013

eendeego commented Oct 31, 2013

rlidwka commented Oct 31, 2013

eendeego commented Nov 4, 2013

indutny commented Nov 4, 2013

eendeego commented Nov 7, 2013

indutny commented Dec 5, 2013

trevnorris Dec 6, 2013

Choose a reason for hiding this comment

eendeego Dec 6, 2013

Choose a reason for hiding this comment

trevnorris Dec 6, 2013

Choose a reason for hiding this comment

trevnorris commented Dec 6, 2013

eendeego commented Dec 6, 2013

indutny commented Dec 6, 2013

eendeego commented Dec 6, 2013

indutny commented Jan 8, 2014

indutny commented Jan 23, 2014

tjfontaine Jan 27, 2014

Choose a reason for hiding this comment

tjfontaine commented Jan 27, 2014

indutny commented Feb 18, 2014

seishun commented Mar 22, 2014

indutny commented Mar 22, 2014

seishun commented Mar 22, 2014

eendeego commented Mar 23, 2014

seishun commented Mar 23, 2014

eendeego commented Mar 24, 2014

bnoordhuis Oct 9, 2014

Choose a reason for hiding this comment

bnoordhuis commented Oct 9, 2014

eendeego commented Oct 11, 2014

eendeego commented Oct 16, 2014

eugeneware commented Oct 19, 2014

trevnorris commented Oct 23, 2014

eendeego commented Nov 7, 2014

eugeneware commented Nov 7, 2014

bnoordhuis commented Nov 7, 2014

indutny commented Nov 7, 2014

trevnorris commented Nov 8, 2014

eendeego commented Nov 8, 2014

indutny commented Nov 9, 2014

eendeego commented Nov 9, 2014

eugeneware commented Nov 10, 2014

chrisdickinson commented Dec 16, 2014

eugeneware commented Dec 16, 2014

misterdjules commented Jan 7, 2015

eendeego commented Jan 8, 2015