New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I110/gunzip #1140

Merged
merged 8 commits into from Feb 3, 2017

Conversation

Projects
None yet
2 participants
@i110
Contributor

i110 commented Dec 13, 2016

add gunzip feature like this nginx module.

If the following conditions are satisfied, the server dynamically decompress .gz file and returns it as plain text.

  • file.send-compressed: gunzip config is enabled
  • There is a gzipped version of the request file
  • The client doesn't accept gzip content encoding

i110 added some commits Dec 13, 2016

@kazuho

Thank you for the PR.

The approach looks minimal, and I like it.

I have left some ideas that IMO would make the PR even better. Please let me know what you think.

Show outdated Hide outdated t/50file-config.t
@@ -2,6 +2,7 @@ use strict;
use warnings;
use Test::More;
use t::Util;
use Gzip::Faster qw/gunzip_file/;

This comment has been minimized.

@kazuho

kazuho Dec 13, 2016

Member

Please use gzip -cd < file | wc -c or something similar instead of adding a new dependency.

@kazuho

kazuho Dec 13, 2016

Member

Please use gzip -cd < file | wc -c or something similar instead of adding a new dependency.

Show outdated Hide outdated include/h2o.h
* decompress
*/
void (*decompress)(struct st_h2o_compress_context_t *self, h2o_iovec_t *inbufs, size_t inbufcnt, h2o_send_state_t state,
h2o_iovec_t **outbufs, size_t *outbufcnt);

This comment has been minimized.

@kazuho

kazuho Dec 13, 2016

Member

Compression context is a context for transforming one representation (e.g. a gzip-ed file) to another (e.g. an decompressed file). A single context can only be used for one purpose; it cannot be used for compressing and decompressing.

Therefore, you should not add a decompress callback.

You should either:

  • create h2o_compress_context_t that decompresses the input passed into the compress callback
  • you could possibly rename the word compress used in various names, but I am not sure if that is worth the effort, considering the fact that we would decompress gzip only
  • create a separate ostream dedicated for decompressing a gzipped content
  • implement decompression in lib/handler/file.c

The above list reflects my preference of addressing the issue.

@kazuho

kazuho Dec 13, 2016

Member

Compression context is a context for transforming one representation (e.g. a gzip-ed file) to another (e.g. an decompressed file). A single context can only be used for one purpose; it cannot be used for compressing and decompressing.

Therefore, you should not add a decompress callback.

You should either:

  • create h2o_compress_context_t that decompresses the input passed into the compress callback
  • you could possibly rename the word compress used in various names, but I am not sure if that is worth the effort, considering the fact that we would decompress gzip only
  • create a separate ostream dedicated for decompressing a gzipped content
  • implement decompression in lib/handler/file.c

The above list reflects my preference of addressing the issue.

This comment has been minimized.

@i110

i110 Dec 13, 2016

Contributor

I preffer your first option too. The name may be misleading, but I believe it's most reasonable for this issue.

@i110

i110 Dec 13, 2016

Contributor

I preffer your first option too. The name may be misleading, but I believe it's most reasonable for this issue.

@@ -303,7 +331,7 @@ static void do_send_file(struct st_h2o_sendfile_generator_t *self, h2o_req_t *re
/* setup response */
req->res.status = status;
req->res.reason = reason;
req->res.content_length = self->bytesleft;
req->res.content_length = self->gunzip ? SIZE_MAX : self->bytesleft;

This comment has been minimized.

@kazuho

kazuho Dec 13, 2016

Member

I believe there is an optional attribute in a gzip file that contains the size of the original content, and that the attribute is accessible from libz. It would be beneficial to use the information, since the clients that lack gzip support would likely overlap with those that do not support chunked encoding (i.e. HTTP/1.0).

@kazuho

kazuho Dec 13, 2016

Member

I believe there is an optional attribute in a gzip file that contains the size of the original content, and that the attribute is accessible from libz. It would be beneficial to use the information, since the clients that lack gzip support would likely overlap with those that do not support chunked encoding (i.e. HTTP/1.0).

This comment has been minimized.

@i110

i110 Dec 13, 2016

Contributor

You mean ISIZE field of gzip file?
https://tools.ietf.org/html/rfc1952#section-2.3.1

I read this spec and searched the web, then I found a few difficulties to use it:

  1. ISIZE field has only 4 bytes, so it's difficult to determine the uncompressed size of the file over 4GB
  2. If the gzipped file has multiple streams (members), the last 4 bytes of that file indicates only the size of last stream, not whole file

and I couldn't found zlib functions which support that information.

BTW, above nginx module also seems to use chunked encoding:
https://github.com/nginx/nginx/blob/f8a9d528df92c7634088e575e5c3d63a1d4ab8ea/src/http/modules/ngx_http_gunzip_filter_module.c#L166

How do you think?

@i110

i110 Dec 13, 2016

Contributor

You mean ISIZE field of gzip file?
https://tools.ietf.org/html/rfc1952#section-2.3.1

I read this spec and searched the web, then I found a few difficulties to use it:

  1. ISIZE field has only 4 bytes, so it's difficult to determine the uncompressed size of the file over 4GB
  2. If the gzipped file has multiple streams (members), the last 4 bytes of that file indicates only the size of last stream, not whole file

and I couldn't found zlib functions which support that information.

BTW, above nginx module also seems to use chunked encoding:
https://github.com/nginx/nginx/blob/f8a9d528df92c7634088e575e5c3d63a1d4ab8ea/src/http/modules/ngx_http_gunzip_filter_module.c#L166

How do you think?

i110 added some commits Dec 13, 2016

@i110

This comment has been minimized.

Show comment
Hide comment
@i110

i110 Dec 13, 2016

Contributor

@kazuho Thank you for the comments, I fixed some issues and replied by inline to others.
Also, I disabled gunzip for HTTP/1.0 requests by this commit, because current implementation uses chunked encoding for sending uncompressed payload.

Contributor

i110 commented Dec 13, 2016

@kazuho Thank you for the comments, I fixed some issues and replied by inline to others.
Also, I disabled gunzip for HTTP/1.0 requests by this commit, because current implementation uses chunked encoding for sending uncompressed payload.

Show outdated Hide outdated lib/handler/compress/gzip.c
ret = deflate(&self->zs, flush);
assert(ret == Z_OK || ret == Z_STREAM_END);
ret = proc(&self->zs, flush);
assert(ret == Z_OK || ret == Z_STREAM_END || ret == Z_BUF_ERROR);

This comment has been minimized.

@kazuho

kazuho Jan 18, 2017

Member

I think we had the discussion before, but could you please once more explain why we need to take care of Z_BUF_ERROR?

The intention of the original code was to always supply enough space to zlib so that it would never return Z_BUF_ERROR. I would like to understand why you needed to accept Z_BUF_ERROR, without growing the amount of free space supplied to zlib.

Note that current design secures at least 32 bytes of room in the out buffer (see the if block above). If there is more than that, then current code does not try to make the room larger. In other words, our design does not follow the requirement of deflate or inflate, that states, quote: if deflate returns with Z_OK or Z_BUF_ERROR, this function must be called again with Z_FINISH and more output space (updated avail_out) but no more input data, until it returns with Z_STREAM_END or an error (source: http://www.zlib.net/manual.html).

@kazuho

kazuho Jan 18, 2017

Member

I think we had the discussion before, but could you please once more explain why we need to take care of Z_BUF_ERROR?

The intention of the original code was to always supply enough space to zlib so that it would never return Z_BUF_ERROR. I would like to understand why you needed to accept Z_BUF_ERROR, without growing the amount of free space supplied to zlib.

Note that current design secures at least 32 bytes of room in the out buffer (see the if block above). If there is more than that, then current code does not try to make the room larger. In other words, our design does not follow the requirement of deflate or inflate, that states, quote: if deflate returns with Z_OK or Z_BUF_ERROR, this function must be called again with Z_FINISH and more output space (updated avail_out) but no more input data, until it returns with Z_STREAM_END or an error (source: http://www.zlib.net/manual.html).

This comment has been minimized.

@i110

i110 Jan 31, 2017

Contributor

@kazuho The current implementation of zlib's inflate always returns Z_BUF_ERROR when Z_FINISH is provided and inflating succeed.
The comment in zlib's inflate.c describes it as the following:

   In this implementation, the flush parameter of inflate() only affects the
   return code (per zlib.h).  inflate() always writes as much as possible to
   strm->next_out, given the space available and the provided input--the effect
   documented in zlib.h of Z_SYNC_FLUSH.  Furthermore, inflate() always defers
   the allocation of and copying into a sliding window until necessary, which
   provides the effect documented in zlib.h for Z_FINISH when the entire input
   stream available.  So the only thing the flush parameter actually does is:
   when flush is set to Z_FINISH, inflate() cannot return Z_OK.  Instead it
   will return Z_BUF_ERROR if it has not reached the end of the stream.

At the last of inflate function..

    if (((in == 0 && out == 0) || flush == Z_FINISH) && ret == Z_OK)
        ret = Z_BUF_ERROR;

So, I believe we should handle Z_BUF_ERROR as non-fatal even if there're enough buffer.

@i110

i110 Jan 31, 2017

Contributor

@kazuho The current implementation of zlib's inflate always returns Z_BUF_ERROR when Z_FINISH is provided and inflating succeed.
The comment in zlib's inflate.c describes it as the following:

   In this implementation, the flush parameter of inflate() only affects the
   return code (per zlib.h).  inflate() always writes as much as possible to
   strm->next_out, given the space available and the provided input--the effect
   documented in zlib.h of Z_SYNC_FLUSH.  Furthermore, inflate() always defers
   the allocation of and copying into a sliding window until necessary, which
   provides the effect documented in zlib.h for Z_FINISH when the entire input
   stream available.  So the only thing the flush parameter actually does is:
   when flush is set to Z_FINISH, inflate() cannot return Z_OK.  Instead it
   will return Z_BUF_ERROR if it has not reached the end of the stream.

At the last of inflate function..

    if (((in == 0 && out == 0) || flush == Z_FINISH) && ret == Z_OK)
        ret = Z_BUF_ERROR;

So, I believe we should handle Z_BUF_ERROR as non-fatal even if there're enough buffer.

This comment has been minimized.

@kazuho

kazuho Jan 31, 2017

Member

Thank you for looking into the issue.

So to clarify, it seems that inflate() returns Z_BUF_ERROR if flush is set to Z_FINISH at the middle of the compressed data (quote: will return Z_BUF_ERROR if it has not reached the end of the stream).

If that is the case, I have no objections to the change. OTOH, I would appreciate if you could add a in-line comment to the source code explaining why such change is necessary.

Thank you in advance.

@kazuho

kazuho Jan 31, 2017

Member

Thank you for looking into the issue.

So to clarify, it seems that inflate() returns Z_BUF_ERROR if flush is set to Z_FINISH at the middle of the compressed data (quote: will return Z_BUF_ERROR if it has not reached the end of the stream).

If that is the case, I have no objections to the change. OTOH, I would appreciate if you could add a in-line comment to the source code explaining why such change is necessary.

Thank you in advance.

@kazuho kazuho added this to the v2.2 milestone Jan 18, 2017

@kazuho

This comment has been minimized.

Show comment
Hide comment
@kazuho

kazuho Feb 1, 2017

Member

Thank you for the changes.

For the record, the document states:

If the parameter flush is set to Z_FINISH, pending input is processed, pending output is flushed and deflate returns with Z_STREAM_END if there was enough output space. If deflate returns with Z_OK or Z_BUF_ERROR, this function must be called again with Z_FINISH and more output space (updated avail_out) but no more input data, until it returns with Z_STREAM_END or an error.
http://www.zlib.net/manual.html

Member

kazuho commented Feb 1, 2017

Thank you for the changes.

For the record, the document states:

If the parameter flush is set to Z_FINISH, pending input is processed, pending output is flushed and deflate returns with Z_STREAM_END if there was enough output space. If deflate returns with Z_OK or Z_BUF_ERROR, this function must be called again with Z_FINISH and more output space (updated avail_out) but no more input data, until it returns with Z_STREAM_END or an error.
http://www.zlib.net/manual.html

@kazuho kazuho merged commit 4b7b17a into master Feb 3, 2017

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

kazuho added a commit that referenced this pull request Feb 24, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment