Skip to content
This repository has been archived by the owner on Nov 6, 2022. It is now read-only.

Forcing a pause at the end of HTTP headers #97

Open
AndreLouisCaron opened this issue Feb 25, 2012 · 18 comments
Open

Forcing a pause at the end of HTTP headers #97

AndreLouisCaron opened this issue Feb 25, 2012 · 18 comments

Comments

@AndreLouisCaron
Copy link
Contributor

I've implemented a simple C++ wrapper for the http-parser library and ran into a problem when using the parser pause feature.

Basically, I have a Request object that wraps a http_parser instance. This request object has a feed() method which invokes http_parser_execute(). The request object also has a headers_complete() which reports the end of HTTP headers. To make this test reliable (mainly for proxying purposes), I force http_parser_execute() to return by pausing the parser calling http_parser_pause() from the on_headers_complete() callback. All this seemingly works, except that it doesn't consume the last byte of a request without a body. To finish processing the request, I have to call http_parser_execute() again with the last byte of data.

Now, calling feed()/http_parser_execute() one extra time is not that much of a problem, except that it makes it prevents clients from accurately finding the exact position of the end of HTTP headers. In a proxying scenario (HTTP proxy, CGI/SCGI/FastCGI or even WebSockets), where you want to forward the body byte-for-byte, clients end up prefixing the HTTP request body with an extra byte.

This was reported to me on the httpxx issue tracker, but I believe it's an issue in http-parser. Basically, the on_headers_complete() callback is called upon examining the last byte of the header data, and pausing in that callback prevents the library from marking the last byte (the one the triggered the callback) as consumed.

Visit the httpx issue #5 for a detailed discussion. If it's unclear, I might be able to concoct an SSCCE that illustrates the problem.

Note: this issue seems related to pull request 89.

@bnoordhuis
Copy link
Member

Confirmed.

It's annoying: while it's obviously a bug, fixing it breaks a number of assumptions that http-parser users may have made about the order of events or when it's safe to call the parser (evidenced by the number of tests the fix breaks).

Rock, meet hard place.

EDIT: I've added a failing test in bnoordhuis/http-parser@e3028fe.

@AndreLouisCaron
Copy link
Contributor Author

I've hacked a workaround in AndreLouisCaron/httpxx@608af01, but it's really ugly. I have to manually fix off-by-one errors because of this bug, and the count is unreliable until the on_message_complete() callback is called so that I can fix it.

If someone has a clear idea on how to fix this issue without breaking everything, I might be able to put in some time to fix it.

@pgriess
Copy link
Contributor

pgriess commented Feb 27, 2012

Yeah, this is tricky to fix because a single byte can trigger multiple callbacks, and if a callback other than the last results in the parser being paused, we need a way to track that these addition callbacks need to be executed. We're doing this now by not consuming the byte until we're done with all of the callbacks that it should trigger. For callbacks where we do not re-execute the same byte, this isn't an issue, of course.

We could fix this to return that all bytes were consumed (e.g. by stashing the last byte in http_parser and using it when continuing execution), but we'd be breaking the assumption of users of the http_parser_execute() API that if all bytes are consumed, there is no reason to call http_parser_execute() again until there is more data to parse (or an EOF has been received). We'd also need a way to poke the parser to consume this stashed byte -- calling http_parser_execute(NULL, 0) isn't going to cut it, as that would look eerily like an EOF to the un-trained eye. Some options here include adding a new http_parser_execute_buffered() API, or having http_parser_pause(0) kick off execution.

Mercifully, these changes in execution behavior might okay-ish because these assumptions about return value/byte consumption only break if you're using http_parser_pause(), and AFAIK there are a limited number of users of this function (possibly only us and AndreLouisCaron/httpxx).

Any other ideas @bnoordhuis, @ry or @clifffrey?

@AndreLouisCaron
Copy link
Contributor Author

I'll go out on a limb here, but it seems to me that we're approaching the problem from the wrong angle. Rather than trying to fix the thorny issue of marking the last byte, wouldn't it be easier to play on the pausing semantics? I don't really care if more callbacks are invoked after I pause the parser (e.g. on_message_complete() for an empty message), so long as it pauses "soon".

Pausing is currently implemented by using a special error code, which happens to be temporary (the client can reset it by calling http_parser_pause(&parser, 0);). The real issue here it that the parser does not distinguish between HPE_PAUSED and real error codes. A more flexible approach to pausing would prevent the parser from continuing after a change of state (e.g. from "parsing headers" to "parsing body").

Any ideas on this?

@pezcode
Copy link

pezcode commented Feb 27, 2012

Is node (or any other library) relying on http-parser not executing on_message_complete after pausing from inside on_headers_complete?
If not, I think the easiest 'fix' would be to implement pause as "don't consume any more bytes (but call whatever callbacks are necessary)" instead of "stop doing anything". That was what I thought it did when I read the prototype of http_parser_pause anyway.

@pgriess
Copy link
Contributor

pgriess commented Feb 27, 2012

I don't really care if more callbacks are invoked after I pause the parser (e.g. on_message_complete() for an empty message), so long as it pauses "soon".

I think having a immediate, non-advisory pause is pretty useful, particularly as evidenced by the Node community's continued struggles to deal with its absence: the fact that ServerRequest.pause() doesn't actually pause bites people frequently, and is a work-around that looks likely to land in the latest release (by buffering in the application) is both highly desired and looked at with skepticism because of what it has to do to work around the fact that http_parser couldn't natively pause (they don't know about http_parser_pause() yet, it seems).

That said, there's certainly a trade-off between implementation complexity and value, but I don't think we've crossed that threshold.

The real issue here it that the parser does not distinguish between HPE_PAUSED and real error codes.

Is your expectation that calling http_parser_execute() should implicitly un-pause the parser? If so, I'm not sure I follow how that's related to reporting the correct number of bytes consumed.

@pgriess
Copy link
Contributor

pgriess commented Feb 27, 2012

Is node (or any other library) relying on http-parser not executing on_message_complete after pausing from inside on_headers_complete?

Not that I know of. AFAIK there are very few users of http_parser_pause().

If not, I think the easiest 'fix' would be to implement pause as "don't consume any more bytes (but call whatever callbacks are necessary)" instead of "stop doing anything".

Yeah, that's certainly an option.

I do think it's worth spending some amount of effort to get the API to work with the more rigid semantics, though, as this provides a more powerful API. I'll spend some on that this weekend and see how far I get.

@clifffrey
Copy link
Contributor

Couldn't we store the pause state somewhere explicitly? It would be more
work to code up, but if we had different parser states for these things,
even though they were at the same byte offset, then the parser could know
which callbacks have been run already.

I think that pause needs to be absolute, it would be very surprising to a
new user of the API to see a callback called after the stream was paused.

On Mon, Feb 27, 2012 at 11:50 AM, Peter Griess <
reply@reply.github.com

wrote:

Is node (or any other library) relying on http-parser not executing
on_message_complete after pausing from inside on_headers_complete?

Not that I know of. AFAIK there are very few users of
http_parser_pause().

If not, I think the easiest 'fix' would be to implement pause as "don't
consume any more bytes (but call whatever callbacks are necessary)" instead
of "stop doing anything".

Yeah, that's certainly an option.

I do think it's worth spending some amount of effort to get the API to
work with the more rigid semantics, though, as this provides a more
powerful API. I'll spend some on that this weekend and see how far I get.


Reply to this email directly or view it on GitHub:
#97 (comment)

@AndreLouisCaron
Copy link
Contributor Author

If not, I think the easiest 'fix' would be to implement pause as "don't consume any more bytes (but call whatever callbacks are necessary)" instead of "stop doing anything".

Yeah, that's certainly an option.

Sorry if I wasn't clear when I said I wanted the parser to stop "soon". This is what I had in mind.

@AndreLouisCaron
Copy link
Contributor Author

I think the easiest 'fix' would be to implement pause as "don't consume any more bytes (but call whatever callbacks are necessary)" instead of "stop doing anything".

I've worked around the issue to implement exactly this in AndreLouisCaron/httpxx@fbd0422. It only works for processing the last byte after the headers, but it fixes the only instance of this problem I've encountered so far (somehow, pausing in on_message_complete() correctly reports all bytes as consumed).

I would still prefer this was fixed in http-parser for all (potential) cases, but if there is a serious design problem in implementing a real fix, I can live with this bug as long as the exact behavior of http_parser_pause() is documented so that no one gets "bitten" by this problem.

@bnoordhuis
Copy link
Member

I would still prefer this was fixed in http-parser for all (potential) cases, but if there is a serious design problem in implementing a real fix, I can live with this bug

I think everyone agrees that it should be fixed on our side, the question is how?

@AndreLouisCaron
Copy link
Contributor Author

I looked at bnoordhuis's new test case and it prevents pausing in on_headers_complete from invoking on_message_complete. I don' t think this can ever be satisfied if we fix this issue since empty requests absolutely need to call both callbacks before marking the last byte as consumed.

I pushed a sample fix that just addresses the specific case of letting on_message_complete be called when the parser is paused from on_headers_complete. It breaks current tests for pausing, which validate that absolutely no more callbacks should be called after a pause. All other tests still pass, but don't use this since I think this is somewhat of a hack, and I'm not sure what the behavior is for pausing in other callbacks (e.g. on_url()).

I think everyone agrees that it should be fixed on our side, the question is how?

First, we need to define the semantics of pausing. Should pausing be allowed all callbacks, or should it be limited to specific cases? AFAICT, the only interesting places to call this are in on_headers_complete for proxying purposes and in on_message_complete for HTTP/1.1 pipelined requests.

Also, in what (other) cases is this a problem? So far, the only problematic instance is for on_headers_complete(), which must potentially call on_message_complete when the request has no body. Are there any other situations that can call more than one callback for a single byte?

Since it already works for the on_message_complete callback and it's a question of a single byte in on_headers_complete and other cases don't seem so interesting, can we just hack in a solution for the on_headers_complete case and document http_parser_pause as only supported in those two callbacks?

@clifffrey
Copy link
Contributor

I feel like I'm missing something.. why is it that "... empty requests
absolutely need to call both callbacks before marking the last byte as
consumed"? Couldn't they consume the last byte, but leave the parser in a
state of s_paused_before_message_complete, and have that state just call
the callback and then do "goto reexecute_byte;"? I'm probably missing
something though, so please forgive me if this idea can't work.

On Tue, Feb 28, 2012 at 9:21 PM, Andr Caron <
reply@reply.github.com

wrote:

I looked at bnoordhuis's new test case and
it prevents pausing in on_headers_complete from invoking
on_message_complete. I don' t think this can ever be satisfied if we fix
this issue since empty requests absolutely need to call both callbacks
before marking the last byte as consumed.

I pushed a sample fix that just addresses the specific case of letting
on_message_complete be called when the parser is paused from
on_headers_complete. It breaks current tests for pausing, which validate
that absolutely no more callbacks should be called after a pause. All
other tests still pass, but don't use this since I think this is somewhat
of a hack, and I'm not sure what the behavior is for pausing in other
callbacks (e.g. on_url()).

I think everyone agrees that it should be fixed on our side, the
question is how?

First, we need to define the semantics of pausing. Should pausing be
allowed all callbacks, or should it be limited to specific cases? AFAICT,
the only interesting places to call this are in on_headers_complete for
proxying purposes and in on_message_complete for HTTP/1.1 pipelined
requests.

Also, in what (other) cases is this a problem? So far, the only
problematic instance is for on_headers_complete(), which must potentially
call on_message_complete when the request has no body.


Reply to this email directly or view it on GitHub:
#97 (comment)

@AndreLouisCaron
Copy link
Contributor Author

After further reading of the code, it seems to me that the only risky bits are parser states that use goto reexecute_byte; and execute callbacks. I only found only two states that do this:

  1. s_body_identity; and
  2. s_headers_almost_done.

FWIW, the latter is our problematic case and the former already has somewhat of a hack:

/* Mimic CALLBACK_DATA_NOADVANCE() but with one extra byte.
 *
 * The alternative to doing this is to wait for the next byte to
 * trigger the data callback, just as in every other case. The
 * problem with this is that this makes it difficult for the test
 * harness to distinguish between complete-on-EOF and
 * complete-on-length. It's not clear that this distinction is
 * important for applications, but let's keep it for now.
 */
 CALLBACK_DATA_(body, p - body_mark + 1, p - data);
 goto reexecute_byte;

@AndreLouisCaron
Copy link
Contributor Author

Couldn't they consume the last byte, but leave the parser in a state of s_paused_before_message_complete, and have that state just call the callback and then do "goto reexecute_byte;"?

Hadn't thought of introducing anoher state! However, I don't think this will work, since the byte will no longer available on the next call to http_parser_execute (the client will have considered it already consumed by the parser). Also, that would mean that on an empty request, the client would have to call http_parser_execute with 0 bytes to "complete" the message even though all the message data has been processed, and I believe this breaks one of the key invariants of http-parser. See pgriess's earlier comment.

@pgriess
Copy link
Contributor

pgriess commented Mar 1, 2012

Yeah, @AndreLouisCaron is right that the particular devil here is the extra byte that you need to stash: where to stash it, and how to kick off parsing with it. I don't think we need to add any more parser states to accomplish this, as there is already a state for every place where we might have paused (i.e. after every callback). We just need to tweak these states to look at the stashed byte.

I'm starting to think that renaming http_parser_pause() to http_parser_unpause_and_execute() (or something) is the right way to go. I can't think of a meaningful situation where one would un-pause the parser without wanting to then immediately process any data that remained. The major benefit of this API name change would be that it's obvious that the parser going to start executing without being given any additional data.

@jasnell
Copy link
Member

jasnell commented Oct 26, 2015

@indutny @bnoordhuis ... any reason to keep this open?

@vinniefalco
Copy link
Contributor

I'm about to open a new issue. I'd like my C++ wrapper to operate the parser in two stages, first to process the message header as a complete unit, and then to process the body. Returning 1 from on_headers_complete is not the answer because you can't resume parsing of the body. So I'm looking through the outstanding issues to see if anyone else seeks similar functionality and I came across this issue.

I'm going to try using the pause feature to implement two stage parsing. Was the bug described fixed? Am I going to encounter a problem with "that last byte?" Is there another way to implement the functionality I desire?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants