Skip to content
This repository

Expose HTTP header boundary #2612

Open
indexzero opened this Issue January 25, 2012 · 24 comments

7 participants

Charlie Robbins James Halliday Maciej Małecki Mikeal Rogers Mark Nottingham Peter Griess Einar Otto Stangvik
Charlie Robbins

@ry reminder about the small feature in the HTTP parser @substack and I asked you about. To support faster proxying we need to expose the total byte length of the HTTP headers parsed.

I'll suggest exposing 'parser.boundary' as an integer. Thoughts?

James Halliday

The proposed boundary integer should also include the framing information for incoming multipart requests (including the final 0\r\n chunk) so that proxies can know when to unpipe themselves so that multiple requests with keep-alive don't bleed into each other.

Charlie Robbins

@substack Agree with you that framing is important, but shouldn't it be exposed though a different var? e.g. parser.frames?

Maciej Małecki mmalecki referenced this issue from a commit in mmalecki/http-parser February 13, 2012
Expose HTTP header boundary b7136a7
Maciej Małecki mmalecki referenced this issue from a commit in mmalecki/node February 13, 2012
http_parser: expose `parser.boundary`
Requires new `http_parser` (with mmalecki/http-parser@b7136a7 included).

Fixes #2612.
82f3eeb
Maciej Małecki mmalecki referenced this issue from a commit in mmalecki/node February 13, 2012
http_parser test: test if parser boundary is exposed
Tests #2612.
bd06e42
Maciej Małecki

Added some commits. @indexzero @substack can you guys review and make sure this is what you mean? Take a look at the test commit (mmalecki/node@bd06e42) to see how it works.

James Halliday

This is a good start at what needs to happen. The framing boundaries for multipart requests need to be exposed also.

Maciej Małecki mmalecki referenced this issue in joyent/http-parser February 12, 2012
Closed

Expose HTTP header boundary #89

Charlie Robbins

@substack I agree, but we should open a separate issue for that. .boundary should enable most reverse-proxy scenarios.

Mikeal Rogers

we need this :)

Charlie Robbins

I'm still +1 on this, but worth mentioning my new approach after speaking with @pgriess. The reason I'm still +1 on this is that the approach using parser.boundary suggests to be more performant because it only invokes a single function on every chunk: ondata instead of both onbody and ondata.

joyent/http-parser#89 (comment)

Mark Nottingham

Just a +1 on the general requirement -- I need this for htracr IIRC.

Peter Griess

So I'd like to expand a bit on my comment on the joyent/http_parser pull request.

I think we could provide an additional set of APIs for HTTPParser that is both a more performant and more expressive and re-implement http.js in terms of these APIs (without changing the external API that http.js provides to its clients). Instead of making upcalls from V8 to JavaScript for each parse event (which is expensive), we could accumulate these events in C++ and pass them back to JavaScript in one shot. Each event would be something like the http_parser_data struct from @ry's event_stream branch: providing the event type (message begin, header name, header value, body, etc), offset and length into the original buffer, plus major/minor version and method for request lines, etc.

One thing that we'd have to be careful about is that the existing HTTPParser object tries hard to do all string concatenation and allocation in C++, vs. using Buffer.toString() from JavaScript (which incurs an extra cross-boundary cal). I think this means that we would probably want to include V8 String objects in the header name/value event objects so that callers don't have to do this. For header names/values that are not complete, we can just punt the concatenation and toString() calls to JavaScript, I think, since a single partial header name/value should be relatively rare compared to fully-received names/values.

FWIW, I think this is a natural extension of the work that node_http_parser.cc is already doing to avoid extra upcalls by batching headers.

Mikeal Rogers

the last big change to the parser was very similar. we used to get an event for every damn header and put them all together in javascript until all the headers were done, then trigger an event in javascript. refactoring it down to one event didn't change the http.js API at all and gave us something like a 10% performance bump.

i think everyone is positive about a change like this but we can't hold off on taking important (read: blocking current use cases) functionality on a large refactor. if this larger change can land in like a week then nobody will complain, but if we keep having to use @substack's javascript parser (a fine effort but fairly untested and slower than node's C parser) for some indefinite amount of time waiting on this re-write then people will start to get pissed and wonder why this smaller change didn't land before the re-write.

Charlie Robbins

@mikeal @substack This is not blocking any behavior. I thought that it was, but I was wrong:

joyent/http-parser#89 (comment)
joyent/http-parser#89 (comment)

We should simply buffer data in parser.onBody instead of doing any framing ourselves. I can't speak to the 0\r\n problem that @substack has encountered, but exposing the header boundary would not fix that.

Mikeal Rogers

we need to know this:

for a given onBody call, how many actual bytes were parsed?

with just that information we can figure out how many bytes came over the socket from the offset we have for the end of the headers and when onEnd is called we know how much data, if any, the socket has emitted that is part of a new request. i believe that would resolve @substack's issue.

Mark Nottingham

Related issue is mranney/node_pcap#8

Peter Griess

I don't think you need to know how many bytes were parsed.

The parser is deigned such that it owns all framing decisions -- you should never be inspecting raw data other than when you know the connection has been upgraded (e.g. via a CONNECT method or an Upgrade: header) and thus you're not dealing with HTTP anymore.

In the scenario you're describing (a single packet contains both the end of the current message and the beginning of a new one), passing the entire buffer into the parser would result in on_body, on_message_complete, on_message_begin, on_header_field, etc ... You can observe the transition from old to new message in the on_message_end/on_message_begin sequence of callbacks. If that doesn't work for you, could you give some more detail on what you're trying to do such that you need to know exactly where at the byte level we transition between messages?

James Halliday

If you want to write a transparent proxy without reframing everything needlessly and just using .pipe() to pump data to a new place, then this boundary information is necessary to compute where to shutdown the pipe to prevent keep-alive connections from bleeding into each other. You really can't do this with the current parser without writing an ad-hoc parser on top of the current parser.

Mark Nottingham

Fair enough; that would require changes in node_pcap (or some really nasty hacks around it), but point taken.

Peter Griess

Even if you have header boundary information, you're still faced with figuring out where the end of the message is. This logic is more convoluted than anyone would like: is there a content-length field? is there a transfer-encoding in effect? is this a response? is this eligible for EOF termination? is this the response to a HEAD request? is the status 1xx/204/304? etc. Anyway, all of this is mercifully already implemented in http_parser, but if you don't want to round-trip body data through there, I think you're stuck implementing all this yourself, which it sounds like you don't want to do. Is there another option that I'm missing?

Anyway, if we ignore the boundaries between messages and just focus on the header/body boundary for the time being, you can get at this information by invoking http_parse_pause() in the on_headers_complete callback such that http_parser_execute() stops at the boundary. You can then look at how many bytes it consumed to know where the message body starts.

Einar Otto Stangvik

How is this implemented in other highly performant proxies? Do they deep inspect or just pipe data through (as requested by most in this discussion?

Peter Griess

Handling keep-alive in proxies is a bit tricky because, while the proxy may support keep-alive, the upstream server may not.

For both HTTP/1.0 and 1.1, this is a per-hop decision (the Connection: header is per-hop, and is used to indicate keep-alive in HTTP/1.0, which does not implicitly re-use connections like HTTP/1.1 does). As a result, the proxy must be prepared to interrogate the message stream to ensure that it can send a single non-pipelined request to upstream hosts that don't support keep-alive. Even if you knew your upstream host supported keep-alive, it would be bad practice to just pipe the two TCP connections together, again because of the (per-hop) Connection header. The upstream server could send a "Connection: close" header, indicating that it does not wish to re-use the TCP connection with the proxy, but that should not imply that the proxy needs to close its connection with the client.

That said, I'm not sure what the majority of proxies do in practice; @mnot is the real expert there.

Einar Otto Stangvik

I can't help but feel that following the HAProxy implementation pattern would be beneficial beyond trying to re-solve this problem.

Mikeal Rogers

defaulting connections to close is not an acceptable solution for performance reasons.

Einar Otto Stangvik

You'll obviously have to implement the proxy in such a manner that upstream limitations are upheld. What I don't understand is why this issue isn't based on the extensive amount of thought put into e.g. HAProxy, rather than being re-explored entirely.

If the rest of you are all intimately familiar with such proxies; please have me excused. Otherwise, that seems like the right thing to spend time digging into at this point.

Mikeal Rogers

our goal is not to copy or match the performance of HAProxy.

our goal is to be faster than HAProxy, which will require a fair amount of new thought.

Einar Otto Stangvik

@mikeal; heh, that's a sane goal, but you don't get there by getting stuck on how to parse responses or properly deal with differing upstream servers. Haproxy (and similar) have long since moved beyond that, and onto focusing on optimizing allocators, ebtree based state storage and so on.

This just has the ring of an age-old problem being re-solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.