New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http: disallow two-byte characters in URL path #16237

Closed
wants to merge 4 commits into
base: master
from

Conversation

@bennofs
Contributor

bennofs commented Oct 16, 2017

This commit changes node's handling of two-byte characters in the path component
of an http URL. Previously, node would just strip the higher byte when
generating the request. So this code:

http.request({host: "example.com", port: "80", "/\uFF2e"})

would request http://example.com/. (. is the character for the byte 0x2e).

This is not useful and can in some cases allow filter evasion. With this
change, the code generates ERR_UNESCAPED_CHARACTERS, just like space and
control characters already did.

Checklist
  • make -j4 test (UNIX) passes
  • tests and/or benchmarks are included
  • commit message follows commit guidelines
Affected core subsystem(s)

http

@@ -0,0 +1,33 @@
// Copyright Joyent, Inc. and other Node contributors.

This comment has been minimized.

@lpinca

lpinca Oct 16, 2017

Member

Nit: copyright is not needed for new tests.

@@ -50,20 +50,20 @@ const errors = require('internal/errors');
// checks can greatly outperform the equivalent regexp (tested in V8 5.4).
function isInvalidPath(s) {

This comment has been minimized.

@joyeecheung

joyeecheung Oct 16, 2017

Member

The comment above this function needs an update as well

@apapirovski

This comment has been minimized.

Member

apapirovski commented Oct 16, 2017

Thanks for the PR @bennofs! I'll abstain from commenting on the correctness of this but there are some performance considerations here.

  1. With this change, isInvalidPath is strictly slower (even for 1 or 2 chars) in v8 versions 6.1 & 6.2 than the RegExp version.

  2. Even if it doesn't or the isInvalidPath perf can be improved, the threshold for which the RegExp is faster has moved down to about 10 characters from the previous 39.

@bennofs

This comment has been minimized.

Contributor

bennofs commented Oct 16, 2017

@apapirovski hmm ok. what did you use to test the performance of isInvalidPath? So I can play around with it and see if there's perhaps some way to optimize it (although I am not very familar with V8 performance).

It is probably impossible not to pay at least a small performance penalty, because we need to check more things than we currently do.

@bennofs

This comment has been minimized.

Contributor

bennofs commented Oct 16, 2017

For the motivation why this is important, see https://www.blackhat.com/docs/us-17/thursday/us-17-Tsai-A-New-Era-Of-SSRF-Exploiting-URL-Parser-In-Trending-Programming-Languages.pdf (NodeJS unicode failure).

tl;dr:

var base = "http://orange.tw/sandbox/";
var path = req.query.path;
if(path.indexOf("..") ==-1) {
  http.get(base + path, callback);
}

Such a filter can easily be bypassed with \uFF2e\uFF2e to get SSRF if this is not fixed.

@jasnell jasnell requested review from mscdex and bnoordhuis Oct 16, 2017

//
// This function is used in the case of small paths, where manual character code
// checks can greatly outperform the equivalent regexp (tested in V8 5.4).
function isInvalidPath(s) {
var i = 0;
if (s.charCodeAt(0) <= 32) return true;
if (s.charCodeAt(0) <= 32 || s.charCodeAt(0) > 0xFF) return true;

This comment has been minimized.

@mscdex

mscdex Oct 16, 2017

Contributor

Shouldn't this be > 127 instead of > 255 ? UTF-8 starts using multiple bytes after the last ASCII character (127). Or am I misunderstanding the PR description?

This comment has been minimized.

@bennofs

bennofs Oct 16, 2017

Contributor

V8 strings are not UTF-8, but UTF-16 (UCS2). If it was UTF-8, this would not be necessary, as each byte in an UTF-8 multibyte sequence is > 127 so it would be impossible to get ascii characters that way, even if some bytes were stripped.

What happens is that the string ' \uff2e' is represented in node as (uint16_t*){0xff2e} (where {} is used here to denote an array). With the latin1 encoding (that's what gets used for headers), node internally calls WriteOneByte on that string to write it to the socket and WriteOneByte will strip the higher byte of the two-byte character, so we end up with 0x2e. It works fine for all characters that can be represented with a single byte (and in UCS2, that means all characters below 0x100) (whether any non-ascii characters make sense in this context is questionable, but it also doesn't break anything so I opted not to filter that).

This comment has been minimized.

@bnoordhuis

bnoordhuis Oct 16, 2017

Member

With the latin1 encoding (that's what gets used for headers)

Unfortunately, it's not that simple. See #13296 (comment) for an explanation. The two word summary is "it depends" (on the encoding of the body, but that doesn't fit in two words.)

Aside: calling .charCodeAt() twice for the same index is somewhat wasteful, it would be better to cache the result in a variable. (V8's TurboFan compiler can inline the calls but I don't believe it will fold them.)

This comment has been minimized.

@bennofs

bennofs Oct 16, 2017

Contributor

Oh, thanks for pointing me to that issue. I thought a bit about something like that while writing the patch, but I couldn't find out where the encoding is actually set.

This comment has been minimized.

@bennofs

bennofs Oct 16, 2017

Contributor

At least as long as we aren't using chunked encoding (no idea what that is :))

This comment has been minimized.

@bnoordhuis

bnoordhuis Oct 18, 2017

Member

Sorry for the delay, missed your replies. I don't know if your exact proposal would work but take a look at the _storeHeader() and _send() methods in lib/_http_outgoing.js, that's where the headers are put on the wire and their encoding is determined.

You may also want to look at the _storeHeader() calls in lib/_http_client.js, they compose the status line.

This comment has been minimized.

@bennofs

bennofs Oct 19, 2017

Contributor

This raises the question if we actually want to support non-latin1 encoded request path? Maybe utf-8 makes sense in some contexts? I am not sure.

This comment has been minimized.

@apapirovski

apapirovski Oct 19, 2017

Member

I spent some time browsing through open issues yesterday and there's already a bug logged against this behaviour. (Would need to go back to find it again.) IMO what mscdex said above makes sense, just cap it at 127 instead. The current behaviour is completely unpredictable and broken.

This comment has been minimized.

@apapirovski

apapirovski Oct 19, 2017

Member

Here's the issue I was mentioning #13296 — there's some good discussion there too.

I would prefer what I said above, at least for 9.x (maybe different solution for LTS), but I'll defer to @bnoordhuis and others with more experience with the http module.

This comment has been minimized.

@bnoordhuis

bnoordhuis Oct 19, 2017

Member

This raises the question if we actually want to support non-latin1 encoded request path? Maybe utf-8 makes sense in some contexts?

It's in fairly widespread use. Rejecting it outright would almost certainly result in quite a bit of fallout and lots of frustrated users. Rejecting malformed UTF-8 should be acceptable and appropriate, though.

(But as discussed, what to accept and what to reject depends on the encoding.)

@apapirovski

This comment has been minimized.

Member

apapirovski commented Oct 16, 2017

@bennofs Totally understand re: the need for it, just don't want to comment since it's not my area of expertise.

Re: performance, I think the simplest solution here is to just completely remove isInvalidPath if this change is going to stick and then just inline the regExp.test() statement into the if condition and you could also pull out the construction of the actual RegExp outside of the ClientRequest so it doesn't run each time.

In terms of running benchmarks, here's the guide for it: https://github.com/nodejs/node/blob/master/doc/guides/writing-and-running-benchmarks.md — the relevant test here would be http/create-clientrequest.js but if you want to run it locally, I would tweak the parameters of the benchmark to something like:

len: [1, 8, 16],
n: [1e7]

(You don't need the large len tests and n = 1e6 is quite noisy.)

Let me know if any of that didn't make sense.

@mcollina

This comment has been minimized.

Member

mcollina commented Oct 17, 2017

Can you please check if this degrades our "simple" http benchmark?

@apapirovski

This comment has been minimized.

Member

apapirovski commented Oct 17, 2017

@mcollina This just affects ClientRequest so I don't think it would, would it? The best benchmark for this should be create-clientrequest which does get degraded as per above.

@addaleax

The PR seems fine to me, but: semver-major?

@apapirovski

This comment has been minimized.

Member

apapirovski commented Oct 19, 2017

@addaleax (or anyone else that knows) what's the policy around semver-major things that have a security impact (which this does)?

@addaleax

This comment has been minimized.

Member

addaleax commented Oct 19, 2017

@apapirovski What we can do is landing a semver-major patch in an existing release line if we think it’s justified for security reasons, and provide users a way to opt out of the patch via a runtime flag.

I think this can be seen as a semver-patch, I just want to make sure we’re explicit about that :)

(Also, I’d take it as a good sign that #8923 was landed in LTS and nobody yelled at us about it.)

@jasnell

This comment has been minimized.

Member

jasnell commented Oct 20, 2017

It makes me quite sad to see yet another example of where our loose conformance to the spec is causing yet another issue. Running the URL through the WHATWG URL parser produces a reasonable result... e.g.

> new url.URL('http://example.org/\uFF2e')
URL {
  href: 'http://example.org/%EF%BC%AE',
  origin: 'http://example.org',
  protocol: 'http:',
  username: '',
  password: '',
  host: 'example.org',
  hostname: 'example.org',
  port: '',
  pathname: '/%EF%BC%AE',
  search: '',
  searchParams: URLSearchParams {},
  hash: '' }

I'm +1 on making this change as semver-patch in all release lines, but I would argue even stronger for deprecating the use of the legacy and buggy url.parse() based approach in favor of using the new WHATWG URL parser for everything http related in core.

bennofs added some commits Oct 16, 2017

http: disallow two-byte characters in URL path
This commit changes node's handling of two-byte characters in the path component
of an http URL. Previously, node would just strip the higher byte when
generating the request. So this code:

```
http.request({host: "example.com", port: "80", "/\uFF2e"})
```

would request `http://example.com/.` (`.` is the character for the byte `0x2e`).

This is not useful and can in some cases lead to filter evasion. With this
change, the code generates `ERR_UNESCAPED_CHARACTERS`, just like space and
control characters already did.
http: always use regex for invalid path check
Using the "optimized" version was not significantly faster and even
slower for larger n.
@bennofs

This comment has been minimized.

Contributor

bennofs commented Nov 22, 2017

Is there anything I would need to do to move this PR forward?

@jasnell

This comment has been minimized.

Member

jasnell commented Nov 22, 2017

CI: https://ci.nodejs.org/job/node-test-pull-request/11638/

Just needs CI and signoff. Ping @nodejs/tsc ... PTAL

@addaleax

This comment has been minimized.

Member

addaleax commented Nov 22, 2017

I think this is ready? semver-major?

@apapirovski

This comment has been minimized.

Member

apapirovski commented Nov 22, 2017

This has performance implications in a hot-path code and I did request changes, albeit not with the red checkmark. I would really like to see this benchmarked and feedback addressed.

@apapirovski

This comment has been minimized.

Member

apapirovski commented Nov 22, 2017

Also, the upper threshold is currently somewhat arbitrary given the way we handle encoding of headers.

@bennofs

This comment has been minimized.

Contributor

bennofs commented Nov 22, 2017

@apapirovski here is the benchmark run (with the pure regex based implementation):

Rscript benchmark/compare.R  < old-new.csv
                                                improvement confidence      p.value
 http/create-clientrequest.js n=10000000 len=1      -3.36 %         ** 4.738245e-03
 http/create-clientrequest.js n=10000000 len=16      4.91 %        *** 7.801409e-05
 http/create-clientrequest.js n=10000000 len=8      -2.03 %            1.028356e-01
@apapirovski

This comment has been minimized.

Member

apapirovski commented Nov 22, 2017

IMO we should just switch to the pure RegExp implementation given those numbers (and confidence levels). It removes some pretty ugly code and really simplifies things.

@bennofs

This comment has been minimized.

Contributor

bennofs commented Nov 22, 2017

@apapirovski yes I already pushed the pure regex implementation

@jasnell

This comment has been minimized.

Member

jasnell commented Nov 22, 2017

@mscdex ... PTAL

@apapirovski

This comment has been minimized.

Member

apapirovski commented Nov 22, 2017

@bennofs Sorry, that's my bad. For some reason github kept showing me the outdated review comments as unaddressed instead of hiding as it does normally. 😞

@jasnell

This comment has been minimized.

Member

jasnell commented Nov 22, 2017

I'd like @mscdex to take one final look at this before it lands :-)

@BridgeAR

This comment has been minimized.

@bnoordhuis

This comment has been minimized.

Member

bnoordhuis commented Nov 23, 2017

People who signed off on this: you do all realize this PR rejects everything that isn't Latin-1? Are you prepared to deal with the fallout?

Let's at least run citgm: https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/1103/ https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/1104/

@jasnell

This comment has been minimized.

Member

jasnell commented Nov 23, 2017

Tbh, yes and no. Yes because strict checking has proven time and again to be the better route from a security POV. No because of the breakage that may occur. The code change looks good, but I'm still on the fence about changing this particular bit of code.

@bennofs

This comment has been minimized.

Contributor

bennofs commented Nov 23, 2017

The reason I suggested this change is that it never worked correctly as far as I can tell. And the incorrect behaviour is so broken IMO that it's not useful (except maybe if you want to transmit raw 2-bytes over URLs? seems likely...)

@addaleax

This comment has been minimized.

Member

addaleax commented Dec 1, 2017

I would like to land this in the near future if there are no objections.

@apapirovski

This comment has been minimized.

Member

apapirovski commented Dec 2, 2017

ping @bnoordhuis — do you want to make your concerns explicit via the red X?

@bnoordhuis

This comment has been minimized.

Member

bnoordhuis commented Dec 3, 2017

No, I'll just be an interested onlooker in this one.

@apapirovski apapirovski referenced this pull request Dec 7, 2017

Closed

http: simplify `isInvalidPath` #17525

3 of 3 tasks complete
if (s.charCodeAt(i) <= 32) return true;
return false;
}
const INVALID_PATH_REGEX = /[\u0000-\u0020\u0100-\uffff]/;

This comment has been minimized.

@TimothyGu

TimothyGu Dec 7, 2017

Member

I'd simplify this with /[^\x21-\xff]/ or /[^\u0021-\u00ff]/

@BridgeAR

This comment has been minimized.

Member

BridgeAR commented Dec 12, 2017

Landed in b961d9f.
I changed the RegExp to the suggested one by @TimothyGu.

@BridgeAR BridgeAR closed this Dec 12, 2017

BridgeAR added a commit that referenced this pull request Dec 12, 2017

http: disallow two-byte characters in URL path
This commit changes node's handling of two-byte characters in
the path component of an http URL. Previously, node would just
strip the higher byte when generating the request. So this code:

```
http.request({host: "example.com", port: "80", "/N"})
```

would request `http://example.com/.`
(`.` is the character for the byte `0x2e`).

This is not useful and can in some cases lead to filter evasion.
With this change, the code generates `ERR_UNESCAPED_CHARACTERS`,
just like space and control characters already did.

PR-URL: #16237
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Anatoli Papirovski <apapirovski@mac.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Timothy Gu <timothygu99@gmail.com>

@addaleax addaleax removed the author ready label Dec 13, 2017

@ChALkeR ChALkeR added the security label Jan 28, 2018

msoechting added a commit to hpicgs/node that referenced this pull request Feb 5, 2018

http: disallow two-byte characters in URL path
This commit changes node's handling of two-byte characters in
the path component of an http URL. Previously, node would just
strip the higher byte when generating the request. So this code:

```
http.request({host: "example.com", port: "80", "/N"})
```

would request `http://example.com/.`
(`.` is the character for the byte `0x2e`).

This is not useful and can in some cases lead to filter evasion.
With this change, the code generates `ERR_UNESCAPED_CHARACTERS`,
just like space and control characters already did.

PR-URL: nodejs#16237
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Anatoli Papirovski <apapirovski@mac.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Timothy Gu <timothygu99@gmail.com>

msoechting added a commit to hpicgs/node that referenced this pull request Feb 7, 2018

http: disallow two-byte characters in URL path
This commit changes node's handling of two-byte characters in
the path component of an http URL. Previously, node would just
strip the higher byte when generating the request. So this code:

```
http.request({host: "example.com", port: "80", "/N"})
```

would request `http://example.com/.`
(`.` is the character for the byte `0x2e`).

This is not useful and can in some cases lead to filter evasion.
With this change, the code generates `ERR_UNESCAPED_CHARACTERS`,
just like space and control characters already did.

PR-URL: nodejs#16237
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Anatoli Papirovski <apapirovski@mac.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Timothy Gu <timothygu99@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment