New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

`http` silently corrupts the request URL when it contains non-Latin-1 codepoints #13296

Closed
Flimm opened this Issue May 30, 2017 · 11 comments

Comments

Projects
None yet
7 participants
@Flimm
Contributor

Flimm commented May 30, 2017

I experience the bug on Node commit 399cb25

Darwin Tephs-Mac-Pro.local 16.6.0 Darwin Kernel Version 16.6.0: Fri Apr 14 16:21:16 PDT 2017; root:xnu-3789.60.24~6/RELEASE_X86_64 x86_64 .

Subsystem: http

Issue description

When making a request using http.get with the path set to /café🐶, the server receives /café=6. This is not the URL that was sent, it's not even the precent-encoded version of the URL (with encodeURI), which would be /caf%C3%A9%F0%9F%90%B6. I expected the URL to be passed along without data corruption.

I've created a pull request which contains a test case that currently fails, illustrating this bug. #13297

Flimm added a commit to Flimm/node that referenced this issue May 30, 2017

test: test requests with Unicode in the URL
This test currently fails. It illustrates that Unicode in the URL does
not arrive intact to the server, there is silent data corruption along
the way at some point.

This test is for the issue #13296.
@jasnell

This comment has been minimized.

Show comment
Hide comment
@jasnell

jasnell May 30, 2017

Member

For sake of performance the impl does not transcode or escape the input into valid ASCII. It assumes the user has done the necessary escaping. That is unlikely to change but can be documented.

Member

jasnell commented May 30, 2017

For sake of performance the impl does not transcode or escape the input into valid ASCII. It assumes the user has done the necessary escaping. That is unlikely to change but can be documented.

@Flimm

This comment has been minimized.

Show comment
Hide comment
@Flimm

Flimm May 30, 2017

Contributor

@jasnell That's great, but that's not what this issue is about. I'm not asking in this issue to transcode or escape the input into valid ASCII. All I'm asking is that silent data corruption not happen. There are solutions for that, one of them is to pass the data without corrupting it. Another would be to throw an exception when Unicode strings are passed in of course. I'll leave it up to core contributors to decide on the best solution to fix the bug that data is getting corrupted silently.

When /café is passed as a URL, it gets sent as /café (encoded in UTF-8, I assume) to the server. No data loss or data corruption happens, even though that URL is not ASCII. It could be that the server can't handle it, but at least no silent data loss or corruption has happened.

So when is the silent data corruption happening?

When /café🐶 is passed as a URL, it gets sent as /café=6, instead of /café🐶. Why is it corrupting the data? This data corruption is helping nobody. No server is going to read =6 and understand that that is meant to be the dog emoji.

(It's worth noting that http.get will check the passed URL for whitespace and throw an exception in the case that it finds a space, despite performance costs.)

Contributor

Flimm commented May 30, 2017

@jasnell That's great, but that's not what this issue is about. I'm not asking in this issue to transcode or escape the input into valid ASCII. All I'm asking is that silent data corruption not happen. There are solutions for that, one of them is to pass the data without corrupting it. Another would be to throw an exception when Unicode strings are passed in of course. I'll leave it up to core contributors to decide on the best solution to fix the bug that data is getting corrupted silently.

When /café is passed as a URL, it gets sent as /café (encoded in UTF-8, I assume) to the server. No data loss or data corruption happens, even though that URL is not ASCII. It could be that the server can't handle it, but at least no silent data loss or corruption has happened.

So when is the silent data corruption happening?

When /café🐶 is passed as a URL, it gets sent as /café=6, instead of /café🐶. Why is it corrupting the data? This data corruption is helping nobody. No server is going to read =6 and understand that that is meant to be the dog emoji.

(It's worth noting that http.get will check the passed URL for whitespace and throw an exception in the case that it finds a space, despite performance costs.)

@mscdex mscdex added the http label May 30, 2017

@seishun

This comment has been minimized.

Show comment
Hide comment
@seishun

seishun May 31, 2017

Member

I agree, this behavior is unnecessarily unintuitive. It should either:

  1. silently percent-encode non-ASCII characters, or
  2. encode as UTF-8, or
  3. throw on non-Latin-1 characters instead of silently breaking them

Options 1 or 2 could be breaking if someone is deliberately using "binary" encoding to send UTF-8 encoded paths, e.g. path: Buffer.from('/wiki/サクラ').toString('binary'). But if anyone is actually doing that, we should just let them specify the path as a Buffer instead of resorting to such hacks.

They could also be breaking if someone actually wants Latin-1 encoded paths (or the server they are talking to is smart enough to recognize it's not UTF-8) and they just happen to use only the Latin-1 range of characters. So path: '/wiki/café' just works by luck. But that scenario seems even less likely than the above.

Here's my proposal to fix this mess:

  1. Allow specifying paths as Buffers.
  2. Runtime-deprecate non-ASCII Latin-1 characters in string paths (a warning notifying that the characters are encoded as Latin-1 and suggesting using Buffer instead).
  3. Throw on non-ASCII non-Latin-1 characters in string paths (they are broken anyway).

In a later major version of Node.js, we could consider one of the following:

  1. Encode paths as UTF-8 instead of Latin-1.
  2. Percent-encode non-ASCII characters in paths.
  3. Disallow non-ASCII completely in paths completely.
  4. Leave it as-is.

cc @nodejs/collaborators feel free to criticize.

Member

seishun commented May 31, 2017

I agree, this behavior is unnecessarily unintuitive. It should either:

  1. silently percent-encode non-ASCII characters, or
  2. encode as UTF-8, or
  3. throw on non-Latin-1 characters instead of silently breaking them

Options 1 or 2 could be breaking if someone is deliberately using "binary" encoding to send UTF-8 encoded paths, e.g. path: Buffer.from('/wiki/サクラ').toString('binary'). But if anyone is actually doing that, we should just let them specify the path as a Buffer instead of resorting to such hacks.

They could also be breaking if someone actually wants Latin-1 encoded paths (or the server they are talking to is smart enough to recognize it's not UTF-8) and they just happen to use only the Latin-1 range of characters. So path: '/wiki/café' just works by luck. But that scenario seems even less likely than the above.

Here's my proposal to fix this mess:

  1. Allow specifying paths as Buffers.
  2. Runtime-deprecate non-ASCII Latin-1 characters in string paths (a warning notifying that the characters are encoded as Latin-1 and suggesting using Buffer instead).
  3. Throw on non-ASCII non-Latin-1 characters in string paths (they are broken anyway).

In a later major version of Node.js, we could consider one of the following:

  1. Encode paths as UTF-8 instead of Latin-1.
  2. Percent-encode non-ASCII characters in paths.
  3. Disallow non-ASCII completely in paths completely.
  4. Leave it as-is.

cc @nodejs/collaborators feel free to criticize.

@bnoordhuis

This comment has been minimized.

Show comment
Hide comment
@bnoordhuis

bnoordhuis May 31, 2017

Member

@seishun See #3062 (comment). Allow me to paraphrase myself:

The set of characters to reject depends on the encoding used for the request headers. That in turn is influenced by the encoding of the request body because node.js tries hard to pack the headers and the body into a single outgoing packet.

An example: U+010A ('Ċ') is fine with encoding="utf8"; it decodes to bytes C4 8A. The same codepoint should be rejected with encoding="binary" (or "latin1") because it decodes to byte 0A, a newline.

IOW, the following statement:

Throw on non-ASCII non-Latin-1 characters in string paths (they are broken anyway).

Is not true (or only conditionally true.) As well, UTF-8 URLs are used widely enough that rejecting them outright is probably not going to fly.

I think the first order of business is to untangle the conflation of headers and body somehow. Unfortunately, the naive approach is riddled with performance pitfalls and some backwards compatibility concerns.

Member

bnoordhuis commented May 31, 2017

@seishun See #3062 (comment). Allow me to paraphrase myself:

The set of characters to reject depends on the encoding used for the request headers. That in turn is influenced by the encoding of the request body because node.js tries hard to pack the headers and the body into a single outgoing packet.

An example: U+010A ('Ċ') is fine with encoding="utf8"; it decodes to bytes C4 8A. The same codepoint should be rejected with encoding="binary" (or "latin1") because it decodes to byte 0A, a newline.

IOW, the following statement:

Throw on non-ASCII non-Latin-1 characters in string paths (they are broken anyway).

Is not true (or only conditionally true.) As well, UTF-8 URLs are used widely enough that rejecting them outright is probably not going to fly.

I think the first order of business is to untangle the conflation of headers and body somehow. Unfortunately, the naive approach is riddled with performance pitfalls and some backwards compatibility concerns.

@seishun

This comment has been minimized.

Show comment
Hide comment
@seishun

seishun May 31, 2017

Member

@bnoordhuis
I see, so there are two issues here:

  1. Headers are encoded using the same encoding as the first data chunk.
  2. In the absence of any data chunks, headers are encoded using latin-1.

I think both of these could be fixed without touching the conflation of headers and body.

The set of characters to reject depends on the encoding used for the request headers.

So it boils down to two questions:

  1. Do we want to allow setting the encoding for the headers?
  2. Which encoding do we use/default to for the headers?

I think most would agree that defaulting to/using the encoding of the first data chunk or 'latin1' is broken.

Member

seishun commented May 31, 2017

@bnoordhuis
I see, so there are two issues here:

  1. Headers are encoded using the same encoding as the first data chunk.
  2. In the absence of any data chunks, headers are encoded using latin-1.

I think both of these could be fixed without touching the conflation of headers and body.

The set of characters to reject depends on the encoding used for the request headers.

So it boils down to two questions:

  1. Do we want to allow setting the encoding for the headers?
  2. Which encoding do we use/default to for the headers?

I think most would agree that defaulting to/using the encoding of the first data chunk or 'latin1' is broken.

@jasnell

This comment has been minimized.

Show comment
Hide comment
@jasnell

jasnell May 31, 2017

Member

Indeed. Encoding of the headers should have absolutely nothing to do with the encoding of the payload.

Member

jasnell commented May 31, 2017

Indeed. Encoding of the headers should have absolutely nothing to do with the encoding of the payload.

@seishun

This comment has been minimized.

Show comment
Hide comment
@seishun

seishun Jun 1, 2017

Member

Great, if we agree on that, then the next question is which assumptions we can make.

  • If we assume that (almost) no one relies on the headers being encoded as Latin-1 (i.e. when there is no payload), then we can just always encode the headers as UTF-8 in the next major version.

  • If we assume that (almost) no one relies on the headers being encoded as UTF-8 (i.e. when the first payload chunk is UTF-8), then see my proposal in #13296 (comment).

  • If we can't assume either of the above, then, as @bnoordhuis pointed out, we can't throw on non-Latin-1 codepoints (or we can do that only if it's going to be encoded as Latin-1, which seems pretty complicated). But if we agree that the header encoding should be independent of the payload, then we should introduce a warning for non-ASCII codepoints about the planned change in behavior, and allow specifying the path as a Buffer as an unambiguous alternative.

Member

seishun commented Jun 1, 2017

Great, if we agree on that, then the next question is which assumptions we can make.

  • If we assume that (almost) no one relies on the headers being encoded as Latin-1 (i.e. when there is no payload), then we can just always encode the headers as UTF-8 in the next major version.

  • If we assume that (almost) no one relies on the headers being encoded as UTF-8 (i.e. when the first payload chunk is UTF-8), then see my proposal in #13296 (comment).

  • If we can't assume either of the above, then, as @bnoordhuis pointed out, we can't throw on non-Latin-1 codepoints (or we can do that only if it's going to be encoded as Latin-1, which seems pretty complicated). But if we agree that the header encoding should be independent of the payload, then we should introduce a warning for non-ASCII codepoints about the planned change in behavior, and allow specifying the path as a Buffer as an unambiguous alternative.

@seishun seishun changed the title from `http` silently corrupts the request URL when it contains Unicode to `http` silently corrupts the request URL when it contains non-Latin-1 codepoints Jun 1, 2017

jasnell added a commit that referenced this issue Jun 2, 2017

test: add known_test request with Unicode in the URL
This test currently fails. It illustrates that Unicode in the URL does
not arrive intact to the server, there is silent data corruption along
the way at some point.

This test is for the issue #13296.

PR-URL: #13297
Reviewed-By: James M Snell <jasnell@gmail.com>
@seishun

This comment has been minimized.

Show comment
Hide comment
@seishun

seishun Jun 2, 2017

Member

@Flimm

When /café is passed as a URL, it gets sent as /café (encoded in UTF-8) to the server.

If you aren't sending any payload, then it should be encoded in Latin-1. Could you re-check?

Member

seishun commented Jun 2, 2017

@Flimm

When /café is passed as a URL, it gets sent as /café (encoded in UTF-8) to the server.

If you aren't sending any payload, then it should be encoded in Latin-1. Could you re-check?

seishun added a commit to seishun/node that referenced this issue Jun 2, 2017

jasnell added a commit that referenced this issue Jun 5, 2017

test: add known_test request with Unicode in the URL
This test currently fails. It illustrates that Unicode in the URL does
not arrive intact to the server, there is silent data corruption along
the way at some point.

This test is for the issue #13296.

PR-URL: #13297
Reviewed-By: James M Snell <jasnell@gmail.com>
@Flimm

This comment has been minimized.

Show comment
Hide comment
@Flimm

Flimm Jun 13, 2017

Contributor

@seishun Sorry, that was just an assumption on my part. I've edited my previous comment to add "I assume".

Contributor

Flimm commented Jun 13, 2017

@seishun Sorry, that was just an assumption on my part. I've edited my previous comment to add "I assume".

MylesBorins added a commit that referenced this issue Jul 17, 2017

test: add known_test request with Unicode in the URL
This test currently fails. It illustrates that Unicode in the URL does
not arrive intact to the server, there is silent data corruption along
the way at some point.

This test is for the issue #13296.

PR-URL: #13297
Reviewed-By: James M Snell <jasnell@gmail.com>

ryanmahan added a commit to ryanmahan/node that referenced this issue Jan 19, 2018

test: change assert message to default
assert.strictEqual message argument removed to replace
with default assert message to show the expected vs
actual values

Refs: nodejs#13296

@ryanmahan ryanmahan referenced this issue Jan 19, 2018

Closed

test: change assert message to default #18259

4 of 4 tasks complete

Trott added a commit to Trott/io.js that referenced this issue Jan 23, 2018

test: change assert message to default
assert.strictEqual message argument removed to replace
with default assert message to show the expected vs
actual values

PR-URL: nodejs#18259
Refs: nodejs#13296
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Shingo Inoue <leko.noor@gmail.com>
Reviewed-By: Jon Moss <me@jonathanmoss.me>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>

maclover7 added a commit that referenced this issue Jan 26, 2018

test: change assert message to default
assert.strictEqual message argument removed to replace
with default assert message to show the expected vs
actual values

PR-URL: #18259
Refs: #13296
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Shingo Inoue <leko.noor@gmail.com>
Reviewed-By: Jon Moss <me@jonathanmoss.me>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>

evanlucas added a commit that referenced this issue Jan 30, 2018

test: change assert message to default
assert.strictEqual message argument removed to replace
with default assert message to show the expected vs
actual values

PR-URL: #18259
Refs: #13296
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Shingo Inoue <leko.noor@gmail.com>
Reviewed-By: Jon Moss <me@jonathanmoss.me>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>

evanlucas added a commit that referenced this issue Jan 30, 2018

test: change assert message to default
assert.strictEqual message argument removed to replace
with default assert message to show the expected vs
actual values

PR-URL: #18259
Refs: #13296
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Shingo Inoue <leko.noor@gmail.com>
Reviewed-By: Jon Moss <me@jonathanmoss.me>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>

msoechting added a commit to hpicgs/node that referenced this issue Feb 5, 2018

test: change assert message to default
assert.strictEqual message argument removed to replace
with default assert message to show the expected vs
actual values

PR-URL: nodejs#18259
Refs: nodejs#13296
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Shingo Inoue <leko.noor@gmail.com>
Reviewed-By: Jon Moss <me@jonathanmoss.me>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>

msoechting added a commit to hpicgs/node that referenced this issue Feb 7, 2018

test: change assert message to default
assert.strictEqual message argument removed to replace
with default assert message to show the expected vs
actual values

PR-URL: nodejs#18259
Refs: nodejs#13296
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Shingo Inoue <leko.noor@gmail.com>
Reviewed-By: Jon Moss <me@jonathanmoss.me>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>

MylesBorins added a commit that referenced this issue Feb 27, 2018

test: change assert message to default
assert.strictEqual message argument removed to replace
with default assert message to show the expected vs
actual values

PR-URL: #18259
Refs: #13296
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Shingo Inoue <leko.noor@gmail.com>
Reviewed-By: Jon Moss <me@jonathanmoss.me>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>

MylesBorins added a commit that referenced this issue Feb 27, 2018

test: change assert message to default
assert.strictEqual message argument removed to replace
with default assert message to show the expected vs
actual values

PR-URL: #18259
Refs: #13296
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Shingo Inoue <leko.noor@gmail.com>
Reviewed-By: Jon Moss <me@jonathanmoss.me>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
@Trott

This comment has been minimized.

Show comment
Hide comment
@Trott

Trott Apr 30, 2018

Member

@nodejs/http Thoughts on what to do with this? Doc update? Code change? Neither? Something else?

Member

Trott commented Apr 30, 2018

@nodejs/http Thoughts on what to do with this? Doc update? Code change? Neither? Something else?

@TimothyGu

This comment has been minimized.

Show comment
Hide comment
@TimothyGu

TimothyGu Apr 30, 2018

Member

This has more or less been fixed with #20270, which will be out in v11.x.

Member

TimothyGu commented Apr 30, 2018

This has more or less been fixed with #20270, which will be out in v11.x.

@TimothyGu TimothyGu closed this Apr 30, 2018

MayaLekova added a commit to MayaLekova/node that referenced this issue May 8, 2018

test: change assert message to default
assert.strictEqual message argument removed to replace
with default assert message to show the expected vs
actual values

PR-URL: nodejs#18259
Refs: nodejs#13296
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Shingo Inoue <leko.noor@gmail.com>
Reviewed-By: Jon Moss <me@jonathanmoss.me>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment