Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
`http` silently corrupts the request URL when it contains non-Latin-1 codepoints #13296
I experience the bug on Node commit 399cb25
Darwin Tephs-Mac-Pro.local 16.6.0 Darwin Kernel Version 16.6.0: Fri Apr 14 16:21:16 PDT 2017; root:xnu-3789.60.24~6/RELEASE_X86_64 x86_64 .
When making a request using
I've created a pull request which contains a test case that currently fails, illustrating this bug. #13297
added a commit
May 30, 2017
This was referenced
May 30, 2017
@jasnell That's great, but that's not what this issue is about. I'm not asking in this issue to transcode or escape the input into valid ASCII. All I'm asking is that silent data corruption not happen. There are solutions for that, one of them is to pass the data without corrupting it. Another would be to throw an exception when Unicode strings are passed in of course. I'll leave it up to core contributors to decide on the best solution to fix the bug that data is getting corrupted silently.
So when is the silent data corruption happening?
(It's worth noting that
I agree, this behavior is unnecessarily unintuitive. It should either:
Options 1 or 2 could be breaking if someone is deliberately using "binary" encoding to send UTF-8 encoded paths, e.g.
They could also be breaking if someone actually wants Latin-1 encoded paths (or the server they are talking to is smart enough to recognize it's not UTF-8) and they just happen to use only the Latin-1 range of characters. So
Here's my proposal to fix this mess:
In a later major version of Node.js, we could consider one of the following:
cc @nodejs/collaborators feel free to criticize.
IOW, the following statement:
Is not true (or only conditionally true.) As well, UTF-8 URLs are used widely enough that rejecting them outright is probably not going to fly.
I think the first order of business is to untangle the conflation of headers and body somehow. Unfortunately, the naive approach is riddled with performance pitfalls and some backwards compatibility concerns.
I think both of these could be fixed without touching the conflation of headers and body.
So it boils down to two questions:
I think most would agree that defaulting to/using the encoding of the first data chunk or 'latin1' is broken.
Great, if we agree on that, then the next question is which assumptions we can make.