url.format() should escape/encode characters that change semantics? #4082
Comments
It's not really a bug but it may be somewhat unintuitive. The relevant RFCs (1738 and 2396) don't impose many restrictions on what characters are valid in a path and neither does node, the |
I disagree. The RFC(s) are pretty clear on both the composition of 'path', and on the encoding of individual uri-components during the construction of a URI string. (see below for support of this assertion). If we intend for the I would be glad to submit a pull request, if there is interest among core committers in seeing this fixed. SupportThe relevant RFC is 3986. Section 3.3 provides the grammar of the path component. It indicates (and I'm paraphrasing here) that paths are composed of sequences of
Note that
Section 2.2 adds further support:
In a uri, Section 2.4 explains when percent encoding should be performed...
This tells us that when RFC 1738 is Tim Berners-Lee's original URL specification from 1994. It has been much updated/obsoleted at this point, but even still, it's pretty clear on this topic. From section 2.2:
|
I agree. url.format should escape characters not allowed in the section in question. It's rare in my experience, but sure it makes sense. Patch welcome, please include a zillion tests. url.parse should be left as-is. |
Hm. never mind. While I do stand by what I said above, the situation is actually more complicated than I originally realized:
> url.parse("/abc%3fabc");
{ pathname: '/abc%3fabc',
path: '/abc%3fabc',
href: '/abc%3fabc' } This means that, in order for > url.format(url.parse("/abc%3fabc"))
'/abc%3fabc' It can't automatically encode special chars, as I proposed in my previous comment, because doing so would produce double-encoding in cases like the above example. Intuitively, it would be nice if Instead, I think that the best we can do is to scan the various fields during |
@isaacs I didn't see your comment until after I posted above. Given something like this: url.format({ pathname: '/hello%3f hello? 100% more' }); Would you prefer that it throws an error? Note that catching the un-encoded '%' char in '100%' will be a little more involved than catching delimiters like '?' and '#'. It'll also be possible (but fairly unlikely) to encounter a string with a '%' char that looks encoded, but isn't: Edit: I was focused on encoding the delimiter chars, but of course, if we're encoding delimiters, then we'd go ahead and encode any other illegal chars we find in the process. That would include space. So if we go with the "just works" approach, then the above would actually yield something like: On the other hand, if we go with the "throw an error" approach, we'd let most illegal chars (like space) through. We'd only throw an Error for contextually significant delimiters or un-encoded percent chars, which --if allowed through-- will always produce broken/confusing URIs that re-parse in unexpected ways. |
These changes are now in progress here. The code in that branch seems to be working correctly, but needs more tests before I can submit a pull request. It will be a couple of days before I can get back around to finish this out. Feel free to review/comment/whatever in the meantime. (edit: fixed the link above to use SHAs instead of branch names, so it won't change when I commit more on the same branch. Its the link I should have posted in the first place.) |
Hey @coltrane, thanks for all the great comments and work so far. Just wanted to reply to your question:
That's a great question and a tough one, but I think that trying to "mix and match" in the solution is overly clever. I think that if there are any unencoded characters/strings in the path, all unencoded characters should be encoded, to be simple and predictable. This also matches likely scenarios I think: either the input you're dealing with is already encoded fine (e.g. from There is indeed the edge case where you want to double-encode it anyway but there are no unencoded characters, but that's hopefully an edge case. And in that case, you can simply encode it yourself using one of the built-in functions since there are no special characters. Let me know if I didn't make sense. Thanks again! |
@coltrane I'd say don't escape % chars, or anything else. Only escape ? and # chars in pathname, and # chars in search, because they change the semantics of the operation otherwise. |
Also: This should be about a 2-10 line code change at the most, just some .replace() calls in the appropriate places, and a bunch of new tests to exercise the code path. Please do not add new methods or anything complicated. https://github.com/coltrane/node/compare/issue4082 is doing waaaaayyyy too much. |
Ah, nice suggestion @isaacs! |
I think you guys are wrong but I don't feel strongly enough to make a big fuzz about it. However, I want to point out that the RFC that @coltrane mentions, RFC 3986, pertains to URIs, not URLs. I don't think the url module ever made any claim to full URI support and indeed the current behavior is consistent with what RFCs 1738 and 2396 require. 2396 defines the path as follows: |
@bnoordhuis I don't want to make a big (bigger? ;) fuzz either, but I do want to clear up a couple of points:
That provision wasn't reflected clearly in the grammar though, until it was updated by RFC-3986. In any case, given a full reading, each the RFCs indicates the need to escape chars that would conflict with delimiters in the given context. (RFC1738 section 2.2 also includes similar language).
That would, indeed, be much simpler than what I've started out to do --and I'll be glad to do it your way instead-- but here's why I started in the direction I did:
The code in the branch I referenced addresses each of those points. But given your comments, we're obviously going for something less extreme, and I'm guessing that RFC-compliance isn't at the top list. That's fine, of course. It just means my changes are out of scope, and I'll roll back to a simpler approach, as you described. |
`url.format` should escape ? and # chars in pathname, and # chars in search, because they change the semantics of the operation otherwise. Don't escape % chars, or anything else. (see: #4082)
Fixed by 54d293d |
Great work @coltrane and thanks guys! |
I could be wrong, but I think
url.format()
should be escaping characters that affect the semantics of the returned URL string. I'm testing that by re-parsing viaurl.parse()
.Here's a simple example:
In this example, the question mark in the path doesn't get escaped/encoded in the returned string, so that string has different semantics, which get reflected in the parse.
The same applies if you change it to a hash symbol. I'm not sure how much the other characters matter, if at all, in the path specifically.
This came up because that's a valid filename/path on Mac OS X (so I would guess Unix in general), and Express gives you the decoded path just like that, e.g. for a request like
GET /path/to/%25%3F%3D%2B%26.txt?foo=bar
.I looked into manually escaping it myself, but
encodeURI()
encodes too little (e.g. no?
or#
),encodeURIComponent()
encodes too much (e.g./
), andescape()
seems to be bad/deprecated (besides missing+
).Ultimately, it seems to me that this breaks the contract of what you'd expect from
url.format()
-- that it'd maintain the semantics of the passed-in URL object. Thanks for the consideration!The text was updated successfully, but these errors were encountered: