Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

url.format() should escape/encode characters that change semantics? #4082

Closed
aseemk opened this issue Oct 5, 2012 · 14 comments
Closed

url.format() should escape/encode characters that change semantics? #4082

aseemk opened this issue Oct 5, 2012 · 14 comments

Comments

@aseemk
Copy link

aseemk commented Oct 5, 2012

I could be wrong, but I think url.format() should be escaping characters that affect the semantics of the returned URL string. I'm testing that by re-parsing via url.parse().

Here's a simple example:

var url = require('url');

function test(props) {
    var string = url.format(props);
    var parsed = url.parse(string, true);

    console.log('Original:', props);
    console.log('Formatted:', string);
    console.log('Parsed:', parsed);
    console.log();
}

var PATHNAME = '/path/to/%?+=&.txt';
var QUERY = { foo: 'bar' };

// the pathname by itself fails:
test({
    pathname: PATHNAME
});

// and with the query it also fails:
test({
    pathname: PATHNAME,
    query: QUERY
});
Original: { pathname: '/path/to/%?+=&.txt' }
Formatted: /path/to/%?+=&.txt
Parsed: { search: '?+=&.txt',
  query: { ' ': '', '.txt': '' },
  pathname: '/path/to/%',
  path: '/path/to/%?+=&.txt',
  href: '/path/to/%?+=&.txt' }

Original: { pathname: '/path/to/%?+=&.txt', query: { foo: 'bar' } }
Formatted: /path/to/%?+=&.txt?foo=bar
Parsed: { search: '?+=&.txt?foo=bar',
  query: { ' ': '', '.txt?foo': 'bar' },
  pathname: '/path/to/%',
  path: '/path/to/%?+=&.txt?foo=bar',
  href: '/path/to/%?+=&.txt?foo=bar' }

In this example, the question mark in the path doesn't get escaped/encoded in the returned string, so that string has different semantics, which get reflected in the parse.

The same applies if you change it to a hash symbol. I'm not sure how much the other characters matter, if at all, in the path specifically.

This came up because that's a valid filename/path on Mac OS X (so I would guess Unix in general), and Express gives you the decoded path just like that, e.g. for a request like GET /path/to/%25%3F%3D%2B%26.txt?foo=bar.

I looked into manually escaping it myself, but encodeURI() encodes too little (e.g. no ? or #), encodeURIComponent() encodes too much (e.g. /), and escape() seems to be bad/deprecated (besides missing +).

Ultimately, it seems to me that this breaks the contract of what you'd expect from url.format() -- that it'd maintain the semantics of the passed-in URL object. Thanks for the consideration!

@bnoordhuis
Copy link
Member

It's not really a bug but it may be somewhat unintuitive. The relevant RFCs (1738 and 2396) don't impose many restrictions on what characters are valid in a path and neither does node, the pathname you pass to util.format() is copied verbatim.

@coltrane
Copy link

coltrane commented Oct 6, 2012

The relevant RFCs (1738 and 2396) don't impose many restrictions on what characters are valid in a path

I disagree. The RFC(s) are pretty clear on both the composition of 'path', and on the encoding of individual uri-components during the construction of a URI string. (see below for support of this assertion).

If we intend for the url module to be RFC-compliant, then this behavior should be considered a bug, and (IMO) it should be fixed.

I would be glad to submit a pull request, if there is interest among core committers in seeing this fixed.

Support

The relevant RFC is 3986. Section 3.3 provides the grammar of the path component. It indicates (and I'm paraphrasing here) that paths are composed of sequences of pchar separated by '/'. Here's how pchar breaks down:

pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded   = "%" HEXDIG HEXDIG
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

Note that ? and # aren't in there. A number of other chars are absent too. Any character not in that grammar, should be encoded, as explained in section 2.1

   A percent-encoding mechanism is used to represent a data octet in a
   component when that octet's corresponding character is outside the
   allowed set or is being used as a delimiter of, or within, the
   component.

Section 2.2 adds further support:

   ... If data for a URI component would conflict with a reserved
   character's purpose as a delimiter, then the conflicting data must be
   percent-encoded before the URI is formed.

      reserved    = gen-delims / sub-delims
      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

In a uri, ?, and #, are the two chars that can delimit the end of the path component. So if they occur within the path value itself, then they have to be encoded, in order to avoid being mis-interpreted as their corresponding delimiter char.

Section 2.4 explains when percent encoding should be performed...

   Under normal circumstances, the only time when octets within a URI
   are percent-encoded is during the process of producing the URI from
   its component parts.  [...]  Once produced, a URI is always
   in its percent-encoded form.

This tells us that when url.format() turns a url from an object (a hash of constituent pieces), into a string (a URI), it needs to perform encoding on the individual pieces, before assembling them into a proper URI string.


RFC 1738 is Tim Berners-Lee's original URL specification from 1994. It has been much updated/obsoleted at this point, but even still, it's pretty clear on this topic. From section 2.2:

   ... Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
   reserved characters used for their reserved purposes may be used
   unencoded within a URL. ...

@isaacs
Copy link

isaacs commented Oct 7, 2012

I agree. url.format should escape characters not allowed in the section in question. It's rare in my experience, but sure it makes sense. Patch welcome, please include a zillion tests.

url.parse should be left as-is.

@coltrane
Copy link

coltrane commented Oct 7, 2012

Hm. never mind. While I do stand by what I said above, the situation is actually more complicated than I originally realized:

url.parse does not decode the percent-encoded chars in pathname.

> url.parse("/abc%3fabc");
{ pathname: '/abc%3fabc',
  path: '/abc%3fabc',
  href: '/abc%3fabc' }

This means that, in order for url.format to be a proper inverse operation of url.parse, format must expect that the pathname field is pre-encoded.

> url.format(url.parse("/abc%3fabc"))
'/abc%3fabc'

It can't automatically encode special chars, as I proposed in my previous comment, because doing so would produce double-encoding in cases like the above example.

Intuitively, it would be nice if parse decoded each field of the parsed URI; then format could re-encode each field appropriately. (I suspect that many folks assume that this actually is how it works). This approach would ensure that the generated URI would never contain invalid chars. Alas, I'm sure that this change can't happen, as it would break basically everything.

Instead, I think that the best we can do is to scan the various fields during format, and throw an Error if they contain delimiter chars that would produce a bogus uri string if format were allowed to proceed. I also think we should add a note in the docs, explaining that parse and format do no encoding/decoding. (the exception is the "auth" field, which is actually encoded/decoded automatically if it is present).

@coltrane
Copy link

coltrane commented Oct 7, 2012

@isaacs I didn't see your comment until after I posted above.

Given something like this:

url.format({ pathname: '/hello%3f hello? 100% more' });

Would you prefer that it throws an error?
Or that it "just works"? (yielding: '/hello%3f hello%3f 100%25 more')

Note that catching the un-encoded '%' char in '100%' will be a little more involved than catching delimiters like '?' and '#'. It'll also be possible (but fairly unlikely) to encounter a string with a '%' char that looks encoded, but isn't: "100%absolute100%" --> "100%absolute100%25". This won't cause an invalid URI, but it may be confusing for the user.


Edit: I was focused on encoding the delimiter chars, but of course, if we're encoding delimiters, then we'd go ahead and encode any other illegal chars we find in the process. That would include space. So if we go with the "just works" approach, then the above would actually yield something like: '/hello%3f%20hello%3f%20100%25%20more'.

On the other hand, if we go with the "throw an error" approach, we'd let most illegal chars (like space) through. We'd only throw an Error for contextually significant delimiters or un-encoded percent chars, which --if allowed through-- will always produce broken/confusing URIs that re-parse in unexpected ways.

@coltrane
Copy link

coltrane commented Oct 8, 2012

These changes are now in progress here. The code in that branch seems to be working correctly, but needs more tests before I can submit a pull request. It will be a couple of days before I can get back around to finish this out. Feel free to review/comment/whatever in the meantime.

(edit: fixed the link above to use SHAs instead of branch names, so it won't change when I commit more on the same branch. Its the link I should have posted in the first place.)

@aseemk
Copy link
Author

aseemk commented Oct 8, 2012

Hey @coltrane, thanks for all the great comments and work so far. Just wanted to reply to your question:

Given something like this:

url.format({ pathname: '/hello%3f hello? 100% more' });

Would you prefer that it throws an error?
Or that it "just works"? (yielding: '/hello%3f hello%3f 100%25 more')

That's a great question and a tough one, but I think that trying to "mix and match" in the solution is overly clever. I think that if there are any unencoded characters/strings in the path, all unencoded characters should be encoded, to be simple and predictable.

This also matches likely scenarios I think: either the input you're dealing with is already encoded fine (e.g. from url.parse()), or it's either not encoded at all or you want to double-encode it anyway.

There is indeed the edge case where you want to double-encode it anyway but there are no unencoded characters, but that's hopefully an edge case. And in that case, you can simply encode it yourself using one of the built-in functions since there are no special characters.

Let me know if I didn't make sense. Thanks again!

@isaacs
Copy link

isaacs commented Oct 8, 2012

@coltrane I'd say don't escape % chars, or anything else. Only escape ? and # chars in pathname, and # chars in search, because they change the semantics of the operation otherwise.

@isaacs
Copy link

isaacs commented Oct 8, 2012

Also: This should be about a 2-10 line code change at the most, just some .replace() calls in the appropriate places, and a bunch of new tests to exercise the code path.

Please do not add new methods or anything complicated. https://github.com/coltrane/node/compare/issue4082 is doing waaaaayyyy too much.

@aseemk
Copy link
Author

aseemk commented Oct 9, 2012

Ah, nice suggestion @isaacs!

@bnoordhuis
Copy link
Member

I think you guys are wrong but I don't feel strongly enough to make a big fuzz about it. However, I want to point out that the RFC that @coltrane mentions, RFC 3986, pertains to URIs, not URLs. I don't think the url module ever made any claim to full URI support and indeed the current behavior is consistent with what RFCs 1738 and 2396 require.

2396 defines the path as follows: path = [ abs_path | opaque_part ] where opaque_part = uric_no_slash *uric and uric = reserved | unreserved | escaped - IOW pretty much everything.

@coltrane
Copy link

@bnoordhuis I don't want to make a big (bigger? ;) fuzz either, but I do want to clear up a couple of points:

... If the data for a URI component would conflict with the reserved purpose, 
then the conflicting data must be escaped before forming the URI.

That provision wasn't reflected clearly in the grammar though, until it was updated by RFC-3986. In any case, given a full reading, each the RFCs indicates the need to escape chars that would conflict with delimiters in the given context. (RFC1738 section 2.2 also includes similar language).

@isaacs

don't escape % chars, or anything else. Only escape ? and # chars in pathname, and # chars in search

That would, indeed, be much simpler than what I've started out to do --and I'll be glad to do it your way instead-- but here's why I started in the direction I did:

  1. The problem is not limited to pathname and search. It's possible to produce the same issue by cramming delimiters into other fields of the URL too. A complete solution should address all fields.
  2. Presently, url.format makes no attempt to ensure that its output fits within the charset allowed by the RFC. Mainly, this refers to ASCII "non-printables", and non-ASCII unicode chars. It is possible to make format encode these chars correctly, and this seems like a great opportunity to do that.
  3. There's the issue of percent chars, which I mentioned previously. "Bare" percent chars aren't allowed in RFC-compliant output. Again, we can fix this here, so it seems like we should do it.

The code in the branch I referenced addresses each of those points. But given your comments, we're obviously going for something less extreme, and I'm guessing that RFC-compliance isn't at the top list. That's fine, of course. It just means my changes are out of scope, and I'll roll back to a simpler approach, as you described.

isaacs pushed a commit that referenced this issue Oct 30, 2012
`url.format` should escape ? and # chars in pathname, and # chars in
search, because they change the semantics of the operation otherwise.
Don't escape % chars, or anything else. (see: #4082)
@isaacs
Copy link

isaacs commented Oct 30, 2012

Fixed by 54d293d

@isaacs isaacs closed this as completed Oct 30, 2012
@aseemk
Copy link
Author

aseemk commented Oct 30, 2012

Great work @coltrane and thanks guys!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants