This repository has been archived by the owner. It is now read-only.

url delims. worthwhile? #3270

Closed
isaacs opened this Issue May 15, 2012 · 20 comments

Comments

Projects
None yet
8 participants
@isaacs

isaacs commented May 15, 2012

It would be good to maybe discuss this, as it's a common complaint.

On the one hand, the various RFCs are quite clear that whitespace, <>, and quotes are delimiters, and thus cannot ever appear in URLs. In url.js, there's this:

// RFC 2396: characters reserved for delimiting URLs.
delims = ['<', '>', '"', '`', ' ', '\r', '\n', '\t'],

Any delims that are found are thus assumed to be not part of the URL.

However, in practice, this is a bit of a deviation from the way that URLs are handled by browsers in the address bar, the window.location object, and the anchor tag, which serve as models for node's URL parser.

Would anyone object if delims were just auto-escaped?

The change in behavior would be:

url.parse("<http://foo.com>")

Currently this is parsed as

{ protocol: 'http:',
  slashes: true,
  host: 'foo.com',
  hostname: 'foo.com',
  href: 'http://foo.com/',
  pathname: '/',
  path: '/' }

With the proposed change, it'd be:

{ pathname: '%3Chttp://foo.com%3E',
  path: '%3Chttp://foo.com%3E',
  href: '%3Chttp://foo.com%3E' }

Which, incidentally, is exactly what you get if you type <http://foo.com> into a web browser address bar, or click on an <a href="<http://foo.com>">foo</a>. Yes, it is arguably not the user intent, but it's how it's interpreted everywhere else.

While this could potentially reduce the usefulness of the url module as a validator, it'd reduce a common gotcha for newcomers, and would make it a closer model to the browser, which was always the original intent of the module.

If we're going to do it, we should do it completely. We've had these little debates about escaping or not escaping spaces, and quotes, and various things one by one. Maybe we ought to just get rid of the delims altogether.

Thoughts?

/cc @bnoordhuis @3rdeden

@polotek

This comment has been minimized.

Show comment
Hide comment
@polotek

polotek May 15, 2012

What is this common gotcha? How does it go?

polotek commented May 15, 2012

What is this common gotcha? How does it go?

@indutny

This comment has been minimized.

Show comment
Hide comment
@indutny

indutny May 15, 2012

Member

-1 for this, though I don't mind if it'll be hidden under some flag or option.

Member

indutny commented May 15, 2012

-1 for this, though I don't mind if it'll be hidden under some flag or option.

@n2liquid

This comment has been minimized.

Show comment
Hide comment
@n2liquid

n2liquid May 15, 2012

Could both behaviors be triggered by a flag or something? Then the question boils down to what should be the default...

Could both behaviors be triggered by a flag or something? Then the question boils down to what should be the default...

@isaacs

This comment has been minimized.

Show comment
Hide comment
@isaacs

isaacs May 15, 2012

@polotek The common gotcha is "Why doesn't url.parse("http://foo.com/bar baz") work?"

@indutny @n2liquid No, let's just pick one and go with it. url.parse already takes 2 more flags than I'd prefer.

isaacs commented May 15, 2012

@polotek The common gotcha is "Why doesn't url.parse("http://foo.com/bar baz") work?"

@indutny @n2liquid No, let's just pick one and go with it. url.parse already takes 2 more flags than I'd prefer.

@indutny

This comment has been minimized.

Show comment
Hide comment
@indutny

indutny May 15, 2012

Member

@isaacs -1 anyway.

Member

indutny commented May 15, 2012

@isaacs -1 anyway.

@s3u

This comment has been minimized.

Show comment
Hide comment
@s3u

s3u May 15, 2012

Actually, the string passed in the example is not a valid URL per http://tools.ietf.org/html/rfc3986 and treatment of delims should ideally be done in context as in "parse URIs from this blob of text". But AFAIU the semantics of url.parse, no such context is expected to be assumed.

s3u commented May 15, 2012

Actually, the string passed in the example is not a valid URL per http://tools.ietf.org/html/rfc3986 and treatment of delims should ideally be done in context as in "parse URIs from this blob of text". But AFAIU the semantics of url.parse, no such context is expected to be assumed.

@isaacs

This comment has been minimized.

Show comment
Hide comment
@isaacs

isaacs May 15, 2012

@indutny Do you ever actually have delims in your urls that you need to truncate the url rather than be escaped? Without looking, what do you think this should do? url.resolve("http://foo.com/bar", "baz<baf")?

isaacs commented May 15, 2012

@indutny Do you ever actually have delims in your urls that you need to truncate the url rather than be escaped? Without looking, what do you think this should do? url.resolve("http://foo.com/bar", "baz<baf")?

@n2liquid

This comment has been minimized.

Show comment
Hide comment
@n2liquid

n2liquid May 15, 2012

@s3u I couldn't have put better; I thought about that too but couldn't find the words. However I do believe url.parse() should assume a single-URL string context... which means this stuff should be escaped, otherwise it's not a single URL, and what does it even mean to put two URL's in there?

@s3u I couldn't have put better; I thought about that too but couldn't find the words. However I do believe url.parse() should assume a single-URL string context... which means this stuff should be escaped, otherwise it's not a single URL, and what does it even mean to put two URL's in there?

@isaacs

This comment has been minimized.

Show comment
Hide comment
@isaacs

isaacs May 15, 2012

@s3u RFC 3986 is about URIs, and well outside the scope of node's url parser, which is only concerned with Uniform Resource Locators, of the sort used to address resources on the web.

Also, literally every single place where you can feed a URL to a thing that takes URLs, which is anywhere related to HTTP or JavaScript (ie, web browser address bar, <a> tags, curl, and the location objects in the dom) auto-escape delims rather than truncating.

Delims are only relevant when parsing urls in a blob of text. If you know that the entire string represents a URL, then it seems rather common to assume that they'll be escaped.

The question is really: will this break any existing programs? What gotchas will it cause?

If the answer to both questions is "none that exist, but some that I can imagine, if I try really hard", then that's a +1 for the feature.

isaacs commented May 15, 2012

@s3u RFC 3986 is about URIs, and well outside the scope of node's url parser, which is only concerned with Uniform Resource Locators, of the sort used to address resources on the web.

Also, literally every single place where you can feed a URL to a thing that takes URLs, which is anywhere related to HTTP or JavaScript (ie, web browser address bar, <a> tags, curl, and the location objects in the dom) auto-escape delims rather than truncating.

Delims are only relevant when parsing urls in a blob of text. If you know that the entire string represents a URL, then it seems rather common to assume that they'll be escaped.

The question is really: will this break any existing programs? What gotchas will it cause?

If the answer to both questions is "none that exist, but some that I can imagine, if I try really hard", then that's a +1 for the feature.

@s3u

This comment has been minimized.

Show comment
Hide comment
@s3u

s3u May 15, 2012

@isaacs minor nitpick - RFC 3986 obsoleted RFC1738 - the latter specified URLs.

s3u commented May 15, 2012

@isaacs minor nitpick - RFC 3986 obsoleted RFC1738 - the latter specified URLs.

@indutny

This comment has been minimized.

Show comment
Hide comment
@indutny

indutny May 15, 2012

Member

@isaacs you convinced me

Member

indutny commented May 15, 2012

@isaacs you convinced me

@s3u

This comment has been minimized.

Show comment
Hide comment
@s3u

s3u May 15, 2012

Since the API already assumes a context, escaping makes sense. But docs need to be clear about the context and how delims are escaped.

s3u commented May 15, 2012

Since the API already assumes a context, escaping makes sense. But docs need to be clear about the context and how delims are escaped.

@kevinohara80

This comment has been minimized.

Show comment
Hide comment
@kevinohara80

kevinohara80 May 15, 2012

@isaacs none that exist, but some that I can imagine, if I try really hard +1

@isaacs none that exist, but some that I can imagine, if I try really hard +1

@bnoordhuis

This comment has been minimized.

Show comment
Hide comment
@bnoordhuis

bnoordhuis May 15, 2012

Member

A downside is that whitespace now becomes significant. The strings 'http://foo.com' and ' http://foo.com ' currently parse to the same URL object. That will no longer be the case.

Member

bnoordhuis commented May 15, 2012

A downside is that whitespace now becomes significant. The strings 'http://foo.com' and ' http://foo.com ' currently parse to the same URL object. That will no longer be the case.

@n2liquid

This comment has been minimized.

Show comment
Hide comment
@n2liquid

n2liquid May 15, 2012

@bnoordhuis & all: in the particular case of empty characters, I believe the browsers just trim the string instead of escaping.

@bnoordhuis & all: in the particular case of empty characters, I believe the browsers just trim the string instead of escaping.

@n2liquid

This comment has been minimized.

Show comment
Hide comment
@n2liquid

n2liquid May 15, 2012

Non-blank characters are still escaped, obviously.

Non-blank characters are still escaped, obviously.

@kevinohara80

This comment has been minimized.

Show comment
Hide comment
@kevinohara80

kevinohara80 May 15, 2012

I do see the issue with the "trimmable" whitespace as @bnoordhuis mentions. This might be where most (if any) see issues.

I do see the issue with the "trimmable" whitespace as @bnoordhuis mentions. This might be where most (if any) see issues.

@isaacs

This comment has been minimized.

Show comment
Hide comment
@isaacs

isaacs May 15, 2012

Alright, seems like we have a rough consensus:

  1. Trim before parse.
  2. Escape delims rather than breaking on them.

isaacs commented May 15, 2012

Alright, seems like we have a rough consensus:

  1. Trim before parse.
  2. Escape delims rather than breaking on them.
@polotek

This comment has been minimized.

Show comment
Hide comment
@polotek

polotek May 15, 2012

I'm fine with this. Good talk.

polotek commented May 15, 2012

I'm fine with this. Good talk.

@0x80

This comment has been minimized.

Show comment
Hide comment

0x80 commented May 15, 2012

+1!

isaacs added a commit to isaacs/node-v0.x-archive that referenced this issue May 16, 2012

Fix #3270 Escape url.parse delims
Rather than omitting them.

@isaacs isaacs closed this in 9fc7283 May 16, 2012

AndreasMadsen pushed a commit to AndreasMadsen/node-v0.x-archive that referenced this issue May 30, 2012

2012.05.28, Version 0.7.9 (unstable)
* Upgrade V8 to 3.11.1

* Upgrade npm to 1.1.23

* uv: rework reference counting scheme (Ben Noordhuis)

* uv: add interface for joining external event loops (Bert Belder)

* repl, readline: Handle Ctrl+Z and SIGCONT better (Nathan Rajlich)

* fs: 64bit offsets for fs calls (Igor Zinkovsky)

* fs: add sync open flags 'rs' and 'rs+' (Kevin Bowman)

* windows: enable creating directory junctions with fs.symlink (Igor Zinkovsky, Bert Belder)

* windows: Fix fs.lstat to properly detect symlinks. (Igor Zinkovsky)

* Fix #3270 Escape url.parse delims (isaacs)

* http: make http.get() accept a URL (Adam Malcontenti-Wilson)

* Cleanup vm module memory leakage (Marcel Laverdet)

* Optimize writing strings with Socket.write (Bert Belder)

* add support for CESU-8 and UTF-16LE encodings (koichik)

* path: add path.sep to get the path separator. (Yi, EungJun)

* net, http: add backlog parameter to .listen() (Erik Dubbelboer)

* debugger: support mirroring Date objects (Fedor Indutny)

* addon: add AtExit() function (Ben Noordhuis)

* net: signal localAddress bind failure in connect (Brian Schroeder)

* util: handle non-string return value in .inspect() (Alex Kocharin)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.