Regression in unicode handling in url.parse #1149

Closed
jeremys opened this Issue Jun 3, 2011 · 6 comments

Projects

None yet

3 participants

@jeremys
jeremys commented Jun 3, 2011

Hi there,

I've detected a regression since Node 0.4.6 regarding the handling of unicode characters when parsing url. A test case is worth a thousand words so here we go:

Here's the gist: https://gist.github.com/1006680
The code:

var url = require('url'),
    assert = require('assert');
var test = 'http://➡.ws/pageloads';
var parsed = url.parse(test);
assert.equal(parsed.host, '➡.ws');
console.log('URL parsed correctly.');

With Node 0.4.6:

j:Code jeremy$ node -v
v0.4.6
j:Code jeremy$ node hostunicode.js 
URL parsed correctly.

With Node 0.4.8:

j:Code jeremy$ node -v
v0.4.8
j:Code jeremy$ node hostunicode.js 

node.js:134
        throw e; // process.nextTick error, or 'error' event on first tick
        ^
AssertionError: "➡.ws" == ""
    at Object.<anonymous> (/Users/jeremy/Code/hostunicode.js:5:8)
    at Module._compile (module.js:407:26)
    at Object..js (module.js:413:10)
    at Module.load (module.js:339:31)
    at Function._load (module.js:298:12)
    at Array.<anonymous> (module.js:426:10)
    at EventEmitter._tickCallback (node.js:126:26)

And I also think node does not deal very well with IDNA (see http://tools.ietf.org/html/rfc3490.html) but I will fill that seperatly with a test case.

@isaacs
isaacs commented Jun 3, 2011

Yeah, this happened because of a fix to prevent delimiters and other characters from getting in there.

This is one area where JavaScript's lack of proper unicode support on RegExps is really painful. It seems like we can maybe just blacklist known-disallowed characters in the 0x00-0x80 range, since even punctuation outside of ASCII seems to be allowed in a domain name.

@jeremys
jeremys commented Jun 6, 2011

Or maybe a solution is to implement IDNA encoding, and process the encoding first, then apply the current blacklist on the result?

Because we have another issue regarding unicode handling in http. Even with node 0.4.6 when I request http://➡.ws, it end up requesting http://â�¡.ws which of course fail because we should encode the domain.

What would be nice is that url.parse idna-encode the hostname before applying validation. But idna encoding is a big deal. Not sure if Node can leverage this from somewhere else (Python or something).

@isaacs
isaacs commented Jun 6, 2011

Yeah, I think if we're going to do IDNA (which seems necessary for http to properly request utf hostnames), we'll need to do it at the url.parse layer. I'm thinking that url.parse should automatically convert to punycode, and thus, url.format will never output a non-ascii hostname. Since browsers treat punycode and utf hostnames the same, it should be the safest approach, if not the prettiest output.

@jeremys
jeremys commented Jun 6, 2011

I don't know how do you plan to do IDNA but there is this interesting lib by Google around URL parsing: http://code.google.com/p/google-url/ — It's used in Chromium. But, I'm no expert, I don't know if it can be used within Node.

@isaacs
isaacs commented Jun 6, 2011

Yeah, I don't know if we want to pull the url parsing/formatting logic into the C layer. That's a pretty big library, and does a lot of stuff that node doesn't really need.

@ry
ry commented Jun 13, 2011

See #1174

@isaacs isaacs added a commit that closed this issue Jul 6, 2011
@jeremys @isaacs jeremys + isaacs Close #1149 IDNA and Punycode support in url.parse
Using @bnoordhuis's punycode lib.

Close #1174 also
2a848fa
@isaacs isaacs closed this in 2a848fa Jul 6, 2011
@isaacs isaacs added a commit to isaacs/node-v0.x-archive that referenced this issue Jul 19, 2011
@jeremys @isaacs jeremys + isaacs Close #1149 IDNA and Punycode support in url.parse
Using @bnoordhuis's punycode lib.

Close #1174 also
786c714
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment