Regression in unicode handling in url.parse #1149

jeremys opened this Issue Jun 3, 2011 · 6 comments


None yet

3 participants

jeremys commented Jun 3, 2011

Hi there,

I've detected a regression since Node 0.4.6 regarding the handling of unicode characters when parsing url. A test case is worth a thousand words so here we go:

Here's the gist:
The code:

var url = require('url'),
    assert = require('assert');
var test = 'http://➡.ws/pageloads';
var parsed = url.parse(test);
assert.equal(, '➡.ws');
console.log('URL parsed correctly.');

With Node 0.4.6:

j:Code jeremy$ node -v
j:Code jeremy$ node hostunicode.js 
URL parsed correctly.

With Node 0.4.8:

j:Code jeremy$ node -v
j:Code jeremy$ node hostunicode.js 

        throw e; // process.nextTick error, or 'error' event on first tick
AssertionError: "➡.ws" == ""
    at Object.<anonymous> (/Users/jeremy/Code/hostunicode.js:5:8)
    at Module._compile (module.js:407:26)
    at Object..js (module.js:413:10)
    at Module.load (module.js:339:31)
    at Function._load (module.js:298:12)
    at Array.<anonymous> (module.js:426:10)
    at EventEmitter._tickCallback (node.js:126:26)

And I also think node does not deal very well with IDNA (see but I will fill that seperatly with a test case.

isaacs commented Jun 3, 2011

Yeah, this happened because of a fix to prevent delimiters and other characters from getting in there.

This is one area where JavaScript's lack of proper unicode support on RegExps is really painful. It seems like we can maybe just blacklist known-disallowed characters in the 0x00-0x80 range, since even punctuation outside of ASCII seems to be allowed in a domain name.

jeremys commented Jun 6, 2011

Or maybe a solution is to implement IDNA encoding, and process the encoding first, then apply the current blacklist on the result?

Because we have another issue regarding unicode handling in http. Even with node 0.4.6 when I request http://➡.ws, it end up requesting http://â�¡.ws which of course fail because we should encode the domain.

What would be nice is that url.parse idna-encode the hostname before applying validation. But idna encoding is a big deal. Not sure if Node can leverage this from somewhere else (Python or something).

isaacs commented Jun 6, 2011

Yeah, I think if we're going to do IDNA (which seems necessary for http to properly request utf hostnames), we'll need to do it at the url.parse layer. I'm thinking that url.parse should automatically convert to punycode, and thus, url.format will never output a non-ascii hostname. Since browsers treat punycode and utf hostnames the same, it should be the safest approach, if not the prettiest output.

jeremys commented Jun 6, 2011

I don't know how do you plan to do IDNA but there is this interesting lib by Google around URL parsing: — It's used in Chromium. But, I'm no expert, I don't know if it can be used within Node.

isaacs commented Jun 6, 2011

Yeah, I don't know if we want to pull the url parsing/formatting logic into the C layer. That's a pretty big library, and does a lot of stuff that node doesn't really need.

ry commented Jun 13, 2011

See #1174

@isaacs isaacs added a commit that closed this issue Jul 6, 2011
@jeremys @isaacs jeremys + isaacs Close #1149 IDNA and Punycode support in url.parse
Using @bnoordhuis's punycode lib.

Close #1174 also
@isaacs isaacs closed this in 2a848fa Jul 6, 2011
@isaacs isaacs added a commit to isaacs/node-v0.x-archive that referenced this issue Jul 19, 2011
@jeremys @isaacs jeremys + isaacs Close #1149 IDNA and Punycode support in url.parse
Using @bnoordhuis's punycode lib.

Close #1174 also
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment