Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

url: ignore IDN errors when domainname have two hyphens #12966

Closed

Conversation

@zimbabao
Copy link
Contributor

commented May 11, 2017

There are valid domain names with hyphens at 3 and 4th position, new
node WHATWG URL parser was failing for it assume its an invalid IDN.

Fixes: #12965

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • documentation is changed or added
  • commit message follows commit guidelines
Affected core subsystem(s)
@watilde
Copy link
Member

left a comment

LGTM. This seems to be how the other URL parsers work.

test/parallel/test-url-parse-format.js Outdated
@@ -487,6 +487,16 @@ const parseTests = {
path: '/bar'
},

'https://r4---sn-a5mlrn7s.gevideo.com/bar': {

This comment has been minimized.

Copy link
@TimothyGu

TimothyGu May 11, 2017

Member

Instead of adding a test here, can you add an entry to test/fixtures/url-idna.js instead? Those cases are tested through test-whatwg-url-domainto.js. Setting mode to 'ascii' should be enough in this case.

Also, please add test cases with leading and trailing hyphens (and both) as well.

src/node_i18n.cc Outdated
@@ -516,6 +516,7 @@ int32_t ToASCII(MaybeStackBuffer<char>* buf,
info.errors &= ~UIDNA_ERROR_EMPTY_LABEL;
info.errors &= ~UIDNA_ERROR_LABEL_TOO_LONG;
info.errors &= ~UIDNA_ERROR_DOMAIN_NAME_TOO_LONG;
info.errors &= ~UIDNA_ERROR_HYPHEN_3_4;

This comment has been minimized.

Copy link
@TimothyGu

TimothyGu May 11, 2017

Member

I suggest the following code, which filters out several more conditions mandated by the UTS#46 draft, and adds an explicit comment explaining why this is:

// These error conditions are mandated unconditionally by UTS #46 version
// 9.0.0 (rev. 17), but were found to be incompatible with many actual domain
// names in the wild. As such, in the current UTS #46 draft (rev. 18) these
// checks are made optional depending on the CheckHyphens flag, which will be
// disabled in WHATWG URL's "domain to ASCII" algorithm as soon as the UTS #46
// draft becomes standard.
// Refs:
// - https://github.com/whatwg/url/issues/53
// - http://www.unicode.org/review/pri317/
// - http://www.unicode.org/reports/tr46/tr46-18.html
info.errors &= ~UIDNA_ERROR_HYPHEN_3_4;
info.errors &= ~UIDNA_ERROR_LEADING_HYPHEN;
info.errors &= ~UIDNA_ERROR_TRAILING_HYPHEN;

I also suspect that these lines of code need to be duplicated for the ToUnicode method in the file, since there is a set of identical changes to ToUnicode in the UTS#46 draft.

This comment has been minimized.

Copy link
@zimbabao

zimbabao May 11, 2017

Author Contributor

Yes. Will do.
I checked those too. But was not able to find any real world example of those.

This comment has been minimized.

Copy link
@zimbabao

zimbabao May 11, 2017

Author Contributor

@TimothyGu : Made changes as suggested. Filtering errors for trailing and leading hyphens.

@domenic

This comment has been minimized.

Copy link
Member

commented May 11, 2017

What is the policy for landing changes that aren't spec-compliant? Or is the intention to move the spec first?

@domenic domenic referenced this pull request May 11, 2017
Closed

@zimbabao zimbabao force-pushed the zimbabao:fix-urls-with-hyphens-at-3-4-pos branch May 11, 2017

src/node_i18n.cc Outdated
@@ -516,6 +531,22 @@ int32_t ToASCII(MaybeStackBuffer<char>* buf,
info.errors &= ~UIDNA_ERROR_EMPTY_LABEL;
info.errors &= ~UIDNA_ERROR_LABEL_TOO_LONG;
info.errors &= ~UIDNA_ERROR_DOMAIN_NAME_TOO_LONG;
info.errors &= ~UIDNA_ERROR_HYPHEN_3_4;

This comment has been minimized.

Copy link
@TimothyGu

TimothyGu May 11, 2017

Member

You can remove this line now that you've added my suggestion :)

// 9.0.0 (rev. 17), but were found to be incompatible with many actual domain
// names in the wild. As such, in the current UTS #46 draft (rev. 18) these
// checks are made optional depending on the CheckHyphens flag, which will be
// disabled in WHATWG URL's "domain to ASCII" algorithm as soon as the UTS #46

This comment has been minimized.

Copy link
@TimothyGu

TimothyGu May 11, 2017

Member

domain to Unicode

This comment has been minimized.

Copy link
@zimbabao

zimbabao May 11, 2017

Author Contributor

Thanks. done all changes.

@TimothyGu

This comment has been minimized.

Copy link
Member

commented May 11, 2017

@domenic I don't think we have encountered such situations before, but generally yes I think the intention is to change spec first. In this case, however, there is a disparity between the spec and what browsers do, so I think it is justified to make this change, given that upstream is aware of the problem.

@domenic

This comment has been minimized.

Copy link
Member

commented May 11, 2017

Yeah, I guess that makes sense, although with luck we can use this to prod @annevk into action on a spec fix to at least match what Node.js intends to do here.

Note that you're not matching browsers with this patch, as per whatwg/url#267 they seem to do something quite different.

@zimbabao zimbabao force-pushed the zimbabao:fix-urls-with-hyphens-at-3-4-pos branch May 11, 2017

@jasnell

This comment has been minimized.

Copy link
Member

commented May 11, 2017

Hmm... I would definitely like to see this fixed but we need to make sure that (a) the behavior matches the browsers and (b) the spec issues gets resolved. I think I'd rather hold off on landing until we get confirmation on (a) and acknowledgement that (b) will happen.

@zimbabao

This comment has been minimized.

Copy link
Contributor Author

commented May 12, 2017

@jasnell : I just created a jsbin with all above style urls and https://jsbin.com/negusoweyu/edit?html,js,console,output . Browser (I tested on chrome, Safari, Firefox). They all parse these domains and resolves them and fires request. I added entries in /etc/hosts to get them resolved.

Do you thing thats enough or any other test would be needed?. Let me know.

@zimbabao zimbabao force-pushed the zimbabao:fix-urls-with-hyphens-at-3-4-pos branch May 12, 2017

@domenic

This comment has been minimized.

Copy link
Member

commented May 12, 2017

To be clear, from reading browser source code and running more exhaustive tests than those simple ones, you'll find that browsers do not apply ToASCII at all for domains with only-ASCII characters. That is not what this PR does. (Nor is it what the spec does.)

See more detail at the issue whatwg/url#267 I linked to above.

@annevk

This comment has been minimized.

Copy link

commented May 12, 2017

It's unfortunate that what browsers do doesn't make any sense. It's the kind Latin-alphabet-favoritism we try to get rid of, although maybe it's minor enough not to matter...

But yeah, if you want to match what browsers do, only run ToASCII if you found a code point > U+007F. (I think that's a per domain determination, not per domain label.)

I don't really want to change the URL standard until I heard from implementers what we can actually do here, although I should probably add a warning.

@zimbabao

This comment has been minimized.

Copy link
Contributor Author

commented May 12, 2017

@domenic , @annevk , @TimothyGu : Will look at the browser code, tests and submit a change.
From @annevk's blog https://annevankesteren.nl/2017/02/idna-toascii-differences , all tested browsers seems to agree when domain label is ascii, so not running toASCII on ASCII only string is justified.

Shall I file tickets to browser vendors for other cases?, or you are already having that discussion?.

@domenic

This comment has been minimized.

Copy link
Member

commented May 12, 2017

We're already having that discussion... again, please look at the issue I've pointed to twice now :)

@annevk

This comment has been minimized.

Copy link

commented May 12, 2017

whatwg/url#309 has some additional information.

@joyeecheung

This comment has been minimized.

Copy link
Member

commented May 12, 2017

Looks like we will need to wait for the spec for the time being. We have a "Waiting on Spec" column in the Github project for WHATWG URL...for now I'll move this one there.

@joyeecheung joyeecheung added this to Waiting on Spec in WHATWG URL implementation May 12, 2017

@annevk

This comment has been minimized.

Copy link

commented May 13, 2017

FWIW, if it's possible for you to iterate quickly I would recommend implementing what browsers do now ("fast path" for ASCII) and later align with any standard changes if they end up being different. It doesn't seem worth it to inconvenience end users over this.

@zimbabao

This comment has been minimized.

Copy link
Contributor Author

commented May 13, 2017

@annevk : Thats what I was planning to do. Will do it over weekend.

@zimbabao

This comment has been minimized.

Copy link
Contributor Author

commented May 15, 2017

@annevk : I tried to get the fast toASCII for all ascii domain. But corresponding ToUnicode will still fail for domain names with hyphens at 3&4 places or with trailing or leading hyphens.

We can special case this by not running ToUnicode on domain labels not having "xn--" prefix.

Will send a different PR

zimbabao added a commit to zimbabao/node that referenced this pull request May 15, 2017
url: fast path ascii domains, do not run ToASCII
To match browser behavior fast path ascii only domains and
do not run ToASCII on them.

Fixes: nodejs#12965
Refs: nodejs#12966
Refs: whatwg/url#309
@annevk

This comment has been minimized.

Copy link

commented May 15, 2017

ToUnicode can never fail. It can only signal errors, but it never fails. See http://www.unicode.org/reports/tr46/#ToUnicode for details.

@TimothyGu

This comment has been minimized.

Copy link
Member

commented May 15, 2017

@annevk, hmm I missed that line when reading (step 2). Will note in the future.

zimbabao added a commit to zimbabao/node that referenced this pull request May 15, 2017
url: fast path ascii domains, do not run ToASCII
To match browser behavior fast path ascii only domains and
do not run ToASCII on them.

Fixes: nodejs#12965
Refs: nodejs#12966
Refs: whatwg/url#309
zimbabao added a commit to zimbabao/node that referenced this pull request May 16, 2017
url: fast path ascii domains, do not run ToASCII
To match browser behavior fast path ascii only domains and
do not run ToASCII on them.

Fixes: nodejs#12965
Refs: nodejs#12966
Refs: whatwg/url#309

@zimbabao zimbabao reopened this May 18, 2017

@zimbabao

This comment has been minimized.

Copy link
Contributor Author

commented May 18, 2017

Reopening this based on this comment.

info.errors &= ~UIDNA_ERROR_HYPHEN_3_4;
info.errors &= ~UIDNA_ERROR_LEADING_HYPHEN;
info.errors &= ~UIDNA_ERROR_TRAILING_HYPHEN;

This comment has been minimized.

Copy link
@TimothyGu

TimothyGu May 18, 2017

Member

Unfortunately I was mistaken in my understanding of ToUnicode, which means that the entire block (ll. 464-485) shouldn't be necessary. While at it, can you delete this block, the erroneous || (!lenient && info.errors != 0) condition below, as well as the now-unnecessary lenient parameter?

This comment has been minimized.

Copy link
@zimbabao

zimbabao May 19, 2017

Author Contributor

Done.

@TimothyGu
Copy link
Member

left a comment

This PR LGTM except the ToUnicode issue above, but we can fix that at a later date if this PR should get in ASAP.

@zimbabao zimbabao force-pushed the zimbabao:fix-urls-with-hyphens-at-3-4-pos branch May 19, 2017

zimbabao added 4 commits May 11, 2017
url: ignore IDN errors when domainname have two hyphens
There are valid domain names with hyphens at 3 and 4th position, new
node WHATWG URL parser was failing for it assume its an invalid IDN.

Fixes: #12965
url: ignore IDN errors when domainname have hyphens
There are valid domain names with hyphens at 3 and 4th position, new
node WHATWG URL parser was failing for it assume its an invalid IDN.
Also filters IDN errors when domain label start or end with hyphen.

Fixes: #12965
Refs: https://www.icann.org/news/announcement-2000-01-07-en
url: ignore IDN errors when domainname have hyphens
There are valid domain names with hyphens at 3 and 4th position, new
node WHATWG URL parser was failing for it assume its an invalid IDN.
Also filters IDN errors when domain label start or end with hyphen.

Also fix error in ToUnicode

Fixes: #12965
Refs: https://www.icann.org/news/announcement-2000-01-07-en
Refs: whatwg/url#309 (comment)
url: ignore IDN errors when domainname have hyphens
There are valid domain names with hyphens at 3 and 4th position, new
node WHATWG URL parser was failing for it assume its an invalid IDN.
Also filters IDN errors when domain label start or end with hyphen.

Fixes: #12965
Refs: https://www.icann.org/news/announcement-2000-01-07-en

@zimbabao zimbabao force-pushed the zimbabao:fix-urls-with-hyphens-at-3-4-pos branch to 5e0f58c May 19, 2017

@zimbabao

This comment has been minimized.

Copy link
Contributor Author

commented May 19, 2017

We had to do hacks to get around this problems with try/catch and using older url module.

@zimbabao

This comment has been minimized.

Copy link
Contributor Author

commented May 19, 2017

@TimothyGu : I made the above changes.

Also based on commit to #309 shall we add checkHyphen param, will do that as separate PR.

@TimothyGu
Copy link
Member

left a comment

Even better! LGTM.

@zimbabao

This comment has been minimized.

Copy link
Contributor Author

commented May 20, 2017

@TimothyGu, @watilde : Can one you start a CI run for this please?.

@TimothyGu

This comment has been minimized.

@zimbabao

This comment has been minimized.

Copy link
Contributor Author

commented May 20, 2017

@TimothyGu : Anything more to get this landed?.

@TimothyGu

This comment has been minimized.

Copy link
Member

commented May 20, 2017

Nope, landed in 06a617a.

Note that for future contributions, the first line of the commit message should be under 50 characters per our commit message guidelines. I've amended the message this time, however.

Thanks again for staying through on this PR!

@TimothyGu TimothyGu closed this May 20, 2017

TimothyGu added a commit that referenced this pull request May 20, 2017
url: update IDNA error conditions
This commit contains three separate changes:

- Always return a string from ToUnicode no matter if an error occurred.
- Disable CheckHyphens boolean flag. This flag will soon be enabled in
  the URL Standard, but is implemented manually by selectively ignoring
  certain errors.
- Enable CheckBidi boolean flag per URL Standard update.

This allows domain names with hyphens at 3 and 4th position, as well as
those with leading and trailing hyphens. They are technically invalid,
but seen in the wild.

Tests are updated and simplified accordingly.

PR-URL: #12966
Fixes: #12965
Refs: whatwg/url#309
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
Reviewed-By: Timothy Gu <timothygu99@gmail.com>

@TimothyGu TimothyGu moved this from Waiting on Spec to Done in WHATWG URL implementation May 20, 2017

@jasnell jasnell referenced this pull request May 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
8 participants
You can’t perform that action at this time.