Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

url.parse misinterprets hostname #8520

Closed
pierre-elie opened this issue Oct 6, 2014 · 32 comments
Closed

url.parse misinterprets hostname #8520

pierre-elie opened this issue Oct 6, 2014 · 32 comments

Comments

@pierre-elie
Copy link

Node v0.10.32

url.parse('https://good.com+.evil.org/?accessToken=xxx');

returns

{ 
  protocol: 'https:',
  slashes: true,
  auth: null,
  host: 'good.com',
  port: null,
  hostname: 'good.com',
  hash: null,
  search: '?accessToken=xxx',
  query: 'accessToken=xxx',
  pathname: '/+.evil.org/',
  path: '/+.evil.org/?accessToken=xxx',
  href: 'https://good.com/+.evil.org/?accessToken=xxx'
}

url.parse misinterpreted https://good.com+.evil.org/ as https://good.com/+.evil.org/
If we use url.parse to check the validity of the hostname, the test passes, but in the browser the user is redirected to the evil.org website.

Other characters than + might do the trick too.

@chrisdickinson
Copy link

I've confirmed this on both v0.10.32 as well as 0.11.15-pre. Curl will hit "evil.org", not "good.com".

@trevnorris
Copy link

Would that simply be an invalid hostname? In which case, ideas how that should be handled?

@cdlewis
Copy link

cdlewis commented Oct 18, 2014

I agree, the hostname is invalid. At the moment, during validation, the parsing function stops when it first encounters an invalid hostname part, adding everything else to the 'rest' of the string. This issue could be resolved by simply giving up when we encounter and invalid part, returning an empty object. This would also be consistent with the documentation, which promises only to return fields that existed in the query string.

@indutny
Copy link
Member

indutny commented Oct 20, 2014

@trevnorris let's throw.

@trevnorris
Copy link

I'm fine with throwing. Anyone want to open a PR for this?

@jondavidjohn
Copy link

@jondavidjohn
Copy link

Was going to grab this (to get my feet wet), but I'm sure someone will do it much faster than I can get up and running with the tooling. Regarding the test above I reference and after looking at the parse code, it would seem that it's terminating the host, based on nonHostChars when I think it might be better served to use hostEndingChars instead.

Also throwing here seems awkward as @cdlewis points out. It doesn't look like anywhere else in this function throws (with the exception being the type check of the argument).

What about simply changing the test I reference above from this...

  // an unexpected invalid char in the hostname.
  'HtTp://x.y.cOm*a/b/c?d=e#f g<h>i' : {
    'href': 'http://x.y.com/*a/b/c?d=e#f%20g%3Ch%3Ei',
    'protocol': 'http:',
    'slashes': true,
    'host': 'x.y.com',   // <---------------
    'hostname': 'x.y.com',
    'pathname': '/*a/b/c',
    'search': '?d=e',
    'query': 'd=e',
    'hash': '#f%20g%3Ch%3Ei',
    'path': '/*a/b/c?d=e'
  },

To this.

  // an unexpected invalid char in the hostname.
  'HtTp://x.y.cOm*a/b/c?d=e#f g<h>i' : {
    'href': 'http://x.y.com*a/b/c?d=e#f%20g%3Ch%3Ei',
    'protocol': 'http:',
    'slashes': true,
    'host': 'x.y.com*a',   // <----------------
    'hostname': 'x.y.com*a',
    'pathname': '/b/c',
    'search': '?d=e',
    'query': 'd=e',
    'hash': '#f%20g%3Ch%3Ei',
    'path': '/b/c?d=e'
  },

It seems outside the scope to check the validity of the host, but to try to accurately extract it from what's provided. This would also solve the OP's problem because then the host would be good.com+.evil.org which would be an obvious mismatch.

@indutny
Copy link
Member

indutny commented Oct 24, 2014

No, this is invalid and should not happen in url.

@jondavidjohn
Copy link

A little confused by the response, maybe add a little detail?

I just think throwing is a bad idea.

This would make the common usage of url.parse() to need a try/catch, since it's often you need this function to specifically parsing unknown input. A main line usage doesn't seem like a good candidate for throwing in my opinion.

@jondavidjohn
Copy link

Another option would be to make hostname empty (similar to how hostnames over the max length are handled).

@pierre-elie
Copy link
Author

I tend to agree with @jondavidjohn

jondavidjohn added a commit to jondavidjohn/node that referenced this issue Oct 24, 2014
Regarding nodejs#8520

url.parse misinterpreted `https://good.com+.evil.org/`
as `https://good.com/+.evil.org/`.  If we use url.parse to check the
validity of the hostname, the test passes, but in the browser the
user is redirected to the evil.org website.
@trevnorris
Copy link

I'd be fine if we added url.valid(string) to the API. That way the URL could go without throwing.

@jondavidjohn
Copy link

I don't know why we'd expand this fix into API changing territory.

Throwing here would cause existing code to behave differently.

Having a way to pre-check validity of the input is good (JSON.parse anyone? 😠), but I think adding a method, and changing behavior of an existing one is a bit over the top when my pull above addresses the issue while maintaining current expected behavior for programs written today.

I don't think requiring

if (url.isValid(str)) {
  var parsed = url.parse(str);
}

everywhere, is a lot different than requiring

try {
  var parsed = url.parse(str);
} catch (e) {}

If we decided to throw in this particular spot, for the sake of consistency we'll also want to throw when the hostname is too long also.

So, have you reviewed my pull above? Do I need to change it to the throwing approach instead? And if so, would it need to be targeted at master?

@trevnorris
Copy link

As for changes to the current url.parse(), I think we should make sure it fully conforms to the whatwg spec (https://url.spec.whatwg.org/). Then determine what should happen with anything designated a "parse error".

@jondavidjohn
Copy link

I feel like my questions / feedback are not being read.

I understand this is a little insignificant issue in the scheme of things and likely has a fraction of your attention, I'm just trying to fix the reporter's obvious issue consistently within the constraints of the v0.10.x API expectations.

Is this the wrong approach?

@aredridel
Copy link

-1 for throwing. That's a huge and backward-incompatible change to the calling convention.

I like the idea of blanking the hostname for an invalid hostname; perhaps having an "invalid" sub-object with the bad parts assigned to it; or perhaps just drop them.

@Qard
Copy link
Member

Qard commented Nov 5, 2014

For reference, this is how new URL(...) behaves in Chrome 38.

screenshot from 2014-11-04 16 59 43

@pierre-elie
Copy link
Author

Following @Quard comment and the fact that Curl hits evil.com: is the hostname really invalid or do Curl and Chrome have an issue?

@trevnorris
Copy link

Definitely not going to break backwards compatibility in v0.10. We can discuss what we want to do going forward though.

I'm not convinced throwing is the right approach in general, since it could possibly break a lot of user's existing code.

Now, as for the results that @Qard showed. It looks like Chrome simply eats the bad URL. Not sure how I feel about that either.

@jondavidjohn
Copy link

@trevnorris do you have any specific issue with the approach of simply returning an empty host/hostname field when an invalid hostname part is detected? This seems to be the safest "path of least surprise" option in my mind since this is how we currently handle hostnames that are too long.

@Qard
Copy link
Member

Qard commented Nov 5, 2014

@trevnorris For what it's worth, Firefox works the same. The browsers are at least consistent in their wrongness.

@chrisdickinson
Copy link

I'm come around to the opinion that the Chrome 38/Firefox output that @Qard showed would be the desired output of url.parse(https://good.com+.evil.org/?accessToken=xxx) -- while it does put the onus on the developer to double check that the host is appropriate, at least by returning good.com+.evil.org the user can make that determination (vs. our current "return good.com" behavior). Returning nothing would be very surprising, as would throwing an exception.

@jondavidjohn
Copy link

@chrisdickinson That would would be my first choice. I can implement it this way instead, just need to wait and see which way yall want to fix it.

@Qard
Copy link
Member

Qard commented Nov 5, 2014

In reading the host parsing section of the spec, I see no mention of + being an invalid symbol in the host portion of the string. https://url.spec.whatwg.org/#host-parsing

Is there somewhere it is documented as being invalid?

@trevnorris
Copy link

@tjfontaine @indutny Thoughts on having parse() work the same as it does in the browser? (e.g. output matches that of new URL())

@trevnorris
Copy link

@Qard ah, you're right. + isn't an invalid character. In that case I would consider it a bug on the part of parse().

@jondavidjohn
Copy link

@trevnorris So you guys would like a pull to align it with the browser behavior noted above by @Qard then?

Tests passing looking like this?

  // an unexpected invalid char in the hostname.
  'http://x.y.com+a/b/c' : {
    'href': 'http://x.y.com+a/b/c',
    'protocol': 'http:',
    'slashes': true,
    'host': 'x.y.com+a',   // <---------------
    'hostname': 'x.y.com+a',
    'pathname': '/b/c',
    'search': '',
    'query': '',
    'hash': '',
    'path': '/b/c'
  },

@trevnorris
Copy link

@jondavidjohn I don't care much about matching browser behavior, but following the spec. In this specific case '+' isn't an invalid character, thus should be left alone. But yes on the PR. Make sure to mention what @Qard pointed out in https://url.spec.whatwg.org/#host-parsing.

jondavidjohn added a commit to jondavidjohn/node that referenced this issue Nov 26, 2014
Regarding nodejs#8520

url.parse misinterpreted `https://good.com+.evil.org/`
as `https://good.com/+.evil.org/`.  If we use url.parse to check the
validity of the hostname, the test passes, but in the browser the
user is redirected to the evil.org website.
@jondavidjohn
Copy link

@trevnorris PR is up, let me know if it's not what you had in mind.

jondavidjohn added a commit to jondavidjohn/node that referenced this issue Nov 26, 2014
Regarding nodejs#8520

url.parse misinterpreted `https://good.com+.evil.org/`
as `https://good.com/+.evil.org/`.  If we use url.parse to check the
validity of the hostname, the test passes, but in the browser the
user is redirected to the evil.org website.
jondavidjohn added a commit to jondavidjohn/node that referenced this issue Nov 26, 2014
Regarding nodejs#8520

This approach changes hostname validation from a whitelist approach
to a blacklist approach as described in https://url.spec.whatwg.org/#host-parsing.

url.parse misinterpreted `https://good.com+.evil.org/`
as `https://good.com/+.evil.org/`.  If we use url.parse to check the
validity of the hostname, the test passes, but in the browser the
user is redirected to the evil.org website.
jondavidjohn added a commit to jondavidjohn/node that referenced this issue Nov 26, 2014
Regarding nodejs#8520

This approach changes hostname validation from a whitelist approach
to a blacklist approach as described in https://url.spec.whatwg.org/#host-parsing.

url.parse misinterpreted `https://good.com+.evil.org/`
as `https://good.com/+.evil.org/`.  If we use url.parse to check the
validity of the hostname, the test passes, but in the browser the
user is redirected to the evil.org website.
jondavidjohn added a commit to jondavidjohn/node that referenced this issue Nov 27, 2014
Regarding nodejs#8520

This changes hostname validation from a whitelist regex approach
to a blacklist regex approach as described in https://url.spec.whatwg.org/#host-parsing.

url.parse misinterpreted `https://good.com+.evil.org/`
as `https://good.com/+.evil.org/`.  If we use url.parse to check the
validity of the hostname, the test passes, but in the browser the
user is redirected to the evil.org website.
trevnorris pushed a commit that referenced this issue Dec 3, 2014
Regarding #8520

This changes hostname validation from a whitelist regex approach
to a blacklist regex approach as described in https://url.spec.whatwg.org/#host-parsing.

url.parse misinterpreted `https://good.com+.evil.org/`
as `https://good.com/+.evil.org/`.  If we use url.parse to check the
validity of the hostname, the test passes, but in the browser the
user is redirected to the evil.org website.
piscisaureus pushed a commit to piscisaureus/node2 that referenced this issue Dec 9, 2014
Regarding nodejs/node-v0.x-archive#8520

This changes hostname validation from a whitelist regex approach
to a blacklist regex approach as described in https://url.spec.whatwg.org/#host-parsing.

url.parse misinterpreted `https://good.com+.evil.org/`
as `https://good.com/+.evil.org/`.  If we use url.parse to check the
validity of the hostname, the test passes, but in the browser the
user is redirected to the evil.org website.
piscisaureus pushed a commit to nodejs/node that referenced this issue Dec 9, 2014
Regarding nodejs/node-v0.x-archive#8520

This changes hostname validation from a whitelist regex approach
to a blacklist regex approach as described in https://url.spec.whatwg.org/#host-parsing.

url.parse misinterpreted `https://good.com+.evil.org/`
as `https://good.com/+.evil.org/`.  If we use url.parse to check the
validity of the hostname, the test passes, but in the browser the
user is redirected to the evil.org website.
petkaantonov added a commit to petkaantonov/urlparser that referenced this issue Jan 8, 2015
'+' is considered now considered a valid host character
@mathiasbynens
Copy link

This seems fixed in Node.js v0.12.

$ nvm use 0.10
Now using node v0.10.38

$ node -e "var parsed = require('url').parse('https://good.com+.evil.com/'); console.log(parsed.host);"
good.com

$ nvm use 0.12
Now using node v0.12.0

$ node -e "var parsed = require('url').parse('https://good.com+.evil.com/'); console.log(parsed.host);"
good.com+.evil.com

@jasnell
Copy link
Member

jasnell commented Jun 3, 2015

@joyent/node-coreteam ... this is fixed in v0.12, but not in v0.10. Do we want to backport?

@jasnell
Copy link
Member

jasnell commented Jun 3, 2015

See: nodejs/node#49

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests