url.parse security with special characters #711

Closed
4poc opened this Issue Feb 21, 2011 · 5 comments

Projects

None yet

4 participants

@4poc
4poc commented Feb 21, 2011

I just noticed that url.parse does not really verify the validity of host/hostname:

> require('url').parse('http://google.com" onload="alert(42)/')
{ href: 'http://google.com" onload="alert(42)/',
  protocol: 'http:',
  slashes: true,
  host: 'google.com" onload="alert(42)',
  hostname: 'google.com" onload="alert(42)',
  pathname: '/' }

This can lead to potential security issues, if the developer just assumes that parse will correctly detect invalid hostnames and arguably this should be handled correctly by node. Here are excerpts of the relevant RFCs (at least ", ', <, >, and ` should be considered dangerous):

RFC 1738 Uniform Resource Locators (URL):

Unsafe:

Characters can be unsafe for a number of reasons.  The space
character is unsafe because significant spaces may disappear and
insignificant spaces may be introduced when URLs are transcribed or
typeset or subjected to the treatment of word-processing programs.
The characters "<" and ">" are unsafe because they are used as the
delimiters around URLs in free text; the quote mark (""") is used to
delimit URLs in some systems.  The character "#" is unsafe and should
always be encoded because it is used in World Wide Web and in other
systems to delimit a URL from a fragment/anchor identifier that might
follow it.  The character "%" is unsafe because it is used for
encodings of other characters.  Other characters are unsafe because
gateways and other transport agents are known to sometimes modify
such characters. These characters are "{", "}", "|", "\", "^", "~",
"[", "]", and "`".

All unsafe characters must always be encoded within a URL. For
example, the character "#" must be encoded within URLs even in
systems that do not normally deal with fragment or anchor
identifiers, so that if the URL is copied into another system that
does use them, it will not be necessary to change the URL encoding.

RFC 2396 Uniform Resource Identifiers (URI): Generic Syntax:

The angle-bracket "<" and ">" and double-quote (") characters are
excluded because they are often used as the delimiters around URI in
text documents and protocol fields.  The character "#" is excluded
because it is used to delimit a URI from a fragment identifier in URI
references (Section 4). The percent character "%" is excluded because
it is used for the encoding of escaped characters.

delims      = "<" | ">" | "#" | "%" | <">

Other characters are excluded because gateways and other transport
agents are known to sometimes modify such characters, or they are
used as delimiters.

unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

Data corresponding to excluded characters must be escaped in order to
be properly represented within a URI.
@isaacs
isaacs commented Feb 28, 2011

Closed by d664bf3 URL parse more safely

This does 3 things:

  1. Delimiters and "unwise" characters are never included in the
    hostname or path.
  2. url.format will sanitize string URLs that are passed to it.
  3. The parsed url's 'href' member will be the sanitized url, which may
    not match the argument to url.parse.
@coolaj86 coolaj86 pushed a commit that referenced this issue Apr 15, 2011
@isaacs @ry isaacs + ry Closes GH-711 URL parse more safely
This does 3 things:

1. Delimiters and "unwise" characters are never included in the
   hostname or path.
2. url.format will sanitize string URLs that are passed to it.
3. The parsed url's 'href' member will be the sanitized url, which may
   not match the argument to url.parse.
d664bf3
@hueniverse

This fix caused valid URI to fail. See: issue 954.

@trentm
trentm commented May 31, 2011

AFAICT this patch implicitly adds pathname: "/" to the parsed result for a URL like "http://example.com" ... which doesn't have a path part. I realize that for an HTTP request a path component is necessary, but AFAIK it is valid URL. This causes a problem when parsing a given URL and later using the parsed components to put a URL back together. The extra '/' can result in an extra leading '/'.

Thoughts on moving the implicit addition of the pathname=='/' out to urlFormat where the sanitized "href" field is being created?

@isaacs
isaacs commented Jun 1, 2011

I think for node's purposes, this makes sense. If you put http://example.com in a web browser or the href attribute of an <a> tag, it'll be interpreted as http://example.com/. You shouldn't be using plain old string concatenation to put URLs together; it's better to use url.resolve or modify the parsed object to achieve this purpose. There are a ton of edge cases.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment