url.parse security with special characters #711

4poc · 2011-02-21T18:12:04Z

I just noticed that url.parse does not really verify the validity of host/hostname:

> require('url').parse('http://google.com" onload="alert(42)/')
{ href: 'http://google.com" onload="alert(42)/',
  protocol: 'http:',
  slashes: true,
  host: 'google.com" onload="alert(42)',
  hostname: 'google.com" onload="alert(42)',
  pathname: '/' }

This can lead to potential security issues, if the developer just assumes that parse will correctly detect invalid hostnames and arguably this should be handled correctly by node. Here are excerpts of the relevant RFCs (at least ", ', <, >, and ` should be considered dangerous):

RFC 1738 Uniform Resource Locators (URL):

Unsafe:

Characters can be unsafe for a number of reasons.  The space
character is unsafe because significant spaces may disappear and
insignificant spaces may be introduced when URLs are transcribed or
typeset or subjected to the treatment of word-processing programs.
The characters "<" and ">" are unsafe because they are used as the
delimiters around URLs in free text; the quote mark (""") is used to
delimit URLs in some systems.  The character "#" is unsafe and should
always be encoded because it is used in World Wide Web and in other
systems to delimit a URL from a fragment/anchor identifier that might
follow it.  The character "%" is unsafe because it is used for
encodings of other characters.  Other characters are unsafe because
gateways and other transport agents are known to sometimes modify
such characters. These characters are "{", "}", "|", "\", "^", "~",
"[", "]", and "`".

All unsafe characters must always be encoded within a URL. For
example, the character "#" must be encoded within URLs even in
systems that do not normally deal with fragment or anchor
identifiers, so that if the URL is copied into another system that
does use them, it will not be necessary to change the URL encoding.

RFC 2396 Uniform Resource Identifiers (URI): Generic Syntax:

The angle-bracket "<" and ">" and double-quote (") characters are
excluded because they are often used as the delimiters around URI in
text documents and protocol fields.  The character "#" is excluded
because it is used to delimit a URI from a fragment identifier in URI
references (Section 4). The percent character "%" is excluded because
it is used for the encoding of escaped characters.

delims      = "<" | ">" | "#" | "%" | <">

Other characters are excluded because gateways and other transport
agents are known to sometimes modify such characters, or they are
used as delimiters.

unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

Data corresponding to excluded characters must be escaped in order to
be properly represented within a URI.

The text was updated successfully, but these errors were encountered:

isaacs · 2011-02-27T18:32:14Z

Fix: https://gist.github.com/846399

isaacs · 2011-02-28T01:19:04Z

Closed by d664bf3 URL parse more safely

This does 3 things:

Delimiters and "unwise" characters are never included in the
hostname or path.
url.format will sanitize string URLs that are passed to it.
The parsed url's 'href' member will be the sanitized url, which may
not match the argument to url.parse.

This does 3 things: 1. Delimiters and "unwise" characters are never included in the hostname or path. 2. url.format will sanitize string URLs that are passed to it. 3. The parsed url's 'href' member will be the sanitized url, which may not match the argument to url.parse.

hueniverse · 2011-04-19T19:07:53Z

This fix caused valid URI to fail. See: issue 954.

trentm · 2011-05-31T23:19:07Z

AFAICT this patch implicitly adds pathname: "/" to the parsed result for a URL like "http://example.com" ... which doesn't have a path part. I realize that for an HTTP request a path component is necessary, but AFAIK it is valid URL. This causes a problem when parsing a given URL and later using the parsed components to put a URL back together. The extra '/' can result in an extra leading '/'.

Thoughts on moving the implicit addition of the pathname=='/' out to urlFormat where the sanitized "href" field is being created?

isaacs · 2011-06-01T00:42:58Z

I think for node's purposes, this makes sense. If you put http://example.com in a web browser or the href attribute of an <a> tag, it'll be interpreted as http://example.com/. You shouldn't be using plain old string concatenation to put URLs together; it's better to use url.resolve or modify the parsed object to achieve this purpose. There are a ton of edge cases.

Giagnus64 mentioned this issue Jan 25, 2020

Attached files with # at the beginning of filename are renamed into *.png BoostIO/BoostNote-Legacy#3389

Open

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

url.parse security with special characters #711

url.parse security with special characters #711

4poc commented Feb 21, 2011

isaacs commented Feb 27, 2011

isaacs commented Feb 28, 2011

hueniverse commented Apr 19, 2011

trentm commented May 31, 2011

isaacs commented Jun 1, 2011

url.parse security with special characters #711

url.parse security with special characters #711

Comments

4poc commented Feb 21, 2011

isaacs commented Feb 27, 2011

isaacs commented Feb 28, 2011

hueniverse commented Apr 19, 2011

trentm commented May 31, 2011

isaacs commented Jun 1, 2011