Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Who must normalise URIs? #788

Closed
mnot opened this issue Mar 1, 2021 · 15 comments · Fixed by #812
Closed

Who must normalise URIs? #788

mnot opened this issue Mar 1, 2021 · 15 comments · Fixed by #812

Comments

@mnot
Copy link
Member

mnot commented Mar 1, 2021

The text on http(s) Normalization and Comparison doesn't assign any behaviours to roles; it just passively states that http(s) URIs 'are' normalised and compared.

This leads to some confusion about what roles are responsible for normalising. E.g., can a server consider https://example.com/foo/ and https://example.com/bar/../foo/ as two different target URIs? I think we all agree on the answer to that, but it isn't clearly stated in the specs AFAICT.

@mnot mnot added the semantics label Mar 1, 2021
@annevk
Copy link
Contributor

annevk commented Mar 1, 2021

I'm not sure about that example (as a browser would never emit the latter), but they can certainly rely on some of the other variants and have been known to do so.

@reschke
Copy link
Contributor

reschke commented Mar 2, 2021

We currently just point to https://www.rfc-editor.org/rfc/rfc3986.html#section-6 - so "path segment normalization" is something that may or may not happen, depending on who's doing the comparison.

This leads to some confusion about what roles are responsible for normalising. E.g., can a server consider https://example.com/foo/ and https://example.com/bar/../foo/ as two different target URIs? I think we all agree on the answer to that, but it isn't clearly stated in the specs AFAICT.

My answer would be "a server could do that, but it would be stupid". What's your take?

@mnot
Copy link
Member Author

mnot commented Mar 2, 2021

I agree with your answer, but think we should say something about it -- perhaps going as far as saying a server should/must not do that (as it's not interoperable).

@martinthomson
Copy link
Contributor

Security is probably a stronger reason not to allow /../ in request targets. Another reason that specific example, as opposed to '+' vs '%20', might be a bad idea.

@reschke
Copy link
Contributor

reschke commented Mar 3, 2021

see also #228

@royfielding
Copy link
Member

I think the comment is more confusing than the spec. The request target is whatever is provided by the client. If the server responds to the second example target with a 301/302 to the first, then it does consider them to be two different target URIs
for the same resource. In practice, that is the best way to respond because it forces the client to change its own copy of the target URI before it makes the "right" request. IIRC, that what we do in Apache httpd.

This type of response is not canonicalization -- it's reasonable handling of an unsafe target without breaking the intended request. We might call that responsible handling after the client fails to canonicalize a reference, but keep in mind that a ".." path segment only has meaning for relative references (not references that are already in absolute form). In any case, a client is not required to canonicalize beyond resolving of relative form to absolute form.

Some other servers might want to respond with 403 (e.g., an authoring server responding to a link checker). That's okay too.

@mnot
Copy link
Member Author

mnot commented Mar 4, 2021

but keep in mind that a ".." path segment only has meaning for relative references (not references that are already in absolute form).

On the face of it, that conflicts with this commentary in 3986:

However, some deployed implementations incorrectly assume that reference resolution is not necessary when the reference is already a URI and thus fail to remove dot-segments when they occur in non-relative paths. URI normalizers should remove dot-segments by applying the remove_dot_segments algorithm to the path, as described in Section 5.2.4.

Regardless, the question is whether we give any advice about servers considering these to be genuinely separate resources (putting aside corrective actions, as you illustrate), as it's not practically interoperable with most HTTP implementations. Even something in the normalisation section to the effect of:

Note that implementations - client, server and intermediary - can and often do normalise URLs when handling messages, although they are not required to. This means that when two URIs normalise to the same string, it is likely that it will be difficult to distinguish them in an interoperable fashion.

@ByteEater-pl
Copy link

ByteEater-pl commented Mar 5, 2021

There's clearly no consensus whether the current relevant specs (RFC 3986 and RFC 7230) allow it or not. This is indeed an obscure case, modern browsers don't request such URIs (not even programatically, from content scripts), neither does cURL with default options. But it should still be settled, for the sake of implementors of web servers and other backend tools, frameworks and authoring tools which help with IA by emitting appropriate server configuration (the question being whether they should allow, possibly with a discouraging message, to create different resources at such URIs, or even assume, given some site not created in them, that they must be the same) and also for defensive coders who believe, as Sir Tim does, that cool URIs don't change, and want to tightly control the set of URIs they respond to with representations of resources and thus admit that the URI is valid – identifies some resource. (I do. Therefore I forbid (default to 404 if they're encountered) trailing slashes, empty path segments, query strings containing only a question mark and so on, except when they are after explicit consideration added to a whitelist.)

For all the reasons above I ask you to either declare request with dot segments in targets invalid (some handling may and probably should be defined, but if a client fails to conform to standards by sending malformed requests and gets inconsistent responses, so be it, some bets may be off, the semantics is preserved on the condition of conformance) or specify (MUST level) that the URIs are equivalent (and in thas case it's the servers which are non-conforming if they serve different resources at them).

There are also purely practical reasons to do either of the two (as opposed to a SHOULD contemplated in some comments above). Being an uncommon case, it presents a problem for implementations, causing numerous bugs, and confusion (around bugs or otherwise) even among knowledgeable people. Cf. a sample of links resulting from my research of the subject:

@ByteEater-pl
Copy link

As for the percent-encoding, section 6 of RFC 3986 seems to have the answer that servers aren't allowed to interpret URIs differing only in them as identifying different resources. However, I have 3 more remarks:

  • I wish the MUST requirement were stated explicitly to avoid confusion among implementors, content authors, server administrators and other audiences.
  • The following excerpt seems to erroneously (because absolute examples are present in the following section) limit the whole chapter 6. to relative URI references, thus excluding e.g. the absolute-form of request-targets (RFC 7230, section 5.3.2.):

In testing for equivalence, applications should not directly compare
relative references; the references should be converted to their
respective target URIs before comparison.

  • The difference, introduced in RFC 3986, between URIs (which are always absolute and don't undergo resolution) and URI references (which may be absolute or relative and chapter 5. defines how to resolve them to obtain a URI) is subtle. Extra care to pick the right one each time should be exercised when using the terms. Particularly with normalization. Currently it's not easy to find and definitively interpret text in the standards clarifying whether both are normalized, when, and how normalization rules for them differ.

@timbray
Copy link

timbray commented Mar 5, 2021

https://tools.ietf.org/html/rfc3986#section-6 is useful to the extent it provides a way to talk about this. Would it be sane to say that servers MUST do the things described in 6.2.2 of that document? When I'm a client implementor I would be shocked, and my caching software might well break, in dealing with a server that didn't.

Section 6.2.3? I think so probably, but it's not nearly as clear and concise as 6.2.2.

@ByteEater-pl
Copy link

@timbray, I concur, make the language of 6.2.2 and 6.2.3 more precise and with uppercase MUSTs instead of present shoulds. On the other hand, 6.2.4 is bogus in this context, may be moved somewhere and rephrased to be clearly just informative advice. A trailing slash does make a URI semantically different, at least with the http: and https: schemes, unless it's immediately preceded by an authority.

@mnot
Copy link
Member Author

mnot commented Mar 6, 2021

Saying servers MUST do anything has the effect of requiring intermediaries to modify URIs on the way through. That's clearly not going to happen, especially given how many different places URIs occur (request target, Location, Content-Location, Link, etc.).

OTOH there are some security implications if policy is applied before normalisation. Caches are less efficient as well (not an interop problem, just an efficiency one).

Roy, you say 'The request target is whatever is provided by the client.' Do you mean the bytes on the wire (if so, what about h2/h3)? or do you mean after some amount of processing? I think we're just talking about the level of processing that's necessary before the protocol element is consumed (but not necessarily forwarded).

@ByteEater-pl
Copy link

Saying servers MUST do anything has the effect of requiring intermediaries to modify URIs on the way through.

Is there a notion of a server which isn't an intermediary for a given request? If so, how about requiring it only of them?

If not, can creators and administrators of resources (people minting URIs for them, either by hand or directing some tool to do it, possibly dynamically) be distinguished as an audience subject to its own set of requirements and have them MUST conform? The advantage of this approach would lie in abstracting from authorities, origins, websites, servers and intermediaries – just the assumptions about the semantics embodied in returned representations and metadata (responses in the case of HTTP) would count as the measure by which to judge conformance. (And they could achieve it with a single server, a server cluster, a CDN, proxies, including non-transparent ones, or whatever else.)

@mnot mnot self-assigned this Mar 22, 2021
mnot added a commit that referenced this issue Mar 26, 2021
@mnot
Copy link
Member Author

mnot commented Mar 26, 2021

See PR.

@ByteEater-pl
Copy link

Thank you, LGTM mostly.

It still does allow minting URIs for different resources which are equal after normalization. It's a SHOULD NOT. I assume that the reluctance to go with MUST NOT results from the concerns raised above, but still it's such a corner case that support for it should IMO be trumped by clarity of semantics and ease of use and implementation. Though users and implementors are, by a MAY, explicitly allowed to conflate such URIs, making some resources unavailable via URI from their perspective, so it's a very strong SHOULD NOT and groups of people wishing to violate it for a good cause (whatever that may be, can you give a use case?) have to ensure there's nothing in either their systems or any intermediaries that would thwart their endeavour by performing normalization (which is always legal per the proposed spec). Would it be feasible to spec that equivalent URIs simply do identify the same resource, and if different representations are returned for different forms, it's a strange but unambiguously identifiable resource? (That's exactly already the semantics for URIs resolved by servers using their file systems,which is the most popular method. If there's an RSS feed at /info, then the file gets deleted, and after some time a new webmaster (possibly even not knowing that there used to be an RSS feed) places an icon (like ℹ️) there (and configures the appropriate Content-Type), the resource doesn't change. It's just a weird resource whose representation (independent of other factors, thus no Vary) is an RSS feed for some period, then it has no representation (404 Not Found or 410 Gone), and still later it's represented by an icon.) It doesn't break backward compatibility with implementations doing crazy things, those things remain technically spec compliant, it just pinpoints the semantics, saying that, despite possibly different representations, some things are the same resource (which current specs leave unclear, so some might have assumed otherwise, but, as I wrote above, it's such a corner case that I'd be surprised to learn that somebody actually not only held this interpretation, but also relied on it). The added burden of compliance to the stricter semantics is on people minting URIs for resources. And if they're unaware of normalization, we'd actually do them a favour by defaulting to the anyway likely desirable thing – having all those other equivalent forms of the URI identify the same resource (and yield the same representation, unless they put enough effort to circumvent normalization and have different representations for the same resource depending on which of the equivalent URIs was used, but doing that indicates they do know of this stuff).

One more nit. Is normalization all or nothing? Or are HTTP components allowed to e.g. only omit port 80 and keep dot segments? I think it should be specified and prefer the former (which also follows Postel's law), though I realize there may be popular implementations doing otherwise; if so, a SHOULD is probably the most that could be counted on. (For completeness: are they allowed to denormalize, e.g. introduce case variation in the authority part?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

7 participants