-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Who must normalise URIs? #788
Comments
I'm not sure about that example (as a browser would never emit the latter), but they can certainly rely on some of the other variants and have been known to do so. |
We currently just point to https://www.rfc-editor.org/rfc/rfc3986.html#section-6 - so "path segment normalization" is something that may or may not happen, depending on who's doing the comparison.
My answer would be "a server could do that, but it would be stupid". What's your take? |
I agree with your answer, but think we should say something about it -- perhaps going as far as saying a server should/must not do that (as it's not interoperable). |
Security is probably a stronger reason not to allow |
see also #228 |
I think the comment is more confusing than the spec. The request target is whatever is provided by the client. If the server responds to the second example target with a 301/302 to the first, then it does consider them to be two different target URIs This type of response is not canonicalization -- it's reasonable handling of an unsafe target without breaking the intended request. We might call that responsible handling after the client fails to canonicalize a reference, but keep in mind that a ".." path segment only has meaning for relative references (not references that are already in absolute form). In any case, a client is not required to canonicalize beyond resolving of relative form to absolute form. Some other servers might want to respond with 403 (e.g., an authoring server responding to a link checker). That's okay too. |
On the face of it, that conflicts with this commentary in 3986:
Regardless, the question is whether we give any advice about servers considering these to be genuinely separate resources (putting aside corrective actions, as you illustrate), as it's not practically interoperable with most HTTP implementations. Even something in the normalisation section to the effect of:
|
There's clearly no consensus whether the current relevant specs (RFC 3986 and RFC 7230) allow it or not. This is indeed an obscure case, modern browsers don't request such URIs (not even programatically, from content scripts), neither does cURL with default options. But it should still be settled, for the sake of implementors of web servers and other backend tools, frameworks and authoring tools which help with IA by emitting appropriate server configuration (the question being whether they should allow, possibly with a discouraging message, to create different resources at such URIs, or even assume, given some site not created in them, that they must be the same) and also for defensive coders who believe, as Sir Tim does, that cool URIs don't change, and want to tightly control the set of URIs they respond to with representations of resources and thus admit that the URI is valid – identifies some resource. (I do. Therefore I forbid (default to 404 if they're encountered) trailing slashes, empty path segments, query strings containing only a question mark and so on, except when they are after explicit consideration added to a whitelist.) For all the reasons above I ask you to either declare request with dot segments in targets invalid (some handling may and probably should be defined, but if a client fails to conform to standards by sending malformed requests and gets inconsistent responses, so be it, some bets may be off, the semantics is preserved on the condition of conformance) or specify (MUST level) that the URIs are equivalent (and in thas case it's the servers which are non-conforming if they serve different resources at them). There are also purely practical reasons to do either of the two (as opposed to a SHOULD contemplated in some comments above). Being an uncommon case, it presents a problem for implementations, causing numerous bugs, and confusion (around bugs or otherwise) even among knowledgeable people. Cf. a sample of links resulting from my research of the subject:
|
As for the percent-encoding, section 6 of RFC 3986 seems to have the answer that servers aren't allowed to interpret URIs differing only in them as identifying different resources. However, I have 3 more remarks:
|
https://tools.ietf.org/html/rfc3986#section-6 is useful to the extent it provides a way to talk about this. Would it be sane to say that servers MUST do the things described in 6.2.2 of that document? When I'm a client implementor I would be shocked, and my caching software might well break, in dealing with a server that didn't. Section 6.2.3? I think so probably, but it's not nearly as clear and concise as 6.2.2. |
@timbray, I concur, make the language of 6.2.2 and 6.2.3 more precise and with uppercase MUSTs instead of present shoulds. On the other hand, 6.2.4 is bogus in this context, may be moved somewhere and rephrased to be clearly just informative advice. A trailing slash does make a URI semantically different, at least with the |
Saying servers MUST do anything has the effect of requiring intermediaries to modify URIs on the way through. That's clearly not going to happen, especially given how many different places URIs occur (request target, OTOH there are some security implications if policy is applied before normalisation. Caches are less efficient as well (not an interop problem, just an efficiency one). Roy, you say 'The request target is whatever is provided by the client.' Do you mean the bytes on the wire (if so, what about h2/h3)? or do you mean after some amount of processing? I think we're just talking about the level of processing that's necessary before the protocol element is consumed (but not necessarily forwarded). |
Is there a notion of a server which isn't an intermediary for a given request? If so, how about requiring it only of them? If not, can creators and administrators of resources (people minting URIs for them, either by hand or directing some tool to do it, possibly dynamically) be distinguished as an audience subject to its own set of requirements and have them MUST conform? The advantage of this approach would lie in abstracting from authorities, origins, websites, servers and intermediaries – just the assumptions about the semantics embodied in returned representations and metadata (responses in the case of HTTP) would count as the measure by which to judge conformance. (And they could achieve it with a single server, a server cluster, a CDN, proxies, including non-transparent ones, or whatever else.) |
See PR. |
Thank you, LGTM mostly. It still does allow minting URIs for different resources which are equal after normalization. It's a SHOULD NOT. I assume that the reluctance to go with MUST NOT results from the concerns raised above, but still it's such a corner case that support for it should IMO be trumped by clarity of semantics and ease of use and implementation. Though users and implementors are, by a MAY, explicitly allowed to conflate such URIs, making some resources unavailable via URI from their perspective, so it's a very strong SHOULD NOT and groups of people wishing to violate it for a good cause (whatever that may be, can you give a use case?) have to ensure there's nothing in either their systems or any intermediaries that would thwart their endeavour by performing normalization (which is always legal per the proposed spec). Would it be feasible to spec that equivalent URIs simply do identify the same resource, and if different representations are returned for different forms, it's a strange but unambiguously identifiable resource? (That's exactly already the semantics for URIs resolved by servers using their file systems,which is the most popular method. If there's an RSS feed at One more nit. Is normalization all or nothing? Or are HTTP components allowed to e.g. only omit port 80 and keep dot segments? I think it should be specified and prefer the former (which also follows Postel's law), though I realize there may be popular implementations doing otherwise; if so, a SHOULD is probably the most that could be counted on. (For completeness: are they allowed to denormalize, e.g. introduce case variation in the authority part?) |
The text on http(s) Normalization and Comparison doesn't assign any behaviours to roles; it just passively states that http(s) URIs 'are' normalised and compared.
This leads to some confusion about what roles are responsible for normalising. E.g., can a server consider
https://example.com/foo/
andhttps://example.com/bar/../foo/
as two different target URIs? I think we all agree on the answer to that, but it isn't clearly stated in the specs AFAICT.The text was updated successfully, but these errors were encountered: