Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add errata for expected behaviour of UriInterface::withUserInfo to PSR-7 #1298

Merged
merged 3 commits into from
Jul 7, 2023

Conversation

dbu
Copy link
Member

@dbu dbu commented Mar 31, 2023

This suggestion comes out of a discussion we had in php-http/discovery#222

The "urlencode when necessary but don't double encode" behaviour is already implemented in laminas, guzzle and slim, and with Nyholm/psr7#213 was adopted in nyholm. It seems to be expected behviour, but the specification did not define it. Can we add this precision?

If i understand correctly, we are are not supposed to change the main document but only the meta document and the actual interface phpdoc in php-fig/http-message. If this suggestion seems acceptable, i can create the pull requests for http-message

@dbu dbu requested a review from a team as a code owner March 31, 2023 11:41
and password if they are already urlencoded.

In release 1.1, `UriInterface::withUserInfo` SHOULD urlencode username or password when
they contain characters that are invalid in an URL and are not already urlencoded.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to sum up the above discussion: would it be correct to change this to something like: SHOULD escape characters in in username and password that are defined as reserved in [RFC3986](https://www.rfc-editor.org/rfc/rfc3986#appendix-A)

and accordingly below, and also in the rule about double-escaping?

Copy link
Contributor

@nicolas-grekas nicolas-grekas Apr 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might also say SHOULD NOT encode other characters. That would cover double encoding and a bit more, eg \ which is encoded by guzzle (but not by nyholm's)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With encoding only reserved characters, if you pass foo%3Abar:baz as username, it will be encoded to foo%3Abar%3Abaz. How would you revert it back to foo%3Abar:baz?

Copy link
Member Author

@dbu dbu Apr 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rob006 then the correct way to formulate would be: if it contains any reserved character, the whole value should be encoded? if you pass foo%eAbar:baz, the implementation has to assume that the string is not yet encoded, and should therefore also encode the % i think.

damn, i don't think there really is a good solution... because if we do that, the behaviour is different when the caller never encodes the value. if they pass foo%3Abar, no encoding happens, if they pass foo%3Abar:baz the % would suddenly be encoded (the receiving server will urldecode and translate the %3A in the first case, and if it was meant literal, we should have encoded.)

nonetheless, i think we should figure out the rule that makes for the least confusion and add an errata that clarifies how to do this. and in the interest of internal consistence and low surprises, i would go with the rules outlined on getQuery. (i am not sure what those rules mean for the foo%3Abar:baz example. but the current implementations i think will do a compromise that will not work, as they only encode the : but not the %)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is simple and reliable solution: always encode. You build user info by encoding username and password, and implode them using :. If you want to retrieve username and password, you do the same backwards: explode results of getUserInfo() by :, and decode each part. Yes, it is inconvenient if you want to pass getUserInfo() result to withUserInfo(), but this is a result of asymmetry between these two methods, and can't be really fixed without serious BC break. But this use case could be simplified by adding getUserName() and getPassword() to interface: $uri->withUserInfo($previous->getUserName(), $previous->getPassword())).

Copy link
Member Author

@dbu dbu Apr 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by you, do you mean the caller or the psr7 implementation?

given that all the largely used implementations discussed above do encode conditionally, i think clarifying that the arguments always must be encoded by the caller is not a viable option as it conflicts with existing implementations. if implementations were to change that behaviour, this would be a nasty hidden BC break for users. (even if the implementation starts throwing an exception on unescaped reserved characters, this might only happen occasionally in production and not discovered when upgrading the psr7 library, if the tests don't use a reserved character in credentials)

my aim with this pull request is to try to clarify what implementations should do in a way that creates as little friction as possible.

i do think there would be merit for a new PSR with stricter and more robust rules, but that is out of scope for this pull request and discussion. (please feel free to cite this discussion as one motivation why there should be a new PSR for Uri, should you propose such a PSR)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by you, do you mean the caller or the psr7 implementation?

I mean implementation. So withUserInfo() should always encode and getUserInfo() should always return this encoded value. getUserName() and getPassword() would return decoded values (in the same form as passed to withUserInfo()).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would be consistent and deterministic. we can't add new methods afaik, though if it is allowed for a version 2.0 it would make sense.

but i am unsure if that is what the current implementations do. the phpdoc on UriInterface::getQuery says The value returned MUST be percent-encoded, but MUST NOT double-encode any characters. i am not 100% sure i understand correctly, but i think when this was written, the foo%3Abar:baz case was not considered.

@dbu
Copy link
Member Author

dbu commented Apr 7, 2023

@weierophinney do you have any input on this? how should withUserInfo (and also with query) handle a string like h%65llo:bar? if i understand correctly, current implementations would urlencode the : but not the % which is unlikely to be correct.

on getQuery there is a statement that double encoding must be avoided, which afaik made the implementors skip encoding of %.

RE the discussion above: i think we should not discuss how PSR-7 1.0 could have been done better, but find a pagmatic solution for moving forward with the current "state of the art" to avoid friction for the current users of PSR-7.

@weierophinney
Copy link
Contributor

@dbu — Essentially, we should have consistency of behavior between the various methods that modify different parts of the URI. We were explicit in the way the path, query parameters, and fragment were to handle encoding; we should mimic that for the user info as well, which means that (a) the values returned from get*() operations MUST be properly encoded, and (b) passing those values to a with*() method MUST be allowed, and MUST NOT double encode. (While I appreciate what @shadowhand notes about how it can be unintuitive to call a get*() method and not get back the raw value, the point of these methods, when we created the spec, was to return something valid for inclusion in a URL or URL fragment. We recognized that this particular interface was not ideal, but it was suited for the domain — HTTP messages — and we didn't want to block on having a full-blown spec for covering the entierty of RFC-3986. I always anticipated that we'd have a URI PSR at some point, and a replacement for PSR-7 that consumed it; it's just never happened.)

In this particular case, however, the characters that MUST be urlencoded differ from other parts of the URI, according to the ABNF. The relevant line of the ABNF reads as follows:

userinfo      = *( unreserved / pct-encoded / sub-delims / ":" )

Note that : is explicitly allowed here, and should NOT be urlencoded. (Looking at Diactoros, we're doing this portion wrong!)

I would also argue that this would qualify for a new minor, but not a new major release of the spec, because it corrects an issue in the specification. Adopting the new minor might indicate that an existing implementation is broken currently, but fixing the implementation to conform would mean it still conforms to the previous versions of the spec as well. This means that since we just released 1.1 and 2.0, we'd need to do 1.2 and 2.1 versions.

@shadowhand
Copy link
Contributor

I always anticipated that we'd have a URI PSR at some point, and a replacement for PSR-7 that consumed it; it's just never happened.

As an aside, I think it is time for this process to begin. UriInterface is (probably) the weakest part of PSR-7, at least from a developer-user standpoint. (I don't know if it could be replaced by itself in a PER, or if we need a whole new PSR to replace PSR-7.)

@Crell
Copy link
Contributor

Crell commented Apr 7, 2023

This sort of clarification belongs in Errata, not an update to the spec itself.

@weierophinney
Copy link
Contributor

This sort of clarification belongs in Errata, not an update to the spec itself.

That's... what's being proposed, @Crell.

@Crell
Copy link
Contributor

Crell commented Apr 7, 2023

I was referring to this:

This means that since we just released 1.1 and 2.0, we'd need to do 1.2 and 2.1 versions.

Which I don't think is necessary. Errata don't have tagged versions.

@weierophinney
Copy link
Contributor

@Crell — ah, good point. :)

This behaviour is already implemented in laminas, guzzle and slim, and with Nyholm/psr7#213 was adopted in nyholm. It seems to be expected behviour, but the specification did not define it.
@dbu
Copy link
Member Author

dbu commented Apr 10, 2023

i updated the wording to an errata without new versions and to reflect what we discussed. wdyt?

i do not explicitly cover the "mixed case" fo%65bar@baz . an errata for these should also cover the other places like getQuery that have the same problem.

Note that : is explicitly allowed here, and should NOT be urlencoded.

@weierophinney : is used to separate username and password. in the username it must be encoded, otherwise the second part of the username would be considered part of the password.

@weierophinney
Copy link
Contributor

this looks good! I will bring it up for discussion this week, with an expected voting period starting two weeks from today.

@weierophinney
Copy link
Contributor

I've opened the discussion period:

Copy link
Contributor

@heiglandreas heiglandreas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is URLencode the right tool? Or should this not rather use rawurlencode instead? The difference being whether spaces are encoded as + (urlencode) or %20 (rawurlencode). The former can lead to disambiguity as a + can be a valid part of a user-info which would then be mistaken for a space

@dbu
Copy link
Member Author

dbu commented May 31, 2023

@heiglandreas the errata has an explicit list of characters that should be encoded. looking at guzzle/psr7, they indeed use rawurlencode, but only for the list of reserved characters.

i guess the PSR should not detail the implementation, but agree it would be good to be more explicit to avoid somebody mistakenly using urlencode on the whole username / password... do you want to suggest a better wording, andreas?

@heiglandreas
Copy link
Contributor

Perhaps just avoiding urlencode as word might be sufficient as that triggered me to think one could use the function of that name...

Perhaps something along the lines of "Encoding the URL according to RFC3986" might also be an option

Copy link
Member Author

@dbu dbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rewording suggestions to address @heiglandreas feedback.

accepted/PSR-7-http-message-meta.md Outdated Show resolved Hide resolved
accepted/PSR-7-http-message-meta.md Outdated Show resolved Hide resolved
accepted/PSR-7-http-message-meta.md Outdated Show resolved Hide resolved
accepted/PSR-7-http-message-meta.md Show resolved Hide resolved
@dbu
Copy link
Member Author

dbu commented Jun 2, 2023

changed to "encode" / "encoding"

Copy link
Member Author

@dbu dbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zonuexe thanks for the input, that seems consistent with the spec. i adjust the method names accordingly

accepted/PSR-7-http-message-meta.md Outdated Show resolved Hide resolved
accepted/PSR-7-http-message-meta.md Outdated Show resolved Hide resolved
Copy link
Contributor

@mbniebergall mbniebergall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Voting passed.

@mbniebergall mbniebergall merged commit b93f632 into master Jul 7, 2023
@dbu dbu deleted the dbu-patch-1 branch July 10, 2023 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet