Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treat invalid percent encoding consistently #5461

Open
openandclose opened this issue May 17, 2020 · 10 comments
Open

Treat invalid percent encoding consistently #5461

openandclose opened this issue May 17, 2020 · 10 comments

Comments

@openandclose
Copy link

openandclose commented May 17, 2020

The present form of requote_uri unnecessarily divides invalid cases.

When alpha-numeric, '%' is quoted to '%25'.
E.g. %pp to %25pp.

When not alpha-numeric, '%' is kept as is.
E.g. %-- to %--.
(Note these two examples are in the tests,
and the latter is for unquote_unreserved,
but the result is the same.)
Or %<< to %%3C%3C.

but I think '%' should be quoted in all invalid cases.
(Or to put it differently, isolated '%' must be quoted.)

@openandclose
Copy link
Author

I can not find the rational in the code history.

(0) The original implementation
(1) Added 'isalnum'
(2) Added try clause and error message.
(3) Added more try clause and one of the present test.
(4) Added another of the present test.

(0)
URI cleanup of non-path components
https://github.com/psf/requests/commits?author=mgiuca
https://github.com/psf/requests/compare/12f9aa36..690426ac

(1)
fix for #630 - unquote_unreserved raising ValueError on invalid URI-escape sequences
#641
cb15310

(2)
Better percent-escape exception
#1514

(3)
Attempt to quote anyway if unquoting fails
#2393

(4)
Added unit tests for utils module
#3024

@openandclose
Copy link
Author

openandclose commented May 19, 2020

@mgiuca himself recently changed his mind.
whatwg/url#369 (comment)

(the point of that particular comment is to make '%s -> %s' a special case,
and it may be better to take this into consideration here (later),
but he agrees on the baseline ('%' -> '%25'))

@mgiuca
Copy link
Contributor

mgiuca commented May 20, 2020

Hello,

I haven't dug deeply into the issue (but please note that my previous interaction with this project was in 2012, so it's entirely possible I made a mistake or changed my mind since then!) Also note that the entire URL standard has been rewritten since then, the URL standard that I'm commenting on in the above quote didn't exist in 2012.

Based on your above description, I agree that it doesn't make sense to treat %xx differently depending on whether x is alphanumeric. The best thing to do is probably say "if xx are both hex digits, keep it as-is, otherwise, expand "%" to "%25". But as you noted, then you will transform "%s" into "%25s", which breaks URLs that are designed to have things substituted into them later.

@sigmavirus24
Copy link
Contributor

But as you noted, then you will transform "%s" into "%25s", which breaks URLs that are designed to have things substituted into them later.

Except that once passed to an HTTP client, nothing will be substituting anything into them later. And if this is being returned as part of a redirect in the Location header then that means that the user will want to handle redirects themselves, otherwise the re-quoting is most likely valid.

@openandclose
Copy link
Author

openandclose commented May 20, 2020

if this is being returned as part of a redirect in the Location header

Is it possible at all, in case of registerProtocolHandler's '%s' url?

@sigmavirus24
Copy link
Contributor

What are you talking about registerProtocolHandler? Are you looking at the right project?

@openandclose
Copy link
Author

It is the only case url parser has to specially handle '%' + <single character>.

That's why in the above comment link, mgiuca was considering:

"%s" -> "%s" (with validation error)
"%6%3D" -> "%6%3D" (with validation error)

In any other cases, my proposal in the first comment should be OK.

@openandclose
Copy link
Author

And it is the concern for general url parsers.
I'm not sure, but A HTTP client may not need wholly embrace this.

@mgiuca
Copy link
Contributor

mgiuca commented May 21, 2020

And it is the concern for general url parsers.

Quick note that the strings accepted by registerProtocolHandler are not technically URLs (after all, they unambiguously are invalid, since you can't follow a % with an s in URL syntax). They are template strings that, when substituted, form a URL.

A generic URL parser does not need to be able to handle %s properly. It's a question for the authors of a given URL parser whether it's useful to be able to handle %s without double-quoting it.

@openandclose
Copy link
Author

@mgiuca thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants