Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should percent encode the query key and value when being encoded as str #100

Closed
xiren7 opened this issue Jun 2, 2018 · 28 comments
Closed

Comments

@xiren7
Copy link

xiren7 commented Jun 2, 2018

I use furl(v1.1) in http proxy app, and receive url in server side(the http proxy received from client):

>>> url_received = 'http://example.com/?redirect=http%3A%2F%2Fwww.example.com%2F'
>>> parsed = furl.furl(url_received)

After do something with the parsed url, and generate url in client side(to be sent to the real server):

>>> url_to_be_sent = parsed.url 
>>> print(url_to_be_sent)
'http://example.com/?redirect=http://www.example.com/'
>>> print(url_to_be_sent.query)
'redirect=http://www.example.com/'

Should the "redirect=http://www.example.com/" part be percent encoded as original query "redirect=http%3A%2F%2Fwww.example.com%2F"?

@gruns
Copy link
Owner

gruns commented Jun 6, 2018

Should the "redirect=http://www.example.com/" part be percent encoded as
original query "redirect=http%3A%2F%2Fwww.example.com%2F"?

"redirect=http://www.example.com/" doesn't need to be percent encoded; .,
:, and / are all valid query characters.

Is the URL http://example.com/?redirect=http://www.example.com/ rejected by
the receiving software? That's a perfectly valid URL.

@xiren7
Copy link
Author

xiren7 commented Jun 6, 2018

Sorry, I was wrong, the : and / does not need be encoded in query.

But should not be decoded in query neither?

import furl

url = 'http://host/?q=://'
assert furl.furl(url).url == url

# not consistent
url_percent_encoded = 'http://host/?q=%3A%2F%2F' 
assert furl.furl(url_percent_encoded).url != url_percent_encoded 

@gruns
Copy link
Owner

gruns commented Jun 6, 2018

It's ambiguous. Both

  • http://host/?q=://

and

  • http://host/?q=%3A%2F%2F

are perfectly valid URLs, so furl has a choice of what to do here.

But should not be decoded in query neither?

The equality

url == furl(url).url

is not a goal of furl. There are many URLs that furl happily accepts where that
equality fails. Like

>>> from furl import furl
>>> url = 'https://www.google.com/a b/?a=a b'
>>> f = furl(url)
>>> f.url
'https://www.google.com/a%20b/?a=a+b'
>>> url == f.url
False

Zooming out, is there a reason you'd prefer furl to encode .:/ (and their ilk)
in query strings? Is http://example.com/?redirect=http://www.example.com/
rejected by software you're using?

@xiren7
Copy link
Author

xiren7 commented Jun 7, 2018

Is http://example.com/?redirect=http://www.example.com/
rejected by software you're using?

Yes. I used furl in a proxy server.
The client sent me the url with query be percent encoded.
I parsed the url with furl, and forwarded the furl(url).url to the server, the server complained about the url.

Zooming out, is there a reason you'd prefer furl to encode .:/ (and their ilk)
in query strings?

I did not know why the client encode the query as percent encoding.

I suggest furl only decode or encode the reserved characters. But not the : or /.
(updated: the behavior just like decodeURI or encodeURI in JavaScript)

decodeURI('%2f') == "%2f"
encodeURI('/') == '/'

encodeURI(' ') == '%20'
decodeURI('%20') == ' '

@gruns
Copy link
Owner

gruns commented Jun 7, 2018

I parsed the url with furl, and forwarded the furl(url).url to the server, the
server complained about the url.

Interesting.

Which software complained about the URL

  • http://example.com/?redirect=http://www.example.com/

? And what was the warning or error message, if available?

As aforementioned, http://example.com/?redirect=http://www.example.com/ is a
perfectly valid URL.

If URLs like http://host/?q=:// prove problematic for existing software in the
wild, even though they shouldn't, furl should switch from

  • http://host/?q=://

to

  • http://host/?q=%3A%2F%2F

to be safe.

@xiren7
Copy link
Author

xiren7 commented Jun 7, 2018

I think the client encode the query component with encodeURIComponent, it encode all the unsafe characters, including : and /.

Which software complained about the URL http://example.com/?redirect=http://www.example.com/
?

I really did not know what the http parser the server(*.backend.semrush.com) used in my case.

UPDATE:
I found why the server complained about the url, because the server hashed the url, and sent the hash value with the query, the url changed(because of %3A%2F%2F -> :// ), so the server complained.

http://host/?q=:// to http://host/?q=%3A%2F%2F to be safe.

I agree, maybe the best way is to encode all the unsafe characters.

@gruns
Copy link
Owner

gruns commented Jun 7, 2018

I'm afraid I don't quite follow. Please elaborate on what you mean by

the server hashed the url, and sent the hash value with the query

So if http://example.com/?redirect=http://www.example.com/ is the input URL,
what is the final URL that backend.semrush.com receives? That is, what is
http://example.com/?redirect=http://www.example.com/ changed into after

the server hashed the url, and sent the hash value with the query

?

@xiren7
Copy link
Author

xiren7 commented Jun 7, 2018

I'm afraid I don't quite follow. Please elaborate on what you mean by

# client side
# the pseudo code for simulating what client do
target = encodeURIComponent("http://target.com/") # js function encodeURIComponent
print(target) # -> "http%3A%2F%2Ftarget.com%2F"
hash = hash_with_md5(target)
print(hash) # -> "2063c1608d6e0baf80249c42e2be5804"
original_url_sent_by_client = f"http://server/?q={target}&h={hash}"

# proxy server side
parsed_url = furl(original_url_sent_by_client)
# do log some values in the parsed_url
...
# proxy client side
url_to_be_sent = parsed_url.url
print(url_to_be_sent) # -> "http://server/?q=http://target.com/&h=2063c1608d6e0baf80249c42e2be5804"
httpclient.send(url_to_be_sent)
# because the query value changes from "http%3A%2F%2Ftarget.com%2F" to "http://target.com/"
# so the hash value will not match, and the real server complains about that.

From https://en.wikipedia.org/wiki/Query_string:
HTML 5 specifies the following transformation for submitting HTML forms with the "get" method to a web server.[1] The following is a brief summary of the algorithm:

  • Characters that cannot be converted to the correct charset are replaced with HTML numeric character references[11]
  • SPACE is encoded as '+' or '%20'
  • Letters (A–Z and a–z), numbers (0–9) and the characters '*','-','.' and '_' are left as-is
  • '+' is encoded by %2B
  • All other characters are encoded as %HH hex representation with any non-ASCII characters first encoded as UTF-8 (or other specified encoding)

According to the algorithm, only Letters (A–Z and a–z), numbers (0–9) and the characters '*','-','.' and '_' are left as-is, all other characters are transformed.
At least for the query part, I think should encode what decoded. For example:

decode(encode(' ')) == ' '
encode(decode('%20'))) == '%20'

decode(encode('=')) == '='
encode(decode('%3D'))) == '%3D'

decode(encode(':')) == ':'
encode(decode('%3A')) == '%3A' # the current furl will return encoded value as ':' not '%3A'

decode(encode('/')) == '/'
encode(decode('%2F')) == '%2F' # the current furl will return encoded value as '/' not '%2F'

@gruns
Copy link
Owner

gruns commented Jun 7, 2018

Aha!

Thank you for the example with pseudo code. Your situation makes perfect sense
now.

Fundamentally, furl.url will not always exactly reproduce the URL it was fed,
like my prior example:

>>> furl('https://www.google.com/a b/?a=a b').url
'https://www.google.com/a%20b/?a=a+b'

In all these scenarios, the hash of furl(url).url will differ from the hash of
the original URL, url.

While as per RFC 3986 / and : are perfectly valid query characters

query         = *( pchar / "/" / "?" )
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
pct-encoded   = "%" HEXDIG HEXDIG
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="

furl's decision not to encode them clearly leads to problems.

I'll tweak furl's encoding scheme to percent encode sub-delim characters, too.

@gruns
Copy link
Owner

gruns commented Jun 21, 2018

Fixed in 68b0cb9.

This fix will ship in the next version of furl, v1.2.

@xiren7
Copy link
Author

xiren7 commented Jun 21, 2018

Thanks for the library.

@gruns
Copy link
Owner

gruns commented Jun 30, 2018

The aforementioned query encoding overhaul shipped in Furl v1.2.

>>> f = furl('http://example.com/')
>>> f.args['redirect'] = 'http://www.example.com/'
>>> f.url
'http://example.com/?redirect=http%3A%2F%2Fwww.example.com%2F'

Upgrade with

pip install furl --upgrade

Thank you for bringing this issue to my attention, @xiren7. Don't hesitate to
let me know if there's anything else I can do for you.

@gruns gruns closed this as completed Jun 30, 2018
@youtux
Copy link

youtux commented Aug 29, 2018

I cannot upgrade to furl 1.2 because of the change in 68b0cb9. I am using furl to return URLs containing query params that are also URLs to applications developed by third parties. Some of them may not use proper decoding, and if I upgrade to version 1.2 I will be breaking compatibility with those apps.
Is there a way to have the old behavior back, maybe by adding a parameter to the tostr() function?

@gruns
Copy link
Owner

gruns commented Sep 12, 2018

@youtux Thank you for weighing in!

Before I add a parameter to tostr() to optionally disable query encoding (i.e.
optionally revert 68b0cb9), which sites and/or libraries "may not use proper
decoding"? Are you sure compatiblity would be broken?

If so, is it realistic to reach out and ask them to switch to proper URL
handling libraries and/or tools?

@gruns gruns reopened this Sep 12, 2018
@youtux
Copy link

youtux commented Sep 14, 2018

It is not easy for me to reach them, and to be honest it would be quite time-consuming. But I saw many application just taking "parsing" query params by splitting the URL with the & character and then splitting again each item with = to fetch key and value, without doing any unquoting.

@gruns
Copy link
Owner

gruns commented Oct 6, 2018

I added a query_dont_quote parameter to furl.tostr() that exempts valid
query characters from percent-encoding, either in their entirety with
query_dont_quote=True or selectively with, for example,
query_dont_quote='/?'.

So, to leave URLs in the query unencoded, use query_dont_quote=True:

>>> f = furl('https://www.google.com/')
>>> f.args['url'] = 'https://lolsup.ru/pepp'
>>> f.tostr()
'https://www.google.com/?url=https%3A%2F%2Flolsup.ru%2Fpepp'
>>> f.tostr(query_dont_quote=True)
'https://www.google.com/?url=https://lolsup.ru/pepp'

query_dont_quote will ship in the next version of furl, v2.1.0.

Does this suffice for your needs, @youtux?

@youtux
Copy link

youtux commented Oct 8, 2018 via email

@youtux
Copy link

youtux commented Oct 10, 2018

I am trying it using commit c30edb2, but I think I found a bug:

import furl

embedded_url = furl.furl('https://example.com/foos/21312312321312312?a=1&b=2')
assert set(embedded_url.args) == {'a', 'b'}

wrapping_url = furl.furl('https://mydomain.com/').set(args={'foo': embedded_url}).tostr(query_dont_quote=True)
assert set(furl.furl(wrapping_url).args) == {'foo'}  # This fails, 'b' is parsed as an query param of wrapping_url.

The problem is that we don't escape ? and & chars when using query_dont_quote=True. If I am correct, tostr should always try to restrict the choice of characters the user is telling to not quote, in case it is not allowed.

@gruns
Copy link
Owner

gruns commented Oct 17, 2018

Thank you for testing, @youtux!

The problem is that we don't escape ? and & chars when using
query_dont_quote=True.

The problem is that it's impossible to satisfy both the above and:

But I saw many application just taking "parsing" query params by splitting the
URL with the & character and then splitting again each item with = to fetch
key and value, without doing any unquoting.

If the receiving application extracts query parameters by splitting on & and not
unquoting percent-encoded characters, then the encoded ? and & characters in

https://example.com/foos/21312312321312312%3Fa=1%26b=2

will be incorrectly parsed as the raw strings %3F and %26, respectively -- not
unquoted to ? and &, as you intend.

This is the original problem I highlighted: furl can only do so much to insulate
lamentable software that ignorantly splits on & and doesn't unquote
percent-encoded characters.

Nonetheless, if you want encode ? and & in your example, you can do so like:

>>> from furl import furl, Query
>>> url = 'https://example.com/foos/21312312321312312?a=1&b=2'
>>> safe = ''.join(set(Query.SAFE_VALUE_CHARS) - set('?&'))
>>> furl('https://mydomain.com/').set(args={'foo': url}).tostr(query_dont_quote=safe)
'https://mydomain.com/?foo=https://example.com/foos/21312312321312312%3Fa=1%26b=2'

Does this suffice for your needs?

If not, I will remove the dont_quote= and query_dont_quote= parameters to
keep furl's API lean. The fault here lies not with furl, but the lackluster
software that doesn't properly decode percent-encoded characters, as per RFC
3986.

@youtux
Copy link

youtux commented Oct 19, 2018

The failing test case that I reported has little to do with my specific problems of application doing a naive split on &.
What I reported is that with the change you suggested, tostr is going to create wrong urls when used with query_dont_quote=True. Instead, it should try to not quote as much as possible, but still being escaping character that must be escaped in the query params.
I am suggesting to remove & from the list of safe characters, just like it used to be in version 1.1.

@youtux
Copy link

youtux commented Nov 9, 2018

@gruns any update on this?

@gruns
Copy link
Owner

gruns commented Nov 16, 2018

@youtux Thank you for the ping.

I'll look into this again soon. At first glance, your suggestion

I am suggesting to remove & from the list of safe characters, just like it
used to be in version 1.1.

seems solid.

@gruns
Copy link
Owner

gruns commented Nov 29, 2018

@youtux I removed & from Query's SAFE_KEY_CHARS list. With such,
your prior example

import furl

embedded_url = furl.furl('https://example.com/foos/21312312321312312?a=1&b=2')
assert set(embedded_url.args) == {'a', 'b'}

wrapping_url = furl.furl('https://mydomain.com/').set(args={'foo': embedded_url}).tostr(query_dont_quote=True)
assert set(furl.furl(wrapping_url).args) == {'foo'}  # This fails, 'b' is parsed as an query param of wrapping_url.

now runs without AssertionErrors. URLs can now be encoded as query
parameters without issue, even with query_dont_quote=True.

Does this solve your original problem?

@youtux
Copy link

youtux commented Dec 16, 2018

Yes, it does. Would you mind adding this scenario to the test suite?

@gruns
Copy link
Owner

gruns commented Feb 6, 2019

I added more tests in e9855e8.

The aforementioned query_dont_quote= parameter of furl.tostr() will ship in
the next version of furl, v2.1.0. Then your prior example will work beautifully:

>>> from furl import furl
>>> url = 'https://example.com/?a=1&b=2'
>>> set(furl(url).args)
{'a', 'b'}
>>> f = furl(url).set(args={'foo': url})
>>> f.tostr(query_dont_quote=True)
'https://example.com/?foo=https://example.com/?a=1%26b=2'
>>> set(f.args)
{'foo'}
>>> f.args['foo']
'https://example.com/?a=1&b=2'

Does this resolve your problem, @youtux?

If so, I'll close this Issue once v2.1.0 ships.

@youtux
Copy link

youtux commented Feb 9, 2019

It looks like it does. Thanks for taking care of it, @gruns, much appreciated.

@youtux
Copy link

youtux commented Jul 22, 2019

@gruns Any plan to release 2.1.0 anytime soon?

@gruns
Copy link
Owner

gruns commented Nov 26, 2019

Furl v2.1.0 released!

$ pip install furl --upgrade

Closing this Issue, now resolved with the addition of query_dont_quote=.

Thank you for your help and input @xiren7 and @youtux. Don't hesitate to let me know if there's anything else I can do for you.

@gruns gruns closed this as completed Nov 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants