Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

we can't choose encoding of final url #89

Closed
tuntapovski opened this issue Sep 22, 2017 · 10 comments
Closed

we can't choose encoding of final url #89

tuntapovski opened this issue Sep 22, 2017 · 10 comments

Comments

@tuntapovski
Copy link

this module is great and easy when manipulating urls

adding "encoding" parameter to ".tostr" function would be nice
rather than encoding everything in "utf8"
since "utf8" is not only option in real life

thanks

@gruns
Copy link
Owner

gruns commented Sep 27, 2017

Adding an encoding parameter to tostr() is a wonderful suggestion.

To better understand the problem, please share with the context (or code) in
which you want the URL string encoded in a non-UTF8 encoding. In other words, in
which scenario(s) do you want URL component(s) encoded in a non-UTF8 encoding?

@tuntapovski
Copy link
Author

most basic example is scraping a website using non-UTF8 encoding

give unicode url -> manipulate -> encode and escape it back correctly using encoding

@gruns
Copy link
Owner

gruns commented Sep 29, 2017

most basic example is scraping a website using non-UTF8 encoding

Please provide a code example, if possible. I'm still not sure how the encoding
of the page contents at a URL relates to the encoding of the URL string itself,
as returned by furl.tostr(). Thus, I'm uncertain how furl should best be
improved to solve your problem.

Also, are you using Python 2 or Python 3?

In Python 2, furl.tostr() returns a byte string encoded with your system's
default encoding, sys.getdefaultencoding(). In Python 3, furl.tostr()
returns a Unicode string, UTF-16 for narrow Python builds, UTF-32 for wide
Python builds.

@tuntapovski
Copy link
Author

tuntapovski commented Sep 30, 2017

py2

if you are using py3, we are not talking about same str here then :D
maybe adding tobyte as alternative is better idea?

# coding: utf8
import urllib
import furl
import sys

print sys.getdefaultencoding() # ascii

encoding = 'gb2312' # chinese charset
url_unicode = u'/你好吗'

url_byte = url_unicode.encode(encoding)
url_byte_quoted = urllib.quote(url_byte)

furl_byte = furl.furl(url_unicode).tostr()

# expected result
print url_byte_quoted # /%C4%E3%BA%C3%C2%F0

# using sys.getdefaultencoding()
print furl_byte # /%E4%BD%A0%E5%A5%BD%E5%90%97

# what i want
print furl.furl(url_unicode).tostr(encoding='gb2312') # /%C4%E3%BA%C3%C2%F0

@gruns
Copy link
Owner

gruns commented Oct 10, 2017

Your example is extremely helpful. Thank you.

As per RFC 3986, all URL data should be UTF-8 encoded before being
percent-encoded. That is, all percent-encoded URL components should be
bytewise encoded from UTF-8 code points.

From Section 2.5, Identifying Data of RFC 3986 (https://www.ietf.org/rfc/rfc3986.txt):

When a new URI scheme defines a component that represents textual data
consisting of characters from the Universal Character Set [UCS], the data
should first be encoded as octets according to the UTF-8 character encoding
[STD63]; then only those octets that do not correspond to characters in the
unreserved set should be percent- encoded.

As such, there's not a strong case for furl to deviate from this behavior, nor
allow an option to deviate from this behavior and potentially mislead users.

That said, thanks to your example, I now understand that you want to
percent-encode non-UTF8 encoded strings, a la u'/你好吗'.encode('gb2312'). But
I'm still a tad confused why you'd like to do that. The resultant percent-encoded URL
data, e.g. /%C4%E3%BA%C3%C2%F0 from quote(u'/你好吗'.encode('gb2312')),
will be decoded incorrectly by the software that consumes that URL, which will expect
valid UTF8 code points but instead find gb2312 code points. Like:

>>> from urllib import quote, unquote
>>> quote(u'你好吗'.encode('gb2312'))
'%C4%E3%BA%C3%C2%F0'
>>> unquote('%C4%E3%BA%C3%C2%F0').decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc4 in position 0: invalid continuation byte

So, if possible, please help me understand why you want to percent-encode
non-UTF8 encoded strings (like gb2312 encoded strings) into the final URL,
non-comformant to RFC 3986? That will help determine how furl can help solve
your problem.

@tuntapovski
Copy link
Author

tuntapovski commented Oct 14, 2017

i understood, didn't know that.
but clearly not every site i deal with care about that rfc either :D

we can close the issue since it's not an issue already

@gruns
Copy link
Owner

gruns commented Oct 14, 2017

If you feel comfortable sharing, which website(s) don't heed RFC 3986 and
percent-encode gb2312 code points in their URL, not UTF-8 code points?

I'm curious to see if, perhaps, something else is going on.

@gruns
Copy link
Owner

gruns commented Oct 27, 2017

@tuntapovski Little bump -- if you feel comfortable sharing, which website(s)
don't heed RFC 3986 and percent-encode gb2312 code points instead of UTF-8 code
points in their URL?

@tuntapovski
Copy link
Author

i just give it as an example, neither i'm chinese nor interested in a website using gb2312.
also, i don't remember at all which sites using different encodings.

but there is a world out there doesn't fit all RFCs and shape them they desire.
for example, did you see any web mail service completely fit RFC?
! $ & * - = \^ | ~ # % ‘ + / ? _ { }` can be used in local part

:D

@gruns
Copy link
Owner

gruns commented Oct 30, 2017

Ah. Thank you for elucidating.

Closing this issue; it's not worth breaking RFC for a hypothetical issue
point. Please re-open this issue if you find this problem in the wild.

Thank you again, tuntapovski. Don't hesitate to let me know if there's anything
else I can do for you.

@gruns gruns closed this as completed Oct 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants