we can't choose encoding of final url #89

tuntapovski · 2017-09-22T19:00:44Z

this module is great and easy when manipulating urls

adding "encoding" parameter to ".tostr" function would be nice
rather than encoding everything in "utf8"
since "utf8" is not only option in real life

thanks

gruns · 2017-09-27T05:58:14Z

Adding an encoding parameter to tostr() is a wonderful suggestion.

To better understand the problem, please share with the context (or code) in
which you want the URL string encoded in a non-UTF8 encoding. In other words, in
which scenario(s) do you want URL component(s) encoded in a non-UTF8 encoding?

tuntapovski · 2017-09-27T17:40:44Z

most basic example is scraping a website using non-UTF8 encoding

give unicode url -> manipulate -> encode and escape it back correctly using encoding

gruns · 2017-09-29T08:10:33Z

most basic example is scraping a website using non-UTF8 encoding

Please provide a code example, if possible. I'm still not sure how the encoding
of the page contents at a URL relates to the encoding of the URL string itself,
as returned by furl.tostr(). Thus, I'm uncertain how furl should best be
improved to solve your problem.

Also, are you using Python 2 or Python 3?

In Python 2, furl.tostr() returns a byte string encoded with your system's
default encoding, sys.getdefaultencoding(). In Python 3, furl.tostr()
returns a Unicode string, UTF-16 for narrow Python builds, UTF-32 for wide
Python builds.

tuntapovski · 2017-09-30T16:25:52Z

py2

if you are using py3, we are not talking about same str here then :D
maybe adding tobyte as alternative is better idea?

# coding: utf8
import urllib
import furl
import sys

print sys.getdefaultencoding() # ascii

encoding = 'gb2312' # chinese charset
url_unicode = u'/你好吗'

url_byte = url_unicode.encode(encoding)
url_byte_quoted = urllib.quote(url_byte)

furl_byte = furl.furl(url_unicode).tostr()

# expected result
print url_byte_quoted # /%C4%E3%BA%C3%C2%F0

# using sys.getdefaultencoding()
print furl_byte # /%E4%BD%A0%E5%A5%BD%E5%90%97

# what i want
print furl.furl(url_unicode).tostr(encoding='gb2312') # /%C4%E3%BA%C3%C2%F0

gruns · 2017-10-10T18:20:08Z

Your example is extremely helpful. Thank you.

As per RFC 3986, all URL data should be UTF-8 encoded before being
percent-encoded. That is, all percent-encoded URL components should be
bytewise encoded from UTF-8 code points.

From Section 2.5, Identifying Data of RFC 3986 (https://www.ietf.org/rfc/rfc3986.txt):

When a new URI scheme defines a component that represents textual data
consisting of characters from the Universal Character Set [UCS], the data
should first be encoded as octets according to the UTF-8 character encoding
[STD63]; then only those octets that do not correspond to characters in the
unreserved set should be percent- encoded.

As such, there's not a strong case for furl to deviate from this behavior, nor
allow an option to deviate from this behavior and potentially mislead users.

That said, thanks to your example, I now understand that you want to
percent-encode non-UTF8 encoded strings, a la u'/你好吗'.encode('gb2312'). But
I'm still a tad confused why you'd like to do that. The resultant percent-encoded URL
data, e.g. /%C4%E3%BA%C3%C2%F0 from quote(u'/你好吗'.encode('gb2312')),
will be decoded incorrectly by the software that consumes that URL, which will expect
valid UTF8 code points but instead find gb2312 code points. Like:

>>> from urllib import quote, unquote
>>> quote(u'你好吗'.encode('gb2312'))
'%C4%E3%BA%C3%C2%F0'
>>> unquote('%C4%E3%BA%C3%C2%F0').decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc4 in position 0: invalid continuation byte

So, if possible, please help me understand why you want to percent-encode
non-UTF8 encoded strings (like gb2312 encoded strings) into the final URL,
non-comformant to RFC 3986? That will help determine how furl can help solve
your problem.

tuntapovski · 2017-10-14T09:13:15Z

i understood, didn't know that.
but clearly not every site i deal with care about that rfc either :D

we can close the issue since it's not an issue already

gruns · 2017-10-14T20:57:02Z

If you feel comfortable sharing, which website(s) don't heed RFC 3986 and
percent-encode gb2312 code points in their URL, not UTF-8 code points?

I'm curious to see if, perhaps, something else is going on.

gruns · 2017-10-27T19:12:08Z

@tuntapovski Little bump -- if you feel comfortable sharing, which website(s)
don't heed RFC 3986 and percent-encode gb2312 code points instead of UTF-8 code
points in their URL?

tuntapovski · 2017-10-28T16:35:06Z

i just give it as an example, neither i'm chinese nor interested in a website using gb2312.
also, i don't remember at all which sites using different encodings.

but there is a world out there doesn't fit all RFCs and shape them they desire.
for example, did you see any web mail service completely fit RFC?
! $ & * - = \^ | ~ # % ‘ + / ? _ { }` can be used in local part

:D

gruns · 2017-10-30T21:43:41Z

Ah. Thank you for elucidating.

Closing this issue; it's not worth breaking RFC for a hypothetical issue
point. Please re-open this issue if you find this problem in the wild.

Thank you again, tuntapovski. Don't hesitate to let me know if there's anything
else I can do for you.

gruns closed this as completed Oct 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

we can't choose encoding of final url #89

we can't choose encoding of final url #89

tuntapovski commented Sep 22, 2017

gruns commented Sep 27, 2017 •

edited

tuntapovski commented Sep 27, 2017

gruns commented Sep 29, 2017 •

edited

tuntapovski commented Sep 30, 2017 •

edited by gruns

gruns commented Oct 10, 2017 •

edited

tuntapovski commented Oct 14, 2017 •

edited

gruns commented Oct 14, 2017 •

edited

gruns commented Oct 27, 2017

tuntapovski commented Oct 28, 2017

gruns commented Oct 30, 2017 •

edited

we can't choose encoding of final url #89

we can't choose encoding of final url #89

Comments

tuntapovski commented Sep 22, 2017

gruns commented Sep 27, 2017 • edited

tuntapovski commented Sep 27, 2017

gruns commented Sep 29, 2017 • edited

tuntapovski commented Sep 30, 2017 • edited by gruns

gruns commented Oct 10, 2017 • edited

tuntapovski commented Oct 14, 2017 • edited

gruns commented Oct 14, 2017 • edited

gruns commented Oct 27, 2017

tuntapovski commented Oct 28, 2017

gruns commented Oct 30, 2017 • edited

gruns commented Sep 27, 2017 •

edited

gruns commented Sep 29, 2017 •

edited

tuntapovski commented Sep 30, 2017 •

edited by gruns

gruns commented Oct 10, 2017 •

edited

tuntapovski commented Oct 14, 2017 •

edited

gruns commented Oct 14, 2017 •

edited

gruns commented Oct 30, 2017 •

edited