New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
we can't choose encoding of final url #89
Comments
Adding an To better understand the problem, please share with the context (or code) in |
most basic example is scraping a website using non-UTF8 encoding give |
Please provide a code example, if possible. I'm still not sure how the encoding Also, are you using Python 2 or Python 3? In Python 2, |
py2 if you are using py3, we are not talking about same # coding: utf8
import urllib
import furl
import sys
print sys.getdefaultencoding() # ascii
encoding = 'gb2312' # chinese charset
url_unicode = u'/你好吗'
url_byte = url_unicode.encode(encoding)
url_byte_quoted = urllib.quote(url_byte)
furl_byte = furl.furl(url_unicode).tostr()
# expected result
print url_byte_quoted # /%C4%E3%BA%C3%C2%F0
# using sys.getdefaultencoding()
print furl_byte # /%E4%BD%A0%E5%A5%BD%E5%90%97
# what i want
print furl.furl(url_unicode).tostr(encoding='gb2312') # /%C4%E3%BA%C3%C2%F0 |
Your example is extremely helpful. Thank you. As per RFC 3986, all URL data should be UTF-8 encoded before being From Section 2.5, Identifying Data of RFC 3986 (https://www.ietf.org/rfc/rfc3986.txt):
As such, there's not a strong case for furl to deviate from this behavior, nor That said, thanks to your example, I now understand that you want to >>> from urllib import quote, unquote
>>> quote(u'你好吗'.encode('gb2312'))
'%C4%E3%BA%C3%C2%F0'
>>> unquote('%C4%E3%BA%C3%C2%F0').decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc4 in position 0: invalid continuation byte So, if possible, please help me understand why you want to percent-encode |
i understood, didn't know that. we can close the issue since it's not an issue already |
If you feel comfortable sharing, which website(s) don't heed RFC 3986 and I'm curious to see if, perhaps, something else is going on. |
@tuntapovski Little bump -- if you feel comfortable sharing, which website(s) |
i just give it as an example, neither i'm chinese nor interested in a website using gb2312. but there is a world out there doesn't fit all RFCs and shape them they desire. :D |
Ah. Thank you for elucidating. Closing this issue; it's not worth breaking RFC for a hypothetical issue Thank you again, tuntapovski. Don't hesitate to let me know if there's anything |
this module is great and easy when manipulating urls
adding "encoding" parameter to ".tostr" function would be nice
rather than encoding everything in "utf8"
since "utf8" is not only option in real life
thanks
The text was updated successfully, but these errors were encountered: