In python3, urlopen expects a byte stream for the POST data. this patch encodes the data in utf-8 before transmission. Essentially a hack, since a proper fix involves allowing the user to specify an encoding scheme of choice.
I'd accept a patch that properly encodes the data to a user defined encoding and sets the content type header appropriately.
In that case, can you suggest a preferred way? would adding an optional parameter (default: ISO-8859-1) to submit_form be okay?
Just found out that ISO-8859-1 is the default for a HTTP POST request when nothing is explicitly specified.
I admit that this is really tricky because of the signature of the user defined open_http() function, which we can't easily change without breaking code.
OTOH, if users provide their own urlopen function, I think we can assume that they know how to deal with encoding the form data, so it should be enough to handle only the default case. So, yes, add an optional parameter that only gets passed through into open_http_urllib(), not into a user provided open_http().
While taking a look, I also noticed that the docstring is outdated. It would be nice if you could clean that up while you're at it anyway.
Thanks for giving it a try.
@scoder I have added an optional parameter to submit_form, which gets passed only to open_http_urllib and not user-defined open_http, just as you suggested. I defaulted to ISO-8859-1 since it is the default for POST.
Most browsers do not set any headers on a direct POST request, and I cave decided to do the same.
I am unsure of changes to the docstring, and have left it untouched as of now.
Please find the new commit in the same branch, phanimahesh:patch-2.
@scoder Have you reviewed the diff?
If there are any objections/comments/suggestions, please let me know.
gentle ping @scoder
The fix is simpler: data = urlencode(values).encode('ascii', 'strict'). Urlencoded value may contain only subset of ascii characters.
data = urlencode(values).encode('ascii', 'strict')
Note: it is unrelated to the character encoding that urlencode() itself uses (encoding parameter). That encoding may be calculated using description for html5 "Selecting a form submission encoding". btw, the default is UTF-8, not ISO-8859-1.
The character encoding was selected to match the default expected value for a POST request. But from the specification on how to select the form submission encoding in general, I agree that it must be UTF-8. And @scoder specifically requested that the user should be able to specify encoding here, hence I had to introduce an optional parameter.
In the code, GET request is being handled separately. Should the url encoding be specified there, at L915: url += urlencode(values)? Also, should we follow the complete detection algorithm as specified in the w3 spec and drop the optional parameter?
L915: url += urlencode(values)
There are two character encodings here.
One encoding is used to convert "urlencoded" Unicode string (that never ever contains non-ascii characters) into bytes to be accepted as data argument for urlopen() function as said in the urlencode() docs to avoid TypeError to fix the LP bug. It is an artifact of Python implementation that application/x-www-form-urlencoded data is returned as Unicode. The result of urlencoding is inherently a byte stream. ascii is always a safe choice here (though other ascii-based encoding also would work and produce the same byte sequence).
Another encoding is unrelated to the issue. It is the character encoding used before "percent encoding" is applied. It is the form submission encoding.
As I understand it, @scoder talks about the second character encoding. Changing it, configuring it doesn't fix TypeError.
GET request is unrelated to the issue. urlopen() should accept url as a Unicode string here.