Query string parameter encoding #43

lovasoa · 2015-05-04T17:35:56Z

When using mod_python with python3, I get invalid parameters encoding.

Whereas the URL specification only talks about UTF-8, I see references to "latin1" everywhere in the code. Is that a bug?

lovasoa · 2015-05-04T17:46:52Z

I even see references to "Unicode Latin1".
What is that? Unicode has several character encodings. The one to use in urls is UTF-8. Latin1 is used to designate the ISO-8859-1, that existed before unicode, and is not part of it.

lovasoa · 2015-05-04T17:56:19Z

All the strings that I get from mod_python in my scripts are corrupted. If I send 0xc3 0xa9 ("é" in UTF8) to my script, I get an string in python containing "Ã©" (represented by the bytes 0xc3 0x83 0xc2 0xa9 in UTF8).

lovasoa · 2015-05-04T18:04:13Z

Still in the domain of query strings: when I send a parameter that contains a null byte, the end of the string is not provided to my application.

grisha · 2015-05-04T18:13:47Z

I may be wrong, but I think a valid URL can only contain US-ASCII strings, everything else must be encoded, so a valid URL cannot contain a é, it must be encoded as %C3%A9. See http://tools.ietf.org/html/rfc3986#section-2.

IIRC mod_python uses ISO-8859-1 (aka Latin-1) because it is a single byte encoding in which all combinations of bits are valid character (unlike ASCII or UTF8 where the first bit denotes a continuation).

lovasoa · 2015-05-04T18:16:39Z

Yes, a valid URL has to have its caracters percent-encoded. What I'm talking about is haw mod_python treats these percent-encoded URLs.

lovasoa · 2015-05-04T18:21:33Z

As you stated, é has to be encoded by %C3%A9. It's encoded as UTF8, and each byte is written in hex with a percent sign in front of it. What I say is that mod_python mistakenly decodes the 0xc3 0xa9 that apache gives it as latin1. So it decodes it as "Ã©", and gives that to my script.

grisha · 2015-05-04T18:37:27Z

Got it. I'd compare the values of req.uri, req.parsed_uri and req.unparsed_uri - there may be a clue there.

lovasoa · 2015-05-04T18:45:48Z

What do you mean? These values are all still percent-encoded.

lovasoa · 2015-05-04T18:47:19Z

def sayhi(req, name):
        return "%s\n%s\n%s" % (req.uri,req.parsed_uri,req.unparsed_uri)

when called with test.py/sayhi?name=Гриша, gives me

/test.py/sayhi
(None, None, None, None, None, None, '/test.py/sayhi', 'name=%D0%93%D1%80%D0%B8%D1%88%D0%B0', None)
/test.py/sayhi?name=%D0%93%D1%80%D0%B8%D1%88%D0%B0

lovasoa · 2015-05-04T19:02:12Z

And according to its documentation, PyUnicode_DecodeLatin1 can raise errors, which are not handled in mod_python.

grisha · 2015-05-04T19:08:01Z

What about req.args? It looks like query parameters are derived from req.args: https://github.com/grisha/mod_python/blob/master/lib/python/mod_python/util.py#L248

lovasoa · 2015-05-04T19:12:15Z

req.args is also percent-encoded.

lovasoa · 2015-05-04T19:14:40Z

And that confuses me. Where is the parameters string percent-decoded?

grisha · 2015-05-04T19:23:44Z

Good question :) I don't remember... Ultimately it all ends up in req.form. Is it still not decoded in there? My guess is that decoding happens somewhere in this class: https://github.com/grisha/mod_python/blob/master/lib/python/mod_python/util.py#L235 (but from just staring at the code in github I can't tell where).

lovasoa · 2015-05-04T20:27:23Z

I think it's in _apachemodule.c, lines 364-365

grisha · 2015-05-04T20:44:35Z

I think you're right... https://github.com/grisha/mod_python/blob/master/src/_apachemodule.c#L364

So somehow the output from Apache does not agree with what mod_python is expecting... hrm...

lovasoa · 2015-05-04T20:47:24Z

The output of apache is just a char *. According to the spec, mod_python should either not decode it, or decode it as UTF8. Currently, it decodes it as iso9959-1.

lovasoa · 2015-05-04T20:49:12Z

And it should handle invalid strings.

(partially) fixes grisha#43

lovasoa · 2015-05-05T13:38:40Z

My PR fixes decoding of query strings. But there are references to Latin1 in other places, that we should check.

And while reading the code, I found a few strange things. Where is parse_qs used? (the code of parse_qsl seems to be copy-pasted from it, but I don't see other references to parse_qs).

Why don't you use apache's functions (in particular ap_getword) for parsing the URIS?

grisha · 2015-05-05T15:05:24Z

Thank you! If you want to "go the extra mile" - it would be good to have a test to go along with it :)

I would also dig up and reference the specific Apache HTTPd documentation that describes that the return of ap_unescape_url is UTF8 for documentation. And just to make sure that this is not something specific to httpd versions.

Certainly checking for other places where Latin1 is used would be a good thing, though I remember at the time considering it very carefully. It looks like it's only mentioned in the code a total of 4 times (if github search is correct).

Regarding parse_qs() - it is a public API documented in the http://modpython.org/live/current/doc-html/pythonapi.html#other-functions

Regarding ap_getword - no idea, it's possible the parse_qs* code predates ap_getword or may be there was some other reason.

Also, you mentioned that PyUnicode_Decode* can raise errors - AFAICT nothing needs to be done about that - the raised error will happen in Python code.

lovasoa · 2015-05-07T09:29:57Z

ap_unescape_url can return anything. It's the client, that, if it respects the spec, will send a percent-encoded string that wil be a valid UTF8 string once it has been unescaped.

grisha · 2015-05-07T16:46:13Z

I understand now. It looks like https://www.ietf.org/rfc/rfc3986.txt states it quite plainly too:

   When a new URI scheme defines a component that represents textual
   data consisting of characters from the Universal Character Set [UCS],
   the data should first be encoded as octets according to the UTF-8
   character encoding [STD63]; then only those octets that do not
   correspond to characters in the unreserved set should be percent-
   encoded.

lovasoa added a commit to lovasoa/mod_python that referenced this issue May 5, 2015

Decode query string as UTF8

732533a

(partially) fixes grisha#43

lovasoa mentioned this issue May 5, 2015

Decode query string as UTF8 #44

Merged

grisha closed this as completed in #44 May 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query string parameter encoding #43

Query string parameter encoding #43

lovasoa commented May 4, 2015

lovasoa commented May 4, 2015

lovasoa commented May 4, 2015

lovasoa commented May 4, 2015

grisha commented May 4, 2015

lovasoa commented May 4, 2015

lovasoa commented May 4, 2015

grisha commented May 4, 2015

lovasoa commented May 4, 2015

lovasoa commented May 4, 2015

lovasoa commented May 4, 2015

grisha commented May 4, 2015

lovasoa commented May 4, 2015

lovasoa commented May 4, 2015

grisha commented May 4, 2015

lovasoa commented May 4, 2015

grisha commented May 4, 2015

lovasoa commented May 4, 2015

lovasoa commented May 4, 2015

lovasoa commented May 5, 2015

grisha commented May 5, 2015

lovasoa commented May 7, 2015

grisha commented May 7, 2015

Query string parameter encoding #43

Query string parameter encoding #43

Comments

lovasoa commented May 4, 2015

lovasoa commented May 4, 2015

lovasoa commented May 4, 2015

lovasoa commented May 4, 2015

grisha commented May 4, 2015

lovasoa commented May 4, 2015

lovasoa commented May 4, 2015

grisha commented May 4, 2015

lovasoa commented May 4, 2015

lovasoa commented May 4, 2015

lovasoa commented May 4, 2015

grisha commented May 4, 2015

lovasoa commented May 4, 2015

lovasoa commented May 4, 2015

grisha commented May 4, 2015

lovasoa commented May 4, 2015

grisha commented May 4, 2015

lovasoa commented May 4, 2015

lovasoa commented May 4, 2015

lovasoa commented May 5, 2015

grisha commented May 5, 2015

lovasoa commented May 7, 2015

grisha commented May 7, 2015