Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query string parameter encoding #43

Closed
lovasoa opened this issue May 4, 2015 · 22 comments · Fixed by #44
Closed

Query string parameter encoding #43

lovasoa opened this issue May 4, 2015 · 22 comments · Fixed by #44

Comments

@lovasoa
Copy link
Contributor

lovasoa commented May 4, 2015

When using mod_python with python3, I get invalid parameters encoding.

Whereas the URL specification only talks about UTF-8, I see references to "latin1" everywhere in the code. Is that a bug?

@lovasoa
Copy link
Contributor Author

lovasoa commented May 4, 2015

I even see references to "Unicode Latin1".
What is that? Unicode has several character encodings. The one to use in urls is UTF-8. Latin1 is used to designate the ISO-8859-1, that existed before unicode, and is not part of it.

@lovasoa
Copy link
Contributor Author

lovasoa commented May 4, 2015

All the strings that I get from mod_python in my scripts are corrupted. If I send 0xc3 0xa9 ("é" in UTF8) to my script, I get an string in python containing "é" (represented by the bytes 0xc3 0x83 0xc2 0xa9 in UTF8).

@lovasoa
Copy link
Contributor Author

lovasoa commented May 4, 2015

Still in the domain of query strings: when I send a parameter that contains a null byte, the end of the string is not provided to my application.

@grisha
Copy link
Owner

grisha commented May 4, 2015

I may be wrong, but I think a valid URL can only contain US-ASCII strings, everything else must be encoded, so a valid URL cannot contain a é, it must be encoded as %C3%A9. See http://tools.ietf.org/html/rfc3986#section-2.

IIRC mod_python uses ISO-8859-1 (aka Latin-1) because it is a single byte encoding in which all combinations of bits are valid character (unlike ASCII or UTF8 where the first bit denotes a continuation).

@lovasoa
Copy link
Contributor Author

lovasoa commented May 4, 2015

Yes, a valid URL has to have its caracters percent-encoded. What I'm talking about is haw mod_python treats these percent-encoded URLs.

@lovasoa
Copy link
Contributor Author

lovasoa commented May 4, 2015

As you stated, é has to be encoded by %C3%A9. It's encoded as UTF8, and each byte is written in hex with a percent sign in front of it. What I say is that mod_python mistakenly decodes the 0xc3 0xa9 that apache gives it as latin1. So it decodes it as "é", and gives that to my script.

@grisha
Copy link
Owner

grisha commented May 4, 2015

Got it. I'd compare the values of req.uri, req.parsed_uri and req.unparsed_uri - there may be a clue there.

@lovasoa
Copy link
Contributor Author

lovasoa commented May 4, 2015

What do you mean? These values are all still percent-encoded.

@lovasoa
Copy link
Contributor Author

lovasoa commented May 4, 2015

def sayhi(req, name):
        return "%s\n%s\n%s" % (req.uri,req.parsed_uri,req.unparsed_uri)

when called with test.py/sayhi?name=Гриша, gives me

/test.py/sayhi
(None, None, None, None, None, None, '/test.py/sayhi', 'name=%D0%93%D1%80%D0%B8%D1%88%D0%B0', None)
/test.py/sayhi?name=%D0%93%D1%80%D0%B8%D1%88%D0%B0

@lovasoa
Copy link
Contributor Author

lovasoa commented May 4, 2015

And according to its documentation, PyUnicode_DecodeLatin1 can raise errors, which are not handled in mod_python.

@grisha
Copy link
Owner

grisha commented May 4, 2015

What about req.args? It looks like query parameters are derived from req.args: https://github.com/grisha/mod_python/blob/master/lib/python/mod_python/util.py#L248

@lovasoa
Copy link
Contributor Author

lovasoa commented May 4, 2015

req.args is also percent-encoded.

@lovasoa
Copy link
Contributor Author

lovasoa commented May 4, 2015

And that confuses me. Where is the parameters string percent-decoded?

@grisha
Copy link
Owner

grisha commented May 4, 2015

Good question :) I don't remember... Ultimately it all ends up in req.form. Is it still not decoded in there? My guess is that decoding happens somewhere in this class: https://github.com/grisha/mod_python/blob/master/lib/python/mod_python/util.py#L235 (but from just staring at the code in github I can't tell where).

@lovasoa
Copy link
Contributor Author

lovasoa commented May 4, 2015

I think it's in _apachemodule.c, lines 364-365

@grisha
Copy link
Owner

grisha commented May 4, 2015

I think you're right... https://github.com/grisha/mod_python/blob/master/src/_apachemodule.c#L364

So somehow the output from Apache does not agree with what mod_python is expecting... hrm...

@lovasoa
Copy link
Contributor Author

lovasoa commented May 4, 2015

The output of apache is just a char *. According to the spec, mod_python should either not decode it, or decode it as UTF8. Currently, it decodes it as iso9959-1.

@lovasoa
Copy link
Contributor Author

lovasoa commented May 4, 2015

And it should handle invalid strings.

lovasoa added a commit to lovasoa/mod_python that referenced this issue May 5, 2015
@lovasoa
Copy link
Contributor Author

lovasoa commented May 5, 2015

My PR fixes decoding of query strings. But there are references to Latin1 in other places, that we should check.

And while reading the code, I found a few strange things. Where is parse_qs used? (the code of parse_qsl seems to be copy-pasted from it, but I don't see other references to parse_qs).

Why don't you use apache's functions (in particular ap_getword) for parsing the URIS?

@grisha
Copy link
Owner

grisha commented May 5, 2015

Thank you! If you want to "go the extra mile" - it would be good to have a test to go along with it :)

I would also dig up and reference the specific Apache HTTPd documentation that describes that the return of ap_unescape_url is UTF8 for documentation. And just to make sure that this is not something specific to httpd versions.

Certainly checking for other places where Latin1 is used would be a good thing, though I remember at the time considering it very carefully. It looks like it's only mentioned in the code a total of 4 times (if github search is correct).

Regarding parse_qs() - it is a public API documented in the http://modpython.org/live/current/doc-html/pythonapi.html#other-functions

Regarding ap_getword - no idea, it's possible the parse_qs* code predates ap_getword or may be there was some other reason.

Also, you mentioned that PyUnicode_Decode* can raise errors - AFAICT nothing needs to be done about that - the raised error will happen in Python code.

@lovasoa
Copy link
Contributor Author

lovasoa commented May 7, 2015

ap_unescape_url can return anything. It's the client, that, if it respects the spec, will send a percent-encoded string that wil be a valid UTF8 string once it has been unescaped.

@grisha
Copy link
Owner

grisha commented May 7, 2015

I understand now. It looks like https://www.ietf.org/rfc/rfc3986.txt states it quite plainly too:

   When a new URI scheme defines a component that represents textual
   data consisting of characters from the Universal Character Set [UCS],
   the data should first be encoded as octets according to the UTF-8
   character encoding [STD63]; then only those octets that do not
   correspond to characters in the unreserved set should be percent-
   encoded.

@grisha grisha closed this as completed in #44 May 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants