New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query string parameter encoding #43
Comments
I even see references to "Unicode Latin1". |
All the strings that I get from mod_python in my scripts are corrupted. If I send |
Still in the domain of query strings: when I send a parameter that contains a null byte, the end of the string is not provided to my application. |
I may be wrong, but I think a valid URL can only contain US-ASCII strings, everything else must be encoded, so a valid URL cannot contain a é, it must be encoded as %C3%A9. See http://tools.ietf.org/html/rfc3986#section-2. IIRC mod_python uses ISO-8859-1 (aka Latin-1) because it is a single byte encoding in which all combinations of bits are valid character (unlike ASCII or UTF8 where the first bit denotes a continuation). |
Yes, a valid URL has to have its caracters percent-encoded. What I'm talking about is haw mod_python treats these percent-encoded URLs. |
As you stated, é has to be encoded by |
Got it. I'd compare the values of |
What do you mean? These values are all still percent-encoded. |
def sayhi(req, name):
return "%s\n%s\n%s" % (req.uri,req.parsed_uri,req.unparsed_uri) when called with
|
And according to its documentation, PyUnicode_DecodeLatin1 can raise errors, which are not handled in mod_python. |
What about |
|
And that confuses me. Where is the parameters string percent-decoded? |
Good question :) I don't remember... Ultimately it all ends up in |
I think it's in _apachemodule.c, lines 364-365 |
I think you're right... https://github.com/grisha/mod_python/blob/master/src/_apachemodule.c#L364 So somehow the output from Apache does not agree with what mod_python is expecting... hrm... |
The output of apache is just a |
And it should handle invalid strings. |
(partially) fixes grisha#43
My PR fixes decoding of query strings. But there are references to Latin1 in other places, that we should check. And while reading the code, I found a few strange things. Where is Why don't you use apache's functions (in particular |
Thank you! If you want to "go the extra mile" - it would be good to have a test to go along with it :) I would also dig up and reference the specific Apache HTTPd documentation that describes that the return of Certainly checking for other places where Latin1 is used would be a good thing, though I remember at the time considering it very carefully. It looks like it's only mentioned in the code a total of 4 times (if github search is correct). Regarding parse_qs() - it is a public API documented in the http://modpython.org/live/current/doc-html/pythonapi.html#other-functions Regarding Also, you mentioned that PyUnicode_Decode* can raise errors - AFAICT nothing needs to be done about that - the raised error will happen in Python code. |
|
I understand now. It looks like https://www.ietf.org/rfc/rfc3986.txt states it quite plainly too:
|
When using mod_python with python3, I get invalid parameters encoding.
Whereas the URL specification only talks about UTF-8, I see references to "latin1" everywhere in the code. Is that a bug?
The text was updated successfully, but these errors were encountered: