utf8 in parse_qs, application/x-www-form-urlencoded #82

Closed
defunerik opened this Issue Aug 17, 2012 · 3 comments

Comments

Projects
None yet
2 participants

Hi,

In the code to decode a POST submitted form, with content type "application/x-www-form-urlencoded", there is a process of decoding the urlencoded (percent-hex) strings for the keys and values. A problem is that the hex encoded characters are placed literally back into the translated string. There is no accommodation for the character encoding. This means that the resulting string contains the integers from the utf8 sequence rather than the string translated from the utf8 sequence to unicode codepoints, which is what a string is expected to contain.

There are some methods available in http to communicate the encoding, but they seem to be in total disarray, and besides a form may be configured to send as either POST or GET and then things get really fun.

As far as I can gather, the issue has been punted to "everything is utf8", meaning that we should assume that all modern browsers encode user input as utf-8 before urlencoding and finally encapsulating into the query string format.

Now, with mochiweb at the moment, I can deal with this by assuming the results of mochiweb_util:parse_qs are integer lists that represent utf-8 sequences, and this generally works simply by using "list_to_binary" on a value returned by proplists:get_value(). For instance, the value can be used as part of a json document or in other contexts which expect a utf8 binary. However, this is a hack, because the original string is an invalid representation.

I see two ways out of this sticky wicket. One is to fix the strings created by parse_qs. They would be converted from utf8 to unicode strings. This stays with the spirit of mochiweb where things are mostly strings. The downside of this is that if other people are using list_to_binary to convert them to utf8 there will be breakage.

The second is to keep everything in binaries, thus preserving the original byte sequences as well as utilizing Erlang's default utf8 mode for binary "strings".

I've read the past discussions and attempts to "binaryize" mochiweb (esp. benbro), which I believe ended in the conclusion that it would be not too much fun to do (lots of testing, managing merges, etc.), even though it looked like the effort was nearly complete, and that those looking for Erlang web servers operating in binary mode maybe should look elsewhere. However, at the moment my project is wound up with Webmachine, which itself is bound to Mochiweb.

So, I guess to wind this up I'm looking for advice - am I right that parse_qs is broken in this respect? Is it fixable in the mother code? Should it be handled by a fork? Should that fork bite the bullet to binary, perhaps building on the work of benbro?

Thanks.

Owner

etrepum commented Aug 17, 2012

mochiweb does not use "unicode strings" anywhere, even though it is using lists of integers they are treated as lists of bytes (more like iolist() than string()). At the time mochiweb was written, Erlang binaries were a lot slower, and they were not supported in HTTP header decoding.

You probably shouldn't fork the code and it can't be fixed without drastically changing the APIs, which isn't something that makes sense at this point. If you want a cleaner all-binary API you'll want to switch web servers. A "fixed" mochiweb (regardless of whether unicode strings or binaries are used) wouldn't be compatible with webmachine either.

@etrepum etrepum closed this Aug 17, 2012

I found a google group posting from a couple of years back where it was explained.

I realize that there is a history there, and, now, that this behavior is intentional.

However, there is also confusion. Take webmachine, for instance. Webmachine utilizes parse_qs directly to instantiate the url query field in the request object it creates. It then specifies in several places that the keys and values of the proplist that results from parse_qs are "strings". Well, if you decide not to use your own knowledge of what a Erlang string is, since that would be too sensible or you have hit a bug, you might instead look up the webmachine definition (nonstandard by their own admission):

"a list() with all elements in the ASCII range".

Really? Is that 7 bit standard ASCII or US-ASCII? Or 8 bit "extended" ASCII?

Of course, nobody is telling me that I need to treat these as strings, so I can use than as lists of bytes.

Owner

etrepum commented Aug 17, 2012

At the time mochiweb was written, Erlang didn't even support Unicode except for a few functions hidden away in xmerl. Strings were either latin1 or bytes of some other kind, and since these are largely compatible no real distinction was made. The documentation you've found that says ASCII is simply wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment