Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

mochiweb_html unicode breakage #54

mattsta opened this Issue Aug 10, 2011 · 5 comments


None yet
3 participants

mattsta commented Aug 10, 2011


 20> mochiweb_html:parse("<done>Matt’s Breaking Line</done>").    
 ** exception error: bad argument
      in function  iolist_to_binary/1
         called as iolist_to_binary([60,100,111,110,101,62,77,97,116,116,8217,
      in call from mochiweb_html:tokens/1
      in call from mochiweb_html:parse/1

 17> mochiweb_html:parse(<<"<done>Matts Breaking Line</done>">>). 
 {<<"done">>,[],[<<"Matts Breaking Line">>]}

Going through mochiweb_html and replacing all iolist_to_binary/1 with unicode:characters_to_binary/1 almost works (along with replacing all binary_to_list/1 with unicode:characters_to_list/1). It causes the functions to return, but the output is garbled:

 14> io:format("~s~n", [unicode:characters_to_binary(mochiweb_html:to_html(mochiweb_html:parse("<done>Matt’s Breaking Line</done>")))]).
 <done>Matt�s Breaking Line</done>

Any ideas on how to help multi-byte characters survive a mochiweb_html:parse/1 to mochiweb_html:to_html/1 cycle?


etrepum commented Aug 10, 2011

The intention is that the input is already UTF-8, whether it is binary or iolist. This is consistent with what you would get by doing something like reading the request body from a mochiweb request

1> io:fwrite([mochiweb_html:to_html(mochiweb_html:parse(<<"<done>Matt\xe2\x80\x99s Breaking Line</done>">>)), "\n"]).
<done>Matt’s Breaking Line</done>

Note the usage of io:fwrite/1 to send the output "directly" to stdout instead of io:format/2 which does the wrong thing. io:format/2 does some really broken shit by default if you are using non-ascii characters.

2> io:format("~s~n", [[8217]]).                                              
** exception exit: {badarg,[{io,format,[<0.25.0>,"~s~n",[[8217]]]},
     in function  io:o_request/3

@etrepum etrepum closed this Aug 10, 2011


dreid commented Aug 10, 2011

@mattsta it's also worth noting here that you can use mochiutf8:codepoints_to_bytes/1 to convert the list you have into a utf8 binary.

1> S = "<done>Matt’s Breaking Line</done>".
2> mochiutf8:codepoints_to_bytes(S).       
3> mochiweb_html:parse(mochiutf8:codepoints_to_bytes(S)).

etrepum commented Aug 10, 2011

I believe that unicode:characters_to_binary(S, utf8) is faster than mochiutf8:codepoints_to_bytes/1 (some functions in the unicode module are BIFs). Different error behavior though.

mattsta commented Aug 10, 2011

Thanks for the clarification. I was passing in a fetched HTML body from httpc:request without first running through unicode/utf8 conversion.

Everything works fine as long as I pre-convert:

 27> io:fwrite(mochiweb_html:to_html(mochiweb_html:parse(unicode:characters_to_binary("<done>Matt’s Breaking Line</done>")))).
 <done>Matt’s Breaking Line</done>ok

 28> io:format(mochiweb_html:to_html(mochiweb_html:parse(unicode:characters_to_binary("<done>Matt’s Breaking Line</done>")))).
 <done>Matt’s Breaking Line</done>ok

What can we do to help people not bug you about this in the future? The type system isn't elaborate enough to reject non-utf8 outright.

There's always the cheap:

 parse(Input) ->
       error:badarg -> not_utf8

Or maybe even try to fix it ourselves, though it could hide performance issues by double-evaluating on every run:

 parse(Input) ->
       error:badarg -> parse_tokens(tokens(unicode:characters_to_binary(Input)))

resolved: dumbmatt


etrepum commented Aug 10, 2011

I'd accept a documentation patch that says the input must be UTF-8. I guess string() is a bit misleading, but I wrote all this stuff before Erlang had any attempt to support unicode (except for some functions hiding in xmerl).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment