mochiweb_html unicode breakage #54

mattsta · 2011-08-10T04:47:24Z

Example:

 20> mochiweb_html:parse("<done>Matt’s Breaking Line</done>").    
 ** exception error: bad argument
      in function  iolist_to_binary/1
         called as iolist_to_binary([60,100,111,110,101,62,77,97,116,116,8217,
                                          115,32,66,114,101,97,107,105,110,103,32,
                                          76,105,110,101,60,47|...])
      in call from mochiweb_html:tokens/1
      in call from mochiweb_html:parse/1

 17> mochiweb_html:parse(<<"<done>Matts Breaking Line</done>">>). 
 {<<"done">>,[],[<<"Matts Breaking Line">>]}

Going through mochiweb_html and replacing all iolist_to_binary/1 with unicode:characters_to_binary/1 almost works (along with replacing all binary_to_list/1 with unicode:characters_to_list/1). It causes the functions to return, but the output is garbled:

 14> io:format("~s~n", [unicode:characters_to_binary(mochiweb_html:to_html(mochiweb_html:parse("<done>Matt’s Breaking Line</done>")))]).
 <done>Mattâ��s Breaking Line</done>

Any ideas on how to help multi-byte characters survive a mochiweb_html:parse/1 to mochiweb_html:to_html/1 cycle?

The text was updated successfully, but these errors were encountered:

etrepum · 2011-08-10T16:48:22Z

The intention is that the input is already UTF-8, whether it is binary or iolist. This is consistent with what you would get by doing something like reading the request body from a mochiweb request

1> io:fwrite([mochiweb_html:to_html(mochiweb_html:parse(<<"<done>Matt\xe2\x80\x99s Breaking Line</done>">>)), "\n"]).
<done>Matt’s Breaking Line</done>
ok

Note the usage of io:fwrite/1 to send the output "directly" to stdout instead of io:format/2 which does the wrong thing. io:format/2 does some really broken shit by default if you are using non-ascii characters.

2> io:format("~s~n", [[8217]]).                                              
** exception exit: {badarg,[{io,format,[<0.25.0>,"~s~n",[[8217]]]},
                            {erl_eval,do_apply,5},
                            {shell,exprs,7},
                            {shell,eval_exprs,7},
                            {shell,eval_loop,3}]}
     in function  io:o_request/3

dreid · 2011-08-10T16:55:09Z

@mattsta it's also worth noting here that you can use mochiutf8:codepoints_to_bytes/1 to convert the list you have into a utf8 binary.

1> S = "<done>Matt’s Breaking Line</done>".
[60,100,111,110,101,62,77,97,116,116,8217,115,32,66,114,101,
 97,107,105,110,103,32,76,105,110,101,60,47,100|...]
2> mochiutf8:codepoints_to_bytes(S).       
<<60,100,111,110,101,62,77,97,116,116,226,128,153,115,32,
  66,114,101,97,107,105,110,103,32,76,105,110,101,60,...>>
3> mochiweb_html:parse(mochiutf8:codepoints_to_bytes(S)).
{<<"done">>,[],
 [<<77,97,116,116,226,128,153,115,32,66,114,101,97,107,
    105,110,103,32,76,105,110,101>>]}

etrepum · 2011-08-10T17:03:45Z

I believe that unicode:characters_to_binary(S, utf8) is faster than mochiutf8:codepoints_to_bytes/1 (some functions in the unicode module are BIFs). Different error behavior though.

mattsta · 2011-08-10T18:22:13Z

Thanks for the clarification. I was passing in a fetched HTML body from httpc:request without first running through unicode/utf8 conversion.

Everything works fine as long as I pre-convert:

 27> io:fwrite(mochiweb_html:to_html(mochiweb_html:parse(unicode:characters_to_binary("<done>Matt’s Breaking Line</done>")))).
 <done>Matt’s Breaking Line</done>ok

 28> io:format(mochiweb_html:to_html(mochiweb_html:parse(unicode:characters_to_binary("<done>Matt’s Breaking Line</done>")))).
 <done>Matt’s Breaking Line</done>ok

What can we do to help people not bug you about this in the future? The type system isn't elaborate enough to reject non-utf8 outright.

There's always the cheap:

 parse(Input) ->
     try
        parse_tokens(tokens(Input))
     catch
       error:badarg -> not_utf8
     end.

Or maybe even try to fix it ourselves, though it could hide performance issues by double-evaluating on every run:

 parse(Input) ->
     try
        parse_tokens(tokens(Input))
     catch
       error:badarg -> parse_tokens(tokens(unicode:characters_to_binary(Input)))
     end.

resolved: dumbmatt

etrepum · 2011-08-10T18:26:27Z

I'd accept a documentation patch that says the input must be UTF-8. I guess string() is a bit misleading, but I wrote all this stuff before Erlang had any attempt to support unicode (except for some functions hiding in xmerl).

etrepum closed this as completed Aug 10, 2011

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mochiweb_html unicode breakage #54

mochiweb_html unicode breakage #54

mattsta commented Aug 10, 2011

etrepum commented Aug 10, 2011

dreid commented Aug 10, 2011

etrepum commented Aug 10, 2011

mattsta commented Aug 10, 2011

etrepum commented Aug 10, 2011

mochiweb_html unicode breakage #54

mochiweb_html unicode breakage #54

Comments

mattsta commented Aug 10, 2011

etrepum commented Aug 10, 2011

dreid commented Aug 10, 2011

etrepum commented Aug 10, 2011

mattsta commented Aug 10, 2011

etrepum commented Aug 10, 2011