Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

mochiweb_html unicode breakage #54

Closed
mattsta opened this Issue Aug 10, 2011 · 5 comments

Comments

Projects
None yet
3 participants

mattsta commented Aug 10, 2011

Example:

 20> mochiweb_html:parse("<done>Matt’s Breaking Line</done>").    
 ** exception error: bad argument
      in function  iolist_to_binary/1
         called as iolist_to_binary([60,100,111,110,101,62,77,97,116,116,8217,
                                          115,32,66,114,101,97,107,105,110,103,32,
                                          76,105,110,101,60,47|...])
      in call from mochiweb_html:tokens/1
      in call from mochiweb_html:parse/1

 17> mochiweb_html:parse(<<"<done>Matts Breaking Line</done>">>). 
 {<<"done">>,[],[<<"Matts Breaking Line">>]}

Going through mochiweb_html and replacing all iolist_to_binary/1 with unicode:characters_to_binary/1 almost works (along with replacing all binary_to_list/1 with unicode:characters_to_list/1). It causes the functions to return, but the output is garbled:

 14> io:format("~s~n", [unicode:characters_to_binary(mochiweb_html:to_html(mochiweb_html:parse("<done>Matt’s Breaking Line</done>")))]).
 <done>Matt�s Breaking Line</done>

Any ideas on how to help multi-byte characters survive a mochiweb_html:parse/1 to mochiweb_html:to_html/1 cycle?

Owner

etrepum commented Aug 10, 2011

The intention is that the input is already UTF-8, whether it is binary or iolist. This is consistent with what you would get by doing something like reading the request body from a mochiweb request

1> io:fwrite([mochiweb_html:to_html(mochiweb_html:parse(<<"<done>Matt\xe2\x80\x99s Breaking Line</done>">>)), "\n"]).
<done>Matt’s Breaking Line</done>
ok

Note the usage of io:fwrite/1 to send the output "directly" to stdout instead of io:format/2 which does the wrong thing. io:format/2 does some really broken shit by default if you are using non-ascii characters.

2> io:format("~s~n", [[8217]]).                                              
** exception exit: {badarg,[{io,format,[<0.25.0>,"~s~n",[[8217]]]},
                            {erl_eval,do_apply,5},
                            {shell,exprs,7},
                            {shell,eval_exprs,7},
                            {shell,eval_loop,3}]}
     in function  io:o_request/3

@etrepum etrepum closed this Aug 10, 2011

Member

dreid commented Aug 10, 2011

@mattsta it's also worth noting here that you can use mochiutf8:codepoints_to_bytes/1 to convert the list you have into a utf8 binary.

1> S = "<done>Matt’s Breaking Line</done>".
[60,100,111,110,101,62,77,97,116,116,8217,115,32,66,114,101,
 97,107,105,110,103,32,76,105,110,101,60,47,100|...]
2> mochiutf8:codepoints_to_bytes(S).       
<<60,100,111,110,101,62,77,97,116,116,226,128,153,115,32,
  66,114,101,97,107,105,110,103,32,76,105,110,101,60,...>>
3> mochiweb_html:parse(mochiutf8:codepoints_to_bytes(S)).
{<<"done">>,[],
 [<<77,97,116,116,226,128,153,115,32,66,114,101,97,107,
    105,110,103,32,76,105,110,101>>]}
Owner

etrepum commented Aug 10, 2011

I believe that unicode:characters_to_binary(S, utf8) is faster than mochiutf8:codepoints_to_bytes/1 (some functions in the unicode module are BIFs). Different error behavior though.

mattsta commented Aug 10, 2011

Thanks for the clarification. I was passing in a fetched HTML body from httpc:request without first running through unicode/utf8 conversion.

Everything works fine as long as I pre-convert:

 27> io:fwrite(mochiweb_html:to_html(mochiweb_html:parse(unicode:characters_to_binary("<done>Matt’s Breaking Line</done>")))).
 <done>Matt’s Breaking Line</done>ok

 28> io:format(mochiweb_html:to_html(mochiweb_html:parse(unicode:characters_to_binary("<done>Matt’s Breaking Line</done>")))).
 <done>Matt’s Breaking Line</done>ok

What can we do to help people not bug you about this in the future? The type system isn't elaborate enough to reject non-utf8 outright.

There's always the cheap:

 parse(Input) ->
     try
        parse_tokens(tokens(Input))
     catch
       error:badarg -> not_utf8
     end.

Or maybe even try to fix it ourselves, though it could hide performance issues by double-evaluating on every run:

 parse(Input) ->
     try
        parse_tokens(tokens(Input))
     catch
       error:badarg -> parse_tokens(tokens(unicode:characters_to_binary(Input)))
     end.

resolved: dumbmatt

Owner

etrepum commented Aug 10, 2011

I'd accept a documentation patch that says the input must be UTF-8. I guess string() is a bit misleading, but I wrote all this stuff before Erlang had any attempt to support unicode (except for some functions hiding in xmerl).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment