Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mochiweb_html unicode breakage #54

Closed
mattsta opened this issue Aug 10, 2011 · 5 comments
Closed

mochiweb_html unicode breakage #54

mattsta opened this issue Aug 10, 2011 · 5 comments

Comments

@mattsta
Copy link

mattsta commented Aug 10, 2011

Example:

 20> mochiweb_html:parse("<done>Matt’s Breaking Line</done>").    
 ** exception error: bad argument
      in function  iolist_to_binary/1
         called as iolist_to_binary([60,100,111,110,101,62,77,97,116,116,8217,
                                          115,32,66,114,101,97,107,105,110,103,32,
                                          76,105,110,101,60,47|...])
      in call from mochiweb_html:tokens/1
      in call from mochiweb_html:parse/1

 17> mochiweb_html:parse(<<"<done>Matts Breaking Line</done>">>). 
 {<<"done">>,[],[<<"Matts Breaking Line">>]}

Going through mochiweb_html and replacing all iolist_to_binary/1 with unicode:characters_to_binary/1 almost works (along with replacing all binary_to_list/1 with unicode:characters_to_list/1). It causes the functions to return, but the output is garbled:

 14> io:format("~s~n", [unicode:characters_to_binary(mochiweb_html:to_html(mochiweb_html:parse("<done>Matt’s Breaking Line</done>")))]).
 <done>Matt�s Breaking Line</done>

Any ideas on how to help multi-byte characters survive a mochiweb_html:parse/1 to mochiweb_html:to_html/1 cycle?

@etrepum
Copy link
Member

etrepum commented Aug 10, 2011

The intention is that the input is already UTF-8, whether it is binary or iolist. This is consistent with what you would get by doing something like reading the request body from a mochiweb request

1> io:fwrite([mochiweb_html:to_html(mochiweb_html:parse(<<"<done>Matt\xe2\x80\x99s Breaking Line</done>">>)), "\n"]).
<done>Matt’s Breaking Line</done>
ok

Note the usage of io:fwrite/1 to send the output "directly" to stdout instead of io:format/2 which does the wrong thing. io:format/2 does some really broken shit by default if you are using non-ascii characters.

2> io:format("~s~n", [[8217]]).                                              
** exception exit: {badarg,[{io,format,[<0.25.0>,"~s~n",[[8217]]]},
                            {erl_eval,do_apply,5},
                            {shell,exprs,7},
                            {shell,eval_exprs,7},
                            {shell,eval_loop,3}]}
     in function  io:o_request/3

@etrepum etrepum closed this as completed Aug 10, 2011
@dreid
Copy link
Member

dreid commented Aug 10, 2011

@mattsta it's also worth noting here that you can use mochiutf8:codepoints_to_bytes/1 to convert the list you have into a utf8 binary.

1> S = "<done>Matt’s Breaking Line</done>".
[60,100,111,110,101,62,77,97,116,116,8217,115,32,66,114,101,
 97,107,105,110,103,32,76,105,110,101,60,47,100|...]
2> mochiutf8:codepoints_to_bytes(S).       
<<60,100,111,110,101,62,77,97,116,116,226,128,153,115,32,
  66,114,101,97,107,105,110,103,32,76,105,110,101,60,...>>
3> mochiweb_html:parse(mochiutf8:codepoints_to_bytes(S)).
{<<"done">>,[],
 [<<77,97,116,116,226,128,153,115,32,66,114,101,97,107,
    105,110,103,32,76,105,110,101>>]}

@etrepum
Copy link
Member

etrepum commented Aug 10, 2011

I believe that unicode:characters_to_binary(S, utf8) is faster than mochiutf8:codepoints_to_bytes/1 (some functions in the unicode module are BIFs). Different error behavior though.

@mattsta
Copy link
Author

mattsta commented Aug 10, 2011

Thanks for the clarification. I was passing in a fetched HTML body from httpc:request without first running through unicode/utf8 conversion.

Everything works fine as long as I pre-convert:

 27> io:fwrite(mochiweb_html:to_html(mochiweb_html:parse(unicode:characters_to_binary("<done>Matt’s Breaking Line</done>")))).
 <done>Matt’s Breaking Line</done>ok

 28> io:format(mochiweb_html:to_html(mochiweb_html:parse(unicode:characters_to_binary("<done>Matt’s Breaking Line</done>")))).
 <done>Matt’s Breaking Line</done>ok

What can we do to help people not bug you about this in the future? The type system isn't elaborate enough to reject non-utf8 outright.

There's always the cheap:

 parse(Input) ->
     try
        parse_tokens(tokens(Input))
     catch
       error:badarg -> not_utf8
     end.

Or maybe even try to fix it ourselves, though it could hide performance issues by double-evaluating on every run:

 parse(Input) ->
     try
        parse_tokens(tokens(Input))
     catch
       error:badarg -> parse_tokens(tokens(unicode:characters_to_binary(Input)))
     end.

resolved: dumbmatt

@etrepum
Copy link
Member

etrepum commented Aug 10, 2011

I'd accept a documentation patch that says the input must be UTF-8. I guess string() is a bit misleading, but I wrote all this stuff before Erlang had any attempt to support unicode (except for some functions hiding in xmerl).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants