New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify that control characters (VT100 escapes, UTF8 and the like) are stripped/handled #8

Open
cfcs opened this Issue Dec 4, 2014 · 12 comments

Comments

Projects
None yet
3 participants
@cfcs

cfcs commented Dec 4, 2014

Verify that control characters (VT100 escapes, UTF8 and the like) are stripped/handled to prevent contacts from messing up your UI.

@hannesm

This comment has been minimized.

Owner

hannesm commented Dec 20, 2014

also invalid utf8 encoded messages should be rejected (see src/xmpp_callbacks:36 and https://github.com/diml/zed/blob/master/src/zed_utf8.mli#L37 for a possible solution)

@hannesm

This comment has been minimized.

Owner

hannesm commented Dec 21, 2014

done in 5cf47e6 6aa60d8 and presence filtering in 63db57c

@hannesm hannesm closed this Dec 21, 2014

@dbuenzli

This comment has been minimized.

dbuenzli commented Dec 21, 2014

Seems quite messy, it seems to me that it should be the task of the underlying xml library to check for encoding validity of all the data you get so that once you have passed that boundary you are sure that all the data is trusted to be UTF-8 valid.

@hannesm

This comment has been minimized.

Owner

hannesm commented Dec 21, 2014

@dbuenzli

  • yes, I'd love to find the time to use xmlm and your unicode libraries instead (in zed/lambda-term/XMPP)
  • the problem arises now: we received an xml message from the server, which contains base64 encoded encrypted text --- thus we need after decryption (via Otr.Handshake.handle) to validate the resulting data (this is done in the first two commits mentioned above)
  • in general, OTR data might include binary data, thus doing utf8 validation within OTR is not a viable solution
@dbuenzli

This comment has been minimized.

dbuenzli commented Dec 21, 2014

Ah ok, makes sense didn't think about the b64 bits.

@hannesm

This comment has been minimized.

Owner

hannesm commented Dec 21, 2014

and I'm actually not entirely sure whether 63db57c is needed or guaranteed...

@dbuenzli

This comment has been minimized.

dbuenzli commented Dec 21, 2014

Well if that concerns the output of the xml decoder you use; I'd keep it since it seems to me that this will input invalid UTF-8: a quick look seems to indicate that it doesn't check subsequent bytes for correct range; see e.g. the case for '\240'..'\247', which treats all 4 bytes encoded coded points and compare to table 3-7 in this document. This is what is used by Xmlm which should be, once I get the time, be rewritten on top of the similar code in Uutf. The latter decoder has a proof of correctness by exhaustiveness and computer heat (that's the reason why the test is commented) on a representation of the specification.

@hannesm

This comment has been minimized.

Owner

hannesm commented Dec 21, 2014

thx @dbuenzli for investigation... and yes, I intend to switch over to your xmlm (whenever I find some more spare time for this)

@hannesm

This comment has been minimized.

Owner

hannesm commented Dec 21, 2014

reopening: escape characters are not handled, and to remind myself that most of the xml from the server is used unchecked (everything apart from the parts a user can set - presence and messages)

@hannesm

This comment has been minimized.

Owner

hannesm commented Jan 16, 2015

so the decrypted messages are being checked here - the xml parser pushes 0..127 through

@hannesm

This comment has been minimized.

Owner

hannesm commented Jan 17, 2015

to wrap up from #42 - we should use 0xfffd (unicode replacement character) for control sequences instead of ? (to distinguish from real ? received over the wire). or maybe just error out (as done with invalid utf8 sequences). this replacement should be done at a single place and not spread over multiple places.

plan is to replace erm_xmpp at some point - and use decent xml parser and unicode libraries... I'm currently busy with other things, but have this on my plan (most likely before any public release)

@cfcs

This comment has been minimized.

cfcs commented Jan 17, 2015

I changed validate_utf8 to work on UChars instead, and made use of the 0xFFfd character for unknown characters. As mentioned in the pull request, I'd love to see this check for non-printable characters in general at some point -- perhaps after we change xml lib.
Pull request: #43

hannesm added a commit that referenced this issue Feb 7, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment