New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added tests for UTF-8 validity of option values. #150
Conversation
@@ -83,19 +83,21 @@ func (s *clientSuite) TestCompatibleSettingsParsing(c *gc.C) { | |||
c.Assert(err, gc.ErrorMatches, `unknown option "yummy"`) | |||
} | |||
|
|||
var setTestValue = "a value with spaces\nand newline\nand UTF-8 characters: äöüéàô 😄👍" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/var/const/
Question one: what ultimately ends up decoding the UTF-8 strings? Does Juju just pass these straight through and it's up to a charm to deal with it? Question two: should Juju be rejecting invalid UTF-8 at the client or API level to prevent bad strings from propagating? |
@@ -37,6 +37,8 @@ func (s *SetSuite) SetUpTest(c *gc.C) { | |||
setupConfigFile(c, s.dir) | |||
} | |||
|
|||
var setTestValue = "a value with spaces\nand newline\nand UTF-8 characters: äöüéàô 😄👍" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally prefer ascii only source text (I realize go does specify source is UTF-8), and using unicode escapes to do non-standard characters.
I believe the last time this came up we went with "lets use ascii for our source" when someone wanted a variable named Façade. I'd probably be stricter about variables/functions rather than contents of strings, though.
So while I'd still recommend unicode escapes, I wouldn't require them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When adding stuf like the umlauts it has been ok for me, because I type them almost every day. But I felt somehow strange with the emojis too. So yes, I'll change it into unicode escapes.
Q1: when reading the files or commandline we just use a go "string" object which doesn't really have an encoding. However, we pass them to json.Marshal(string) when we want to send them to the API, and that assumes UTF-8 encoded strings. There is http://golang.org/pkg/unicode/utf8/ utf8.Valid([]byte) and utf8.Valid(string) that we can use to assert that we are getting proper UTF-8 strings when we want to, rather than assuming it as a side effect. So yeah, I'd like to see a test that non-UTF-8 gets rejected, but otherwise this is better than what we have. |
Just tested a bit more with the JSON encoding. Sadly the InvalidUTF8Error isn't returned anymore in case of data containing invalid UTF-8 sequences. Instead all sequences are marshalled to the string "\ufffd". When unmarshalling the string containing the unicode escape it is converted into the valid unicode sequence U+FFFD (as bytes: [239 191 189]). So to ensure valid transfer of data via the API a change should be done inside the jsonrpc package. Instead of using direct marshalling to writers, like with the websocket.JSON codec, the data can be marshalled first and checked for the string "\ufffd" without another escaping backslash and return an error in that case. The alternative would be to control possible string arguments, also in maps or slices, in the client functions of the API with utf8.ValidString(). But the risk of missing this later when adding new functions is high. |
Could we at least validate at the edges? "juju set" should error immediately if given invalid UTF-8. Then it doesn't matter what the API does with invalid UTF-8 as invalid UTF-8 won't be introduced in the first place (not by juju set at least). Aside: U+FFFD is the unicode "replacement character" which represents unknown/invalid characters. This seems like a reasonable way for the API to handle such data. |
Yes, we discovered a possible bug that never has occured because nobody sends invalid UTF-8 strings. Only the idea of reading the values for options out of files let us think about it. So it seems good enough to do it on a higher level like the command. Alternatively it could be done at charm.Config.ParseSettingsStrings(), so that also non-command settings are validated. |
I think we could potentially put the check in the schema package, and have schema say "string objects must be valid UTF-8 strings". I would be happy enough just having a utf.IsValid() check in 'cmd/juju/set*' and not bloat this fix particularly further, though. |
+1 to the simple check in cmd/juju/set... That's all I was suggesting. |
…havior. After a discussion this morning we accept this. It never has been an issue in real-life and only occurred during a test trying to transport binary data as string.
|
Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju |
Added tests for UTF-8 validity of option values. So far tests of the API and the set command checked only for ASCII values. Usage of UTF-8 should lead to no probems. The usage of UTF runes in the test value verifies this.
So far tests of the API and the set command checked only for ASCII values. Usage of UTF-8 should lead to no probems. The usage of UTF runes in the test value verifies this.