Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added tests for UTF-8 validity of option values. #150

Merged
merged 7 commits into from Jun 25, 2014
Merged

Added tests for UTF-8 validity of option values. #150

merged 7 commits into from Jun 25, 2014

Conversation

themue
Copy link
Contributor

@themue themue commented Jun 23, 2014

So far tests of the API and the set command checked only for ASCII values. Usage of UTF-8 should lead to no probems. The usage of UTF runes in the test value verifies this.

@@ -83,19 +83,21 @@ func (s *clientSuite) TestCompatibleSettingsParsing(c *gc.C) {
c.Assert(err, gc.ErrorMatches, `unknown option "yummy"`)
}

var setTestValue = "a value with spaces\nand newline\nand UTF-8 characters: äöüéàô 😄👍"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/var/const/

@mjs
Copy link

mjs commented Jun 24, 2014

Question one: what ultimately ends up decoding the UTF-8 strings? Does Juju just pass these straight through and it's up to a charm to deal with it?

Question two: should Juju be rejecting invalid UTF-8 at the client or API level to prevent bad strings from propagating?

@@ -37,6 +37,8 @@ func (s *SetSuite) SetUpTest(c *gc.C) {
setupConfigFile(c, s.dir)
}

var setTestValue = "a value with spaces\nand newline\nand UTF-8 characters: äöüéàô 😄👍"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer ascii only source text (I realize go does specify source is UTF-8), and using unicode escapes to do non-standard characters.
I believe the last time this came up we went with "lets use ascii for our source" when someone wanted a variable named Façade. I'd probably be stricter about variables/functions rather than contents of strings, though.
So while I'd still recommend unicode escapes, I wouldn't require them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When adding stuf like the umlauts it has been ok for me, because I type them almost every day. But I felt somehow strange with the emojis too. So yes, I'll change it into unicode escapes.

@jameinel
Copy link
Member

Q1: when reading the files or commandline we just use a go "string" object which doesn't really have an encoding. However, we pass them to json.Marshal(string) when we want to send them to the API, and that assumes UTF-8 encoded strings.
Q2: I do think we want a test that uses non-UTF-8 (\xff\xff or somesuch) and asserts that it fails.

There is http://golang.org/pkg/unicode/utf8/ utf8.Valid([]byte) and utf8.Valid(string) that we can use to assert that we are getting proper UTF-8 strings when we want to, rather than assuming it as a side effect.
I don't really want this change to bloat up into "lets go around and validate all of our input", this was a bit more about "we should be clear that we are currently supporting UTF-8 and have tests for it so that we don't actually regress support".
I believe back when clients accessed the DB directly, we actually supported plain 8-bit strings with no encoding. But since we are going via a JSON layer, I think we now would have to play base64 encoding tricks if we wanted to support that.
Since it was never clearly stated or tested we can't say for sure what people were doing in the past. So now we can at least improve things incrementally and say "these things are UTF-8 encoded strings, if you need more than that, come talk to us and we can work out how to support binary data properly".

So yeah, I'd like to see a test that non-UTF-8 gets rejected, but otherwise this is better than what we have.

@themue
Copy link
Contributor Author

themue commented Jun 24, 2014

Just tested a bit more with the JSON encoding. Sadly the InvalidUTF8Error isn't returned anymore in case of data containing invalid UTF-8 sequences. Instead all sequences are marshalled to the string "\ufffd". When unmarshalling the string containing the unicode escape it is converted into the valid unicode sequence U+FFFD (as bytes: [239 191 189]).

So to ensure valid transfer of data via the API a change should be done inside the jsonrpc package. Instead of using direct marshalling to writers, like with the websocket.JSON codec, the data can be marshalled first and checked for the string "\ufffd" without another escaping backslash and return an error in that case.

The alternative would be to control possible string arguments, also in maps or slices, in the client functions of the API with utf8.ValidString(). But the risk of missing this later when adding new functions is high.

@mjs
Copy link

mjs commented Jun 24, 2014

Could we at least validate at the edges? "juju set" should error immediately if given invalid UTF-8. Then it doesn't matter what the API does with invalid UTF-8 as invalid UTF-8 won't be introduced in the first place (not by juju set at least).

Aside: U+FFFD is the unicode "replacement character" which represents unknown/invalid characters. This seems like a reasonable way for the API to handle such data.

@themue
Copy link
Contributor Author

themue commented Jun 25, 2014

Yes, we discovered a possible bug that never has occured because nobody sends invalid UTF-8 strings. Only the idea of reading the values for options out of files let us think about it. So it seems good enough to do it on a higher level like the command. Alternatively it could be done at charm.Config.ParseSettingsStrings(), so that also non-command settings are validated.

@jameinel
Copy link
Member

I think we could potentially put the check in the schema package, and have schema say "string objects must be valid UTF-8 strings".
It might also show up where we thought we were handling strings as byte arrays but weren't actually treating the data correctly.

I would be happy enough just having a utf.IsValid() check in 'cmd/juju/set*' and not bloat this fix particularly further, though.

@mjs
Copy link

mjs commented Jun 25, 2014

+1 to the simple check in cmd/juju/set... That's all I was suggesting.

Frank Mueller added 2 commits June 25, 2014 11:56
…havior. After a discussion this morning we accept this. It never has been an issue in real-life and only occurred during a test trying to transport binary data as string.
@themue
Copy link
Contributor Author

themue commented Jun 25, 2014

$$merge$$

@jujubot
Copy link
Collaborator

jujubot commented Jun 25, 2014

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

jujubot added a commit that referenced this pull request Jun 25, 2014
Added tests for UTF-8 validity of option values.

So far tests of the API and the set command checked only for ASCII values. Usage of UTF-8 should lead to no probems. The usage of UTF runes in the test value verifies this.
@jujubot jujubot merged commit 35d38ca into juju:master Jun 25, 2014
@themue themue deleted the set-test-utf-encoding branch June 25, 2014 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants