Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forbid C1 control chars in any_char #182

Closed
stasm opened this issue Oct 16, 2018 · 7 comments
Closed

Forbid C1 control chars in any_char #182

stasm opened this issue Oct 16, 2018 · 7 comments
Labels
Projects

Comments

@stasm
Copy link
Contributor

stasm commented Oct 16, 2018

Follow-up to #174. Right now, regular_char is defined as:

/* Any Unicode character excluding C0 control characters (but including tab),
 * surrogate blocks and non-characters (U+FFFE, U+FFFF).
 * Cf. https://www.w3.org/TR/REC-xml/#NT-Char */
any_char            ::= [\\u{9}\\u{20}-\\u{D7FF}\\u{E000}-\\u{FFFD}]
                      | [\\u{10000}-\\u{10FFFF}]

The U+21 - U+D7FF range includes more control characters which make little sense in translations:

The Unicode control characters cover U+0000—U+001F (C0 controls), U+007F (delete), and U+0080—U+009F (C1 controls). Unicode only specifies semantics for U+001C—U+001F, U+0009—U+000D, and U+0085. The rest of the control characters are transparent to Unicode and their meanings are left to higher-level protocols. source

I'd like to suggest that we forbid characters from U+7F to U+9F, inclusive. This would be inline with the recommendation included in the definition of NT-Char in the XML spec, although with the exception of the U+85 (NEL, Next Line) character. NEL is equivalent to CRLF and appears to be mostly used on IBM mainframes running z/OS. I think it's OK to forbid it in Fluent (we can always revisit later if need be).

@stasm stasm changed the title Forbid C0 and C1 control chars in regular_char Forbid C1 control chars in regular_char Oct 16, 2018
@stasm stasm added the syntax label Oct 16, 2018
@stasm stasm added this to To do in Syntax 0.8 via automation Oct 19, 2018
@stasm stasm changed the title Forbid C1 control chars in regular_char Forbid C1 control chars in any_char Oct 30, 2018
@stasm
Copy link
Contributor Author

stasm commented Oct 30, 2018

The current definition of any_char is inspired directly by the NT-Char production from the XML spec:

/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] 

OTOH, some languages choose to be more liberal. From the ECMAScript spec:

ECMAScript code is expressed using Unicode. ECMAScript source text is a sequence of code points. All Unicode code point values from U+0000 to U+10FFFF, including surrogate code points, may occur in source text where permitted by the ECMAScript grammars.

@jfkthame, may I summon you for this one? Can you advise if forbidding more ASCII control characters would be a good move for Fluent?

@jfkthame
Copy link

Yes, I think excluding control characters (except for <tab>, <cr> and <lf>) makes sense in a context like this. We probably also want to exclude both surrogate and noncharacter codepoints. None of these have any reason to occur in translated texts. That leads to something like

any_char ::= [\\u{9}\\u{A}\\u{D}] /* ASCII control chars <tab>, <cr>, <lf> */
         | [\\u{20}-\\u{7E}] /* exclude <del> */
         /* skip C1 controls */
         | [\\u{A0}-\\u{D7FF}]
         /* skip surrogate codepoints */
         | [\\u{E000}-\\u{FDCF}]
         /* skip noncharacter block U+FDD0..FDEF */
         | [\\u{FDF0}-\\u{FFFD}] /* last two codepoints of each block are also noncharacters */
         | [\\u{10000}-\\u{1FFFD}]
         | [\\u{20000}-\\u{2FFFD}]
...
         | \\u{100000}-\\u{10FFFD}]

(Please double-check that I got the ranges right!)

@stasm stasm moved this from To do to In progress in Syntax 0.8 Oct 30, 2018
@stasm
Copy link
Contributor Author

stasm commented Oct 30, 2018

Thanks, @jfkthame, this is very helpful. I opened #199 based on your suggestion.

@aphillips
Copy link

I think this might be a mistake. So far this thread has focused on reasonable use and what "ought" to appear in translations (or source material). However, I note that most programming languages, including JavaScript and Java, are much more permissive about what they allow in strings. Since resource formats can be used for perverse reasons (or for non-perverse but unusual reasons--such as writing I18N demos!) I'm concerned that foreclosing the ability to serialize "any string" at the format level will come back to bite users in annoying or unexpected ways--not to mention adding overhead to validation and serialization steps.

This might be summarized as "if you don't want non-characters or C1 controls in your translations, don't use them" or perhaps "it's no business of yours what I want to store in my resources" :-)

The counter-argument, of course, is that this is to help protect users from doing bad things (such as double-conversion to UTF-8 or storing things they really shouldn't).

@stasm
Copy link
Contributor Author

stasm commented Nov 6, 2018

Thanks for your comment, @aphillips. I can see the general appeal of being more lenient with regards to the input, even if people really shouldn't do some of the things that are currently possible in Fluent's syntax nor which would be made possible by allowing all Unicode characters. One of the principles that we listed for Fluent is Be liberal in what you require but conservative in what you do, after all.

I wonder if checking for weird characters would be better handled as an extra validation step which can issue warnings but not reject translations completely. We don't currently have such tools, but we'll get there eventually.

That said, I'm not sure what kind of uses-cases there are which would call for supporting e.g. non-characters. When it comes to parsing syntax which I consider "ugly", the major benefits to being lenient is that people can be sloppy when they author translations, and prettify them later. But would anyone try to sloppily use a non-character? :) What kind of perverse and non-perverse reasons are there that woudl be good examples?

Another reason to forbid these weird characters in Fluent 1.0 is that it's always easier to relax the grammar later on. We could do it in a 1.x milestone.

I'm really torn on this one. I created two PRs: #199 and #207, to try out both approaches. I like the robustness of the former and the leniency of the latter. @jfkthame, @aphillips, @Pike: Do you have more thoughts on this?

@Pike
Copy link
Contributor

Pike commented Nov 6, 2018

I'm for 207, for no better reason than that it's easier. Like, I just don't want to try to figure out why 199 would be correct.

@stasm stasm moved this from Under Consideration to Accepted in Syntax 0.8 Nov 8, 2018
@stasm
Copy link
Contributor Author

stasm commented Nov 9, 2018

Let's go ahead with the more permissive approach, but let's also define a recommendation in the spec to discourage translation authors from using control characters and non-characters. #207 is the PR implementing this approach. Thanks for comments, everyone!

@stasm stasm moved this from Accepted to In review in Syntax 0.8 Nov 9, 2018
@stasm stasm closed this as completed in #207 Nov 9, 2018
Syntax 0.8 automation moved this from In review to Done Nov 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Syntax 0.8
  
Done
Development

No branches or pull requests

4 participants