Forbid C1 control chars in any_char #182

stasm · 2018-10-16T07:57:04Z

Follow-up to #174. Right now, regular_char is defined as:

/* Any Unicode character excluding C0 control characters (but including tab),
 * surrogate blocks and non-characters (U+FFFE, U+FFFF).
 * Cf. https://www.w3.org/TR/REC-xml/#NT-Char */
any_char            ::= [\\u{9}\\u{20}-\\u{D7FF}\\u{E000}-\\u{FFFD}]
                      | [\\u{10000}-\\u{10FFFF}]

The U+21 - U+D7FF range includes more control characters which make little sense in translations:

The Unicode control characters cover U+0000—U+001F (C0 controls), U+007F (delete), and U+0080—U+009F (C1 controls). Unicode only specifies semantics for U+001C—U+001F, U+0009—U+000D, and U+0085. The rest of the control characters are transparent to Unicode and their meanings are left to higher-level protocols. ^source

I'd like to suggest that we forbid characters from U+7F to U+9F, inclusive. This would be inline with the recommendation included in the definition of NT-Char in the XML spec, although with the exception of the U+85 (NEL, Next Line) character. NEL is equivalent to CRLF and appears to be mostly used on IBM mainframes running z/OS. I think it's OK to forbid it in Fluent (we can always revisit later if need be).

The text was updated successfully, but these errors were encountered:

stasm · 2018-10-30T15:01:10Z

The current definition of any_char is inspired directly by the NT-Char production from the XML spec:

/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

OTOH, some languages choose to be more liberal. From the ECMAScript spec:

ECMAScript code is expressed using Unicode. ECMAScript source text is a sequence of code points. All Unicode code point values from U+0000 to U+10FFFF, including surrogate code points, may occur in source text where permitted by the ECMAScript grammars.

@jfkthame, may I summon you for this one? Can you advise if forbidding more ASCII control characters would be a good move for Fluent?

jfkthame · 2018-10-30T15:34:18Z

Yes, I think excluding control characters (except for <tab>, <cr> and <lf>) makes sense in a context like this. We probably also want to exclude both surrogate and noncharacter codepoints. None of these have any reason to occur in translated texts. That leads to something like

any_char ::= [\\u{9}\\u{A}\\u{D}] /* ASCII control chars <tab>, <cr>, <lf> */
         | [\\u{20}-\\u{7E}] /* exclude <del> */
         /* skip C1 controls */
         | [\\u{A0}-\\u{D7FF}]
         /* skip surrogate codepoints */
         | [\\u{E000}-\\u{FDCF}]
         /* skip noncharacter block U+FDD0..FDEF */
         | [\\u{FDF0}-\\u{FFFD}] /* last two codepoints of each block are also noncharacters */
         | [\\u{10000}-\\u{1FFFD}]
         | [\\u{20000}-\\u{2FFFD}]
...
         | \\u{100000}-\\u{10FFFD}]

(Please double-check that I got the ranges right!)

stasm · 2018-10-30T20:49:35Z

Thanks, @jfkthame, this is very helpful. I opened #199 based on your suggestion.

aphillips · 2018-11-01T15:06:36Z

I think this might be a mistake. So far this thread has focused on reasonable use and what "ought" to appear in translations (or source material). However, I note that most programming languages, including JavaScript and Java, are much more permissive about what they allow in strings. Since resource formats can be used for perverse reasons (or for non-perverse but unusual reasons--such as writing I18N demos!) I'm concerned that foreclosing the ability to serialize "any string" at the format level will come back to bite users in annoying or unexpected ways--not to mention adding overhead to validation and serialization steps.

This might be summarized as "if you don't want non-characters or C1 controls in your translations, don't use them" or perhaps "it's no business of yours what I want to store in my resources" :-)

The counter-argument, of course, is that this is to help protect users from doing bad things (such as double-conversion to UTF-8 or storing things they really shouldn't).

stasm · 2018-11-06T19:03:01Z

Thanks for your comment, @aphillips. I can see the general appeal of being more lenient with regards to the input, even if people really shouldn't do some of the things that are currently possible in Fluent's syntax nor which would be made possible by allowing all Unicode characters. One of the principles that we listed for Fluent is Be liberal in what you require but conservative in what you do, after all.

I wonder if checking for weird characters would be better handled as an extra validation step which can issue warnings but not reject translations completely. We don't currently have such tools, but we'll get there eventually.

That said, I'm not sure what kind of uses-cases there are which would call for supporting e.g. non-characters. When it comes to parsing syntax which I consider "ugly", the major benefits to being lenient is that people can be sloppy when they author translations, and prettify them later. But would anyone try to sloppily use a non-character? :) What kind of perverse and non-perverse reasons are there that woudl be good examples?

Another reason to forbid these weird characters in Fluent 1.0 is that it's always easier to relax the grammar later on. We could do it in a 1.x milestone.

I'm really torn on this one. I created two PRs: #199 and #207, to try out both approaches. I like the robustness of the former and the leniency of the latter. @jfkthame, @aphillips, @Pike: Do you have more thoughts on this?

Pike · 2018-11-06T20:22:43Z

I'm for 207, for no better reason than that it's easier. Like, I just don't want to try to figure out why 199 would be correct.

stasm · 2018-11-09T12:12:13Z

Let's go ahead with the more permissive approach, but let's also define a recommendation in the spec to discourage translation authors from using control characters and non-characters. #207 is the PR implementing this approach. Thanks for comments, everyone!

stasm changed the title ~~Forbid C0 and C1 control chars in regular_char~~ Forbid C1 control chars in regular_char Oct 16, 2018

stasm added the syntax label Oct 16, 2018

stasm added this to To do in Syntax 0.8 via automation Oct 19, 2018

stasm changed the title ~~Forbid C1 control chars in regular_char~~ Forbid C1 control chars in any_char Oct 30, 2018

stasm moved this from To do to In progress in Syntax 0.8 Oct 30, 2018

stasm mentioned this issue Oct 30, 2018

Forbid control characters and all non-characters #199

Closed

stasm mentioned this issue Nov 6, 2018

Allow all Unicode characters #207

Merged

stasm moved this from Under Consideration to Accepted in Syntax 0.8 Nov 8, 2018

stasm moved this from Accepted to In review in Syntax 0.8 Nov 9, 2018

stasm closed this as completed in #207 Nov 9, 2018

Syntax 0.8 automation moved this from In review to Done Nov 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forbid C1 control chars in any_char #182

Forbid C1 control chars in any_char #182

stasm commented Oct 16, 2018 •

edited

stasm commented Oct 30, 2018 •

edited

jfkthame commented Oct 30, 2018

stasm commented Oct 30, 2018

aphillips commented Nov 1, 2018

stasm commented Nov 6, 2018

Pike commented Nov 6, 2018

stasm commented Nov 9, 2018

Forbid C1 control chars in any_char #182

Forbid C1 control chars in any_char #182

Comments

stasm commented Oct 16, 2018 • edited

stasm commented Oct 30, 2018 • edited

jfkthame commented Oct 30, 2018

stasm commented Oct 30, 2018

aphillips commented Nov 1, 2018

stasm commented Nov 6, 2018

Pike commented Nov 6, 2018

stasm commented Nov 9, 2018

stasm commented Oct 16, 2018 •

edited

stasm commented Oct 30, 2018 •

edited