New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Forbid C1 control chars in any_char #182
Comments
The current definition of
OTOH, some languages choose to be more liberal. From the ECMAScript spec:
@jfkthame, may I summon you for this one? Can you advise if forbidding more ASCII control characters would be a good move for Fluent? |
Yes, I think excluding control characters (except for
(Please double-check that I got the ranges right!) |
I think this might be a mistake. So far this thread has focused on reasonable use and what "ought" to appear in translations (or source material). However, I note that most programming languages, including JavaScript and Java, are much more permissive about what they allow in strings. Since resource formats can be used for perverse reasons (or for non-perverse but unusual reasons--such as writing I18N demos!) I'm concerned that foreclosing the ability to serialize "any string" at the format level will come back to bite users in annoying or unexpected ways--not to mention adding overhead to validation and serialization steps. This might be summarized as "if you don't want non-characters or C1 controls in your translations, don't use them" or perhaps "it's no business of yours what I want to store in my resources" :-) The counter-argument, of course, is that this is to help protect users from doing bad things (such as double-conversion to UTF-8 or storing things they really shouldn't). |
Thanks for your comment, @aphillips. I can see the general appeal of being more lenient with regards to the input, even if people really shouldn't do some of the things that are currently possible in Fluent's syntax nor which would be made possible by allowing all Unicode characters. One of the principles that we listed for Fluent is Be liberal in what you require but conservative in what you do, after all. I wonder if checking for weird characters would be better handled as an extra validation step which can issue warnings but not reject translations completely. We don't currently have such tools, but we'll get there eventually. That said, I'm not sure what kind of uses-cases there are which would call for supporting e.g. non-characters. When it comes to parsing syntax which I consider "ugly", the major benefits to being lenient is that people can be sloppy when they author translations, and prettify them later. But would anyone try to sloppily use a non-character? :) What kind of perverse and non-perverse reasons are there that woudl be good examples? Another reason to forbid these weird characters in Fluent 1.0 is that it's always easier to relax the grammar later on. We could do it in a 1.x milestone. I'm really torn on this one. I created two PRs: #199 and #207, to try out both approaches. I like the robustness of the former and the leniency of the latter. @jfkthame, @aphillips, @Pike: Do you have more thoughts on this? |
I'm for 207, for no better reason than that it's easier. Like, I just don't want to try to figure out why 199 would be correct. |
Let's go ahead with the more permissive approach, but let's also define a recommendation in the spec to discourage translation authors from using control characters and non-characters. #207 is the PR implementing this approach. Thanks for comments, everyone! |
Follow-up to #174. Right now,
regular_char
is defined as:The
U+21
-U+D7FF
range includes more control characters which make little sense in translations:I'd like to suggest that we forbid characters from
U+7F
toU+9F
, inclusive. This would be inline with the recommendation included in the definition ofNT-Char
in the XML spec, although with the exception of theU+85
(NEL
,Next Line
) character.NEL
is equivalent toCRLF
and appears to be mostly used on IBM mainframes running z/OS. I think it's OK to forbid it in Fluent (we can always revisit later if need be).The text was updated successfully, but these errors were encountered: