fix(developer): restrict invalid characters in identifiers in kmcmplib#14746
fix(developer): restrict invalid characters in identifiers in kmcmplib#14746
Conversation
The compiler has always been very ambiguous on which characters were accepted in group and store names, even to the point of accepting things like comma in a store name, which would then make it impossible to reference in an `index` statement! This commit clarifies the allowable characters in an identifier. While it would have been possible to use UAX#31 for this, that would have extended the requirements for this change substantially, and may have caused us more trouble with legacy keyboards. Given kmcmplib is end-of-life (see epic/ng-compiler), I have chosen a lower friction approach. There are certainly other characters that could be excluded, but in general I have chosen to exclude only those that will definitely be problematic. The set of allowable characters for deadkeys has actually been expanded in this release to match the store and group name rules. It is expected that there may be some impacted keyboards, but addressing this change will be relatively straightforward, so I consider this to be an acceptable back-compatibility trade-off, see https://github.com/keymanapp/keyman/wiki/Principles-of-Keyman-Code-Changes#4-source-backward-compatibility-keyboard-model-and-package-source-file-formats-should-be-backward-compatible Fixes: #14604 Test-bot: skip Build-bot: skip build:developer
User Test ResultsTest specification and instructions User tests are not required Test Artifacts
|
Document the tightened validity of all identifiers in .kmn, which will make it easier to build the new parser and allow for extension of the language in the future. Relates-to: keymanapp/keyman#14604 Relates-to: keymanapp/keyman#14746 Test-bot: skip
|
@markcsinclair I am aware that you are not currently available to review. Tagging you for your future reference. |
|
|
||
| namespace KmnCompilerMessages { | ||
| enum { | ||
| enum KmnCompilerMessages { |
There was a problem hiding this comment.
naming the enum so we can use it elsewhere for cleaner types
| KMX_WCHAR const * DeadKeyChars = | ||
| u"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_"; |
There was a problem hiding this comment.
Actually expanding deadkey name validity
| class CompilerMessage { | ||
| public: | ||
| virtual void report(enum KmnCompilerMessages::KmnCompilerMessages msg, const std::vector<std::string>& parameters) = 0; | ||
| virtual void report(enum KmnCompilerMessages::KmnCompilerMessages msg) = 0; | ||
| }; | ||
|
|
||
| class DefaultCompilerMessage : public CompilerMessage { | ||
| virtual void report(enum KmnCompilerMessages::KmnCompilerMessages msg, const std::vector<std::string>& parameters) ; | ||
| virtual void report(enum KmnCompilerMessages::KmnCompilerMessages msg); | ||
| }; |
There was a problem hiding this comment.
This pattern makes CompilerMessage easily mockable
v19 of kmc will disallow spaces in group names. Pre-emptively tweaking this. As this change has no functional impact, am not bumping the version number or history. Relates-to: keymanapp/keyman#14746 Relates-to: keymanapp/keyman#14604
|
I have tested this against the keyboards repository, and only sil_yi is impacted. sil_yi was the keyboard that triggered this investigation. I have opened PR keymanapp/keyboards#3720 to fix this. |
b79cc94 to
08c36db
Compare
sil_yi was impacted by the changes in #14746, as it had the group name 'Unicode Group', which is now illegal, so the compiler fails to build the keyboard at the referenced commit. Easiest workaround currently is to remove it from the set of compared keyboards.
jahorton
left a comment
There was a problem hiding this comment.
Are there any corresponding changes that need to be made within the Web TS compiler?
No, because that starts with the output from kmcmplib, so all identifiers have already been checked. |
.../src/kmc-kmn/test/fixtures/invalid-keyboards/error_name_contains_invalid_character-group.kmn
Show resolved
Hide resolved
.../src/kmc-kmn/test/fixtures/invalid-keyboards/error_name_contains_invalid_character-store.kmn
Show resolved
Hide resolved
...rc/kmc-kmn/test/fixtures/invalid-keyboards/error_name_contains_invalid_character-deadkey.kmn
Show resolved
Hide resolved
| KMX_BOOL Uni_IsControlCharacter(KMX_WCHAR ch); | ||
|
|
||
| /* | ||
| Unicode version 16.0; GC=Zs. |
There was a problem hiding this comment.
I guess these are all the spaces that Unicode defines? A sentence explaining that might help...
There was a problem hiding this comment.
Yeah, that was the cryptic GC=Zs 😆 I will elucidate!
There was a problem hiding this comment.
We've got Unicode 17.0 now 😄
It's got the same list
| static ERROR_NameMustNotContainSquareBrackets = SevError | 0x0BA; | ||
| static Error_NameMustNotContainSquareBrackets = (o:{name:string}) => m( | ||
| this.ERROR_NameMustNotContainSquareBrackets, | ||
| `The referenced name '${def(o.name)}' must not contain opening or closing square brackets`, |
There was a problem hiding this comment.
I get that the error message has enough detail for the user to clean up on the invalid parts of the name.
I also wonder since the error codes are split up, would it be helpful to have a specific message for each error? Granted it could also be frustrating to have to keep recompiling to find each error...
What about
"The referenced name must (not) contain X...
Names (or identifiers) must not contain spaces, commas, ...."
There was a problem hiding this comment.
I could simplify to a single error message... still need all the tests though. But it would make the error message very long.
| KMX_BOOL Uni_IsControlCharacter(KMX_WCHAR ch); | ||
|
|
||
| /* | ||
| Unicode version 16.0; GC=Zs. |
There was a problem hiding this comment.
We've got Unicode 17.0 now 😄
It's got the same list
| ERROR_NameMustNotContainSpaces = SevError | 0x0B7, | ||
| ERROR_NameMustNotContainComma = SevError | 0x0B8, | ||
| ERROR_NameMustNotContainParentheses = SevError | 0x0B9, | ||
| ERROR_NameMustNotContainSquareBrackets = SevError | 0x0BA, |
There was a problem hiding this comment.
curly braces allowed though, huh?
There was a problem hiding this comment.
Curly braces are not recognized characters in Kmn, whereas square brackets are.
|
Changes in this pull request will be available for download in Keyman version 19.0.119-alpha |
sil_yi was impacted by the changes in keymanapp#14746, as it had the group name 'Unicode Group', which is now illegal, so the compiler fails to build the keyboard at the referenced commit. Easiest workaround currently is to remove it from the set of compared keyboards.
The compiler has always been very ambiguous on which characters were accepted in group and store names, even to the point of accepting things like comma in a store name, which would then make it impossible to reference in an
indexstatement!This commit clarifies the allowable characters in an identifier. While it would have been possible to use UAX#31 for this, that would have extended the requirements for this change substantially, and may have caused us more trouble with legacy keyboards. Given kmcmplib is end-of-life (see epic/ng-compiler), I have chosen a lower friction approach. There are certainly other characters that could be excluded, but in general I have chosen to exclude only those that will definitely be problematic.
The set of allowable characters for deadkeys has actually been expanded in this release to match the store and group name rules.
It is expected that there may be some impacted keyboards, but addressing this change will be relatively straightforward, so I consider this to be an acceptable back-compatibility trade-off, see
https://github.com/keymanapp/keyman/wiki/Principles-of-Keyman-Code-Changes#4-source-backward-compatibility-keyboard-model-and-package-source-file-formats-should-be-backward-compatible
Fixes: #14604
Test-bot: skip
Build-bot: skip build:developer