Skip to content

fix(developer): restrict invalid characters in identifiers in kmcmplib#14746

Merged
mcdurdin merged 5 commits intomasterfrom
fix/developer/14604-disallow-space-in-group-name
Sep 15, 2025
Merged

fix(developer): restrict invalid characters in identifiers in kmcmplib#14746
mcdurdin merged 5 commits intomasterfrom
fix/developer/14604-disallow-space-in-group-name

Conversation

@mcdurdin
Copy link
Copy Markdown
Member

The compiler has always been very ambiguous on which characters were accepted in group and store names, even to the point of accepting things like comma in a store name, which would then make it impossible to reference in an index statement!

This commit clarifies the allowable characters in an identifier. While it would have been possible to use UAX#31 for this, that would have extended the requirements for this change substantially, and may have caused us more trouble with legacy keyboards. Given kmcmplib is end-of-life (see epic/ng-compiler), I have chosen a lower friction approach. There are certainly other characters that could be excluded, but in general I have chosen to exclude only those that will definitely be problematic.

The set of allowable characters for deadkeys has actually been expanded in this release to match the store and group name rules.

It is expected that there may be some impacted keyboards, but addressing this change will be relatively straightforward, so I consider this to be an acceptable back-compatibility trade-off, see
https://github.com/keymanapp/keyman/wiki/Principles-of-Keyman-Code-Changes#4-source-backward-compatibility-keyboard-model-and-package-source-file-formats-should-be-backward-compatible

Fixes: #14604
Test-bot: skip
Build-bot: skip build:developer

The compiler has always been very ambiguous on which characters were
accepted in group and store names, even to the point of accepting
things like comma in a store name, which would then make it impossible
to reference in an `index` statement!

This commit clarifies the allowable characters in an identifier. While
it would have been possible to use UAX#31 for this, that would have
extended the requirements for this change substantially, and may have
caused us more trouble with legacy keyboards. Given kmcmplib is
end-of-life (see epic/ng-compiler), I have chosen a lower friction
approach. There are certainly other characters that could be excluded,
but in general I have chosen to exclude only those that will definitely
be problematic.

The set of allowable characters for deadkeys has actually been expanded
in this release to match the store and group name rules.

It is expected that there may be some impacted keyboards, but addressing
this change will be relatively straightforward, so I consider this to be
an acceptable back-compatibility trade-off, see
https://github.com/keymanapp/keyman/wiki/Principles-of-Keyman-Code-Changes#4-source-backward-compatibility-keyboard-model-and-package-source-file-formats-should-be-backward-compatible

Fixes: #14604
Test-bot: skip
Build-bot: skip build:developer
@keymanapp-test-bot
Copy link
Copy Markdown

keymanapp-test-bot bot commented Sep 12, 2025

User Test Results

Test specification and instructions

User tests are not required

Test Artifacts

  • Developer
    • Keyman Developer - build : all tests passed (no artifacts on BuildLevel "build")
    • Compiler Regression Tests - build : all tests passed (no artifacts on BuildLevel "build")
    • Keyman Developer (old PRs) - build : all tests passed (no artifacts on BuildLevel "build")
    • kmcomp.zip - build : all tests passed (no artifacts on BuildLevel "build")
    • kmcomp.zip (old PRs) - build : all tests passed (no artifacts on BuildLevel "build")
  • Keyboards
    • Test Keyboards - build : all tests passed (no artifacts on BuildLevel "build")

@keymanapp-test-bot keymanapp-test-bot bot added this to the A19S11 milestone Sep 12, 2025
mcdurdin added a commit to keymanapp/help.keyman.com that referenced this pull request Sep 12, 2025
Document the tightened validity of all identifiers in .kmn, which will
make it easier to build the new parser and allow for extension of the
language in the future.

Relates-to: keymanapp/keyman#14604
Relates-to: keymanapp/keyman#14746
Test-bot: skip
@mcdurdin
Copy link
Copy Markdown
Member Author

@markcsinclair I am aware that you are not currently available to review. Tagging you for your future reference.


namespace KmnCompilerMessages {
enum {
enum KmnCompilerMessages {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming the enum so we can use it elsewhere for cleaner types

Comment on lines -2085 to -2086
KMX_WCHAR const * DeadKeyChars =
u"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_";
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually expanding deadkey name validity

Comment on lines +8 to +17
class CompilerMessage {
public:
virtual void report(enum KmnCompilerMessages::KmnCompilerMessages msg, const std::vector<std::string>& parameters) = 0;
virtual void report(enum KmnCompilerMessages::KmnCompilerMessages msg) = 0;
};

class DefaultCompilerMessage : public CompilerMessage {
virtual void report(enum KmnCompilerMessages::KmnCompilerMessages msg, const std::vector<std::string>& parameters) ;
virtual void report(enum KmnCompilerMessages::KmnCompilerMessages msg);
};
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern makes CompilerMessage easily mockable

mcdurdin added a commit to keymanapp/keyboards that referenced this pull request Sep 12, 2025
v19 of kmc will disallow spaces in group names. Pre-emptively tweaking
this. As this change has no functional impact, am not bumping the
version number or history.

Relates-to: keymanapp/keyman#14746
Relates-to: keymanapp/keyman#14604
@mcdurdin
Copy link
Copy Markdown
Member Author

I have tested this against the keyboards repository, and only sil_yi is impacted. sil_yi was the keyboard that triggered this investigation. I have opened PR keymanapp/keyboards#3720 to fix this.

@mcdurdin mcdurdin force-pushed the fix/developer/14604-disallow-space-in-group-name branch from b79cc94 to 08c36db Compare September 12, 2025 12:14
sil_yi was impacted by the changes in #14746, as it had the group name
'Unicode Group', which is now illegal, so the compiler fails to build
the keyboard at the referenced commit. Easiest workaround currently is
to remove it from the set of compared keyboards.
Copy link
Copy Markdown
Contributor

@jahorton jahorton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any corresponding changes that need to be made within the Web TS compiler?

@mcdurdin
Copy link
Copy Markdown
Member Author

Are there any corresponding changes that need to be made within the Web TS compiler?

No, because that starts with the output from kmcmplib, so all identifiers have already been checked.

KMX_BOOL Uni_IsControlCharacter(KMX_WCHAR ch);

/*
Unicode version 16.0; GC=Zs.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess these are all the spaces that Unicode defines? A sentence explaining that might help...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was the cryptic GC=Zs 😆 I will elucidate!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've got Unicode 17.0 now 😄
It's got the same list

@darcywong00 darcywong00 modified the milestones: A19S11, A19S12 Sep 13, 2025
static ERROR_NameMustNotContainSquareBrackets = SevError | 0x0BA;
static Error_NameMustNotContainSquareBrackets = (o:{name:string}) => m(
this.ERROR_NameMustNotContainSquareBrackets,
`The referenced name '${def(o.name)}' must not contain opening or closing square brackets`,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that the error message has enough detail for the user to clean up on the invalid parts of the name.

I also wonder since the error codes are split up, would it be helpful to have a specific message for each error? Granted it could also be frustrating to have to keep recompiling to find each error...

What about
"The referenced name must (not) contain X...
Names (or identifiers) must not contain spaces, commas, ...."

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could simplify to a single error message... still need all the tests though. But it would make the error message very long.

KMX_BOOL Uni_IsControlCharacter(KMX_WCHAR ch);

/*
Unicode version 16.0; GC=Zs.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've got Unicode 17.0 now 😄
It's got the same list

ERROR_NameMustNotContainSpaces = SevError | 0x0B7,
ERROR_NameMustNotContainComma = SevError | 0x0B8,
ERROR_NameMustNotContainParentheses = SevError | 0x0B9,
ERROR_NameMustNotContainSquareBrackets = SevError | 0x0BA,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curly braces allowed though, huh?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curly braces are not recognized characters in Kmn, whereas square brackets are.

Copy link
Copy Markdown
Contributor

@ermshiperete ermshiperete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now 😄

@mcdurdin mcdurdin merged commit 121fb80 into master Sep 15, 2025
8 checks passed
@mcdurdin mcdurdin deleted the fix/developer/14604-disallow-space-in-group-name branch September 15, 2025 12:49
@github-project-automation github-project-automation bot moved this from Todo to Done in Keyman Sep 15, 2025
@keyman-server
Copy link
Copy Markdown
Collaborator

Changes in this pull request will be available for download in Keyman version 19.0.119-alpha

cvosoft pushed a commit to cvosoft/keyman that referenced this pull request Oct 23, 2025
sil_yi was impacted by the changes in keymanapp#14746, as it had the group name
'Unicode Group', which is now illegal, so the compiler fails to build
the keyboard at the referenced commit. Easiest workaround currently is
to remove it from the set of compared keyboards.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

bug(developer): TIKE's basic .kmn parser fails if space is found in group name

5 participants