Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2.0 additional restricted literal characters #250

Closed
tabatkins opened this issue Nov 4, 2021 · 4 comments · Fixed by #353
Closed

v2.0 additional restricted literal characters #250

tabatkins opened this issue Nov 4, 2021 · 4 comments · Fixed by #353
Assignees
Labels
breaking This can only be done for the next major version of KDL enhancement New feature or request

Comments

@tabatkins
Copy link
Contributor

Currently, idents disallow a few characters from being expressed literally, requiring they be escaped if authors want to include them:

  • codepoints < 0x20 (control characters)
  • codepoints > 0x10FFFF (invalid codepoints)
  • some ASCII characters reserved for syntax reasons

I think there's a few more we can reasonably restrict to make KDL documents more readable/understandable:

Removing 0x7F just seems like fixing an omission; it's easy to forget that the ASCII control characters aren't contiguous.

Removing the direction-control characters helps keep KDL source readable; the direction override characters in particular are somewhat fraught to show up in plain-text documents, as they can corrupt the display of following text in the wrong direction (as demonstrated in the recent somewhat-hyperbolic complaints about them showing up in Rust and other source languages as a possible review-attack). If these character are desired for use in text values, such as strings, they can still be escaped; their literal usage in what is otherwise an ASCII-based language is virtually always either accidental or malicious, since they're intended for text formatting and have no semantic meaning.

The BOM is allowed at the start of a KDL document

(A previous issue suggested restricting the surrogate-pair characters as well (0xD800-DFFF); these are already restricted implicitly by the requirement that KDL documents be encoded in UTF-8, where such codepoints can't be validly encoded. As such I'm continuing to omit them from these suggestions.)

While there are still a number of "invisible" characters in Unicode that could potentially be confusing or accidental, they also have semantic uses, so I don't currently recommend restricting them.

@tabatkins tabatkins added breaking This can only be done for the next major version of KDL enhancement New feature or request labels Nov 4, 2021
@marrus-sh
Copy link

I think you should not remove direction characters as they make it impossible to literally encode bidirectional strings, which is important for internationalization. It is true that BIDI control characters can create review‐attacks, but KDL is not a programming language and the probability of someone it needing to encode lengthy strings (which may include bidirectional text) is pretty high. Linters and formatters can be used by individual projects to detect and warn about the use of these characters if needed; there is no reason to forbid them at a language level.

I would suggest disallowing exactly the same characters as RestrictedChar in XML 1.1, plus U+0000, U+FFFE, and U+FFFF (which are not allowed to be escaped in XML either).

@zkat zkat mentioned this issue Sep 5, 2022
@Lucretiel
Copy link
Contributor

I think you should not remove direction characters as they make it impossible to literally encode bidirectional strings, which is important for internationalization.

To be clear, the proposal is only to remove them from identifiers like node. There's nothing stopping someone from using them (in either literal or escaped form) in a quoted string.

@tabatkins
Copy link
Contributor Author

Well, my post was unclear; I talked about the ident restrictions at first, but then later mentioned being able to include them in strings via escapes.

But yeah, I think just talking about idents is fine. (Notably, you can't escape anything in raw strings, which would be somewhat limiting.)

zkat added a commit that referenced this issue Dec 11, 2023
zkat added a commit that referenced this issue Dec 11, 2023
@zkat zkat self-assigned this Dec 11, 2023
zkat added a commit that referenced this issue Dec 11, 2023
zkat added a commit that referenced this issue Dec 13, 2023
zkat added a commit that referenced this issue Dec 13, 2023
#352)

* fix some confusion in grammar syntax, and actually specify the syntax itself

Fixes: #345

* allow ,<> as identifier characters since they no longer need to be reserved

* fix typo

* disallow more code points and outright ban certain ones from KDL documents altogether (#353)

Fixes: #250

* `r` prefix is no longer required for raw strings (#354)

Fixes: #337
@zkat
Copy link
Member

zkat commented Dec 13, 2023

These changes have been merged into the kdl-v2 branch

@zkat zkat closed this as completed Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking This can only be done for the next major version of KDL enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants