Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunk unicode similar to spread syntax and less like regex #462

Open
frostburn opened this issue Jan 29, 2024 · 3 comments
Open

Chunk unicode similar to spread syntax and less like regex #462

frostburn opened this issue Jan 29, 2024 · 3 comments

Comments

@frostburn
Copy link
Contributor

The way unicode codepoints are chunked in Peggy patterns doesn't follow visual chunking.

Example grammar:

sharp.peggy

SharpAccidental = [#x♯𝄪]

Unexpected result:

$ npx peggy sharp.peggy -t 'x'
'x'
$ npx peggy sharp.peggy -t '𝄪'
Error running test
Error: Expected end of input but "" found.
 --> command line:1:2
  |
1 | 𝄪
  |  ^

Reasonable behavior in node using spread syntax:

> [...'#x♯𝄪']
[ '#', 'x', '♯', '𝄪' ]
@hildjj
Copy link
Contributor

hildjj commented Jan 29, 2024

I think the actionable part here is allowing non-BMP characters in a character class to work as expected. The current behavior is useless, so backward-compatibility with older Peggy/peg.js versions is not needed. The rule given above gets generated with:

var peg$r0 = /^[#x\u266F\uD834\uDD2A]/;
var peg$e0 = peg$classExpectation(["#", "x", "\u266F", "\uD834", "\uDD2A"], false, false);

Both of which are wrong. This should generate:

var peg$r0 = /^[#x\u266F\uD834\uDD2A]/u; // JS backward-compat issue!
var peg$e0 = peg$classExpectation(["#", "x", "\u266F", "\uD834\uDD2A"], false, false);

or:

var peg$r0 = /^[#x\u266F\u{1d12a}]/u; // JS backward-compat issue!
var peg$e0 = peg$classExpectation(["#", "x", "\u266F", "\u{1d12a}"], false, false);

(see: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/unicode)

Alternately, this could be automatically turned into the following rule:

SharpAccidental = [#x♯] / "𝄪"

If we are dropping IE11 support in 4.0, the first solution is probably better, but we should only put a /u at the end of regeular expressions that need it; there may be performance changes at the least.

@reverofevil
Copy link

I'm unsure if conditional /u won't lead to unexpected behavior. IMO it should be a generator option, and as such can be already supported. I'd rather change to /u by default in 4.0.

@hildjj
Copy link
Contributor

hildjj commented Jan 29, 2024

Let's talk about browser support in #463.
Let's do some benchmarking before making a final decision on implementation approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants