Add support for super-BMP unicode range; closes #586 #616

blake-regalia · 2019-06-03T07:30:27Z

PR type

Bug fix (non-breaking change which fixes an issue): no
New feature (non-breaking change which adds functionality): yes
Breaking change (fix or feature that would cause existing functionality to change): no
Documentation change: no

Prerequisites

I have read the CONTRIBUTING.md document: yes
I have updated the documentation accordingly: no documentation currently exists for unicode support
I have added tests to cover my changes: yes

Description

Closes #586 . Adds support for super-BMP unicode range via the commonly used capital 'U' escape character for 8-digit unicode escape sequences.

Example:

control = [\\U000DD038-\\U0010FFFF]

vercel · 2019-06-03T07:30:36Z

This pull request is automatically deployed with Now.
To access deployments, click Details below or on the icon next to each push.

Latest deployment for this branch: https://pegjs-next-web-git-fork-blake-regalia-add-super-bmp-unic-b82360.futagoza.now.sh

codeclimate · 2019-06-03T07:32:45Z

Code Climate has analyzed commit d10662b and detected 0 issues on this pull request.

View more on Code Climate.

deostroll · 2019-06-22T05:22:38Z

Hi,

I like your idea of introducing \\U to make peg support super bmp. At this moment, I haven't gone through the source enough to understand how the parser is actually generated. However, I have to ask...have you considered es6 support for super bmp characters in string/regex literals? But of course, I am coming from nodejs background, hence, I can't quite imagine how it would be for browsers. But like your thoughts on the matter.

Seb35 · 2019-07-28T12:34:22Z

I’ve tested this patch. Two remarks about syntax and about an issue.

Why don’t use ES6-like syntax \u{XXXXX} for these characters? And why extend above existing Unicode code points (only code points up to 10FFFF are defined)? I find it would be more intuitive to use ES6 syntax with only existing code points.
A grammar like a = "\U0001F4A9" does correctly work on the input string \uD83D\uDCA9, but not a = [\\U0001F4A9] because it is replace by UTF-16 surrogates var peg$r0 = /^[\uD83D\uDCA9]/; and only the first surrogate is captured. I have no (good) idea how to improve this for now.

The only idea I have to fix the second point is decomposing astral (=super-BMP) code points into two surrogates. E.g. [A\\U0001F4A9B] would become the regex /^([A]|\uD83D\uDCA9|[B])/ and [A\\U0001F4A9-\\U0001F4AAB] would become the regex /^([A]|\uD83D[\uDCA9-\uDCAA]|[B])/. But that would imply some computations to create specific regexes and compute the ranges.

Seb35 · 2019-07-28T13:03:26Z

The issue about regexes is quite complicated, see this blog post and the two associated libraries regenerate and regexpu. Using one of the libraries could solve the issue.

StoneCypher · 2020-02-02T21:52:35Z

I will happily write the tests to ensure that this is safe, probably using fast check, then merge this

I need this very badly, and didn't know it had been waiting in the wings for almost a year

@dmajda Can I get commit rights so that I can release 0.10.1? This would probably go in.

dmajda · 2020-02-03T19:55:42Z

@dmajda Can I get commit rights so that I can release 0.10.1? This would probably go in.

I no longer have any rights to the project. Resolving this is up to @futagoza. Thanks for understanding.

StoneCypher · 2020-02-03T22:00:36Z

Understood.

Seb35 · 2020-06-04T19:28:13Z

I propose an alternative syntax using the ES6 syntax for astral code points like \u{1F4AF}, integrating this commit. See #651.

feat(unicode): support super-BMP range; fixes pegjs#586

d10662b

vercel bot requested a deployment to staging June 3, 2019 07:30 Pending

vercel bot deployed to staging June 3, 2019 07:32 View deployment

StoneCypher mentioned this pull request Feb 2, 2020

Full Unicode support, namely for codepoints outside the BMP #586

Open

StoneCypher mentioned this pull request Feb 7, 2020

See if we can get high unicode working StoneCypher/fsl#102

Closed

StoneCypher mentioned this pull request Apr 18, 2021

Higher unicode peggyjs/peggy#67

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for super-BMP unicode range; closes #586 #616

Add support for super-BMP unicode range; closes #586 #616

blake-regalia commented Jun 3, 2019

vercel bot commented Jun 3, 2019 •

edited

codeclimate bot commented Jun 3, 2019

deostroll commented Jun 22, 2019

Seb35 commented Jul 28, 2019

Seb35 commented Jul 28, 2019

StoneCypher commented Feb 2, 2020

dmajda commented Feb 3, 2020

StoneCypher commented Feb 3, 2020

Seb35 commented Jun 4, 2020

Add support for super-BMP unicode range; closes #586 #616

Are you sure you want to change the base?

Add support for super-BMP unicode range; closes #586 #616

Conversation

blake-regalia commented Jun 3, 2019

PR type

Prerequisites

Description

Example:

vercel bot commented Jun 3, 2019 • edited

codeclimate bot commented Jun 3, 2019

deostroll commented Jun 22, 2019

Seb35 commented Jul 28, 2019

Seb35 commented Jul 28, 2019

StoneCypher commented Feb 2, 2020

dmajda commented Feb 3, 2020

StoneCypher commented Feb 3, 2020

Seb35 commented Jun 4, 2020

vercel bot commented Jun 3, 2019 •

edited