Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for super-BMP unicode range; closes #586 #616

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

blake-regalia
Copy link

PR type

  • Bug fix (non-breaking change which fixes an issue): no
  • New feature (non-breaking change which adds functionality): yes
  • Breaking change (fix or feature that would cause existing functionality to change): no
  • Documentation change: no

Prerequisites

  • I have read the CONTRIBUTING.md document: yes
  • I have updated the documentation accordingly: no documentation currently exists for unicode support
  • I have added tests to cover my changes: yes

Description

Closes #586 . Adds support for super-BMP unicode range via the commonly used capital 'U' escape character for 8-digit unicode escape sequences.

Example:

control = [\\U000DD038-\\U0010FFFF]

@vercel
Copy link

vercel bot commented Jun 3, 2019

This pull request is automatically deployed with Now.
To access deployments, click Details below or on the icon next to each push.

Latest deployment for this branch: https://pegjs-next-web-git-fork-blake-regalia-add-super-bmp-unic-b82360.futagoza.now.sh

@codeclimate
Copy link

codeclimate bot commented Jun 3, 2019

Code Climate has analyzed commit d10662b and detected 0 issues on this pull request.

View more on Code Climate.

@deostroll
Copy link

Hi,

I like your idea of introducing \\U to make peg support super bmp. At this moment, I haven't gone through the source enough to understand how the parser is actually generated. However, I have to ask...have you considered es6 support for super bmp characters in string/regex literals? But of course, I am coming from nodejs background, hence, I can't quite imagine how it would be for browsers. But like your thoughts on the matter.

@Seb35
Copy link

Seb35 commented Jul 28, 2019

I’ve tested this patch. Two remarks about syntax and about an issue.

  1. Why don’t use ES6-like syntax \u{XXXXX} for these characters? And why extend above existing Unicode code points (only code points up to 10FFFF are defined)? I find it would be more intuitive to use ES6 syntax with only existing code points.
  2. A grammar like a = "\U0001F4A9" does correctly work on the input string \uD83D\uDCA9, but not a = [\\U0001F4A9] because it is replace by UTF-16 surrogates var peg$r0 = /^[\uD83D\uDCA9]/; and only the first surrogate is captured. I have no (good) idea how to improve this for now.

The only idea I have to fix the second point is decomposing astral (=super-BMP) code points into two surrogates. E.g. [A\\U0001F4A9B] would become the regex /^([A]|\uD83D\uDCA9|[B])/ and [A\\U0001F4A9-\\U0001F4AAB] would become the regex /^([A]|\uD83D[\uDCA9-\uDCAA]|[B])/. But that would imply some computations to create specific regexes and compute the ranges.

@Seb35
Copy link

Seb35 commented Jul 28, 2019

The issue about regexes is quite complicated, see this blog post and the two associated libraries regenerate and regexpu. Using one of the libraries could solve the issue.

@StoneCypher
Copy link

I will happily write the tests to ensure that this is safe, probably using fast check, then merge this

I need this very badly, and didn't know it had been waiting in the wings for almost a year

@dmajda Can I get commit rights so that I can release 0.10.1? This would probably go in.

@dmajda
Copy link
Contributor

dmajda commented Feb 3, 2020

@dmajda Can I get commit rights so that I can release 0.10.1? This would probably go in.

I no longer have any rights to the project. Resolving this is up to @futagoza. Thanks for understanding.

@StoneCypher
Copy link

Understood.

@Seb35
Copy link

Seb35 commented Jun 4, 2020

I propose an alternative syntax using the ES6 syntax for astral code points like \u{1F4AF}, integrating this commit. See #651.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Full Unicode support, namely for codepoints outside the BMP
5 participants