Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about unicode categories #67

Open
nalply opened this issue Oct 15, 2023 · 3 comments
Open

Question about unicode categories #67

nalply opened this issue Oct 15, 2023 · 3 comments

Comments

@nalply
Copy link

nalply commented Oct 15, 2023

I need to match a token which contains unicode scalar values in the categories L, M, N, P, S and Cf.

I see three different ways to solve this:

  1. I wrote a small tool to generate the character set (about 15 kB of text) and include it
  2. I use a simpler character set and verify with a regex in the semantic action
  3. I hack lexgen

@osa1, what do you think?

@osa1
Copy link
Owner

osa1 commented Oct 15, 2023

I wrote a small tool to generate the character set (about 15 kB of text) and include it

Is this text full of unique characters, or do you extract the characters from the text?

15 KB of characters is just huge, compile times will be terrible.

How many characters are there in each of these categories?

I use a simpler character set and verify with a regex in the semantic action

This may be a bit error prone, because you can't jump to the next semantic action from one semantic action, once you're in a semantic action you have to go back to the beginning of a state. Example:

// A two character sequence.
_ _ => |lexer| {
    // If you don't like the matched characters here there's no way to run the next
    // semantic action below.
    ...
},

// There's no way to run this semantic action because of the rule above.
"ab" => |lexer| {
    ...
},

If you're OK with this limitation that I think this should work.

I hack lexgen

This works too, and we may even consider including these unicode categories in as a built-in, similar to $$XID_Start and $$XID_Continue built-in character sets. See here for the implementation of built-in character sets. The character ranges are generated by this program.

I've never heard of these categories before and I don't know how useful they are to an average user though. I don't know if it's worth including them in the library.

@nalply
Copy link
Author

nalply commented Oct 15, 2023

Is this text full of unique characters, or do you extract the characters from the text?

15 kB is just the source code of the character set.

I've never heard of these categories before and I don't know how useful they are to an average user though. I don't know if it's worth including them in the library.

I understand.

Perhaps you find following explanation interesting.

L is the category of all letters and letter-likes scalars; M is the category of all marks, for example combining accents, some languages even have surrounding marks; N is the category of all digits, there are the arabic digits, but some languages in India use different digits, and there are super- and subscript digits; P is the category of all punctuation and S is the category of all symbols (like the dollar symbol). Cf is a subcategory of control codes, these are formatting codes, I include them because some combining emojis have the zero-width joinder, wich is in category Cf. In other words, this is a token which can contain emoji, even complicated ones! Soft hyphen is also in Cf.

Another way to look at this: any unicode scalar is allowed except separators (like space, carriage return, line feed, form feed) and control characters (like NUL, Ctrl-G aka the bell, CSI (0x9b) wich is used for ANSI colors as an equivalent of ESC [, and others. Formatting control characters are allowed because, see above.

This said, I am very grateful for your lexer, and I don't expect you to do anything.

@nalply
Copy link
Author

nalply commented Oct 15, 2023

It looks there's a fourth possibility: just use the built-ins, because some built-ins are same as the mentioned unicode categories! For example $$alphabetic might correspond to the category L.

Yay!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants