Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal - venturing outside the Basic Multilingual Plane #100

Open
GregRos opened this issue Feb 5, 2024 · 0 comments
Open

Proposal - venturing outside the Basic Multilingual Plane #100

GregRos opened this issue Feb 5, 2024 · 0 comments
Labels
discussion Talk talk talk enhancement v2 Features to be added in version 2

Comments

@GregRos
Copy link
Owner

GregRos commented Feb 5, 2024

This is a solution to #98 to increase support for parsing more Unicode characters correctly.

Background

In JavaScript, characters use the UTF-16 encoding. The encoding says that:

  • A Unicode codepoint (character) is encoded using two bytes if it’s part of the Basic Multilingual Plane (BMP).
  • Outside the BMP characters are encoded using a pair of two-byte sequences called surrogate pairs.
    The BMP includes all standard symbols of all scripts in active use today. However, there are some things it doesn’t include:
  1. Variant CJK characters
  2. Unusual CJK characters
  3. Most emoji, such as 🥔🍠
  4. Styled mathematical characters
  5. Characters from some historical scripts, such as Egyptian Hieroglyphs, Linear B, and Ugaritic.

Javascript Limitations

Due to technical limitations, JavaScript views characters as always being two bytes long. This goes against the UTF-16 specification, but makes writing code easier.

Anyway, it does mean that surrogate pairs are just treated as two characters. To see this, just write "🥔".length into your browser console. It will output 2.

It gets more complicated than that

Text is hard nowadays. Today there are several kinds of combining characters that might look like one symbol but are actually several separate codepoints. So in these cases, even the Unicode standard says they are multiple characters, but typesetting software displays them as something like a single unit.

  1. Combining diacritics, such as o.
  2. Flag characters, such as 🇨🇭.
  3. Skin tone-modified emoji, such as 👍🏾.
Type Example length
Combining diacritics o 1 + number of diacritics
Flag characters 🇨🇭 🇨🇭 $2 + 2=4$
Skin tone-modified emoji 👍🏾 $2 + 2 = 4$
Family emoji 👨‍👩‍👧 $2 + 1 + 2 + 1 + 2=8$

Further reading\

Solution

The ability to parse Unicode is important to the library and its users. Therefore we should support parsing as much Unicode as possible. This does mean we have to come up with a definition for “character”, since it’s something every piece of software seems to define separately.

This definition will have to be used at every point in the code where reading a specific number of characters is involved. This affects a large number of building blocks, such as anyCharOf, anyChar, exactly, and so on.

We should avoid duplicating code for handling this, so instead we should encapsulate the definition of “character” into an object, and instead of working directly on the input string this character object will be used.

Everything is a combinator

This character object is effectively a building block parser. This means every parser becomes a combinator given a root parser that defines “character.” However, unlike most combinators, this would be implemented on the state object itself and support two operations:

  1. Read one character
  2. Read N characters

There would be a default character parser. At first it will be the standard 2-byte JS character parsed for backwards compatibility, but in the future it will be replaced with the full Unicode-aware parser.

However, alternative parsers might define characters in broader terms, such as treating combining diacritics or emoji chars as single characters.

Comments?

  • Can you see any caveats or issues with this proposal?
  • Did I miss something?
  • Suggestions about the interface or design?
  • Do you feel like it would be a good feature?
  • Use-cases?
@GregRos GregRos added the v2 Features to be added in version 2 label Feb 5, 2024
@GregRos GregRos added enhancement discussion Talk talk talk labels Feb 5, 2024
@GregRos GregRos changed the title Proposal - venturing outside the BMP Proposal - venturing outside the Basic Multilingual Plane Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Talk talk talk enhancement v2 Features to be added in version 2
Projects
None yet
Development

No branches or pull requests

1 participant