You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a solution to #98 to increase support for parsing more Unicode characters correctly.
Background
In JavaScript, characters use the UTF-16 encoding. The encoding says that:
A Unicode codepoint (character) is encoded using two bytes if it’s part of the Basic Multilingual Plane (BMP).
Outside the BMP characters are encoded using a pair of two-byte sequences called surrogate pairs.
The BMP includes all standard symbols of all scripts in active use today. However, there are some things it doesn’t include:
Characters from some historical scripts, such as Egyptian Hieroglyphs, Linear B, and Ugaritic.
Javascript Limitations
Due to technical limitations, JavaScript views characters as always being two bytes long. This goes against the UTF-16 specification, but makes writing code easier.
Anyway, it does mean that surrogate pairs are just treated as two characters. To see this, just write "🥔".length into your browser console. It will output 2.
It gets more complicated than that
Text is hard nowadays. Today there are several kinds of combining characters that might look like one symbol but are actually several separate codepoints. So in these cases, even the Unicode standard says they are multiple characters, but typesetting software displays them as something like a single unit.
The ability to parse Unicode is important to the library and its users. Therefore we should support parsing as much Unicode as possible. This does mean we have to come up with a definition for “character”, since it’s something every piece of software seems to define separately.
This definition will have to be used at every point in the code where reading a specific number of characters is involved. This affects a large number of building blocks, such as anyCharOf, anyChar, exactly, and so on.
We should avoid duplicating code for handling this, so instead we should encapsulate the definition of “character” into an object, and instead of working directly on the input string this character object will be used.
Everything is a combinator
This character object is effectively a building block parser. This means every parser becomes a combinator given a root parser that defines “character.” However, unlike most combinators, this would be implemented on the state object itself and support two operations:
Read one character
Read N characters
There would be a default character parser. At first it will be the standard 2-byte JS character parsed for backwards compatibility, but in the future it will be replaced with the full Unicode-aware parser.
However, alternative parsers might define characters in broader terms, such as treating combining diacritics or emoji chars as single characters.
Comments?
Can you see any caveats or issues with this proposal?
Did I miss something?
Suggestions about the interface or design?
Do you feel like it would be a good feature?
Use-cases?
The text was updated successfully, but these errors were encountered:
This is a solution to #98 to increase support for parsing more Unicode characters correctly.
Background
In JavaScript, characters use the UTF-16 encoding. The encoding says that:
The BMP includes all standard symbols of all scripts in active use today. However, there are some things it doesn’t include:
Javascript Limitations
Due to technical limitations, JavaScript views characters as always being two bytes long. This goes against the UTF-16 specification, but makes writing code easier.
Anyway, it does mean that surrogate pairs are just treated as two characters. To see this, just write
"🥔".length
into your browser console. It will output2
.It gets more complicated than that
Text is hard nowadays. Today there are several kinds of combining characters that might look like one symbol but are actually several separate codepoints. So in these cases, even the Unicode standard says they are multiple characters, but typesetting software displays them as something like a single unit.
length
Further reading\
Solution
The ability to parse Unicode is important to the library and its users. Therefore we should support parsing as much Unicode as possible. This does mean we have to come up with a definition for “character”, since it’s something every piece of software seems to define separately.
This definition will have to be used at every point in the code where reading a specific number of characters is involved. This affects a large number of building blocks, such as
anyCharOf
,anyChar
,exactly
, and so on.We should avoid duplicating code for handling this, so instead we should encapsulate the definition of “character” into an object, and instead of working directly on the input string this character object will be used.
Everything is a combinator
This character object is effectively a building block parser. This means every parser becomes a combinator given a root parser that defines “character.” However, unlike most combinators, this would be implemented on the state object itself and support two operations:
There would be a default character parser. At first it will be the standard 2-byte JS character parsed for backwards compatibility, but in the future it will be replaced with the full Unicode-aware parser.
However, alternative parsers might define characters in broader terms, such as treating combining diacritics or emoji chars as single characters.
Comments?
The text was updated successfully, but these errors were encountered: