New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for unicode characters #265

Closed
josdejong opened this Issue Jan 10, 2015 · 11 comments

Comments

Projects
None yet
3 participants
@josdejong
Owner

josdejong commented Jan 10, 2015

Would be nice to have support for unicode characters in the expression parser, so you can define variables containing special characters.

@josdejong josdejong added the feature label Jan 10, 2015

@balagge

This comment has been minimized.

balagge commented Aug 7, 2015

:) I saw this comment just now. Man, this would be VERY useful for me, as I use rendered formulas (MathML) as input, which may contain many different characters. So now I have to create a 1-to-1 mapping between names containing (for example) greek letters and latin variable names. This leads to having two names for the same variable, which is always a headache.

However, this feature is only useful (for me at least) if it also supports the Unicode "mathematical alphanumeric symbols" block. This block is unfortunately not in the BMP, but SMP part of Unicode, and cannot be represented in UTF-16 without using surrogate pairs. Javascript is not UTF-32 enabled, you start to manipulate pairs of 16bit characters instead of single 32bit character in Javascript.

Maybe that is not a big issue in mathjs as you don't have to inspect variable names too much. However, it is a problem in the parser, because you have to know when you encounter such a pair, you have to advance the parser position by 2 instead of one, and treat the pair as a single character.

@josdejong

This comment has been minimized.

Owner

josdejong commented Aug 7, 2015

Thanks for your feedback. I haven't yet looked what would be needed to support unicode, but I know that there are tricky cases where a single unicode characters is represented by two characters.

@pedroteixeira

This comment has been minimized.

pedroteixeira commented Aug 7, 2015

👍 me too would realy find usefull to allow unicode in symbol names.

josdejong added a commit that referenced this issue Aug 10, 2015

Fied #265: added support for unicode characters in the expression par…
…ser: greek letters and latin letters with accents
@josdejong

This comment has been minimized.

Owner

josdejong commented Aug 10, 2015

I've added support for unicode characters. I've been a bit conservative here, allowing latin letters with accents and greek letters now. What do you think, would that be enough for practical usage?

@pedroteixeira

This comment has been minimized.

pedroteixeira commented Aug 10, 2015

That's great! My case was exactly latin and greek letters :)

@balagge

This comment has been minimized.

balagge commented Aug 13, 2015

Great!
But as I have commented above, the Mathematical Alphanumeric Symbols block would be needed for me as well, if possible. These are special characters designed especially for use in math identifier names. The rationale is that in everyday math text the character typesetting is semantically important (e.g. a bold letter may mean a vector, etc.). No matter what the intended purpose of using it, a bold variable name is considered a different variable than a normal (which is, by the way, italic by default). Encoding this information in the actual name string saves a lot of work that must be done externally otherwise. Also, it ensures that no matter where you "take" that name, it will still contain this additional information.

I don't think there is a vast problem there. Surrogate pairs (those that encode non-BMP characters like the ones I'd like to have) seem to be supported in variable names, property names, strings, etc. in Javascript, so you won't even notice the difference.

The only place where care is needed is if you manipulate a string by character position, or taking the length of a string. There these pairs will show up as "two characters". But in any other way those two characters behave like ordinary characters. They just should / must not be separated because the result of separating them is unpredictable.

The ranges are D800-DBFF (high surrogate block) and DC00-DFFF (low surrogate block).
Enabling these blocks completely would mean that any Unicode character in the Supplementary Plane (U+10000 and above, a.k.a 'Astral') can be encoded and is allowed. So maybe you want to limit that to the actual mathematical characters, U+1D400 ... U+1D7FF. Which are (as a pair): [xD835, xDC00] and [xD835, xDFFF]. So this would mean that a single high surrogate (xD835), and the complete low surrogate block should be enabled.

This would help me a lot :)

@josdejong

This comment has been minimized.

Owner

josdejong commented Aug 14, 2015

@balagge sure, we will add these blocks of unicode too. Thanks for looking them up.

@balagge

This comment has been minimized.

balagge commented Aug 17, 2015

... I forgot to mention that some of the math characters are NOT in the range given above (because they existed previously and are left in their original position instead of duplicating them in the new math range). Also, 4 additional code points are not used (reserved) These are:

Valid code point Character Name Invalid code point Comment
U+210E planck constant U+1D455 despite the name ("planck constant"), it is also used as "mathematical italic small h"
U+212C script capital B U+1D49D "script ..." = "mathematical script ..."
U+2130 script capital E U+1D4A0
U+2131 script capital F U+1D4A1
U+210B script capital H U+1D4A3
U+2110 script capital I U+1D4A4
U+2112 script capital L U+1D4A7
U+2133 script capital M U+1D4A8
U+211B script capital R U+1D4AD
U+212F script small e U+1D4BA
U+210A script small g U+1D4BC
U+2134 script small o U+1D4C4
U+212D black-letter capital C U+1D506 "black-letter ... " = "mathematical fraktur ..."
U+210C black-letter capital H U+1D50B
U+2111 black-letter capital I U+1D50C
U+211C black-letter capital R U+1D515
U+2128 black-letter capital Z U+1D51D
U+2102 double-struck capital C U+1D53A "double-struck ... " = "mathematical double-struck ..."
U+210D double-struck capital H U+1D53F
U+2115 double-struck capital N U+1D545
U+2119 double-struck capital P U+1D547
U+211A double-struck capital Q U+1D548
U+211D double-struck capital R U+1D549
U+2124 double-struck capital Z U+1D551
U+1D6A6 reserved
U+1D6A7 reserved
U+1D7CC reserved
U+1D7CD reserved

Note: "Valid code point" should be allowed, these are older BMP characters. "Invalid code point" is where the new range contains a "hole" of unused / non-existing characters. These should NOT be accepted.

@josdejong

This comment has been minimized.

Owner

josdejong commented Aug 19, 2015

Thanks @balagge . I don't expect to be able to implement this within next week. Feel free to create a pull request adding all additional unicode characters (in this commit 33370bf you can see where to add the characters and how to unit test).

@josdejong

This comment has been minimized.

Owner

josdejong commented Sep 25, 2015

@balagge I've added support for mathematical symbols. It's in the develop branch, and you can have a look at the implementation:

https://github.com/josdejong/mathjs/blob/develop/lib/expression/parse.js#L368-L403

@josdejong

This comment has been minimized.

Owner

josdejong commented Oct 9, 2015

The mathematical symbols are now supported in the just released v2.4.0. It would be great if you could give this a try, @balagge .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment