Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple-char operators in the Operator Dictionary #143

Closed
fred-wang opened this issue Sep 21, 2019 · 18 comments
Closed

Multiple-char operators in the Operator Dictionary #143

fred-wang opened this issue Sep 21, 2019 · 18 comments
Labels
MathML Core Issues affecting the MathML Core specification MathML 4 Issues affecting the MathML 4 specification

Comments

@fred-wang
Copy link

cc @rwlbuis

I'm not aware when this was decided, but the operator dictionary contains the following entries with multiple characters:

U00021-00021 !!  {'lspace': 1, 'rspace': 0}
U00021-0003D !=  {'lspace': 4, 'rspace': 4}
U00026-00026 &&  {'lspace': 4, 'rspace': 4}
U0002A-0002A **  {'lspace': 1, 'rspace': 1}
U0002A-0003D *=  {'lspace': 4, 'rspace': 4}
U0002B-0002B ++  {'lspace': 0, 'rspace': 0}
U0002B-0003D +=  {'lspace': 4, 'rspace': 4}
U0002D-0002D --  {'lspace': 0, 'rspace': 0}
U0002D-0003D -=  {'lspace': 4, 'rspace': 4}
U0002D-0003E ->  {'lspace': 5, 'rspace': 5}
U0002E-0002E ..  {'lspace': 0, 'rspace': 0}
U0002E-0002E-0002E ...  {'lspace': 0, 'rspace': 0}
U0002F-0002F //  {'lspace': 1, 'rspace': 1}
U0002F-0003D /=  {'lspace': 4, 'rspace': 4}
U0003A-0003D :=  {'lspace': 4, 'rspace': 4}
U0003C-0003D <=  {'lspace': 5, 'rspace': 5}
U0003C-0003E <>  {'lspace': 1, 'rspace': 1}
U0003D-0003D ==  {'lspace': 4, 'rspace': 4}
U0003E-0003D >=  {'lspace': 5, 'rspace': 5}
U0007C-0007C ||  {'lspace': 2, 'symmetric': True, 'stretchy': True, 'rspace': 2}
U0007C-0007C ||  {'lspace': 0, 'symmetric': True, 'stretchy': True, 'rspace': 0}
U0007C-0007C ||  {'lspace': 0, 'symmetric': True, 'stretchy': True, 'rspace': 0}
U0007C-0007C-0007C |||  {'lspace': 2, 'symmetric': True, 'stretchy': True, 'rspace': 2}
U0007C-0007C-0007C |||  {'lspace': 0, 'symmetric': True, 'stretchy': True, 'rspace': 0}
U0007C-0007C-0007C |||  {'lspace': 0, 'symmetric': True, 'stretchy': True, 'rspace': 0}
U0223D-00331 ∽̱  {'lspace': 3, 'rspace': 3}
U02242-00338 ≂̸  {'lspace': 5, 'rspace': 5}
U0224E-00338 ≎̸  {'lspace': 5, 'rspace': 5}
U0224F-00338 ≏̸  {'lspace': 5, 'rspace': 5}
U02266-00338 ≦̸  {'lspace': 5, 'rspace': 5}
U0226A-00338 ≪̸  {'lspace': 5, 'rspace': 5}
U0226B-00338 ≫̸  {'lspace': 5, 'rspace': 5}
U0227F-00338 ≿̸  {'lspace': 5, 'rspace': 5}
U02282-020D2 ⊂⃒  {'lspace': 5, 'rspace': 5}
U02283-020D2 ⊃⃒  {'lspace': 5, 'rspace': 5}
U0228F-00338 ⊏̸  {'lspace': 5, 'rspace': 5}
U02290-00338 ⊐̸  {'lspace': 5, 'rspace': 5}
U029CF-00338 ⧏̸  {'lspace': 5, 'rspace': 5}
U029D0-00338 ⧐̸  {'lspace': 5, 'rspace': 5}
U02A7D-00338 ⩽̸  {'lspace': 5, 'rspace': 5}
U02A7E-00338 ⩾̸  {'lspace': 5, 'rspace': 5}
U02AA1-00338 ⪡̸  {'lspace': 5, 'rspace': 5}
U02AA2-00338 ⪢̸  {'lspace': 5, 'rspace': 5}
U02AAF-00338 ⪯̸  {'lspace': 5, 'rspace': 5}
U02AB0-00338 ⪰̸  {'lspace': 5, 'rspace': 5}
U02ADD-00338 ⫝̸  {'lspace': 5, 'rspace': 5}

Currently this is not supported in browsers ( https://bugs.webkit.org/show_bug.cgi?id=124828 ).
We have some tests to check that operators render the same with implicit and without explicit operator properties specified by the dictionary.

Probably they will need to be handled in a separate table, which will make the code a bit more complex/larger. Multiple vertical bars are even stretchy but OpenType only provides per-glyph stretching, which means we will have to add more spec description/test/implementation if we really want to support multi-char stretching.

So I wonder how important all of these are? It seems many of them are just equivalent to a single unicode character (or are waiting for such a character to be introduced in Unicode). Can't people just use explicit lspace/rspace or the equivalent unicode code point? At least there are already code points for double/triple stretchy vertical bars which are stretchy and supported by OpenType fonts / browsers.

@fred-wang fred-wang added MathML 4 Issues affecting the MathML 4 specification MathML Core Issues affecting the MathML Core specification need polyfill Issues requiring implementation changes need resolution Issues needing resolution at MathML Refresh CG meeting need specification update Issues requiring specification changes need tests Issues related to writing WPT tests labels Sep 21, 2019
@fred-wang
Copy link
Author

My proposal would be

  • remove them from MathML Core (this has never been implemented in browsers so that does not break anything + it was not normative)
  • make MathML Full refer to the MathML Core dictionary (to avoid duplicating content) and extend it with multiple characters for backward compatibility.

@fred-wang
Copy link
Author

I stand corrected, multi-char seems to be implemented in Gecko:

https://searchfox.org/mozilla-central/source/layout/mathml/mathfont.properties#72

spacing seems to work but not stretching. Currently, it uses a hash table of strings (see https://bugzilla.mozilla.org/show_bug.cgi?id=1336437) while WebKit uses a sorted table of Unicode code point. cc @emilio

@davidcarlisle
Copy link
Collaborator

davidcarlisle commented Sep 21, 2019

They are really two flavours of these some with duplicated ascii like ++ and += etc might be thought of legacy approximations to ⧺ or but there are not really enough symbols and in any case people often prefer the look of the repeated operators when laying out += for C-style assignments.

Other duplicated operators with a combining character such as the combining negation slash or the variant selector are harder to get rid of as Unicode as a rule would be reluctant to add new pre-composed characters that are equivalent to a combination with a negation slash. The ones that do have pre-composed negations are some arbitrary list based on legacy font encodings (mostly).

stretching of multiple character operators is likely to be difficult (pretty much impossible in TeX as well) so you could probably say explicitly that that isn't supported in core (and we could make all multiple character entries have stretchy set to false ?

If supporting multiple character entries for spacing is likely to be problematic in core then it would be easy enough (I think) to extract a table for core spec without them and extract something for full spec that says something or adds them back, but of spacing is Ok and just stretchy property is difficult as I say I think we could just make them all stretchy=false even in full.

@davidcarlisle
Copy link
Collaborator

davidcarlisle commented Sep 21, 2019

note that if you use the entities the multiple character nature is hidden.

if you look at greater than, not greater than, much greater than, not much greater than then

&gt;, &ngt;, &GreaterGreater;, &NotGreaterGreater;

>, ≯, ⪢, ≫̸

look like four similar inputs, but the fact that one negation is pre-composed and one made up of a base and combining character is the sort of low level Unicode details that in an ideal world authors would not need to know about.

@rwlbuis
Copy link

rwlbuis commented Sep 23, 2019

This seems reasonable and people will probably expect the multi char to be supported given the past. I'll soon try to implement this since at least for Full it will be in the specification, so WebKit would need to support it anyway.

@NSoiffer
Copy link
Contributor

My 2 cents:

  1. A number of these, such as ++ and += were added for compatibility with programming languages. People sometimes write pseudo code that has subscripts and other math notations but also use programming language symbols. I don't have a clue how often that is done with MathML though.

  2. I believe some multi-char operators were added as approximations prior to Unicode adding the symbol. I could be wrong about that. If anyone cared, you can look at when symbols were added. This class of characters could easily be dropped. || and ||| as prefix/postfix operators fall into that category. As infix operators, at least || falls into the programming language category.

  3. I agree that any remaining multiple-character symbols should not be stretchy by default. Note that the MathML spec says "In practice, typical renderers will only be able to stretch a small set of characters, and quite possibly will only be able to generate a discrete set of character sizes." Hence, stretchiness is dependent on the renderer and the font. For core, we want all renders to do the same thing, but what they do depends on the font. I'm not sure how we specify that...

  4. In most(?) cases, Unicode will say that a combining character involving a slash (to create a "not" form) is equivalent to a pre-combined character. That has to be supported in core as this comes from Unicode, not MathML.

@rwlbuis
Copy link

rwlbuis commented Sep 26, 2019

We now support enough of multi-char in chromium that the test passes:
https://w3c-test.org/mathml/presentation-markup/operators/operator-dictionary-001.html

@khaledhosny
Copy link

Unicode as a rule would be reluctant to add new pre-composed characters that are equivalent to a combination with a negation slash

AFAIK while Unicode has a policy against encoding new pre-composed characters, combining marks that over strike their bases are exempted from this (but they will not be made canonically equivalent to the decomposed form).

@fred-wang
Copy link
Author

So I didn't comment here, but two weeks ago we agree to keep multi-char support. Rob already fixed our chromium branch and there is https://bugs.webkit.org/show_bug.cgi?id=124828 in webkit.

@fred-wang
Copy link
Author

Consensus from previous meetings:

  • keep multi-char operators
  • remove "stretchy" property for them (since stretching is not supported in that case anyway)

@fred-wang
Copy link
Author

@NSoiffer @davidcarlisle I still see a log of multiple-char entries with symmetric/stretchy (and fence). Can we remove these properties?

@davidcarlisle
Copy link
Collaborator

Yes I agree we shouldn't imply these stretch. Neil have you pending changes, or should I do that?

@NSoiffer
Copy link
Contributor

NSoiffer commented Apr 11, 2020 via email

@fred-wang
Copy link
Author

Yes, I have some changes pending. I'll remove any stretchy properties from them. Since symmetric only applies to stretchy chars, I'll make sure those go too.

Thanks.

Removing "fence" though doesn't make sense unless you are saying you want to remove that property from MathML ("separator" would then go also). That would be something to raise in its own issue and something to discuss on a call.

Yes, that's why I put that one in parenthesis. fence/separator don't have any use for layout so implementers can just ignore them for now anyway, which is probably what we will do in Chromium for now. The question of whether this will be used for browsers' accessibility tree is still open but I'm not aware of any use or plan to use it (they ar exposed by webkit on iOS/macOS but not sure if VoiceOver handles them). It seems there are not many of operators with these properties so there is also the option of handling them separately if they turn out to be necessary.

@fred-wang
Copy link
Author

I opened #209 for the separate fence/separator discussion.

The following entries seem still weird to me, can the spacing be tweaked so that they can be moved to another pre-existing category?

  • ** infix: {'form': 'infix', 'lspace': 1, 'rspace': 1}
  • // infix: {'form': 'infix', 'lspace': 1, 'rspace': 1}
  • <> infix: {'form': 'infix', 'lspace': 1, 'rspace': 1}

I still see a lot of repeated ASCII characters and I'm not sure how relevant these entries are. I would rather see them in prefformated text, not math layout...

@fred-wang
Copy link
Author

Multichar entries are now handled by https://mathml-refresh.github.io/mathml-core/#operator-dictionary-compact

For the record, current estimated size is 770*2 = 1540 bytes. The cost of supporting multi char entries is quite significant, (154+49)*2 = 406 bytes so 26% of the dictionary size.

If some entries are not essential, it would be very good to try and simplify things. For example restricting to 2-char strings would avoid the extra character necessary for nulll-terminated strings. And it looks like ASCII forms are not important at all, they should be replaced with the proper Unicode code point (or people should use preformated text rather than math formulas). I wonder whether we could just restrict to negated XXXX-00338 entries?

fred-wang added a commit to mathml-refresh/xml-entities that referenced this issue May 7, 2020
* "|||" does not seem to be used as a programming language operator.
* For (stretchy) fences, U+2980 is more appropriate than "|||"

w3c/mathml#143
w3c/mathml#176
fred-wang added a commit to mathml-refresh/xml-entities that referenced this issue May 7, 2020
It seems to be used as a punctuation sign rather than an operator.
The ellipsis character … U+2026 seems more appropriate for that
purpose.

w3c/mathml#143
w3c/mathml#176
@fred-wang
Copy link
Author

If some entries are not essential, it would be very good to try and simplify things. For example restricting to 2-char strings would avoid the extra character necessary for nulll-terminated strings.

For this point, I opened
mathml-refresh/xml-entities#25
mathml-refresh/xml-entities#26

@fred-wang
Copy link
Author

This is now a table of 2-char ASCII operators (38 bytes): Operators_2_ascii_chars
https://mathml-refresh.github.io/mathml-core/#operator-dictionary-compact

Text has been changed to handle case of 2-char op with the second character is either U+338 COMBINING LONG SOLIDUS OVERLAY or U+20D2 COMBINING LONG VERTICAL LINE OVERLAY. I'm not sure if there is an easy way in browsers to check for combining characters, and only these two seemed important per yesterday's discussion. But we can change that later if more single char + combining are needed.

The two surrogate pairs for Arabic operators are also handled specially.

I'm closing this as the tests are already written, they just need to be regenerated.

@fred-wang fred-wang removed need resolution Issues needing resolution at MathML Refresh CG meeting need specification update Issues requiring specification changes need tests Issues related to writing WPT tests need polyfill Issues requiring implementation changes labels May 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MathML Core Issues affecting the MathML Core specification MathML 4 Issues affecting the MathML 4 specification
Projects
None yet
Development

No branches or pull requests

5 participants