Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TeX translation could be improved for numbers #2772

Open
NSoiffer opened this issue Sep 15, 2021 · 3 comments
Open

TeX translation could be improved for numbers #2772

NSoiffer opened this issue Sep 15, 2021 · 3 comments
Labels
Expected Behavior This is how MathJax works v3

Comments

@NSoiffer
Copy link

Issue Summary

The translation for numbers containing spaces or commas could be improved

Issue details:

  1. With this input 16\,807 MathJax will produce the following MathML when using "copy to clipboard:mathml"
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block" data-semantic-type="infixop" data-semantic-role="implicit" data-semantic-annotation="clearspeak:unit" data-semantic-id="4" data-semantic-children="0,2" data-semantic-content="3" data-semantic-speech="16 807">
  <mn data-semantic-type="number" data-semantic-role="integer" data-semantic-font="normal" data-semantic-annotation="clearspeak:simple" data-semantic-id="0" data-semantic-parent="4">16</mn>
  <mstyle scriptlevel="0">
    <mspace width="0.167em" data-semantic-type="operator" data-semantic-role="space" data-semantic-id="3" data-semantic-parent="4" data-semantic-operator="infixop,&#x2062;"></mspace>
  </mstyle>
  <mn data-semantic-type="number" data-semantic-role="integer" data-semantic-font="normal" data-semantic-annotation="clearspeak:simple" data-semantic-id="2" data-semantic-parent="4">807</mn>
</math>

Notice that the number is broken up into two mn's. Also notice that SRE is interpreting the space as multiplication. Although it is possible this is what is meant, I think it is far more likely that this is meant to be a single number. Context and digit block counting could be used to choose one interpretation in favor of another.

A similar issue arises when using,. E.g., 7^5=16,807. In this case, context clearly points to 16,807 being a single number.

This poor translation will effect speech. Potentially it affects braille generation also.

Technical details:

@zorkow
Copy link
Member

zorkow commented Sep 23, 2021

  1. The TeX parser interprets an expression for the purpose of visual rendering. Making a decision on an authors intention is not really the job of a syntax parser. Interpreting spaces between numbers is highly context sensitive and locale dependent. You suggest counting digit blocks, so how would you interpret expressions like: 10\,10\,100, 10\,10\,10, 1\,2\,3, without any context information? Or, for that matter, 10^4 = 100\,100?

  2. 7^5=16,807 in LaTeX renders with an explicit space after the comma, hence it has to be interpreted as comma separated values. In order to get a single number in LaTeX you have to explicitly make the comma a non-punct: 16{,}807. This results in no space after the comma and a single number in LaTeX, in MathJax and consequently in its MathML output.

Please note, that any syntax conversions in MathJax are just that: conversion between purely syntactic representations, i.e., LaTeX, MathML or AsciiMath. The goal is to retain visual equivalence. Making a leap of faith interpretation could destroy that.

I would argue that a clear separation of syntax and semantic is important. Semantic interpretation should be left to a higher level recognition process, in our case it is done by SRE, which effectively only uses spatial layout and pattern recognition to build its own representation. (For an --- outdated version --- of what that tree looks like have a look here: https://zorkow.github.io/semantic-tree-visualiser/visualise.html) This is embedded without interfering with the visual rendering. (Past issues show that this is not always successful! So I should say minimal interference.)

SRE has a number of heuristics for numbers, some locale dependent. But only a few are currently exposed in MathJax. I might expose a few more for the next release. But I would never claim that any of these will be perfect or indeed will reflect the intentions of authors; so I am sure misinterpretations can be found in the future.

@dpvc dpvc added Expected Behavior This is how MathJax works v3 labels Sep 23, 2021
@zorkow
Copy link
Member

zorkow commented Oct 23, 2021

@NSoiffer
Btw, I was serious about this point. In fact, I'd be very happy to work on a comprehensive parsing algorithm for numbers, or possibly multiple restricted ones depending on locales or a locale or on a specific domain (e.g., US K-12). But one would need to do this based on a set of clearly defined criteria against which correctness and restrictions of an algorithm could be demonstrated. Not just on some intuitive notion of good and bad parsing or second guessing the intention of authors.

@NSoiffer
Copy link
Author

Sorry for not getting back to this sooner -- it got buried.

I appreciate the requirement not to break the display, but wrapping the digits inside of an mn won't change the display (use the thinspace char U+2009). So I believe the issue is not the clear cut "semantics" vs "display" issue you bring up. Why is splitting it into several mns more correct typographically than to put it into a single mn? Maybe easier, but not more correct.

If you agree with the above (which I think might be a big "if"), then you have two choices:

  1. do what is easiest
  2. do what yields the more likely correct markup (as defined by the MathML spec).

There is no way to know what is in the author's mind, but some simple rules will yield likely >>99% correctness. MathPlayer had some repair and MathCAT has stronger repair, but most systems won't and so JAWS and VoiceOver will likely speak the expression poorly, something that MathJax could prevent by merging these cases into a single mn.

I was aware of two common digit block strategies: Western languages use blocks of three and in many Asian countries, it is blocks of four. This wikipedia article mentions India as using a somewhat different style. In all these cases, locale helps resolve what to do.

In looking up the wikipedia page, I also discovered that ISO has a standard out that say blocks of three are preferred with whitespace as the preferred separator. It also mentions using that after the decimal point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Expected Behavior This is how MathJax works v3
Projects
None yet
Development

No branches or pull requests

3 participants