Complex text layout, in particular with TeX input [was: MathJax does not support Complex text layout.] #474

Open
opened this issue May 19, 2013 · 23 comments

hartman commented May 19, 2013

 Because MathJax looks at individual code points it has trouble dealing with scripts that require bidirectionality, context shaping etc. This is visible whenever trying to use hebrew or arabic for instance. It would be good if MathJax would be able to identify these ranges and be able to keep those as blocks instead of dividing it into individual characters. At the very least in \text mode. http://en.wikipedia.org/wiki/Complex_text_layout
Member

dpvc commented May 19, 2013

 Note that if you set mtextFontInherit to true in the HTML-CSS and SVG sections of your configuration, then MathJax will process \text{} as a single , and so that should do as you request. You are right that MathJax could do better when mtextFontInherit is false. It should group "unknown" characters into a single collection, rather than putting each into a separate .
Member

dpvc commented May 19, 2013

 PS, I saw the report on the Wikimedia bugzilla and was planning to add it to the list of things to fix. Thanks for staring the issue here to track that.
Author

hartman commented May 22, 2013

 Thanks for the mtextFontInherit tip. I was going to enable that anyways, but this is one more reason to do that.
This was referenced Oct 17, 2013
Member

dpvc commented Mar 21, 2014

 Some support for RTL was added in v2.3, but the issue of multiple-character sequences being treated as a unit remains. For \text{}, these characters should already be grouped into a single , so that would be one way to handle it, though not very convenient. Ideally, MathJax would put each sequence that forms one group into a single  or , just as it does for single Latin letters now. I've looked into this to some degree, and there are some difficulties handling it. It is possible to have combining characters grouped with their preceding characters, but it is not clear to me how some characters work. For example, it seems that the virama (U+0D4D) combines not just the character on its left, but also on the right, though I might be misunderstanding it. It also seems that some of these grouping are handled by ligatures within the fonts, not by combining characters. Unfortunately, MathJax does not have access to ligature information from the fonts. While it would be possible to add ligature data to MathJax's font tables, this could be a significant amount of data very little of which would be used by any one page. I'm really not familiar enough with the languages that use these features to know if what I'm trying out would be sufficient or not. I'm wondering if it is possible to get some examples from a variety of languages that show the range of situations that need to be accommodated. One approach might be to put the data needed for each language's script into an individual extension that gets loaded for those pages that need it (either explicitly in the MathJax configuration, or via \require{} within the math on the page). Do you think that would be acceptable?
Author

hartman commented Mar 22, 2014

 Perhaps @amire80 of our WMF language engineering is able to help out a bit here...
modified the milestones: A future release, Bugfix Version Apr 10, 2014
referenced this issue Nov 12, 2014
modified the milestones: A future release, The next release Feb 26, 2015
Member

pkra commented Feb 26, 2015

 @hartman do you think you could poke @amire80 some time? We'd love to improve this, especially if Wikipedia wants to roll out the SVG output more widely.

amire80 commented Feb 26, 2015

 I'm right here :) How can I help? Testing? - Gladly, just tel me what to test exactly. Examples of how non-Latin scripts work in formulas? - It's not used in Hebrew textbooks, but it is used in textbooks in Arabic and Persian. Maybe @ebraminio can chime in here. Anything else?
Member

pkra commented Feb 26, 2015

 Thanks for stopping by @amire80 :-) How can I help? I'm hoping we can improve handling of combined characters in non-Latin scripts. This has come up on WMF bugzilla/phabricator repeatedly. To quote Davide from #474 (comment) : Ideally, MathJax would put each sequence that forms one group into a single or , just as it does for single Latin letters now. I've looked into this to some degree, and there are some difficulties handling it. It is possible to have combining characters grouped with their preceding characters, but it is not clear to me how some characters work. For example, it seems that the virama (U+0D4D) combines not just the character on its left, but also on the right, though I might be misunderstanding it. It also seems that some of these grouping are handled by ligatures within the fonts, not by combining characters. Unfortunately, MathJax does not have access to ligature information from the fonts. While it would be possible to add ligature data to MathJax's font tables, this could be a significant amount of data very little of which would be used by any one page. I'm really not familiar enough with the languages that use these features to know if what I'm trying out would be sufficient or not. I'm wondering if it is possible to get some examples from a variety of languages that show the range of situations that need to be accommodated. So our question would be: does anyone have expertise they can share with us? @hartman was kind enough to point to you ;-) (Perhaps we should split this out into a separate issue.)

amire80 commented Feb 26, 2015

 The (very) basic idea of virama is that the sequence of consonant + virama + consonant has three Unicode characters, which appear as occupying the space of one glyph (but it can get far more complicated). More generally, I'd love to understand MathJax's current situation. What should I do to test the current rendering? Install my own instance? Or is there an online instance where a current version can be tested?
Member

pkra commented Feb 26, 2015

 consonant + virama + consonant has three Unicode characters, which appear as occupying the space of one glyph Right. Combined characters are common enough in mathematical layout so we understand the situation in general. (but it can get far more complicated). That's our problem. We lack the specifics for most natural language, non-Latin scripts. Or is there an online instance where a current version can be tested? You can do this on MediaWiki (using the MathML/SVG mode of the math extension), in the browser (this sample or this codepen) or use a local copy of MathJax -- whichever you like. A basic example: ത്ര will be converted to ത്ര and since we don't have any routines to identify these kinds of combined characters, the TeX input converts this internally to MathML as   Which the MathJax output will in turn split across three span's (in the HTML outputs) or three g's (in the SVG output) -- and of course this breaks the rendering of the combined character. (I just noticed that Firefox sometimes combines the spans in the HTML outputs e.g., ത്ര but not the subscript in കു_ശ. Chrome is more "consistent" in that nothing is combined) So for us the problem is: is there a concise set of data (or some efficient heuristic) that we could use to identify all relevant situations where we need to re-combine into one mi/mo element in the MathML? Once we have that, the rendering will work as well.

davidcarlisle commented Mar 2, 2015

 So for us the problem is: is there a concise set of data (or some efficient heuristic) that we could use to > identify all relevant situations where we need to re-combine into one mi/mo element in the MathML? Sorry for the long comment, bringing a bit of off site discussion back to the issue tracker. How feasible/expensive would it be to make the Unicode UCD database combining class available to mathjax for each character? Basically (or at least as a good first approximation) any character with non zero combining class (field 4 in UnicodeData.txt) needs to stay with the preceding one, and in addition if it's class 9 (virama) the following character needs to be kept together as well. It's probably also worth noting that tex, even unicode tex like xetex or luatex are almost certainly not going to get this right without markup that is you will need \text{abc} or \mathit{abc} or some other such command to force a string of characters to be typeset as text with a single font rather than TeX's normal habit of splitting things up character by character. Even if the construct looks like a single character to the author. In classic tex it is not an issue as fonts can only have 256 characters and while composed characters can be supported with various macro remapping tricks composing characters following the base are basically not supportable even for simple composing accents like acute. Support in unicode tex variants such as xetex and luatex seems a bit variable. In text, xetex hands things over to the HarfBuzz library so does pretty well. luatex handles it internally and currently does less well with the virama. In math both require a font with an opentype MATH table to do anything very useful and I couldn't find such a font that had a virama. The following latex document is using kartika in text and latin modern math in math, you will note that even european accents typically fail in math, but even the virama example works if you add some markup \mbox here or mi or mtext equivalently in MathML The image shows xetex at the top and luatex at the bottom. So while not requiring something like \text{..} or \mbox{...} around such character strings would be desirable, it would put your unicode support a long way ahead of what TeX can currently achieve so it depends a bit on what the specification of the "tex-like syntax" is, how far beyond what TeX can do is it reasonable to push it? \documentclass{article} \usepackage{fontspec} \usepackage{unicode-math} \setmainfont{kartika.ttf} \begin{document} U+0d24 U+0d4d U+0d30 outputs e.g., ത്ര but abc $abc \mbox{ത്ര}$ U+0063 abç $abç \mbox{ത്ര}$ U+00e7 abç $abç \mbox{ത്ര}$ U+0063 U+0327 \end{document} 

khaledhosny commented Mar 3, 2015

 I'm not really sure if I understand what the discussion is about, but if the idea is to identify what sequence of characters constitute a single unit, then Unicode grapheme clustering should provide the needed information..

amire80 commented Mar 3, 2015

 Yes - what @khaledhosny says sounds like the right thing to me, although I'm not every experienced with it. Maybe @santhoshtr can contribute more details. Santhosh, I think that what @pkra wrote three comments above explains the problem best.

davidcarlisle commented Mar 3, 2015

 On 3 March 2015 at 12:05, Khaled Hosny notifications@github.com wrote: I'm not really sure if I understand what the discussion is about, but if the idea is to identify what sequence of characters constitute a single unit, then Unicode Grapheme clustering http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries should provide the needed information.. Yes but I suppose the question is how far it makes sense for a javascript library to do that by hand if the underlying platform doesn't make the unicode properties available and if it's emulating tex syntax how far would tex go? You know as much about the tex support as anyone. How far would it be reasonable in xetex to have such a cluster do anything sensible in math without escaping to text with \text{..} or some such command, given that you can't assign a \mathclass to such a cluster?
Author

hartman commented Mar 4, 2015

 I found a CoffeeScript implementation for graphemes. https://github.com/devongovett/grapheme-breaker Might be useful.
changed the title MathJax does not support Complex text layout. TeX input and complex text layout [was: MathJax does not support Complex text layout.] Mar 4, 2015
Member

pkra commented Mar 4, 2015

 Thanks for all the useful comments. To summarize, xetex/luatex do not handle input the way requested in this issue, i.e., without extra markup such as \text it's not clear (to me at least) if there are plans to handle it this way a solution could start with the simple approach David C outlined or potentially build on grapheme-breaker (thanks @hartman!) To add to that, On the other hand, a quick test with LaTeXML and pandoc indicates that they do handle such characters as requested here, i.e., not like xetex/luatex. So it seems to me that a solution can't be in the core TeX input but needs to be an extension. That's not a problem, of course, since it probably would have ended up an extension anyway. It would be good to hear from MediaWiki/WMF communities if they actually want to delineate from the TeX-engines here.
referenced this issue Mar 4, 2015
Member

pkra commented Mar 10, 2015

 Again it would be good to get more feedback. At TeX folks, is handling characters in math mode without extra markup the future direction of xetex/luatex/etc? At MediaWiki / WMF folks: is non-standard TeX behavior actually desired by the relevant communities? Without more feedback, I think we should punt on this / move it out of the 2.6 milestone.

khaledhosny commented Mar 11, 2015

 Let me understand the issue here, people want to do things like $x+y=$ where  is possibly a multi-code point grapheme, and have  treated as a math identifier, right? If so, then I think that is a reasonable expectation and if current Unicode TeX engines do not handle it correctly (they probably don’t) it is likely a bug or a missing feature, not something by design. Or is it that people want to do things like , where  is a multi-character text string that possibly needs complex text layout, and get proper text layout (bidi, shaping etc.)? I don't think that is a reasonable expectation and some kind of markup is needed here to indicate that this is a regular text string that needs to be treated as such.
Member

pkra commented Mar 11, 2015

 Thanks, @khaledhosny! [...] people want to do things like $x+y=$ where is possibly a multi-code point grapheme, and have treated as a math identifier, right? Yes, that's how I understand it as well. (It's a bit difficult to say since this is originally a request from the Wikipedia end). I think that is a reasonable expectation Thanks! if current Unicode TeX engines do not handle it correctly (they probably don’t) it is likely a bug or a missing feature, not something by design. Thanks for that, too. The "they probably don’t" part worries me slightly but if you and @davidcarlisle agree that it's the desired behavior in Unicode TeX engines, then that's enough for us, I think. Still hoping the MediaWiki/WMF/Wikipedia side will chime in.
modified the milestones: A future release, MathJax v2.6 Aug 4, 2015
Member

pkra commented Aug 4, 2015

 As per F2F, we're removing this from the v2.6 Milestone (i.e., the upcoming release). It's not clear what the right approach is, in particular, in terms of compatibility with TeX/LaTeX (or rather XeTeX/LuaTeX). It's also not clear what the WMF and the Wikipedia community really want here. To be clear, we're not closing this issue and we are still interested in figuring out how complex layout might work in the TeX input.
referenced this issue Aug 21, 2015
referenced this issue Nov 23, 2015
changed the title TeX input and complex text layout [was: MathJax does not support Complex text layout.] Complex text layout, in particular with TeX input [was: MathJax does not support Complex text layout.] Nov 23, 2015
referenced this issue Nov 23, 2015
removed the label Apr 11, 2016
referenced this issue Jul 11, 2016
referenced this issue Jun 29, 2017
Member

pkra commented Oct 25, 2018

 Blast from the future: there's a TC39 proposal "Unicode segmentation" to allow (among other things) to split strings by grapheme https://github.com/tc39/proposal-intl-segmenter. The repository includes a link to a polyfill (and there's also a non-standard Chrome feautre apparently).
Member

dpvc commented Oct 25, 2018

 Cool. Thanks, @pkra.
Member

pkra commented Oct 25, 2018

 No problem. The polyfill is unfortunately useless -- it only covers Enligsh. But for those who want to try it out, the chrome build-in might be useful.