# issue with &Tscr; conversion #361

Closed
opened this issue Dec 15, 2012 · 18 comments

Projects
None yet
3 participants
Member

### pkra commented Dec 15, 2012

 I was building a demo around this mathml torture test and ran into an issue. The Schwinger-Dyson equation uses 𝒯 and MathJax renders everything fine using MathJax, but if I copy the show-MathML source (the processed MathML, not the original MathML), then the result fails to render with MathJax. In particular, 𝒯 gets converted to ��. Googling gave me 𝒯 which seems to work. Addendum: inspecting the result, MathJax uses MathJax_script-font and a T in the HTML-output.
Member Author

### pkra commented Dec 15, 2012

 Similarly (from Cichon's diagram) 𝒦 is �� instead of 𝒦 (but ℒ is ok) 𝔟 is �� instead of 𝔟 𝔡 is �� instead of 𝔡
Member Author

### pkra commented Dec 15, 2012

 Alright, somebody kindly pointed me to utf16 surrogate pairs -- I guess it's a missing conversion from the internal representation.
Member Author

### pkra commented Dec 15, 2012

 Also: 𝔈 𝔇 𝔉 𝔄𝔅.
Contributor

### fred-wang commented Dec 15, 2012

 I don't see 𝒯 in jax/input/MathML/entities/t.js and in general I don't see surrogate pairs (at best I see multiple chars like ThickSpace). Is it on purpose that we don't include these characters? I remember that when we discussed support for more math fonts, Asana Math was the only option but I think Davide told me that Asana's stretchy characters were outside the unicode range accessible from Javascript and so we would need to change their code points. Is it the same limitation, here?
Member

### dpvc commented Dec 15, 2012

 Fred, the math alphabets are in separate files (src.js, fr.js, opf.js), so they aren't in the files by letter. Peter, my understanding is that the use of named entities in MathML is discouraged, and so I haven't done a lot to accommodate them. How well they are handled also depends on whether you are using the mml2jax preprocessor, and on the format of the file (XHTML versus HTML). The browser frequently does the entity translation before MathJax sees it. MathJax does use UTF-16 internally (since that is what the browsers seem to use), but I guess you are right that the Show Math As MathML should convert the UTF-16 back to a single entity. Fred, the Asana font is one of the few fonts where the stretchy characters are accessible to MathJax. Most of the fonts (like Cambria) put them above 􏿿 which is outside what UTF-16 can address.
Member Author

### pkra commented Dec 16, 2012

 I'm not concerned with the named entities. The key issue is that the MathML produced by "show MathML" will not be rendered by MathJax -- we shouldn't have that, methinks ;) Converting to utf8 makes it work. Since some characters get converted to utf8, I'm wondering why others don't (in particular Kscr vs Lscr seems odd).
Member

### dpvc commented Dec 16, 2012

 I agree that the Show Math output needs to be corrected. As for the difference between Kscr and Lscr, there are a number of script characters that are in the "Letterlike Symbols" unicode block, and these are not duplicated in the "Mathematical Alphabets" block. Lscr is one such character that is at U+2112, and so that doesn't need a surrogate pair, whereas Kscr is in Plane 1, which requires the surrogate pairs.
Member Author

### pkra commented Dec 17, 2012

 Ah, thanks for the explanation!
Contributor

### fred-wang commented Dec 17, 2012

 Thanks Davide. I just had a quick look at the code. If I understand correctly, in toMathML.js MathJax.ElementJax.mml.entity.toMathML is used to convert the named entity and MathJax.ElementJax.mml.mbase.toMathMLquote to convert the chars. toMathMLquote does not seem to consider surrogate pairs so I guess this must be fixed. I'm not sure what is the data[0] when MathJax.ElementJax.mml.entity.toMathML is called. jax/element/mml/jax.js seems to do something with surrogate pairs in MathJax.ElementJax.mml.toString (I haven't tried to think to hard about that). Should MathJax.ElementJax.mml.entity.toMathML just use data[0].toString() instead or has data[0] already been converted before? I guess I can take this bug if that's not too difficult and that will be a good opportunity to add some documentation...
Member

### dpvc commented Dec 17, 2012

 Fred, thanks for looking into this. I'm on the road today, but will give some advice when I get home ether tonight or tomorrow morning. I suspect it would be a good one for you to handle. More information when I respond later.
Member

### dpvc commented Dec 18, 2012

 OK, I've looked into it, and you are right, the toMathMLquote is probably the best place to solve this. The mml.entity object doesn't actually come into play, here, as either the browser or the MathML input jax will have converted the entities to unicode characters already. (Originally, I was going to retain the entities and have them translated internally in the mml object, but because the browsers already convert entities to unicode characters prior to MathJax seeing them, that ended up not being practical to do. The only time MathJax sees entities is if the math is originally entered in the page as MathJax  scripts). So the character references are really being handled via the mml.chars class, not mml.entity. The data[0] property for mml.chars is the actual character string, so in the case of 𝒦, this will hold the pair of characters \uD835\UDCA6, which is the UTF-16 encoding for this character. The mml.chars.toString() function should just return the data[0] string, and when that is passed to toMathMLquote(), the individual characters are (incorrectly) converted to separate numeric character references, as you point out. They should be recombined to produce the single numeric entity 𝒦. That should not be hard to do, and I will leave that to you to work out. See the Wikipedia entry for UTF-16 for details of the encoding scheme if you need them. Thanks for taking this one on.
Contributor

### fred-wang commented Dec 18, 2012

 Thanks for the confirmation. So if data[0] just contains the surrogate pair(s) and only toMathMLquote is relevant, I guess I can do something to fix that issue. => assigned to myself

Contributor

### fred-wang commented Dec 20, 2012

 Alright, I've submitted a fix to my issue361 branch: 96dc4b2 (only the unpacked/ version). I'm wondering what is the rationale for these split and join operations, but I have kept them anyway. I've also added emacs/vim header to set indentation style (so that in particular I'll keep Davide's indentation with 2 spaces when I auto-indent with emacs). I've also tried to mimic Davide's "dense" code style. I hope that's ok. I haven't done much testing but Peter's testcases  $𝒯 𝒦 𝔟 𝔡 𝔈 𝔇 𝔉 𝔄 𝔅$  serialize as $𝒯 𝒦 𝔟 𝔡 𝔈 𝔇 𝔉 𝔄 𝔅$  in the show Math menu.
Member

### dpvc commented Dec 20, 2012

 The code looks good. I would probably have done (n-0xD800) << 10 because in the old days shifting was faster than multiplication, but I suppose that isn't the case any more. There is also the issue of what to do with a lead surrogate that doesn't have a trail surrogate (which is invalid, but possible). The current code leaves the surrogate as a literal character; I might be tempted to remove it (string[i] = "") instead. Nice job.
Member

### dpvc commented Dec 20, 2012

 PS, the split and join are because we are replacing individual characters by multiple characters. This makes it easier to handle the loop counter and avoids the more complicated splicing of the string. It certainly could have been done in other ways.
Contributor

### fred-wang commented Dec 20, 2012

 OK, thanks for the explanation & suggestions. I'll made the changes you suggest in my next commit.
Contributor

### fred-wang commented Jan 5, 2013

 UI/show-source-2.html => In testsuite
Member

### dpvc commented Feb 12, 2013

 OK, this looks good. I'm marking "Ready for Release" and will merge into develop.

### dpvc pushed a commit to dpvc/MathJax that referenced this issue Feb 12, 2013

 Merge remote-tracking branch 'fred/issue361' into develop 
Resolves issue mathjax#361.
 fc7aa04