Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue with 𝒯 conversion #361

Closed
pkra opened this issue Dec 15, 2012 · 18 comments

Comments

Projects
None yet
3 participants
@pkra
Copy link
Member

commented Dec 15, 2012

I was building a demo around this mathml torture test and ran into an issue.

The Schwinger-Dyson equation uses 𝒯 and MathJax renders everything fine using MathJax, but if I copy the show-MathML source (the processed MathML, not the original MathML), then the result fails to render with MathJax.

In particular, 𝒯 gets converted to �&#xDCAF. Googling gave me 𝒯 which seems to work.

Addendum: inspecting the result, MathJax uses MathJax_script-font and a T in the HTML-output.

@pkra

This comment has been minimized.

Copy link
Member Author

commented Dec 15, 2012

Similarly (from Cichon's diagram)

  • 𝒦 is �� instead of 𝒦 (but ℒ is ok)
  • 𝔟 is �� instead of 𝔟
  • 𝔡 is �� instead of 𝔡
@pkra

This comment has been minimized.

Copy link
Member Author

commented Dec 15, 2012

Alright, somebody kindly pointed me to utf16 surrogate pairs -- I guess it's a missing conversion from the internal representation.

@pkra

This comment has been minimized.

Copy link
Member Author

commented Dec 15, 2012

Also: 𝔈 𝔇 𝔉 𝔄𝔅.

@fred-wang

This comment has been minimized.

Copy link
Contributor

commented Dec 15, 2012

I don't see 𝒯 in jax/input/MathML/entities/t.js and in general I don't see surrogate pairs (at best I see multiple chars like ThickSpace). Is it on purpose that we don't include these characters? I remember that when we discussed support for more math fonts, Asana Math was the only option but I think Davide told me that Asana's stretchy characters were outside the unicode range accessible from Javascript and so we would need to change their code points. Is it the same limitation, here?

@dpvc

This comment has been minimized.

Copy link
Member

commented Dec 15, 2012

Fred, the math alphabets are in separate files (src.js, fr.js, opf.js), so they aren't in the files by letter.

Peter, my understanding is that the use of named entities in MathML is discouraged, and so I haven't done a lot to accommodate them. How well they are handled also depends on whether you are using the mml2jax preprocessor, and on the format of the file (XHTML versus HTML). The browser frequently does the entity translation before MathJax sees it.

MathJax does use UTF-16 internally (since that is what the browsers seem to use), but I guess you are right that the Show Math As MathML should convert the UTF-16 back to a single entity.

Fred, the Asana font is one of the few fonts where the stretchy characters are accessible to MathJax. Most of the fonts (like Cambria) put them above  which is outside what UTF-16 can address.

@pkra

This comment has been minimized.

Copy link
Member Author

commented Dec 16, 2012

I'm not concerned with the named entities.

The key issue is that the MathML produced by "show MathML" will not be rendered by MathJax -- we shouldn't have that, methinks ;)

Converting to utf8 makes it work. Since some characters get converted to utf8, I'm wondering why others don't (in particular Kscr vs Lscr seems odd).

@dpvc

This comment has been minimized.

Copy link
Member

commented Dec 16, 2012

I agree that the Show Math output needs to be corrected.

As for the difference between Kscr and Lscr, there are a number of script characters that are in the "Letterlike Symbols" unicode block, and these are not duplicated in the "Mathematical Alphabets" block. Lscr is one such character that is at U+2112, and so that doesn't need a surrogate pair, whereas Kscr is in Plane 1, which requires the surrogate pairs.

@pkra

This comment has been minimized.

Copy link
Member Author

commented Dec 17, 2012

Ah, thanks for the explanation!

@fred-wang

This comment has been minimized.

Copy link
Contributor

commented Dec 17, 2012

Thanks Davide. I just had a quick look at the code. If I understand correctly, in toMathML.js MathJax.ElementJax.mml.entity.toMathML is used to convert the named entity and MathJax.ElementJax.mml.mbase.toMathMLquote to convert the chars. toMathMLquote does not seem to consider surrogate pairs so I guess this must be fixed. I'm not sure what is the data[0] when MathJax.ElementJax.mml.entity.toMathML is called. jax/element/mml/jax.js seems to do something with surrogate pairs in MathJax.ElementJax.mml.toString (I haven't tried to think to hard about that). Should MathJax.ElementJax.mml.entity.toMathML just use data[0].toString() instead or has data[0] already been converted before?

I guess I can take this bug if that's not too difficult and that will be a good opportunity to add some documentation...

@dpvc

This comment has been minimized.

Copy link
Member

commented Dec 17, 2012

Fred, thanks for looking into this. I'm on the road today, but will give some advice when I get home ether tonight or tomorrow morning. I suspect it would be a good one for you to handle. More information when I respond later.

@dpvc

This comment has been minimized.

Copy link
Member

commented Dec 18, 2012

OK, I've looked into it, and you are right, the toMathMLquote is probably the best place to solve this. The mml.entity object doesn't actually come into play, here, as either the browser or the MathML input jax will have converted the entities to unicode characters already. (Originally, I was going to retain the entities and have them translated internally in the mml object, but because the browsers already convert entities to unicode characters prior to MathJax seeing them, that ended up not being practical to do. The only time MathJax sees entities is if the math is originally entered in the page as MathJax <script type="math/mml">...</script> scripts). So the character references are really being handled via the mml.chars class, not mml.entity. The data[0] property for mml.chars is the actual character string, so in the case of &Kscr;, this will hold the pair of characters \uD835\UDCA6, which is the UTF-16 encoding for this character. The mml.chars.toString() function should just return the data[0] string, and when that is passed to toMathMLquote(), the individual characters are (incorrectly) converted to separate numeric character references, as you point out. They should be recombined to produce the single numeric entity &#x1D4A6;. That should not be hard to do, and I will leave that to you to work out. See the Wikipedia entry for UTF-16 for details of the encoding scheme if you need them. Thanks for taking this one on.

@fred-wang

This comment has been minimized.

Copy link
Contributor

commented Dec 18, 2012

Thanks for the confirmation. So if data[0] just contains the surrogate pair(s) and only toMathMLquote is relevant, I guess I can do something to fix that issue.

=> assigned to myself

@ghost ghost assigned fred-wang Dec 18, 2012

@fred-wang

This comment has been minimized.

Copy link
Contributor

commented Dec 20, 2012

Alright, I've submitted a fix to my issue361 branch:

96dc4b2

(only the unpacked/ version). I'm wondering what is the rationale for these split and join operations, but I have kept them anyway. I've also added emacs/vim header to set indentation style (so that in particular I'll keep Davide's indentation with 2 spaces when I auto-indent with emacs). I've also tried to mimic Davide's "dense" code style. I hope that's ok. I haven't done much testing but Peter's testcases

 <math>
    <mo>&Tscr;</mo>
    <mo>&Kscr;</mo>
    <mo>&bfr;</mo>
    <mo>&dfr;</mo>
    <mo>&Efr;</mo>
    <mo>&Dfr;</mo>
    <mo>&Ffr;</mo>
    <mo>&Afr;</mo>
    <mo>&Bfr;</mo>
  </math>

serialize as

<math>
  <mo>&#x1D4AF;</mo>
  <mo>&#x1D4A6;</mo>
  <mo>&#x1D51F;</mo>
  <mo>&#x1D521;</mo>
  <mo>&#x1D508;</mo>
  <mo>&#x1D507;</mo>
  <mo>&#x1D509;</mo>
  <mo>&#x1D504;</mo>
  <mo>&#x1D505;</mo>
</math>

in the show Math menu.

@dpvc

This comment has been minimized.

Copy link
Member

commented Dec 20, 2012

The code looks good. I would probably have done (n-0xD800) << 10 because in the old days shifting was faster than multiplication, but I suppose that isn't the case any more. There is also the issue of what to do with a lead surrogate that doesn't have a trail surrogate (which is invalid, but possible). The current code leaves the surrogate as a literal character; I might be tempted to remove it (string[i] = "") instead.

Nice job.

@dpvc

This comment has been minimized.

Copy link
Member

commented Dec 20, 2012

PS, the split and join are because we are replacing individual characters by multiple characters. This makes it easier to handle the loop counter and avoids the more complicated splicing of the string. It certainly could have been done in other ways.

@fred-wang

This comment has been minimized.

Copy link
Contributor

commented Dec 20, 2012

OK, thanks for the explanation & suggestions. I'll made the changes you suggest in my next commit.

@fred-wang

This comment has been minimized.

Copy link
Contributor

commented Jan 5, 2013

UI/show-source-2.html

=> In testsuite

@dpvc

This comment has been minimized.

Copy link
Member

commented Feb 12, 2013

OK, this looks good. I'm marking "Ready for Release" and will merge into develop.

dpvc pushed a commit to dpvc/MathJax that referenced this issue Feb 12, 2013

@dpvc dpvc closed this May 17, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.