Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move mj-page out of mathjax-node #206

Closed
pkra opened this issue Apr 11, 2016 · 18 comments
Closed

move mj-page out of mathjax-node #206

pkra opened this issue Apr 11, 2016 · 18 comments

Comments

@pkra
Copy link
Contributor

pkra commented Apr 11, 2016

As per F2F (cf #191), we will move mj-page out of mathjax-node; there will be alternative modules to use.

@pkra
Copy link
Contributor Author

pkra commented Apr 15, 2016

[EDIT: disregard this comment -- it was supposed to be on #205]

Here's the first draft for a separate module based on svg2png

var mjAPI = require("mathjax-node/lib/mj-single.js");

exports.math2png = function(mjoptions, pngoptions, callback){
  mjAPI.start();
  var svg2png = function(result, options, callback){
    if (result.errors) return result.errors;
    var svgpng = require('svg2png');
    var sourceBuffer = new Buffer(result.svg, "utf-8");
    var returnBuffer = svgpng.sync(sourceBuffer, {
      width: result.svg.width,
      height: result.svg.height
    });
    result.png = {};
    result.png.data = "data:image/png;base64," + returnBuffer.toString('base64');
    // maybe have something for scaling / dpi?
    return callback(result);
  };
  // make sure SVG output will be generated
  mjoptions.svg = true;
  return mjAPI.typeset(mjoptions, function(result){
    console.log();
    svg2png(result, pngoptions, callback);
  });
}

@dpvc
Copy link
Member

dpvc commented Apr 17, 2016

[EDIT: you can disregard this comment -- it is a response to a misplaced comment regarding #205]

A couple of comments:

  1. This doesn't use mjAPI.config(), and that means you can't control the font, for example. You don't necessarily have to allow full access to mjAPI.config(), but allowing configuration of fonts is probably a good idea. I suppose the application could itself require mj-single and call mjAPI.config() itself, but perhaps that needs to be encompassed in another (configuration) module? I haven't really thought much about how these modules should be configured.
  2. That brings up the question of the mjoptions and pngoptions arguments. As it stands the mjoptions could specify creation of HTML or MathML output, and those formats would be generated and discarded (inefficient). Perhaps you want to make your own option object for the typeset() call and copy the options that make sense from the mjoptions object rather than passing them all in unfiltered.

Also, the pngoptions are not used anywhere (I assume these are for future use). I'm wondering if there need to be two sets of options. If you create your own options for mathjax from the ones passed in, you could dal so create your own options for the png conversion from one option object that the user passes to math2png.
3. Your current approach will require that a new DOM be creates and a new copy of MathJax be loaded, initialize, and run each time math2png is called. That is very expensive, both in terms of time, and in memory usage (creating, freeing, and garbage collecting lots of stuff over and over again if you process multiple equations). One of the reasons for the current organization of the library was to allow you to start one copy of MathJax and have it process multiple expressions. It would be good to be able to take advantage of that, here.

The main culprit is the mjAPI.start() call, since that forces the new DOM and new copy of MathJax. Fortunately, however, the mjAPI.typeset() will perform the start() call if one hasn't already been done already. So if you remove the start() call, then you should be using the same copy of MathJax for all the typesetting.
4. It's a little odd to store the result of require('svg2png') as svgpng rather than svg2png and to have some other variable called svg2png. This can lead to confusion in the future (it already confused me). Perhaps renaming svg2png as createPNG or something similar, and renaming svgpng as svg2png would help.
5. Your svg2png function is the same for every invocation, and does not rely on the arguments to math2png (they are passed to it later), so there is no need for the closure, here. Why not just move it outside and reuse it rather than creating a new function each time math2png is called?

Similarly, the svg2png package could be loaded just once (not each time svg2png is called), just like mjAPI is. So you could structure it more like the following:

var mjAPI = require("mathjax-node/lib/mj-single.js");
var svg2png = require('svg2png');

function createPNG (result, options, callback) {
  if (result.errors) return result.errors;
  var sourceBuffer = new Buffer(result.svg, "utf-8");
  var returnBuffer = svg2png.sync(sourceBuffer, {
    width: result.svg.width,
    height: result.svg.height
  });
  result.png = {};
  result.png.data = "data:image/png;base64," + returnBuffer.toString('base64');
  // maybe have something for scaling / dpi?
  return callback(result);
};

exports.math2png = function (mjoptions, pngoptions, callback) {
  // make sure SVG output will be generated
  mjoptions.svg = true;
  return mjAPI.typeset(mjoptions, function (result) {
    createPNG(result, pngoptions, callback);
  });
}
  1. I don't like the use of svgpng.sync, since that means you can't use this very easily within a web service (that wants to operate asynchronously so as to still be able to handle other requests while this one is being serviced) without having to spawn child processes to handle the requests. While the latter is possible, it is a more sophisticated technique that not everyone will want to use. It is also much more resource intensive (in particular, it would require a new child process to be created, with a new copy of MathJax to be run for every expression). Does svg2png have an asynchronous method? I'd recommend using that, especially since you already have to use a callback for the MathJax typeset() call.
  2. I would not put the test for result.errors in the createPNG() (or svg2png()) function. It seems more natural to put that test in the callback for the typeset() action, since if there were errors, you don't want to try to create the PNG. Something like:
mjAPI.typeset(mjoptions, function (result) {
    if (results.errors) callback(results);
                   else createPNG(result, pngoptions, callback);
  });

should do it. Here, if there are errors, the callback is called immediately with the results so fat (including the error indications), otherwise createPNG() is called. Note that in your code, returning result.errors does nothing, since the value of createPNG() is not used, and the user's callback is never called (so their code likely hangs forever). Also, note that the result of the myAPI.typeset() call is always null, so there is no reason to return it (so I have removed the return from before it). Similarly, there is no need for the return in the return callback(results) in the createPNG() function, since (again) the return value of createPNG() is never used.
8. Finally, I'm not sure what other data you have in mind for the results.png object, but it would be more efficient to do

          result.png = {
            data: "data:image/png;base64," + returnBuffer.toString('base64')
          };

rather than creating an empty object and then adding a property to it.

That is what I see at the moment.

@pkra
Copy link
Contributor Author

pkra commented Apr 20, 2016

[EDIT: you can disregard this comment -- it is a response to a misplaced comment regarding #205]

Thanks for these. As mentioned, it was a first draft just to accompany the PR. It did not require such a detailed response.

@fptoth
Copy link

fptoth commented Apr 28, 2016

Hi,

I don't understand the MathJax internals well enough to follow all of the above, but it might be useful to document how and why we currently use mj-page.js:

We have a production tool that does error checking on large numbers of full text XML files which include LaTeX markup. It's not uncommon for an XML file to have thousands of equations nor is it uncommon for us to process thousands of XML files daily.

In this tool we don't care about MathJax rendering at all. We're only interested in finding TeX errors.

For each XML file being processed, our tool extracts all of the individual pieces of TeX and creates a single HTML file containing only the equations and no other markup.

We then run this HTML through mathjax-node, using your components that are customized for our use. These components are derived as follows:

  • Our main "driver" is based on bin/page2mml, but has been modified to ignore HTML output. Instead it gathers any MathJax errors and returns them to the caller in JSON format.
  • The "driver" uses a customized lib/mj-page.js, just as does bin/page2mml. The customizations include additions of our internal macros. We have also experimented with turning off various extensions to get better speed. But this is optional (and very tricky to get right).
  • Also, as documented in MathJax-node fails to report \c error #224, we think we've found a bug that is worked-around by turning off AsciiMath pre-processing.

Obviously we can continue to use our customized mj-page.js in the future, but this thread suggests that there is a better way to accomplish what we're doing.

How would we approach the above without mj-page.js? What's the better way?

Thanks,
Fred

@pkra
Copy link
Contributor Author

pkra commented Apr 29, 2016

Thanks for sharing your setup, Fred!

First off, I just realized that the preceding discussion is on the wrong issue -- which is totally my fault. The comments 2,3, and 4 are about #205, moving PNG support out of mathjax-node. Sorry for the confusion! 😞

Back to your setup. Your setup is, I think, a good example why I initially proposed to drop mj-page.

The question is: why are you using mj-page in the first place? You get individual TeX fragments out of your XML, so mj-single seems more natural to use here.

I'm guessing the motivation for going to mj-page might be performance, thinking the overhead of jsdom and MathJax would be large enough to make it faster to do mj-page than loop through the fragments and call mj-single for each one. However, we noticed that that's not actually the case; performance is pretty much identical (both methods can be slightly faster than the other, depending on the context but they're virtually the same). So a key motivation for mj-page from our end is no longer present.

Additionally, mj-page can lead to subtle problems. Your bug report is a nice example of this and also for how too many options seem to prevent people from finding them. The fact that we have mj-page leads to other questions/requests, e.g., having an equivalent tool for XML (where jsdom sometimes causes issues), or for markdown files, or having something using alternative DOM-like libraries (e.g., cheerio).

Adding solutions for these would increase mathjax-node's maintenance burden considerably. So we think it's better to leave those tools in the hands of the community, while providing ideas and of course help with any questions.

As part of this deprecation and our work on MathJax v3.0, we will be providing a modernized pre-processor library that will allow developers to incorporate pre-processing more easily into their workflows. Though in your case, you don't seem to need this.

How would we approach the above without mj-page.js? What's the better way?

To recap. I'd expect a) you could keep using something like mj-page or b) you could just loop through your TeX fragments with mj-single and collect the errors.

@kovidgoyal
Copy link

I am in the process of writing an in browser ebook reader application for the calibre content server (this allows reading ebooks directly in the browser, loaded from an embedded HTML server running in calibre). I am investigating adding support for MathML by using MathJax (similar to how the calibre desktop ebook viewer renders maths using MathJax). The server already does some pre-processing on individual HTML files, so for me the ability to server side process a HTML page contain math is invaluable -- as this is both more performant, and it means I dont have to patch MathJax itself to change how it loads its resources (I need the in browser reader to work offline).

Bottom line, the ability to render math in HTML files server side is valuable to me, so I would like to vote against removing this facility.

@fptoth
Copy link

fptoth commented May 1, 2016

Peter, you're correct that we process a page at a time for performance. But to give you more detail: Our QC tool is written in Java. It creates the page with an article's worth of equations and then runs a single invocation of node to process that file. In other words, node is not running continuously, but is called via command line, once for each article.

So, if I have 1000 XML files, that's 1000 calls to node. But if I use single equation processing, and have 1000 XML files, each with 1000 equations, that's 1,000,000 calls to node.

But, I take your point and I appreciate your advice. We'll keep this issue on our radar as we move forward.

Thanks,
Fred

@pkra
Copy link
Contributor Author

pkra commented May 2, 2016

@kovidgoyal thanks for sharing your use case!

Since I've worked on ebook pre-processing quite a bit, I can say that mj-page is somewhat problematic for xhtml content given jsdom's problems with XML. Building a small script like mj-page but using a real XML tool has been much more reliable (e.g., libxmljs).

@fptoth to clarify: I'm not suggesting to call mj-single repetitively but to write a small wrapper that re-uses the same mj-single instance, passing it one equation after another and collecting the errors. That's really just a few lines of code for your setting I think, e.g., wouldn't something like https://gist.github.com/pkra/d2894330e784f27dbc6e8b12763ca972 be enough for your use case?

@pkra
Copy link
Contributor Author

pkra commented May 2, 2016

[Edited the above: "not suggesting", @fptoth.]

@kovidgoyal
Copy link

@pkra calibre already has the tools to serialize arbitrary html4/5/xhtml into canonical representations -- parsing is not a problem for me. However, after exploring this some more, I decided to just use mathjax client side, storing it in indexeddb for offline mode. Turns out that the patching mathjax requires for this is minimal.

The problem with mathjax-node is the dependency chain is too large to bundle with calibre (since calibre is cross platform, i cannot simply rely on using system libraries). Someday I might decide to revisit this, but most likely in that case I'll end up just porting the parts of mathjax that I need to run in python directly, server side.

@pkra
Copy link
Contributor Author

pkra commented May 2, 2016

I decided to just use mathjax client side, storing it in indexeddb for offline mode. Turns out that the patching mathjax requires for this is minimal.

Oh, would be interested in hearing more. Just don't use the HTML-CSS output for that -- it will break easily across browser&OS versions.

The problem with mathjax-node is the dependency chain is too large to bundle with calibre

Copy that. MathJax v3.0 should make that significantly lighter (ideally: zero).

@kovidgoyal
Copy link

This is client side, so why would HTML+CSS output break? What I am doing is exactly the same as adding MathJax to a normal HTML page, the only difference is that MathJax is loaded from indexeddb instead of from CDN.

@pkra
Copy link
Contributor Author

pkra commented May 2, 2016

This is client side, so why would HTML+CSS output break?

Sorry! I hand't read your posting carefully enough. It probably won't break anything if you're re-using in the same browser session (well, I guess changes in the surrounding CSS (text size, font etc) might cause minor issues). But if you're using indexddb, then you can use the CommonHTML output anyway, which is more robust and faster so I'd still use that (and you could then even use that to generate something permanently to send to the server and use across sessions).

@kovidgoyal
Copy link

Yeah, probably a good idea to use CommonHTML -- I am only using HTML+CSS to start with as I am more familiar with how it works. Once I have that working, I can always switch it to CommonHTML fairly easily.

@pkra
Copy link
Contributor Author

pkra commented May 24, 2016

Here's a random question that popped into my mind: what shall we call mj-single after we remove mj-page? A list of things seen in the wild

  • main.js
  • lib.js
  • index.js
  • app.js
  • mathjax-node.js

@pkra pkra modified the milestones: v1.0, What comes next May 26, 2016
@pkra
Copy link
Contributor Author

pkra commented Jun 24, 2016

Here's a simple example (for bin) just doing mathml in an html fragment https://gist.github.com/pkra/1642fd00bc9430f7a6ffb84f8cfd2ddd.

@pkra
Copy link
Contributor Author

pkra commented Oct 26, 2016

[removed; wrong issue]

@pkra
Copy link
Contributor Author

pkra commented Oct 27, 2016

I've started to work on a replacement for mj-page. For now, the code is at https://github.com/pkra/mathjax-node-page.

Feedback would be very welcome.

@pkra pkra closed this as completed Nov 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants