Skip to content

Proposal to refactor tex2jax extension

Davide P. Cervone edited this page Jul 14, 2016 · 4 revisions

Proposal to refactor tex2jax extension

The current tex2jax extension combines several functions into one process: scanning the DOM for strings where math could be found, scanning those strings for TeX delimiters, and removing the math strings from the DOM and inserting MathJax math script tags that will be typeset later. These operations should be decoupled into separate functions in version 3. There should be separate actions for locating the strings where math could be found, for scanning those strings for math delimiters, and for replacing those strings with typeset math (or other markup). This will make it possible to process not just HTML but other formats like Markdown or even just plain text strings outside of a DOM setting.

We want version 3 not to be so tied to the DOM as version 2 is, so the basic functionality should work on a string, not a DOM tree. That is, we need a function to locate TeX and AsciiMath within a string (the middle function from tex2jax). The strings might come from an HTML document (DOM), or could be from another source format, like Markdown. Of course, we can provide wrapper functions that do work on a DOM and call the string-based function internally.

For something like an HTML document, since we don't want to have HTML tags within our math (other than <br>), we will need to process multiple strings that come from the text nodes of the document. There can be multiple text nodes in a row, so we should be able to search a sequence of separate strings as though they were one string (the concatenation of the individual strings). While it would be possible to have the find-math function operate only on a single string and concatenate the text from multiple DOM text nodes into one string that is passed to it, that would complicate the process of mapping the located math strings back into the DOM nodes (so they can be removed and replaced by the typeset math). The find-math function will need to be called multiple times over the contents of the file, with each call being passed the text-node strings that should be considered one run of text (the text between HTML tags where math could be found).

For Markdown, we don't want to search sections in verbatim mode, for example, so we may need to break the Markdown file into separate sections where math could be found, and call the find-math function on each.

One of the key ideas, here, is that the same function can be used to locate math delimiters no matter what kind of text is used. That is, developers don't need to come up with their own regular expressions for locating math within their text (and indeed, regular expressions are not powerful enough to do it properly); instead, they can use exactly the same mechanism that MathJax does for finding the math on the page, and can then process that math as they see fit.

Locating Math within Strings

There will be a FindMath() function that locates mathematics within text strings. Thus FindMath() should accept an array of strings, together with options that control the search. The return value is an array of objects that identify the math within these strings. A math expression can span multiple strings in the string array, and the returned object will include start and end markers that identify the index of the string in the array and the character within the string where the math starts and ends.

In order to process an HTML file, for example, an outer routine would walk the DOM tree looking for text nodes. It would collect together consecutive text nodes, possibly separated by <br> nodes, and build the string array from them, then call FindMath() on each collection of strings. The results can be combined into a larger array of math identifiers. The start and end markers could be augmented to include pointers to the text nodes that correspond to the strings in the array, so that the math could be removed from the DOM later when the typeset math is available.

Similarly, to process a Markdown file, the file could be broken into lines, and those lines scanned for verbatim markers (often `...`, or indentation by four blanks, though there are other reasons for indentation, so some sophistication needs to be used to identify verbatim blocks). The lines (or portions of lines) between these blocks can be passed to FindMath(), and the results can have their start and end indices adjusted to correspond to the original line numbers and positions in the file.

Routines to handle HTML and Markdown could be provided either as example code, or as core routines for actual use.

The FindMath() function

result = FindMath([string, ...], {options})

where options is an object like the following

{
  TeX: {
    disabled: false,
    delimiters: {
      inline: [ [open,close], ... ],
      display: [ [open,close], ...]
    },
    processEnvironments: true,
    processRefs: true,
    processEscapes: true
  },
  AsciiMath: {
    disabled: false,
    delimiters: [ [open,close], ...]
  }
}

which controls the processing of TeX and AsciiMath input within the strings. The disabled option allows you to prevent FindMath() from looking for that input format. The delimiters option allows you to specify open and close delimiters for the various input formats (multiple pairs can be given). The other options control what features to look for (as in the current tex2jax extension). Note that the array of strings is unchanged by this function.

The result is an array of objects of the form:

{
  math: string,
  format: "TeX" or "AsciiMath",
  display: true or false,
  start: {index: i, char: n},
  end: {index: i, char: n}
}

where math is the text string containing the math expression, format indicates which of the input formats was found, display is true when the TeX display delimiters or Asciimath delimiters were found, and start and end give the locations of the beginning and ending of the math within the array of strings. The index is the index of the string in the array, and char is the position within the string.

Note: This object may need to be augmented later to include values like the line-break width (if line breaking is in effect), and the em and ex sizes of the surrounding font.

Question: this proposal uses the current TeX and AsciiMath input formats. Do we want a general means of adding new input formats that works through this same function, or would new formats require a new function? That is, do input formats get "registered" (like they do now), and get incorporated into this search automatically (provided they have open and close delimiters of some kind)? Is there an appropriate abstraction for the other parameters for the TeX input (like \begin...\end, and \ref? What if a format doesn't have specific delimiters, like AsciiMath's automatic expression detection (not currently supported in MathJax)?

One alternative would be to have separate functions for finding each of the formats (e.g. FindTeX(), FindAsciiMath(), etc., and have FindMath() call a sequence of these, one for each input format that is registered. This is similar to how MathJax currently works, and allows formats to be added easily. One issue would be that you would not want a later format to look through math that was already identified by an earlier step. So this would require making new string arrays with the previously located math removed, and calling the later input formats on those modified arrays (multiple times, e.g., once on the strings before a found math expression, and once on the strings after it). This would make for a more complicated internal pipeline, with more temporary string arrays and the associated garbage collection, but would be more flexible in terms of adding new input formats.

Finding Math in HTML

For finding math within an HTML document, there should be a wrapper function, FindMathInHTML() that locates the get strings in the document and calls FundMath() to identify the mathematics within those strings.

result = FindMathInHTML(document,{options})

where document is an HTML document body (or perhaps just an HTML fragment), and options are an object like the following:

{
  skipTags: [ node-name, ... ],
  ignoreClass: regex,
  processClass: regex,
  TeX: {
    ...
  },
  AsciiMath: {
    ...
  },
  MathML: {
    disabled: false,
    xmlns: "m"
  }

where skipTags is an array of tag names that are not searched for math (e.g., ["script","noscript","style","object","embed","iframe","pre","code"]), ignoreClass and processClass give regular expressions for class names that indicate when an element should be ignored or scanned for math, TeX and AsciiMath give objects like those described in FindMath() above to control searching to TeX and AsciiMath delimiters, and MathML is a similar object that controls the search of MathML nodes in the DOM.

The result is an array of the following:

{
  math: string or MathML DOM tree,
  format: "TeX" or "AsciiMath" or "MathML",
  display: true or false,
  start: {node: DOM-text-node, char: n},
  end: {node: DOM-text-node, char: n}
}

These properties are the same as the ones for FindMath() above, except that start and end now contain pointers to DOM nodes in the document rather than indices in a string array. Note that the document is not changed by this function.

Internally, FindMathInHTML() will locate text strings within the DOM and call FindMath() to identify the math expressions within those strings.

Question: should the options to FindMath() (namely the TeX and AsciiMath blocks) be separated out into a special property that holds the object to pass to FindMath() rather than passing the complete options object (where FindMath() will simply ignore the options it doesn't recognize), or creating a new structure form the TeX and AsciiMath fields? If we allow input formats to be registered, that would complicate creating a new substructure.

Locating the strings could be done by a function FindTextInHTML() that produces the string arrays that are needed by FindMath(); FindMathInHTML() would call this, and then call FindMath() on them, fixing up the resulting objects to point to the proper DOM nodes, and concatenating the results together into one large array.

Question: Should there be an option to FindMathInHTML() that specifies the function to call for finding the math? This would default to FindMath(), but could be overridden to call an author-supplied function (say one that searches for un-delimited expressions, like AsciiMath's automatic mode). Or would the existence of FindTextInHTML() and FindMath() be sufficient (perhaps with separating out the function that modifies the start and end values so that that can be reused as well) for an author to code up their own DOM search that has a different means of finding the math?

Finding Math in Markdown

For finding math within a Markdown document, there should be a wrapper function, FindMathInMarkdown() that locates the text strings in the document and calls FundMath() to identify the mathematics within those strings.

result = FindMathInMarkdown(string,{options})

Where string is a markdown string to search for math, and options are an object like the one for FindMath() above.

The result is an array of objects of the form

{
  math: string,
  format: "TeX" or "AsciiMath",
  display: true or false,
  start: n
  end: n
}

where the math, format, and display properties are as described above, and start and end are now indices into the original string. Note that the string is left unchanged by this function; it simply identifies the math with the document.

Internally, FindMathInMarkdown() will locate text strings within the document and call FindMath() to identify the math expressions within those strings. Locating the strings could be done by a function FindTextInMarkdown() that produces the string arrays that are needed by FindMath() and then calls FindMath() on them, fixing up the resulting objects to point to the proper character positions, and concatenating the results together into one large array.

Processing the Math Objects

Once you have an array of math objects like those described above, you would go on to perform the input processing and output generation that would be needed to produce the final typeset mathematics. That will be a separate proposal, but could involve:

  • Passing the array of math to a function like CompileMath() that would run the proper input handler on each object and add a mml field that holds the MathJax internal MathML object tree.
  • Passing the array to a function like GetMetrics() that would determine the container width (for line breaking), and the ex and em sizes of all the math (while only performing two reflows), and add those values into the math objects.
  • Passing the array to a function like TypesetMath() that would produce the typeset output in the specified format and add that to the math objects.
  • Passing the array to a function like AddMenu() that would add the MathJax menu to the typeset math.
  • Passing the array to a function InsertIntoHTML() that would insert the math into the HTML document at the correct places.

Of course, the array of math objects could itself be an object (e.g., MathCollection), and these functions could be methods of that object class. So you might be able to do something like

var math = FindMathInHTML(document,{...});
math.CompileMath()
    .GetMetrics()
    .TypesetMath("CommonHTML")
    .AddMenu()
    .InsertIntoHTML();

or

var math = FindMathInMarkdown(markdownString,{...});
var md = math.RemoveFromMarkdown(markdownString);  // replaces math with special markers
    md = ProcessMarkdown(md);                      // runs the markdown engine
    md = math.ReplaceInMarkdown(md);               // puts the math back

Those details have yet to be specified. In particular, how to handle the potential need for loading of auxiliary files and the synchronization of that process.