# Object generation with proctex.py

proctex.py is a script designed to generate document and equation objects, which it then automatically exports to .pkl (document) and .json (document & equation) objects.

Here's a quick example of how the latexmlmath function works, for something simple like the fraction:
$\frac{3}{4}$.

In [32]:
import subprocess
from subprocess import PIPE
proc = subprocess.Popen(["latexmlmath", "--quiet", "-"], stderr = PIPE, stdout = PIPE, stdin = PIPE)
intext = "\\frac{3}{4}"
stdout, stderr = proc.communicate(intext)
print(stdout)

<?xml version="1.0" encoding="UTF-8"?>
<math xmlns="http://www.w3.org/1998/Math/MathML" alttext="\frac{3}{4}" display="block">
  <mfrac>
    <mn>3</mn>
    <mn>4</mn>
  </mfrac>
</math>



latexmlmath can output both contentML and displayML. LateXML document currently describes contentML output as being under development.

### Classes

The equation, document, and archive classes are all located in core/texclasses.py, and provide a navigable data structure that can be exported to both pickle (.pkl) and JSON (.json) file formats.

Commented out code corresponds generation of equation metadata (e.g. text/word tokens occurring before and after the equation).

The numerous try statements exist to address one of the issues with a non-standardized corpus of LaTeX documents - namely, that there are multiple encoding schemes, whereas the JSON library can only handle Unicode/UTF-8 text. Python 2's default functionality is to attempt to preserve the original text's encoding, which can then lead to UnicodeDecodeErrors.

Due to the excessively large/non-parallelizable nature of interacting with entire archive class objects, the archive document type will likely be depreciated in favor of generation of objects purely on a per-document/per-equation basis.

In [43]:
import subprocess
from subprocess import PIPE
from nltk.tokenize import word_tokenize
import os
class equation:
    def __init__(self,eqtext,fname, desig = 'latex'):
        self.text = eqtext
        self.type = desig
        self.itemtype = "equation"
        # self.prevtext = ""
        # self.nexttext = ""
        # self.prevtexttoks = []
        # self.nexttexttoks = []
        self.file = fname
        self.mathml = ""
        proc = subprocess.Popen(["latexmlmath", "--quiet", "-"], stderr = PIPE, stdout = PIPE, stdin = PIPE)
        try:
            stdout, stderr = proc.communicate(self.text)
        except:
            print("{}: Text encoding error occurred. Encoding to utf-8...".format(fname))
            try:
                stdout, stderr = proc.communicate(self.text.encode('utf-8'))
                print("{}: Alternate encoding successful".format(fname))
            except:
                print("{}: Encoding failed - MathML invalid".format(fname))
                selfmathml = ""
                return
        if proc.returncode !=0:
            self.mathml = ""
            print("{}: Encountered MathML equation error".format(fname))
        else:
            stdout = stdout.strip()
            self.mathml = stdout

    def __str__(self):
        return self.text

    def __repr__(self):
        return self.text

    # def gentokens():
    #     self.prevsenttoks = word_tokenize(self.prevsent)
    #     self.nextsenttoks = word_tokenize(self.nextsent)

    def tojson(self):
        return self.__dict__

class document:
    def __init__(self, fname,textarray):
        self.name = fname
        self.array = textarray
        self.itemtype = "document"
        # arraylen = len(self.array)
        # for i in range(1,arraylen-1):
        #     if isinstance(self.array[i],equation):
        #         print("Found an equation :D")
        #         for x in range(i-1,-1,-1):
        #             if isinstance(self.array[x],str):
        #                 self.array[i].prevtext = self.array[x]
        #                 break
        #         for x in range(i+1,arraylen,1):
        #             if isinstance(self.array[x],str):
        #                 self.array[i].nexttext = self.array[x]
        #                 break
        # self.array = self.get_equations()

    def get_equations(self):
        ret = []
        for item in self.array:
            if isinstance(item,equation):
                ret.append(item)
        return(ret)

    def tojson(self):
        return self.__dict__

class archive:
    def __init__(self,directory_name,dictionary):
        self.dir = directory_name
        self.docdict = dictionary
    def save(self):
        print(self.dir)
        outfilepath = self.dir + ".pkl"
        if os.path.isfile(outfilepath):
            outfile = open(outfilepath)
        else:
            outfile = open(outfilepath,'w+')
        pickle.dump(self,outfile)
        outfile.close()

def JSONHandler(Obj):
    if hasattr(Obj, 'tojson'):
        return Obj.tojson()
    else:
        raise TypeError('Object is not JSON serializable')


### Object Generation

In [62]:
import re
from core.texclasses import *

def strip(param):
    return param.strip()


filename = 'meetingexample.tex'
f1 = open(filename, 'rt')
text = f1.read()
f1.close()
newtext = text.decode('utf-8', 'ignore')
#remove comments
#remove all comments at beginning of lines
newtext = re.sub(r'(?m)^%+.*$', '', newtext)
#remove all remaining comments
cdelim = " CUSTOMDELIMITERHERE "
newtext = re.sub(r"(?m)([^\\])\%+.*?$", r'\1', newtext)
newtext = re.sub(r'\\begin\{comment\}.*?\\end\{comment\}','',newtext,re.DOTALL)
newtext = re.sub(r'(?s)\\begin\{equation\}(.*?)\\end\{equation\}',cdelim + r'\1' + cdelim,newtext)
newtext = re.sub(r'(?s)\\begin\{multline\}(.*?)\\end\{multline\}',cdelim + r'\1' + cdelim,newtext)
newtext = re.sub(r'(?s)\\begin\{gather\}(.*?)\\end\{gather\}',cdelim + r'\1' + cdelim,newtext)
newtext = re.sub(r'(?s)\\begin\{align\}(.*?)\\end\{align\}',cdelim + r'\1' + cdelim,newtext)
newtext = re.sub(r'(?s)\\begin\{flalign\*\}(.*?)\\end\{flalign\*\}',cdelim + r'\1' + cdelim,newtext)
newtext = re.sub(r'(?s)\\begin\{math\}(.*?)\\end\{math\}',cdelim + r'\1' + cdelim,newtext)
newtext = re.sub(r'(?s)[^\\]\\\[(.*?)\\\]',cdelim + r'\1' + cdelim,newtext)
newtext = re.sub(r'(?s)\$\$([^\^].*?)\$\$',cdelim + r'\1' + cdelim,newtext)
dispeqs = re.findall(r'(?s)' + cdelim + r'(.*?)' + cdelim,newtext)
dispeqs = map(strip,dispeqs)
textlist = newtext.split(cdelim)
textlist = map(strip,textlist)
for i in range(len(textlist)):
    if textlist[i] in dispeqs:
        textlist[i] = equation(eqtext = textlist[i], fname = filename)
newdoc = document(filename,textlist)
eqlist = newdoc.get_equations()

Some example equations, their LateXML, and their corresponding MathML output:
<img src="img_0.png">

In [66]:
print eqlist[0].mathml

<?xml version="1.0" encoding="UTF-8"?>
<math xmlns="http://www.w3.org/1998/Math/MathML" alttext="\int ax^{2}+bx+c=\frac{3}{4}" display="block">
  <mrow>
    <mrow>
      <mrow>
        <mo largeop="true" symmetric="true">∫</mo>
        <mrow>
          <mi>a</mi>
          <mo>⁢</mo>
          <msup>
            <mi>x</mi>
            <mn>2</mn>
          </msup>
        </mrow>
      </mrow>
      <mo>+</mo>
      <mrow>
        <mi>b</mi>
        <mo>⁢</mo>
        <mi>x</mi>
      </mrow>
      <mo>+</mo>
      <mi>c</mi>
    </mrow>
    <mo>=</mo>
    <mfrac>
      <mn>3</mn>
      <mn>4</mn>
    </mfrac>
  </mrow>
</math>


And here's an example of the JSON for just this equation:

In [75]:
print eqlist[0].tojson()

{'text': u'\\int ax^{2}+bx+c= \\frac{3}{4}', 'mathml': '<?xml version="1.0" encoding="UTF-8"?>\n<math xmlns="http://www.w3.org/1998/Math/MathML" alttext="\\int ax^{2}+bx+c=\\frac{3}{4}" display="block">\n  <mrow>\n    <mrow>\n      <mrow>\n        <mo largeop="true" symmetric="true">\xe2\x88\xab</mo>\n        <mrow>\n          <mi>a</mi>\n          <mo>\xe2\x81\xa2</mo>\n          <msup>\n            <mi>x</mi>\n            <mn>2</mn>\n          </msup>\n        </mrow>\n      </mrow>\n      <mo>+</mo>\n      <mrow>\n        <mi>b</mi>\n        <mo>\xe2\x81\xa2</mo>\n        <mi>x</mi>\n      </mrow>\n      <mo>+</mo>\n      <mi>c</mi>\n    </mrow>\n    <mo>=</mo>\n    <mfrac>\n      <mn>3</mn>\n      <mn>4</mn>\n    </mfrac>\n  </mrow>\n</math>', 'type': 'latex', 'itemtype': 'equation', 'file': 'meetingexample.tex'}


<img src="img_1.png">

In [67]:
print eqlist[1].mathml

<?xml version="1.0" encoding="UTF-8"?>
<math xmlns="http://www.w3.org/1998/Math/MathML" alttext="P_{f}(f)=\frac{(N-1)}{2}\frac{\chi_{\circ}^{2}-\chi_{m}^{2}(f)}{\chi_{\circ}^{%&#10;2}}" display="block">
  <mrow>
    <mrow>
      <msub>
        <mi>P</mi>
        <mi>f</mi>
      </msub>
      <mo>⁢</mo>
      <mrow>
        <mo stretchy="false">(</mo>
        <mi>f</mi>
        <mo stretchy="false">)</mo>
      </mrow>
    </mrow>
    <mo>=</mo>
    <mrow>
      <mfrac>
        <mrow>
          <mo stretchy="false">(</mo>
          <mrow>
            <mi>N</mi>
            <mo>-</mo>
            <mn>1</mn>
          </mrow>
          <mo stretchy="false">)</mo>
        </mrow>
        <mn>2</mn>
      </mfrac>
      <mo>⁢</mo>
      <mfrac>
        <mrow>
          <msubsup>
            <mi>χ</mi>
            <mo>∘</mo>
            <mn>2</mn>
          </msubsup>
          <mo>-</mo>
          <mrow>
            <msubsup>
              <mi>χ</mi>
              <mi>m</mi>
              

<img src="img_2.png">

In [68]:
print eqlist[2].mathml

<?xml version="1.0" encoding="UTF-8"?>
<math xmlns="http://www.w3.org/1998/Math/MathML" alttext="\hbox{MC}(F,S):=\frac{1}{N}\sum_{i=1}^{N}F(s_{i})." display="block">
  <mrow>
    <mrow>
      <mrow>
        <mtext>MC</mtext>
        <mo>⁢</mo>
        <mrow>
          <mo stretchy="false">(</mo>
          <mi>F</mi>
          <mo>,</mo>
          <mi>S</mi>
          <mo stretchy="false">)</mo>
        </mrow>
      </mrow>
      <mo>:=</mo>
      <mrow>
        <mfrac>
          <mn>1</mn>
          <mi>N</mi>
        </mfrac>
        <mo>⁢</mo>
        <mrow>
          <munderover>
            <mo largeop="true" movablelimits="false" symmetric="true">∑</mo>
            <mrow>
              <mi>i</mi>
              <mo>=</mo>
              <mn>1</mn>
            </mrow>
            <mi>N</mi>
          </munderover>
          <mrow>
            <mi>F</mi>
            <mo>⁢</mo>
            <mrow>
              <mo stretchy="false">(</mo>
              <msub>
                <mi>s</mi>

<img src="img_3.png">

In [69]:
print eqlist[3].mathml

<?xml version="1.0" encoding="UTF-8"?>
<math xmlns="http://www.w3.org/1998/Math/MathML" alttext="S(x)\equiv\frac{1}{N}\sum_{i=1}^{N}\delta_{s_{i}}(x)" display="block">
  <mrow>
    <mrow>
      <mi>S</mi>
      <mo>⁢</mo>
      <mrow>
        <mo stretchy="false">(</mo>
        <mi>x</mi>
        <mo stretchy="false">)</mo>
      </mrow>
    </mrow>
    <mo>≡</mo>
    <mrow>
      <mfrac>
        <mn>1</mn>
        <mi>N</mi>
      </mfrac>
      <mo>⁢</mo>
      <mrow>
        <munderover>
          <mo largeop="true" movablelimits="false" symmetric="true">∑</mo>
          <mrow>
            <mi>i</mi>
            <mo>=</mo>
            <mn>1</mn>
          </mrow>
          <mi>N</mi>
        </munderover>
        <mrow>
          <msub>
            <mi>δ</mi>
            <msub>
              <mi>s</mi>
              <mi>i</mi>
            </msub>
          </msub>
          <mo>⁢</mo>
          <mrow>
            <mo stretchy="false">(</mo>
            <mi>x</mi>
            <mo str

<img src="img_4.png">

In [71]:
print eqlist[4].mathml

<?xml version="1.0" encoding="UTF-8"?>
<math xmlns="http://www.w3.org/1998/Math/MathML" alttext="\rho({\gamma}\circ\tilde{{\gamma}})=\rho({\gamma})\circ\rho(\tilde{{\gamma}}),%&#10;\qquad\forall\ {\gamma},\tilde{{\gamma}}\in{\Gamma}" display="block">
  <mrow>
    <mrow>
      <mrow>
        <mi>ρ</mi>
        <mo>⁢</mo>
        <mrow>
          <mo stretchy="false">(</mo>
          <mrow>
            <mi>γ</mi>
            <mo>∘</mo>
            <mover accent="true">
              <mi>γ</mi>
              <mo stretchy="false">~</mo>
            </mover>
          </mrow>
          <mo stretchy="false">)</mo>
        </mrow>
      </mrow>
      <mo>=</mo>
      <mrow>
        <mrow>
          <mrow>
            <mrow>
              <mi>ρ</mi>
              <mo>⁢</mo>
              <mrow>
                <mo stretchy="false">(</mo>
                <mi>γ</mi>
                <mo stretchy="false">)</mo>
              </mrow>
            </mrow>
            <mo>∘</mo>
            <mi>ρ</mi>

<img src="img_5.png">

In [72]:
print eqlist[5].mathml

<?xml version="1.0" encoding="UTF-8"?>
<math xmlns="http://www.w3.org/1998/Math/MathML" alttext="\int_{\Gamma}\widehat{{\gamma}(v)}^{i}\cdot\overline{\widehat{{\gamma}(v)}^{j}%&#10;}=\frac{|{\Gamma}|}{\hbox{dim}(V)}\cdot\|v\|^{2}\cdot\delta_{ij}." display="block">
  <mrow>
    <mrow>
      <mrow>
        <msub>
          <mo largeop="true" symmetric="true">∫</mo>
          <mi mathvariant="normal">Γ</mi>
        </msub>
        <mrow>
          <msup>
            <mover accent="true">
              <mrow>
                <mi>γ</mi>
                <mo>⁢</mo>
                <mrow>
                  <mo stretchy="false">(</mo>
                  <mi>v</mi>
                  <mo stretchy="false">)</mo>
                </mrow>
              </mrow>
              <mo>^</mo>
            </mover>
            <mi>i</mi>
          </msup>
          <mo>⋅</mo>
          <mover accent="true">
            <msup>
              <mover accent="true">
                <mrow>
                  <mi>γ</mi

## Performance / alternatives

With full JSON and cPickle serialization, proctex.py took 201m47.248s (3h21m47.248s) to complete processing of 8,312 documents.

Without generating latexmlmath for each equation, however, the script took 17 seconds to run over all 8,312 documents (generating only JSON files).

latexmlmath is, by far, the slowest part of this process. Indeed, on the [documentation page](http://dlmf.nist.gov/LaTeXML/manual/commands/latexmlmath.html), Bruce Miller notes:

*"This program runs much slower than would seem justified. This is a result of the relatively slow initialization including loading TeX and LaTeX macros and the schema. Normally, this cost would be ammortized over large documents, whereas, in this case, we’re processing a single math expression."*

Timing latexmath on one of our earlier examples ($\frac{3}{4}$):

```
<?xml version="1.0" encoding="UTF-8"?>
<math xmlns="http://www.w3.org/1998/Math/MathML" alttext="\frac{3}{4}" display="block">
  <mfrac>
    <mn>3</mn>
    <mn>4</mn>
  </mfrac>
</math>
real	0m0.632s
user	0m0.572s
sys     0m0.048s
```

It takes a whole 0.63 seconds for latexmlmath to parse in a single equation. As Miller notes, this is partiallydue to the overhead of the command initializing/loading in libraries each time it is called.

In [74]:
intext = "\\begin{document} $\\frac{3}{4}$ $\\frac{5}{6}$ $\\frac{7}{8}$ \\end{document}"

proc = subprocess.Popen(["latexml", "--quiet", "-"], stderr = PIPE, stdout = PIPE, stdin = PIPE)
stdout, stderr = proc.communicate(intext)
proc = subprocess.Popen(["latexmlpost", "--pmml", "--format=xhtml" , "-"], stderr = PIPE, stdout = PIPE, stdin = PIPE)
stdout2, stderr = proc.communicate(stdout)
print(stdout2)

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN" "http://www.w3.org/Math/DTD/mathml2/xhtml-math11-f.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Untitled Document</title>
<!--Generated on Thu Jul 28 14:40:42 2016 by LaTeXML (version 0.8.2) http://dlmf.nist.gov/LaTeXML/.-->

<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8"/>
<link rel="stylesheet" href="/home/jay/hopper/hoptex/LaTeXML.css" type="text/css"/>
</head>
<body>
<div class="ltx_page_main">
<div class="ltx_page_content">
<div class="ltx_document">
<div id="p1" class="ltx_para">
<p class="ltx_p"><math xmlns="http://www.w3.org/1998/Math/MathML" id="p1.m1" class="ltx_Math" alttext="\frac{3}{4}" display="inline"><mfrac><mn>3</mn><mn>4</mn></mfrac></math> <math xmlns="http://www.w3.org/1998/Math/MathML" id="p1.m2" class="ltx_Math" alttext="\frac{5}{6}" display="inline"><mfrac><mn>5</mn><mn>6</mn></mfrac></math> <math xmlns="http:

Note that, within the results of *latexmlpost*, we have presentation MathML with the exact same formatting as the results we received from latexmlmath. One potential fix to the performance problems would be to:
* Per document, construct a latex body of only display mode equations from said document
* Run the temporary document through the latexml/latexmlpost pipeline
* If it encounters errors, try to handle it with latexmlmath
* Otherwise, parse in the equations from the document XHTML output (the equations should be in the same order in the temporary file as they were in the main file).