In [1]:
from IPython.core.display import HTML
with open ("../../style.css", "r") as file:
    css = file.read()
HTML(css)

# Converting $\LaTeX$ to <span style="font-variant:small-caps;">Html</span>

The purpose of the following exercise is to implement a translator from [$\LaTeX$](http://www.latex-project.org) to 
[MathML](https://www.tutorialspoint.com/mathml/index.htm).  $\LaTeX$ is a document markup language
that is especially well suited to present text that contains mathematical formulas.  MathML is the part of <span style="font-variant:small-caps;">Html</span> that deals with the representation of mathematical formulas.  As $\LaTeX$ provides a very rich
document markup language and we can only afford to spend a few hours on this exercise, we confine
ourselves to a small subset of $\LaTeX$.  The file `example.tex` contains some $\LaTeX$.  The goal of this exercise is to implement a translator that is able to transform this file into MathML.

We start with reading the file. 

In [2]:
with open('example.tex') as f:
    data = f.read()

Now the variable `data` contains the text that is stored in this file.

In [3]:
print(data)

\documentclass{article}

\begin{document}
The sum of the squares of the first $n$ natural numbers is given as:
$$ \sum\limits_{i=1}^{n} i^{2} = \frac{1}{6} \cdot n \cdot (n+1) \cdot (2\cdot n + 1). $$
According to Pythagoras, the length of the hypotenuse of a right-angled triangle is
the square root of the squares of the length of the two catheti:
$$ c = \sqrt{a^{2} + b^{2}}.  $$
The area of a circle is given as 
$$  A = \pi \cdot r^{2},   $$ 
while its circumference satisfies
$$ C = 2 \cdot \pi \cdot r.  $$
\end{document}



$$ c = \sqrt{a^{2}+b^{2}} $$

Let us look at the output file `example.pdf` that is produced if we run $\LaTeX$ on this file. 
Depending on your operating system, you might have to exchange the command `start` for another command
that is able to open the file `example.pdf`.

In [4]:
!start example.pdf

Next, we open the file `example.html`.  The scanner we are going to implement has to write its output into this file.

In [None]:
outfile = open('example.html', 'w')

<hr style="height:4px;background-color:blue">
Below are some predefined functions that you can use to create the <span style="font-variant:small-caps;">Html</span> file.
<hr style="height:4px;background-color:blue">

The function `start_html` writes the header of the <span style="font-variant:small-caps;">Html</span> file
and the opening `<body>` tag to the file opened above.

In [5]:
def start_html():
    outfile.write('<!doctype html>\n')
    outfile.write('<html>\n')
    outfile.write('<head>\n')
    outfile.write('<script type="text/javascript" ')
    outfile.write('src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML">\n')
    outfile.write('</script>\n')
    outfile.write('</head>\n\n')
    outfile.write('<body>\n\n')

The function `end_html` writes the closing `</body>` and `</html>` tags.

In [6]:
def end_html():
    outfile.write('</body>\n')
    outfile.write('</html>\n')

The function `start_math_block` starts a *math block*.  This is useful for formulas enclosed in `$$`.  This type of formulas is displayed in a line by itself.

In [7]:
def start_math_block():
    outfile.write('<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">\n')

The function `start_math_inline` starts an <em style="color:blue">inline formula</em>, i.e. a formula enclosed in `$`.  Formulas of this type are part of the surrounding text.

In [8]:
def start_math_inline():
    outfile.write('<math xmlns="http://www.w3.org/1998/Math/MathML" display="inline">\n')

The function `end_math` ends a math block.

In [9]:
def end_math():
    outfile.write('</math>\n')

The functions `start_sum` and `end_sum` write code to display formulas involving sums.  For example, to display  the expression
$$ \sum\limits_{i=1}^n i^2 $$
we can use the following MathML:
```
<munderover>
<mo>&sum;</mo>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mrow>
<mi>n</mi>
</mrow>
</munderover>
<msup>
<mi>i</mi>
<mrow>
<mn>2</mn>
</mrow>
</msup>
```

In [10]:
def start_sum():
    outfile.write('<munderover>\n')
    outfile.write('<mo>&sum;</mo>\n')

def end_sum():
    outfile.write('</munderover>\n')

The functions `start_sqrt` and `end_sqrt` write code to display formulas involving square roots.  For example, to display  the expression
$$ \sqrt{a^2 + b^2} $$
we can use the following MathML:
```
<msqrt>
<mrow>
<msup>
<mi>a</mi>
<mrow>
<mn>2</mn>
</mrow>
</msup>
<mo>+</mo>
<msup>
<mi>b</mi>
<mrow>
<mn>2</mn>
</mrow>
</msup>
</mrow>
</msqrt>
```

In [11]:
def start_sqrt():
    outfile.write('<msqrt>\n')

def end_sqrt():
    outfile.write('</msqrt>\n')

In order to write exponents we have to use the tag `<msup>`.  For example, the expression $a^2$ 
is equivalent to the following markup:
```
<msup>
<mi>a</mi>
<mrow>
<mn>2</mn>
</mrow>
</msup>
```
Note that the exponent is enclosed in `<mrow>` `</mrow>` tags.

<b>Note</b> that **everything**, i.e. both the variable and the exponent is enclosed in `<msup>` `</msup>` tags.

In [12]:
def start_super():
    outfile.write('<msup>\n')

def end_super():
    outfile.write('</msup>\n')

In order to write fractions we have to use the tag `<mfrac>`.  For example, the expression $\frac{1}{6}$ 
is equivalent to the following markup:
```
<mfrac>
<mrow>
<mn>1</mn>
</mrow>
<mrow>
<mn>6</mn>
</mrow>
</mfrac>
```
Note that both nominator and denominator are enclosed in `<mrow>` `</mrow>` tags.

In [13]:
def start_fraction():
    outfile.write('<mfrac>\n')

def end_fraction():
    outfile.write('</mfrac>\n')

Arguments of functions like the square root or exponents have to be enclosed in pairs of `<mrow>` and `</mrow>` tags.  

In [14]:
def start_row():
    outfile.write('<mrow>\n')

def end_row():
    outfile.write('</mrow>\n')

Variable names should be enclosed in pairs of `<mi>` and `</mi>` tags.  For example, the variable $x$ is displayed by the following MathML:
```
<mi>x</mi>
```
The tag name `mi` is short for *math italics*.

In [15]:
def write_var(v):
    outfile.write('<mi>' + v + '</mi>\n')

Numbers should be enclosed in pairs of `<mn>` and `</mn>` tags.  For example, the number $6$ is displayed by the following MathML:
```
<mn>6</mn>
```

In [16]:
def write_number(n):
    outfile.write('<mn>' + n + '</mn>\n')

The symbol $\cdot$ is created by the following MathML:
```
<mo>&sdot;</mo>
```

In [17]:
def write_times():
    outfile.write('<mo>&sdot;</mo>\n')

Mathematical operators should be enclosed in pairs of `<mo>` and `</mo>` tags.  For example, the operator $+$ is displayed by the following MathML:
```
<mo>+</mo>
```

In [None]:
def write_operator(op):
    outfile.write('<mo>' + op + '</mo>\n')

The symbol $\pi$ is created by the following MathML:
```
<mn>&pi;</mn>
```

In [18]:
def write_pi():
    outfile.write('<mn>&pi;</mn>\n')

The symbol $\leq$ is created by the following MathML:
```
<mn>&le;</mn>
```

In [19]:
def write_leq():
    outfile.write('<mo>&le;</mo>\n')

The symbol $\geq$ is created by the following MathML:
```
<mn>&ge;</mn>
```

In [20]:
def write_geq():
    outfile.write('<mo>&ge;</mo>\n')

The function `write_any` writes a single character unadorned to the output file.

In [21]:
def write_any(char):
    outfile.write(char)

We will be use the library [ply](https://ply.readthedocs.io/en/latest/ply.html) to translate $\LaTeX$ into 
<span style="font-variant:small-caps;">MathML</span>.
We only use the scanner that is provided by the module `ply.lex`. 
Hence we import the module `ply.lex` that contains the scanner generator from `ply`.

In [22]:
import ply.lex as lex

We have to declare all tokens below.  We will need tokens for the following parts of the $\LaTeX$ file:
 - The $\LaTeX$ file starts with the string `\documentclass{article}`.
 - Next, there is the string `\begin{document}` that starts the content.
 - The string &#92;`end{document}` ends the content.
 - The string `$$` starts and ends a formula that is displayed on a line by itself.
 - The string `$` starts and ends a formula that is displayed as part of the text.
 - The string `\sum\limits_{` starts the definition of a sum.
 - The string `\sqrt{` starts the definition of a square root.
 - The string `\frac{` starts the definition of a fraction.
 - A variable taken to a power starts something like `a^{`.
 - $\vdots$

In [23]:
tokens = [ 'HEAD',            # r'\\documentclass\{article\}'
           'BEGIN_DOCUMENT',  # r'\\begin\{document\}'
           'END_DOCUMENT',    # r'\\end\{document\}'
           'DOLLAR_DOLLAR',   # r'\$\$'
           'DOLLAR',          # r'\$'
           '...' # many more token declarations here
           'ANY',             # r'.|\n'
           'WS'               # r'[ \t]'
         ]

When we see a closing brace `}` things get difficult.  The reason is that we need to know what type of formula is being closed.
Is it a square root, the subscript of a sum, the superscript of a sum, or some part of a fraction.  My idea is to use a stack that is attached to the lexer, i.e. we have a variable `lexer.stack` that stores this information.

Furthermore, the scanner has two different states.  Either we are inside a formula, i.e. inside something that is enclosed in dollar symbols, or we are inside text that needs to be echoed unchanged to the output file.

In [24]:
states = [ ('formula', 'exclusive') ]

In [None]:
def t_HEAD(t):
    r'\\documentclass\{article\}'
    pass

In [None]:
def t_BEGIN_DOCUMENT(t):
    r'\\begin\{document\}'
    start_html()

In [None]:
def t_END_DOCUMENT(t):
    r'\\end\{document\}'
    end_html()

In [None]:
def t_DOLLAR_DOLLAR(t):
    r'\$\$'
    t.lexer.begin('formula')
    t.lexer.stack = []
    t.lexer.stack.append('INITIAL')
    start_math_block()

In [None]:
def t_DOLLAR(t):
    r'\$'
    t.lexer.begin('formula')
    t.lexer.stack = []
    t.lexer.stack.append('INITIAL')
    start_math_inline()

In [None]:
def t_ANY(t):
    r'.|\n'
    write_any(t.value)

In [None]:
def t_formula_DOLLAR_DOLLAR(t):
    r'\$\$'
    t.lexer.begin('INITIAL')
    end_math()

In [None]:
def t_formula_DOLLAR(t):
    r'\$'
    t.lexer.begin('INITIAL')
    end_math()

$\cdots$ lots of token definitions $\cdots$

In [None]:
def formula_LEQ(t):
    r'\\leq'
    write_leq()

In [None]:
def t_formula_GEQ(t):
    r'\\geq'
    write_geq()    

In [None]:
def t_formula_PI(t):
    r'\\pi'
    write_pi()

In [None]:
def t_formula_OPERATOR(t):
    r'[.,()+<>=-]'
    write_operator(t.value)

In [None]:
def t_formula_WS(t):
    r'[ \t]'
    pass

In [None]:
def t_formula_error(t):
    print(f"Illegal character in state 'formula': '{t.value[0]}'")
    t.lexer.skip(1)

The line below is necessary to trick `ply.lex` into assuming this program is written in an ordinary python file instead of a *Jupyter notebook*.

In [None]:
__file__ = 'main'

The line below generates the scanner.

In [None]:
lexer = lex.lex(debug=True)

Next, we feed our input string into the generated scanner.

In [None]:
lexer.input(data)

In order to scan the data that we provided in the last line, we iterate over all tokens generated by our scanner.

In [None]:
def scan(lexer):
    for t in lexer:
        pass

In [None]:
scan(lexer)

In [None]:
outfile.close()

Now you should be able to see a file with the name `example.html` in your current durectory.

In [None]:
!start 'example.html'