Skip to content

TexParser

Taichen Rose edited this page May 24, 2021 · 13 revisions

The TexParser module is the heart of this project. Its function is similar to a LaTeX to pdf compiler, except that its goal is to parse LaTeX into a stream of text, marked up with SSML, for use in text-to-speech generation.

Storing Supported LaTeX Commands

We designed a system that was both robust and flexible enough to effectively cover many possible conversion scenarios, while still being feasible to implement. An external XML file is used to actually specify how every command, environment, and command within an environment should be said via SSML. To use this XML file, we decided on creating a MVC design that used an intermediate database to handle all parsing of the XML file. This is not only done to separate code and leave flexibility to change the database implementation, but the conversion database also creates all the SSMLElement objects that are used to create the SSMLElement tree.

Creating an intermediate SSMLElement tree from our own proprietary objects was a decision made for the following reasons:

  1. Both the LaTeX that the parser will reference and the eventual desired output will be in the form of trees (accomplished with TexSoup for the LaTeX and python's ElementTree package for the output XML).
  2. More advanced resolution of nested elements requires us to create our own tree structure to manually decide how to resolve each nested element.
  3. It allows us to decide what common SSML elements we'll have to support
  4. Creating our own classes allows us flexibility to design classes for solving implementation specific problems as they arise.

SSMLElements were designed with their eventual final form as a python ElementTree in mind, the most significant consequence of which is the headText and tailText attributes. This has a 1 to 1 correspondence with how the ElementTree package stores text within/after a XML node, and so makes it easier for us to convert each node to an XML node afterward. The structure of the classes used in the SSMLElement tree are noted in the following UML diagram: UML Diagram for SSMLElement Tree

Pronunciation XML Structure

Under the root LaTeX node there can be either <cmd> or <env> tags. The corresponding name must be specified as an attribute. Inside the cmd nodes any SSML elements can be used, along with the tags <text> or <arg>. Inside env nodes there are two separator tags called <says> and <defines>, where <says> defines how the environment will be read out and <defines> defines (or redefines) commands within the specific environment. Within the <says> tag, a <content> tag must be used to denote the relative position of the environments contents.

Example XML Definition

Sample
<latex>
<cmd name=”foo” type = "none" family = "">
    asdf
    <prosody strength=”strong”>
        more text
        <arg num=”2”/>
    </prosody>
    <arg num=”1”>
</cmd>
<env name=”bar” type = "none" family = "">
    <says>
        <break time=”3ms”/>
        qwerty
        <content/>
        <arg num=”3”>
    </says>
    <defines>
        <cmd name=”baz” type = "none" family = "">
            buz
        </cmd>
    </defines>
</env>
</latex>

From the following LaTeX…

\foo{a}{b}  
\begin{bar}{c}{d}{e}  
    I’m just some text  
    \baz  
\end{bar}

… the parser should generate…

<speak>
asdf
<prosody strength=”strong”>
b
</prosody>
a
<break time=”3ms”/>
qwerty
I’m just some text
buz
e
</speak>

Tables

While the parser is traversing each node and we find a \begin{table} or \begin{tabular} command, we set a boolean to true, which will be caught at the next non environment/command look up. When this boolean is found to be true, we call several functions that look inside of the given input string and parse the table elements into a listenable SSML formatted text.

The input is barsed by splitting the contents of the table by the & delimiter. This is because in every table in LaTeX, the & delimiter is used to signal the next column. The input is each row in the table columns. At the start of the table, it will be read as "Begin Table" and at the end of the table, it will be read as "End Table". Throughout each new row, it will state "New Row" and at each value it will specify which column. Extra information such as title or caption will be read aloud through the pronunciation.xml file.

Issues/Refactoring

There may be a more efficient way of using the XML file to read out the contents of the file, however since the contents of the table are not commands, environments, or specific SSML tags they won't be checked. To go about this way, would require additional work in regards to XML and creating a flag using one of the tag elements for finding tables.

Major tables that include images or multiple rows with different columns sometimes do not render properly. This will need to be expanded upon on the table.

Citations

There are two different ways bibliographies can be used within LaTeX documents. There are embedded bibliographies where there are commands such as \bibitems and \textit values. Then there are .bib files that are separate files which can be uploaded through the web application. Whether it's an internal or external bibliography, it must be read so our users can figure out what documents are being referenced throughout listening to their audio.

For internal bibliographies, using the pronunciation.xml file, we add commands for getting the bibliographies to be read aloud. For external bibliographies, we concatenate the bibliography file with the main file that it corresponds with. We do this by looking for the \bibliography tag in the main LaTeX file, then look inside the argument and compare if it is the name of the .bib file. If a match is successful, we call a function that uses pybtex (python package) that will parse out the .bib file. The contents in this file we will concatenate to the end of the main .tex file then feed the overall file to the Tex parser. If a bibliography file was not found, then we print that no bibliography file was found so the user will know why there wasn't a reference section in their audio output.

Macro Definitions

Description

Macros are typically in reference to either newcommand, renewcommand, newenvironment or renewenvironment. According to our survey of many real papers, def is a popular alternative as well, and so should be supported. Since LaTeX only expands these macros within the compiler itself, we need to emulate this behavior in Tex2Speech.

The design needs to take many peculiarities of the new*, renew* blank into account. First, since one command can be redefined multiple times, we have to keep track of where these definitions are made and reference the correct one when expanding an instance of a macro. Next, if a macro will expand into a previously defined macro, the macro it expands into should be expanded itself, as this is the behavior LaTeX employs. Finally, all macros can take an optional number of arguments, some possibly default, so behavior must be encoded for parsing these and inserting the arguments in the proper places.

Design

To take these considerations into account, we first need to traverse the document finding all the macro definitions before any macro expansion can take place. This separation will make the expansion easier to write, but necessitates some kind of data structure to hold information of macros seen so far. This naturally leads to the creation of a Macro class with subclasses for command macros and environment macros. There is no need to differentiate between new* and renew* since we don't know if commands already exist from some other package and there's no need to enforce unnecessary semantics.

Each macro class will need to take in its constructor its definition in the form of a TexSoup node and all the macros previously created. Using this information, we can expand the contents of the macros using the previous macros, and then store all the necessary information for the macro to provide expansions itself. This will include the definition's relative position in the document as we'll have to reference that when expanding given instances of macros.

From here, all we have to do is go through the document once more looking for every invocation of the current set of macros and expand it according to the correct definition. If the macros are stored in a dictionary with their name mapping to a list of definitions in their document defined order, it is fairly straightforward to find the correct definition and apply the expansion. That expansion will be inserted into the document, and we'll proceed forward.

Embedded Bibliographies

Description

There are two different ways bibliographies can be used within LaTeX documents. There are embedded bibliographies where there are commands such as \bibitems and \textit values. Then there are .bib files that are separate files which can be uploaded through the web application. Whether it's an internal or external bibliography, it must be read so our users can figure out what documents are being referenced throughout listening to their audio.

Design

For internal bibliographies, using the pronunciation.xml file, we add commands for getting the bibliographies to be read aloud. For external bibliographies, we concatenate the bibliography file with the main file that it corresponds with. We do this by looking for the \bibliography tag in the main LaTeX file, then look inside the argument and compare if it is the name of the .bib file. If a match is successful, we call a function that uses pybtex (python package) that will parse out the .bib file. The contents in this file we will concatenate to the end of the main .tex file then feed the overall file to the Tex parser. If a bibliography file was not found, then we print that no bibliography file was found so the user will know why there wasn't a reference section in their audio output.

Multi-File LaTeX Documents

Description

For some LaTeX projects, users are able to use \input or \include tags to embed extra files. These extra files are separate .tex files that need to be incorporated into the main tex file. Users will be able to upload a single main.tex file, and that main.tex file could potentially have \input or \include which opens up other .tex files and reads them in that place before continuing on. Include or input files may not include a main .tex file, and when found another \input or \include in that file it will open up the next file until exiting out each file back to the main.tex file.

Design

To solve this issue, currently we have three separate buttons that users will be able to upload their files. One being the main LaTeX file, incorporating .bib files, and input files. This is so we have the list of main files initially. Opening up these main files, we read the file until there is an \input or \include command found. If found, we open up the potential file (if it exists) and add the contents of that file into the main file. Once this has been finished, we feed the full file to our conversion parser, which results into one audio that contains the full tex document with its extra input or include files.

Math Equations

When the main LaTeX parser flags an argument as a LaTeX mathmode object it is handed over to another parser that converts LaTeX mathmode to a Sympy object, then a Sympy object to SSML. See the Mathmode to SSML wiki page for documentation on this process.

Known TexParser Bugs

Commands That Don't Use { } Bug

Certain commands such as \item, {\emph stuff}, etc which are not in the traditional format of \command{} do not work well with our parser. \item is currently temporarily done with an if statement to find \item and add all contents afterwards. However, this is not a good solution since there could potentially be nested commands/environments/mathmode that needs to be rendered. {\emph stuff} is a problem since when our parser goes through this, it does not put emphasis tags around the "stuff" component.

Expected "x", reached end of file Error - Issue

This is 25.96% of the errors that have currently occurred. Essentially, there are multiple LaTeX commands \newc, \def, \newcommand that users write at the top of their files. These commands allow them to replace a regular environment with a smaller one. For example \newcommand{\beq}{\begin{equation} means that in this LaTeX document, I can use \beq instead of typing out \begin{equation}. However, because of this, when TexSoup sees \begin{equation} in these newcommand, def, etc it gets confused. It believes there should be an associated closing tag, but it can't find it in the correct position and errors out.

Command \item invalid in math mode Error

Occurs 16.45% of the time, from file h2625.tex this occurs when \item is nested within \begin{list}{} \end{list}. There shouldn't be any math mode involved, could be erroring at that extra {}.

Expected "x", instead got "y" Error

This occurs 4.79% of the time in current files that are erroring. So far, we were able to lessen the occurrence of this error due to some backslash commands that didn't give a command/environment white space.

Malformed Argument Error

Occurs 11.60% from current tests, there are multiple problems that cause this, an example that causes this, is there isn't enough whitespace between commands. For example \def\t{theta}\def\T{Theta} error out due to this. Instead of them being side by side, the want to be in the following format:

\def\t{\theta}

\def\T{\theta}

Uncaught Errors

12.19% of errors are currently uncaught.

Other Errors

Remaining 6.8% of bugs. There are other errors that were not mentioned on this list from our error_log. You can view them in the \Documentation\GenerateErrorLogs\parse_data\error_log.txt, and they do not occur often, which could mean they are edge cases.