LatexWalker does not correctly parse "\newcolumntype{C}{>{$}c<{$}}" #48

gamboz · 2020-11-18T09:27:33Z

First, let me thank you for you work, it helps me a lot.

I want to report the issue in the title.
For instance:

echo '\newcolumntype{C}{>{$}c<{$}}'  |  latex2text

will issue a parse error (I think because the parser sees a closing group right after an opening math).
I don't know if it is possible to fix, but, if it is not, I wanted to ask if it would be possible to entirely skip the line(s) with parsing errors (maybe by supplying some optional argument) in the hope of not cluttering the rest of the parsing.

Something similar: echo '\newcommand{\be}{\begin{equation}}' |latex2text

My use case is this: I need to scan a tex file, looking for some specific macro (\title, \author,...) and transform their arguments into text. I do not want to use something like latex_text.find("\\title") (and then parse from there) because of comments and "look-alike" macros (e.g. \titleBar, \authorFoo). I could use regexps to find the starting point, but I prefer to navigate the tree of nodes built by LatexWalker.

The text was updated successfully, but these errors were encountered:

phfaist · 2020-11-18T11:48:49Z

Hi, thanks for reporting this issue.

The trouble here is that some macros, especially ones that define new macros or new behavior (such as \newcommand, \def, \newcolumntype, etc.) cannot be parsed like "normal" macros (such as \textbf{...}) because their argument structure isn't simply LaTeX content and has to be treated specially (e.g., a sequence of symbols that have a macro-specific meaning, or for \newcommand, LaTeX tokens that one shouldn't try to expand or parse further). In TeX/LaTeX, these macros work with TeX tricks such as changing catcodes, etc. The latexwalker module is not designed to be a TeX engine, but instead, it is meant to parse LaTeX with a simplified set of rules (e.g., there are no catcodes).

That said, there is quite a bit of flexibility to parse complicated LaTeX macros and constructs, but those have to be defined manually in python. There currently is minimal support for some of these "special" macros (e.g. \verb+xxx+). It would be great to add support for more such "special" macros; for instance I hope to be able to add support for \newcommand macros in the future.

In your use case however, you could see if latexwalker's "tolerant parsing mode" (e.g. latex2text --tolerant-parsing or latexwalker --tolerant-parsing) would be able to recover from the errors you mention. (The option is supposed to be on by default. Besides the parse error warnings, did you get any useful output in your attempts?). The resulting node tree might be a bit off, or might have missing nodes, but this would be currently the closest option to the feature you are referring to (ignore the line on which the error is).

I definitely agree that document preambles often have a bunch of advanced LaTeX definitions (e.g., "\makeatletter" tricks etc.) that can throw off the parser. The parser was indeed more designed in view of parsing a document contents, or content snippets of a document, rather than the entire document in one go. An option is to keep all preamble definitions in a separate file and use "\input{macros.tex}" (which by default on the command-line tool is ignored, and is customizable if you use the python API directly). I agree that using regexp searches is very ugly and it will be easy to miss some situations. Here are some further suggestions:

parse everything into a big node list in tolerant parsing mode, and explore it to find the information you'd like
if definitions are well-organized in lines, you could try to load the file and parse it line-by-line or node-by-node (based on your suggestion). E.g.: parse a single node, and if there is a parse error, move to the next line. Then continue parsing the next node etc. Something along the lines of (UNTESTED):

with open("my_tex_file.tex") as f:
    doc_contents = f.read()
lw = latexwalker.LatexWalker(
    doc_contents,
    tolerant_parsing=False, # set to False, we'll catch parse errors ourselves
)
parsing_state = lw.make_parsing_state()
pos = 0
while True:
    # Try to read a one node at a time position pos. If there is a parse error, move to the next line and try again.
    try:
        (nodelist, npos, nlen) = lw.get_latex_nodes(pos, read_max_nodes=1, parsing_state=parsing_state)
        pos = npos+nlen # continue parsing after this node next time we start the while loop
    except LatexWalkerParseError as e:
        print(f"Ignoring parse error at line {e.lineno}, col {e.colno}: {e}")
        # find next newline to continue parsing from there
        pos = doc_contents.find('\n', pos+1) # position of next newline
        continue
    if len(nodelist) == 0:
        break # end of document reached
    assert len(nodelist) == 1
    node = nodelist[0]

    ... # do something with `node`, e.g. call nodelist_to_text([node]) on some latex2text.LatexNodes2Text instance

gamboz · 2020-11-18T12:25:16Z

Thank you for the suggestions, I'll them try and let you know how it goes.
To answer your question, yes, when there is no pesky \newcommands & co., I definitely get useful results.
I use the function below (still work-in-progress) to navigate the tree looking for the macros that I'm interested in (I customize the context) and, since I know how many arguments they have, I can pass them to LatexNodes2Text to get what I need:

def find_macro(node, macroname):
    """Walk the node (consider the given node as the root of a tree) and
    find the macro I'm looking for"""
    found = []
    if node is None:
        return found
    if isinstance(node, str):
        # logger.debug(f'string: {str}')
        return found
    # == CHARS ==
    if node.isNodeType(LatexCharsNode):
        # logger.debug(f'chars: {node.chars}')
        return found
    # == GROUP ==
    elif node.isNodeType(LatexGroupNode):
        # logger.debug(f'group: {len(node.nodelist)}')
        for child in node.nodelist:
            found.extend(find_macro(child, macroname))
        return found
    # == MACRO ==
    elif node.isNodeType(LatexMacroNode):
        # if node.macroname == "newcommand":
        #     import pdb
        #     pdb.set_trace()
        if node.macroname == macroname:
            logger.debug(f"FOUND {macroname} in {node}")
            found.append(node)
            return found
        else:
            # "\newcommand\mgg[1]{#1}" has empty nodeargd (None)
            if node.nodeargd is None:
                # logger.debug(f'macro: \\{node.macroname}@None')
                return found
            argument_nodes = node.nodeargd.argnlist
            # logger.debug(f'macro: \\{node.macroname}@{len(argument_nodes)}')
            for child in argument_nodes:
                found.extend(find_macro(child, macroname))
        return found
    # == ENVIRONMENT ==
    elif node.isNodeType(LatexEnvironmentNode):
        # logger.debug(f'environment: {len(node.nodelist)}')
        for child in node.nodelist:
            found.extend(find_macro(child, macroname))
        return found
    # == COMMENT ==
    elif node.isNodeType(LatexCommentNode):
        # logger.debug(f'comment: {node.comment}')
        return found
    # == MATH ==
    elif node.isNodeType(LatexMathNode):
        return found
    # == SPECIALS ==
    elif node.isNodeType(LatexSpecialsNode):
        return found
    # == ??? ==
    else:
        logger.debug(f"UNKNOWN TYPE: {node.nodeType()}")
        return found
    assert False

phfaist added the enhancement label Nov 18, 2020

gamboz mentioned this issue Nov 19, 2020

LatexWalker.get_latex_nodes returns wrong "len" (sometimes) #49

Closed

phfaist mentioned this issue Jan 5, 2021

Support for \newenvironment wrapping another environment #50

Open

phfaist mentioned this issue Apr 1, 2021

Unknown latex macros do not include arguments? #60

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LatexWalker does not correctly parse "\newcolumntype{C}{>{$}c<{$}}" #48

LatexWalker does not correctly parse "\newcolumntype{C}{>{$}c<{$}}" #48

gamboz commented Nov 18, 2020

phfaist commented Nov 18, 2020

gamboz commented Nov 18, 2020

LatexWalker does not correctly parse "\newcolumntype{C}{>{$}c<{$}}" #48

LatexWalker does not correctly parse "\newcolumntype{C}{>{$}c<{$}}" #48

Comments

gamboz commented Nov 18, 2020

phfaist commented Nov 18, 2020

gamboz commented Nov 18, 2020