-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LatexWalker does not correctly parse "\newcolumntype{C}{>{$}c<{$}}" #48
Comments
Hi, thanks for reporting this issue. The trouble here is that some macros, especially ones that define new macros or new behavior (such as That said, there is quite a bit of flexibility to parse complicated LaTeX macros and constructs, but those have to be defined manually in python. There currently is minimal support for some of these "special" macros (e.g. In your use case however, you could see if I definitely agree that document preambles often have a bunch of advanced LaTeX definitions (e.g., "\makeatletter" tricks etc.) that can throw off the parser. The parser was indeed more designed in view of parsing a document contents, or content snippets of a document, rather than the entire document in one go. An option is to keep all preamble definitions in a separate file and use "\input{macros.tex}" (which by default on the command-line tool is ignored, and is customizable if you use the python API directly). I agree that using regexp searches is very ugly and it will be easy to miss some situations. Here are some further suggestions:
with open("my_tex_file.tex") as f:
doc_contents = f.read()
lw = latexwalker.LatexWalker(
doc_contents,
tolerant_parsing=False, # set to False, we'll catch parse errors ourselves
)
parsing_state = lw.make_parsing_state()
pos = 0
while True:
# Try to read a one node at a time position pos. If there is a parse error, move to the next line and try again.
try:
(nodelist, npos, nlen) = lw.get_latex_nodes(pos, read_max_nodes=1, parsing_state=parsing_state)
pos = npos+nlen # continue parsing after this node next time we start the while loop
except LatexWalkerParseError as e:
print(f"Ignoring parse error at line {e.lineno}, col {e.colno}: {e}")
# find next newline to continue parsing from there
pos = doc_contents.find('\n', pos+1) # position of next newline
continue
if len(nodelist) == 0:
break # end of document reached
assert len(nodelist) == 1
node = nodelist[0]
... # do something with `node`, e.g. call nodelist_to_text([node]) on some latex2text.LatexNodes2Text instance |
Thank you for the suggestions, I'll them try and let you know how it goes. def find_macro(node, macroname):
"""Walk the node (consider the given node as the root of a tree) and
find the macro I'm looking for"""
found = []
if node is None:
return found
if isinstance(node, str):
# logger.debug(f'string: {str}')
return found
# == CHARS ==
if node.isNodeType(LatexCharsNode):
# logger.debug(f'chars: {node.chars}')
return found
# == GROUP ==
elif node.isNodeType(LatexGroupNode):
# logger.debug(f'group: {len(node.nodelist)}')
for child in node.nodelist:
found.extend(find_macro(child, macroname))
return found
# == MACRO ==
elif node.isNodeType(LatexMacroNode):
# if node.macroname == "newcommand":
# import pdb
# pdb.set_trace()
if node.macroname == macroname:
logger.debug(f"FOUND {macroname} in {node}")
found.append(node)
return found
else:
# "\newcommand\mgg[1]{#1}" has empty nodeargd (None)
if node.nodeargd is None:
# logger.debug(f'macro: \\{node.macroname}@None')
return found
argument_nodes = node.nodeargd.argnlist
# logger.debug(f'macro: \\{node.macroname}@{len(argument_nodes)}')
for child in argument_nodes:
found.extend(find_macro(child, macroname))
return found
# == ENVIRONMENT ==
elif node.isNodeType(LatexEnvironmentNode):
# logger.debug(f'environment: {len(node.nodelist)}')
for child in node.nodelist:
found.extend(find_macro(child, macroname))
return found
# == COMMENT ==
elif node.isNodeType(LatexCommentNode):
# logger.debug(f'comment: {node.comment}')
return found
# == MATH ==
elif node.isNodeType(LatexMathNode):
return found
# == SPECIALS ==
elif node.isNodeType(LatexSpecialsNode):
return found
# == ??? ==
else:
logger.debug(f"UNKNOWN TYPE: {node.nodeType()}")
return found
assert False |
First, let me thank you for you work, it helps me a lot.
I want to report the issue in the title.
For instance:
will issue a parse error (I think because the parser sees a closing group right after an opening math).
I don't know if it is possible to fix, but, if it is not, I wanted to ask if it would be possible to entirely skip the line(s) with parsing errors (maybe by supplying some optional argument) in the hope of not cluttering the rest of the parsing.
Something similar:
echo '\newcommand{\be}{\begin{equation}}' |latex2text
My use case is this: I need to scan a tex file, looking for some specific macro (
\title
,\author
,...) and transform their arguments into text. I do not want to use something likelatex_text.find("\\title")
(and then parse from there) because of comments and "look-alike" macros (e.g.\titleBar
,\authorFoo
). I could use regexps to find the starting point, but I prefer to navigate the tree of nodes built by LatexWalker.The text was updated successfully, but these errors were encountered: