This is the beginning of a project to write a LaTeX to Asciidoctor-latex converter. For the moment I am roughing out a recursive descent parser for a coarse-grained version of LaTeX. For tests, (concept and progress), see item three in the section Testing.
-
Parser
is now recursive in view of the productionsexpr = { item | text_sequence | macro | environment | inline_math | display_math } environment = BEGIN_ENV {label | expr } END_ENV display_math = \[ { inner_expr } \] inner_expr = { text_sequence | macro | environment }
Look at the tests for the files
matrix.tex
(nested environments) anditemize.tex
(\items
insideitemize
environment. -
There are two backends, one for the production of tex, the other for the production of asciidoc. The first is mainly for verigying the operation of the parser.
-
Parser
now passes the "identity test" for a small number of files, e.g.,empty.tex
, 'paragraphs.tex',comment.tex
,macro.tex
,inline_math.tex
, anddisplay_math.tex
-
Parser
now uses classNode << Tree::TreeNode
to populate the (still very primitive) AST. The idea is to mimic to some extent the Nodes in @movalinux’s Asciidoctor project.
-
Continue adding to the grammar (and hence the parser and backends}
We should mimic the structure of Asciidoctor — in particular, have abstract and concrete blocks, where the head and body are blocks, environments are blocks nested within them, and which can in turn contain other blocks. The grammar and parser will have to be reworked for this.
Latex2Asciidoctor, which is in version the 0.0001, has the following parts:
-
Reader takes a string, e.g., a file’s contents as input, then provide text in chunks to a caller via the methods
get_word
andget_line
. The instance varables@current_word
and@current_line
point to the last-gotten line and word. The relationship @current_word is in @curent_line is preserved regardless of which theget
methods is called. -
Parser also takes a string as input. It initializes a
Reader
object and calls it indirectly viaget_token
. Set up a Parser withparser = Parser.new(text)
. The callparser.parse
is intended to push an AST ontoparser.stack
. At the moment what is pushed is a bunch of nodes for elements of the grammar listed below. I’ve tested real AST production with the same technology and a simple grammar of arithmetic expressions, one that recognizes and evaluates things like( 1 + 2 ) * ( 3 + 4 )
. The correct AST is built by a recursive descent parser constructed by hand from the grammar. The nodes of the AST are instances ofTree::TreeNode
(see https://github.com/evolve75/RubyTree ). We useTree::TreeNode
this projecct as well, though the grammar at he moment is very simple. -
Other files in
lib/
are mosty junk and/or irrelevant will be removed.
Below is the grammar currently used. It "works" but is not satisfactory. For example, we should have something like
environment = BEGIN_ENV expr2 END_ENV
where expr2
is an expr
that doesn’t conatin END_ENV.
For this reason we can’t just make a call to expr
.
Currently we have
environment = BEGIN_ENV env_text END_ENV
env_text = { words != END_ENV }
The effect is to recognize the tokens
BEGIN_ENV env_text END_ENV
and push them onto the stack. What we need to push instead is the sequence
BEGIN_ENV parse(env_text) END_ENV
To be continued …
# Grammar # ------- # Productions # ----------- # document = header BEGIN_DOC body END_DOC # header = { text_sequence | comment | macro_defs } # macro_defs = '%%begin_macro_defs' { defs } '%%end_macro_defs' # body = { expr } # expr = { item | text_sequence | macro | environment | inline_math | display_math } # inner_expr = { text_sequence | macro | environment } # text_sequence = ordinary prose: a sequence words with no in-line math, display # math. or macros (control sequences) # macro = \command \{ {args} \} # environment = BEGIN_ENV {label | expr } END_ENV # inline_math = $ math_text $ # display_math = \[ { inner_expr } \] # item = '\item` text_sequence # Terminals: # ---------- # BEGIN_DOC = '\begin{document}' # END_DOC = '\end{document}' # Pseudoterminals # -------------- # BEGIN_ENV = '\begin{' env_name '}' # END_ENV = '\end{' env_name '}' # NOTES. word_rx = [a-zA-Z]* -- what about punctiation, numerals, etc.? # Maybe a word should just be: no '\', or no leading '\' #
A pseudoterminal such as \begin{theorem}
matches \end{theorem}
, \begin{definition}
matches \end{definition}
, etc.
Warning
|
the production for environment is not yet implemented per
the grammar.
|
The table with items 1-5 below describes the parsing
process for environment
.
Each line in the table has the form
label, stack | input
Items of the form (x)
are nodes, items of the
form [y]
are tokens, and other items are words
of input.
1. ?, (node) ... (node) | be 1 ... 2 ee ... 2. ee, [2] ... [1] [be] | ... 3. [ee] (node) ... (node) | [1] ... [2] [ee] ... 4. (NODE) ... (NODE) [ee] (node) ... (node) | ... 5. (ENV_NODE) (node) ... (node) | ...
-
Phase 1: The parsers sees
be
, which has the form\begin{foo}
. The methodenvironment
is called. In phase (1) it gets tokens repeatedly until it seesee = \end{foo}
. The state is now (2). -
Phase 2: Push
[ee]
onto the input stream and save it asmark
; push[2] … [1]
in reverse order onto the input stream; pop[be]
, pushmark
. The state is now as in (3). -
Phase 3: call
expr
. -
Phase 4:
expr
has returned. Set n = depth([ee]). Create ENV_NODE, make (NODE) … (NODE) its descendants, pop [ee], push ENV_NODE onto stack. The state is as in (5). -
Phase 5: return
Still very primitive, but I am using rspec and want to use @jirutka’s doctest as well. The Reader is well tested.
To test the parser, run
$ rspec spec/parser_spec.rb
At the moment I am working on
-
making the parser produce a tree instead of a list of nodes.
-
the method
render_tree
in moduleRenderNode
-
passing the "identity test" in
parser_spec.rb
: the yield of the parser should be the same as the input text modulo white space. This is tested usingtest(file)
ortest(file, verbose: true)
which is defined inparser_spec.rb
. Here is a typical test:
it 'can parse and render environments' do
test('environment.tex')
end
This approach is inspired by @jirutka’s doctest
.
Once the parser implements whatever the final grammar
is and passes this text, another render method
can be developed to produce the proper Asciidoctor text.