GitHub

Latex2Asciidoctor

This is the beginning of a project to write a LaTeX to Asciidoctor-latex converter. For the moment I am roughing out a recursive descent parser for a coarse-grained version of LaTeX. For tests, (concept and progress), see item three in the section Testing.

Notes

Parser is now recursive in view of the productions

  expr = { item | text_sequence | macro | environment | inline_math | display_math }
  environment = BEGIN_ENV {label | expr } END_ENV
  display_math = \[ { inner_expr } \]
  inner_expr = { text_sequence | macro | environment }

Look at the tests for the files matrix.tex (nested environments) and itemize.tex (\items inside itemize environment.

There are two backends, one for the production of tex, the other for the production of asciidoc. The first is mainly for verigying the operation of the parser.
Parser now passes the "identity test" for a small number of files, e.g., empty.tex, 'paragraphs.tex', comment.tex, macro.tex, inline_math.tex, and display_math.tex
Parser now uses class Node << Tree::TreeNode to populate the (still very primitive) AST. The idea is to mimic to some extent the Nodes in @movalinux’s Asciidoctor project.

Major Todos

Continue adding to the grammar (and hence the parser and backends}

Design notes

We should mimic the structure of Asciidoctor — in particular, have abstract and concrete blocks, where the head and body are blocks, environments are blocks nested within them, and which can in turn contain other blocks. The grammar and parser will have to be reworked for this.

The Basic Idea

Latex2Asciidoctor, which is in version the 0.0001, has the following parts:

Reader takes a string, e.g., a file’s contents as input, then provide text in chunks to a caller via the methods get_word and get_line. The instance varables @current_word and @current_line point to the last-gotten line and word. The relationship @current_word is in @curent_line is preserved regardless of which the get methods is called.
Parser also takes a string as input. It initializes a Reader object and calls it indirectly via get_token. Set up a Parser with parser = Parser.new(text). The call parser.parse is intended to push an AST onto parser.stack. At the moment what is pushed is a bunch of nodes for elements of the grammar listed below. I’ve tested real AST production with the same technology and a simple grammar of arithmetic expressions, one that recognizes and evaluates things like ( 1 + 2 ) * ( 3 + 4 ). The correct AST is built by a recursive descent parser constructed by hand from the grammar. The nodes of the AST are instances of Tree::TreeNode (see https://github.com/evolve75/RubyTree ). We use Tree::TreeNode this projecct as well, though the grammar at he moment is very simple.
Other files in lib/ are mosty junk and/or irrelevant will be removed.

Grammar

Below is the grammar currently used. It "works" but is not satisfactory. For example, we should have something like

  environment = BEGIN_ENV expr2 END_ENV

where expr2 is an expr that doesn’t conatin END_ENV. For this reason we can’t just make a call to expr. Currently we have

   environment = BEGIN_ENV env_text END_ENV
   env_text = { words != END_ENV }

The effect is to recognize the tokens

   BEGIN_ENV env_text END_ENV

and push them onto the stack. What we need to push instead is the sequence

   BEGIN_ENV parse(env_text) END_ENV

To be continued …

  # Grammar
  # -------

  # Productions
  # -----------
  # document = header BEGIN_DOC body END_DOC
  # header = { text_sequence | comment | macro_defs }
  # macro_defs = '%%begin_macro_defs' { defs } '%%end_macro_defs'
  # body = { expr }
  # expr = { item | text_sequence | macro | environment | inline_math | display_math }
  # inner_expr = { text_sequence | macro | environment }
  # text_sequence = ordinary prose: a sequence words with no in-line math, display
  #        math. or macros (control sequences)
  # macro = \command \{ {args} \}
  # environment = BEGIN_ENV {label | expr } END_ENV
  # inline_math = $ math_text $
  # display_math = \[ { inner_expr } \]
  # item = '\item` text_sequence

  # Terminals:
  # ----------
  # BEGIN_DOC = '\begin{document}'
  # END_DOC = '\end{document}'

  # Pseudoterminals
  # --------------
  # BEGIN_ENV = '\begin{' env_name '}'
  # END_ENV = '\end{' env_name '}'

  # NOTES. word_rx = [a-zA-Z]* -- what about punctiation, numerals, etc.?
  #        Maybe a word should just be: no '\', or no leading '\'
  #

A pseudoterminal such as \begin{theorem} matches \end{theorem}, \begin{definition} matches \end{definition}, etc.

Warning

the production for environment is not yet implemented per the grammar.

The Parsing Process for `environment`

The table with items 1-5 below describes the parsing process for environment. Each line in the table has the form

  label, stack | input

Items of the form (x) are nodes, items of the form [y] are tokens, and other items are words of input.

  1. ?, (node) ... (node) | be 1 ... 2 ee ...
  2. ee, [2] ... [1] [be] | ...
  3. [ee] (node) ... (node) | [1] ... [2] [ee] ...
  4. (NODE) ... (NODE) [ee] (node) ... (node) |  ...
  5. (ENV_NODE) (node) ... (node) | ...

Phase 1: The parsers sees be, which has the form \begin{foo}. The method environment is called. In phase (1) it gets tokens repeatedly until it sees ee = \end{foo}. The state is now (2).
Phase 2: Push [ee] onto the input stream and save it as mark; push [2] … [1] in reverse order onto the input stream; pop [be], push mark. The state is now as in (3).
Phase 3: call expr.
Phase 4: expr has returned. Set n = depth([ee]). Create ENV_NODE, make (NODE) … (NODE) its descendants, pop [ee], push ENV_NODE onto stack. The state is as in (5).
Phase 5: return

Testing

Still very primitive, but I am using rspec and want to use @jirutka’s doctest as well. The Reader is well tested.

To test the parser, run

  $ rspec spec/parser_spec.rb

At the moment I am working on

making the parser produce a tree instead of a list of nodes.
the method render_tree in module RenderNode
passing the "identity test" in parser_spec.rb: the yield of the parser should be the same as the input text modulo white space. This is tested using test(file) or test(file, verbose: true) which is defined in parser_spec.rb. Here is a typical test:

 it 'can parse and render environments' do
    test('environment.tex')
  end

This approach is inspired by @jirutka’s doctest. Once the parser implements whatever the final grammar is and passes this text, another render method can be developed to produce the proper Asciidoctor text.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
lib		lib
spec		spec
text		text
.editorconfig		.editorconfig
README.adoc		README.adoc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Latex2Asciidoctor

Design notes

The Basic Idea

Grammar

The Parsing Process for `environment`

Testing

About

Releases

Packages

Languages

jxxcarlson/latex2asciidoctor

Folders and files

Latest commit

History

Repository files navigation

Latex2Asciidoctor

Design notes

The Basic Idea

Grammar

The Parsing Process for environment

Testing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

The Parsing Process for `environment`

Packages