Skip to content
Ian Kjos edited this page May 18, 2022 · 13 revisions

Welcome to the booze-tools wiki!

Documentation is Moving:

The new location is https://boozetools.readthedocs.io/en/latest/ but not everything has necessarily moved over just yet.

What have we got here?

As mentioned on the readme.md file, booze-tools is a set of language-processing tools.

At the moment, these are:

  • macroparse for both scanning and parsing with a separate language-definition document.
  • miniparse for bootstrapping context-free language recognition.
  • miniscan for bootstrapping lexical analysis.
  • The beginnings of GLR support.

macroparse

MacroParse takes language definition documents and produces corresponding automatons for both scanner and parser. As JSON objects, these are independent of host-language (e.g. Python, C, or Java) but sample implementations for Python are provided.

MacroParse is so named because it supports "macro-productions" which enable you to encapsulate patterns (e.g. "comma-separated-list-of-FOO") with standardized semantics rather than having to repeat yourself all over the grammar as normal BNF would require. As such, it's a bit nicer than typical EBNF.

It also supports embedded actions and a few other abbreviation devices. Full examples are provided explaining how to access (almost?) all the features.

miniscan

MiniScan makes scanners: these analyze text into the basic lexical units like words, numbers, and punctuation. MiniScan definitions associate a set of patterns (regular expressions) with corresponding actions to perform.

The mini-scanner also supports:

  • Rule ranks, which can override the general leftmost-longest heuristic and yield a shorter match that ranks higher.
    • For example, foo at rank 1 preempts [a-z]+ at (default) rank 0 for input foobar.
  • Intersection of character classes:
    • For example, [A-Z&&^AEIOU] matches an upper-case consonant.
  • Perl-style counted repetitions.
    • For example, {xdigit}{4,8} matches anywhere between a 16 and 32-bit hexadecimal numeral.
  • Named subexpressions.
    • If the subexpression is a character class, it may be referred to inside another character class.
  • Unicode input.

The MiniScan how-to page explains in detail how to get set up and working with the mini-scanner. The MiniScan usage notes page covers the regular expression dialect and limitations.

miniparse

This is your basic LALR(1) or LR(1) parser accepting BNF rules, with one key extra-fancy feature:

  • You can have more than one "start" symbol within a single grammar, and select among them at parse time.

In your application, you might want to be able to parse specific sub-phrases of a larger language without having to duplicate (parts of) the grammar. There is real application of this feature in the way the miniscan module grapples with named sub-expressions, which may not contain anchors but otherwise run the gamut of regular expressions.

The miniparse how-to page contains further detail.

GLR bits:

The algorithms are in place for generating non-deterministic parse tables for all supported strengths. These are a straightforward extension of the deterministic case.

Simple routines are provided for both a test-parse (unconcerned with semantics) and normal semantic parsing with a tree-structured stack.

A tree-structured stack works OK for mild non-determinism over small-ish inputs, but it's not really ready for prime time. Preferable is a true graph-structured stack, with merging as in Elk-Hound or Bison. The code is yet to come. With it will probably arrive various disambiguation strategies, including support for computed semantic predicates.

Examples:

For a complete, functional, working, tested example, please see the example folder, and in particular the file JSON.py therein defined. To see it in action, look in tests/test_examples.py.