Skip to content

Introduction to Extended Affix Grammars

Denis Kuniss edited this page Jan 24, 2023 · 17 revisions

The parser generator schema

Parser generators are quite well known: They process a context free grammar specification to generate a parser.

The generated parser processes an arbitrary input text deciding whether this text is a sentence of the language described by the context free grammar the parser generator has generated the parser for. And if so, it generates a parse tree that represents the input structurally.

Parser Generation Schema

The compiler generator schema

A compiler generator, however, consist of an parser generator accompanied with an evaluator generator. The constructed parstree built by the generated parser is passed to the generated evaluator which performs semantic checks and generates the output in the specified target language.

Compiler Generation Schema

However, what is the input to a compiler generator?

It must be partly a context free grammar specification. But this is not sufficient! All real compilers are written for languages (like all general purpose programming languages) which are not context free, they are context-sensitive.

The specification for an compiler generator must belong to the class of context-sensitive grammars to be powerful enough to define a context-sensitive language.

A context-sensitive language example

A quite famous context-sensitive language example which is not context-free is

{ anbncn | n ≥ 1 } = { abc, aabbcc, aaabbbccc, … }

E.g., words of this language are "abc", "aabbcc", "aaabbbccc" and so on. The number of "a", "b" and "c" characters is always equal in a word that belongs to that language.
This cannot be specified by a context-free grammar!

First attempt using a context-free grammar

Let’s try to describe this language using a context-free grammar. We will use a formalism similar to the BNF grammar formalism.

S: A B C. 
A: | 'a' A. 
B: | 'b' B. 
C: | 'c' C. 

Two possible word derivations using that grammar are

S →* aaabbbccc
S →* aaabbc

Both words are part of the language defined by our context-free grammar. However, the second word is not part of our context-sensitive language { anbncn | n ≥ 1 } — the number of "a", "b" and "c" characters are not equal! The language defined by this context free grammar has "more" words than the context-sensitive one we are looking for.

So, the the context-free grammar from above is not sufficient to describe our language. We somehow need to "count" the "a", "b" and "c" characters and compare the counts.

Short excursion: Unary Numeral System

The Unary Numeral System is probably the first numeral system at all. Any child is starting to count using its fingers — its first unary numeral system. The first humans have probably used it. And in Germany they are counting the beers in the pub that way.

Here is an example using tally marks and a second using "1" for counting:

0:             0: 
1: |           1: 1
4: ||||        4: 1111

This kind of counting can be expressed by a context free grammar:

N → '1' N.
N → .

The non-terminal N derives to either a leading '1' followed by a recursive application of the same non-terminal N, or it derives to the empty alternative. The final dot just marks the end of rule, especially helpful for seeing empty alternatives.

This grammar defines the infinite language of all countable '1’s. Formally:

L(GN) = { ɛ, 1, 11, 111, …​ } = { 1n | n ≥ 0}

We will come back to that later on.

Note
ɛ describes the empty word. As N derives to the empty word too, it must be part of the resulting language.

Counting as part of a context free grammar

But back to our context sensitive example anbncn. As we have figured out, we need to count the "a", "b" and "c" characters to compare the equality of their counts.

Let’s first do it naively by setting up a context free grammar for our language which represents the number of to be recognized characters of "a", "b" and "c" as part of the non-terminal name. The number of the characters is expressed as the corresponding number of "1"s.

We would start with the S, the start symbol of the grammar.

S → A1   B1   C1.
S → A11  B11  C11.
S → A111 B111 C111.
...

We would need to add one rule for each natural number where the number of "1" behind the A, B and C non-terminal references corresponds to a natural number.

The same for the A rule:

A    → .
A1   → 'a' A.
A11  → 'a' A1. 
A111 → 'a' A11.
...

Informally, here the number of "1"s on the left side of the grammar rule corresponds to the number of recognized "a" characters when all non-terminal gets resolved. This has been done similar for rules B and C.

Admitted: this leads to an infinite number of rules. So it cannot be written down! But this infinite grammar exactly defines our language in questions.

Nevertheless, let’s verify this on an example by writing down the grammar derivation steps for a particular sentence of our language. The example sentence is

aabbcc

We start at the grammars start symbol trying to derive the example sentence. The grammar rules applied for the current derivation step is given in parentheses behind:

S   → A11 B11 C11
    → aA1 B11 C11   [A11 → 'a' A1.]
    → aaA B11 C11   [A1 → 'a' A.]
    → aa B11 C11    [A → .]
    → aa bB1 C11    [B11 → 'b' B1.]
    → aa bbB C11    [Bi1→ 'b' B.]
    → aa bb C11     [B → .]
    → aa bb cC1     [C11 → 'c' Ci.]
    → aa bb ccC     [C1 → 'c' C.]
    → aa bb cc      [C → .]

This shows that the sentence aa bb cc can be derived from the start symbol S and therefore this sentence is part of the language defined by the infinite grammar from above.

Get rid of the infinity

Still, the grammar is infinite. It cannot be written down. How to overcome this?
To get rid of it we will apply two 2 tricks.

First, lets add <> parentheses to the rule names to separate the context-free grammar’s original non-terminals from the "1" counting name parts:

S → A <1>   B <1>   C <1>.
S → A <11>  B <11>  C <11>.
S → A <111> B <111> C <111>.
...

The added <> parentheses can just be taken as part of the rule names and nothing will change from grammar and sentence derivation point of view. The non-terminal names just got longer. The grammar is still infinite.

The second trick is to introduce a second grammar, called the Meta Grammar, and embed it into the first one. …​ That’s where the term "two level grammar" is coming from.

You probably remember the grammar for the Unary Numeral System from our excursion:

N → '1' N.
N → .

Anyone of the sequences of "1"s in the grammar in <> parentheses is a sentence of the language build by the Unary Numeral System grammar.

N →* 1
N →* 11
N →* 111
N →* 1111
...

So each sequence of "1"s in the grammar in <> parentheses could be replaced by the non-terminal N if N is treated as the start symbol for a sentence of the second grammar, the meta grammar, while the <> parenthesis are treated as real parentheses separating the first grammar from the second grammar parts.

If we do this for all the (infinitely many) of grammar rules they become a finite number.

The rules for start symbol S

S → A <1>   B  <1>   C <1>.
S → A <11>  B <11>  C <11>.
S → A <111> B <111> C <111>.
...

shrink together to the one following rule if sequences of "1"s are substituted by the start symbol of the Unary Numeral System grammar N:

S → A <N> B <N> C <N>.

We call the grammar parts in the parentheses an Affix and, more specifically, the non-terminal of the meta grammar inside an Affix Variable. It’s a variable as it stands for an arbitrary sentence of the meta grammar language. In our case it can be an arbitrary number of "1"s.

And here, a major constraint of our formalism comes into play: If an affix variable occurs several times in a rule, it must be derived to the same sentence of the meta grammar language. Here, it means that all N behind A, B and C must derive to the same number of "1"s. This constraint is later on ensured by the implementation of the compiler generator. We call it, the Principle of Consistent Substitution.

Let us do it similarly for the A rule as an example. It may be done similarly for B and C. They are structurally equivalent.

A <>    → .
A <1>   → 'a' A <>.
A <11>  → 'a' A <1>. 
A <111> → 'a' A <11>.
...

becomes

A <>    → .
A <'1' N>   → 'a' A <N>.

That way, we got only 2 rules for A. Here the same constraint is applied. But as the number of "1"s on the left and the right hand side of the rules are different, only a part of the "1"s can be substituted by an N because N must be derived to the same number of "1" inside one rule according to the Principle of Consistent Substitution.

So one "1" on the left hand side of the rule remains. But this just represents the counting of an "a" from the right hand side of the rule.

This becomes more clear when we try to derive a sentence of the language using our 2-level grammar. Let’s try to derive again the sentence "aabbcc" as we have done already before.

Note, the rule in the brackets below are the one which got applied for the derivation step result shown on the left. So, the previous derivation state the rule got applied to is shown one line above. And, the character "ɛ" is Epsilon, the empty word.

S   → A <N> B <N> C <N>        [S → A <N> B <N> C <N>]
    → A <1N>  B <N>   C <N>    [meta: N → '1' N.]
    → A <11N> B <N>   C <N>    [meta: N → '1' N.]
    → A <11>  B <N>   C <N>    [meta: N → .]
    → A <11>  B <1N>  C <N>    [meta: N → '1' N.]
    → A <11>  B <11N> C <N>    [meta: N → '1' N.]
    → A <11>  B <11>  C <N>    [meta: N → .]
    → A <11>  B <11>  C <1N>   [meta: N → '1' N.]
    → A <11>  B <11>  C <11N>  [meta: N → '1' N.]
    → A <11>  B <11>  C <11>   [meta: N → .]
    → aA <1>  B <11>  C <11>   [A <'1' N> → 'a' A <N> . | with N →* 1 ]
    → aaA <>  B <11>  C <11>   [A <'1' N> → 'a' A <N> . | with N →* ɛ ]
    → aa     B <11>  C <11>    [A <> → .]
    → aa     bB <1>  C <11>    [B <'1' N> → 'b' B <N> . | with N →* 1 ]
    → aa     bbB <>  C <11>    [B <'1' N> → 'b' B <N> . | with N →* ɛ ]
    → aa     bb     C <11>     [B <> → .]
    → aa     bb     cC <1>     [C <'1' N> → 'c' C <N> . | with N →* 1 ]
    → aa     bb     ccC <>     [C <'1' N> → 'c' C <N> . | with N →* ɛ ]
    → aa     bb     cc         [C <> → .]

So, the derivation is done the same way for a 2 level grammar as for a one level grammar, just derivation rules get applied. Some more rule steps have to be applied, however.
This shows that this approach is not just a trick but a formal method!

The grammar with affixes in parentheses is called Hyper Grammar.

Running the compiler generator

Time to get the compiler generator to run on our example.

The BNF like grammar of our 2-level grammar language given in the Gamma grammar language notation may be looked up at abc-bnf.eag. It has the following content:

N = | '1' N.

S <+ N: N>: A <N> B <N> C <N>.

A <+ '1' N: N>: 'a' A <N>.
A <+ : N> : .

B <- '1' N: N>: 'b' B <N>.
B <- : N>:.

C <- '1' N: N>: 'c' C <N>.
C <- : N>: .

The rules of the meta grammar are given in a simple BNF syntax with the equal sign for separating the left hand side of the rule from the left hand side. Rule alternatives may be separated by the pipe operator. A rule is always finished with a dot. Terminals are surrounded by double quotes.

N = | '1' N.

The syntax for the hyper-rule of the start symbol S is very similar to the notation we used above. However, the left hand side of the hyper-rule is separated from the right hand side of the rule by a colon. The rule is finished by a dot.

S <+ N: N>: A <N> B <N> C <N>.

The first hyper rule non-terminal in the grammar specification is always treated as the start symbol of the grammar. It also becomes the name of the executable compiled by the D compiler.

The start symbol is also requiring an output affix that becomes the result of the compilation, a formal constraint of the Gamma compiler generator formalism. It will be printed out by a generated compiler on the console. An output affix is always marked with a "+" sign. Each affix is also marked by it’s affix type — the type of the meta non-terminal it represents — separated by a colon from the affix form. It’s also called the affix signature.

S <+ N: N>

The rules for the non-terminal A is similar. It is separated into two rules for the two needed alternatives: the rule which "counts" the "a" characters, and the rule which derives to the empty word.

A <+ '1' N: N>: 'a' A <N>.
A <+ : N> : .

The string '1' N is a non-empty affix of the affix type N representing the first rule alternative of the N rule (see above). In contrast, the second rule of A specifies an empty affix form, the second rule alternative of the N rule. Both have the plus sign in front, which means that they represent an output affix.

Hyper rule alternatives can also be specified more concisely using the pipe operator as for the meta rule and by using an EBNF formalism. But more about that later, in another article.

In the following we have prepared an online development environment based on Gitpod.

Click on the following link and a development environment will be opened with the Gamma compiler generator pre-compiled and the Gamma grammar specification for our abc grammar opened in the editor:

DISCLAIMER
A free registration at gitpod.io is required when following the link!

Executing the following command will generate and compile a compiler for our language as shown in the following console screen shot:

gitpod /workspace/gamma (master) $ ./gamma -s example/bnf/abc.eag
info: Epsilon 1.02   JoDe/SteWe  22.11.96
info: SOAG-Evaluatorgenerator 1.06 dk 14.03.98
info: Analysing S
info: S grammar is valid
info: Analyser: no warnings on S's hyper-nonterminals
info: predicates in S: 0
info: ELL(1) testing S
info: S grammar is ELL(1)
info: SLAG testing S
info: S grammar is SLAG
info: LexGen writing S
info: ELL(1) writing S
info: +rc +ct
dmd S.d SLex.d -g include/runtime.d src/io.d src/log.d src/epsilon/soag/listacks.d
gitpod /workspace/gamma (master) $

The compiler code will be generated in the build/ directory with some D source code files and associated tables. It will also compile an executable S what we will invoke on an example to get our compiler run.

NOTE
Epsilon is the predecessor of Gamma which originally was implemented in Modula-2. It’s name is still in the code…​

Let’s start with a good case. The compiler executable is a real Unix tool, it accepts piped input. So we can redirect input on the command line directly into the compiler executable invocation:

gitpod /workspace/gamma (master) $ echo aaabbbccc | ./S
info: S compiler (generated with epsilon)
1 1 1
gitpod /workspace/gamma (master) $

The invocation does not fail with an error. The input "aaabbbccc" is part of our language. As result the compiler issues the compilation result, the number of counted "a", "b" and "c" characters expressed as an equally numbered sequence of "1".

As an example of a sentence not part of the language, we just remove one "c" at the end.

gitpod /workspace/gamma (master) $ echo aaabbbcc | ./S
info: S compiler (generated with epsilon)
error: analysis in C failed
stdin:2:1
          ^
info: errors detected: 1
gitpod /workspace/gamma (master) $

The compiler run fails as expected with an error complaining that the (semantic) analysis fails for non-terminal C. Which is correct as the number of "c" characters in the input sequence is not equal to the number "a" and "b" characters.

NOTE
The "^" character is intended to mark the error position. This works better on real files but not very well on piped input. Ignore it for now, please.

If we remove another "b" character, we will get 2 errors:

gitpod /workspace/gamma (master) $ echo aaabbcc | ./S
info: S compiler (generated with epsilon)
error: analysis in B failed
stdin:1:6 aaabbcc
               ^
error: analysis in C failed
stdin:2:1
          ^
info: errors detected: 2
gitpod /workspace/gamma (master) $

Semantic checks as affix specifications

This makes it obvious: The count of "a" characters is always treated as the "right" count. We will not get error for a wrong number of "a" characters. This is determined by how the analysis may be formally controlled. To understand why this is the case, we need to examine the other rules for B or C. The rules for B are defined as follows:

B <- '1' N: N>: 'b' B <N>.
B <- : N>: .

When we look closer, the only difference to the A hyper-rule, besides all the b characters in upper and lower case, is the minus character in front of the affix form. Here is the A hyper rule again for comparison:

A <+ '1' N: N>: 'a' A <N>.
A <+ : N> : .

The minus character defines an affix as being an input affix. The direction of affix, input and output, specifies the data flow of the affix values inside the compiler and implicitly define where the comparisons take place to ensure the Principle of Consistent Substitution holds true.

When specifying a compiler with Gamma, we mainly think in terms of affix value flows which are attached to the hyper non-terminals of our grammar.

Sometimes and in a first approach, it is easier to think of the alternatives of a hyper non-terminal as one method declaration: The left hand side of a rule is a method signature, output affixes are output parameters. The right hand side of the rule is a method body, terminals are used to decide which alternative to choose.

Let’s examine this at the A rules, first. Here, we could imagine it as the following method declaration:

void A(out N affix) {
    if (is("a") {
        A(affixOfRecursion)
        affix = "1" affixOfRecursion
    }
    else
        affix = ɛ // return the empty word
}

So, the A method implements a recursion which in its termination case returns the empty word. And, in its recursive case it constructs the result of the current iteration by adding "1" in front of the result of the recursion.

However, when looking at representing the B rules as method the affix parameter becomes an input parameter which needs to be deconstructed:

void B(in N affix) {
    if (is("b") {
        assert(affix == "1" N)
        let ("1" affixOfRecursion) = affix // decomposition
        B(affixOfRecursion)
    }
    else
        assert(affix == ɛ)  // the empty word
}

At the recursion case, the incoming affix gets checked to be structurally equal to sentence form corresponding to the first grammar rule alternative of the meta non-terminal N. And it gets implicitly deconstructed into a leading "1" and representation of an N value which gets passed down into the recursion call.

At the recursion termination case, it is checked whether the empty word has been passed in.

That way, the whole sentence of passed-in "1"s gets deconstructed and checked whether for each a corresponding "b" can be counted.

NOTE
This representation of the affix flow semantic and of the hyper rules as methods is a meaningful approach to reason about the grammar specification. However, not all affix dependency graphs created by affix flows can be represented that way and result in a method-like implementation. If possible, the Gamma compiler generator will generate methods in that way. For more sophisticated affix dependency graphs, however, it is able to generate more elaborated affix evaluation mechanisms.