Introduction to Extended Affix Grammars

Table of Contents

The parser generator schema
The compiler generator schema
A context-sensitive language example

The parser generator schema

Parser generators are quite well known: They process a context free grammar specification to generate a parser.

The generated parser processes an arbitrary input text deciding whether this text is a sentence of the language described by the context free grammar the parser generator has generated the parser for. And if so, it almost generate a parse tree representing the input structurally.

parser generator schema

The compiler generator schema

A compiler generator, however, consist of an parser generator accompanied with an evaluator generator. The constructed parstree built by the generated parser is passed to the generated evaluator which performs semantic checks and generates the output in the specified target language.

compiler generator schema

However, what is the input to a compiler generator?

It must be partly a context free grammar specification. But this is not sufficient! Almost all real compilers are written for languages (like all general purpose programming languages) which are not context free, they are context-sensitive.

The specification for an compiler generator must belong to the class of context-sensitive grammars to be powerfull enough to define a context-sensitive language.

A context-sensitive language example

A quite famous context-sensitive language example which is not context-free is

{ aⁿbⁿcⁿ | n ≥ 1 } = { abc, aabbcc, aaabbbccc, … }

E.g., words of this language are "abc", "aabbcc", "aaabbbccc" and so on. The number of "a", "b" and "c" characters is always equal in a word that belongs to that language.
This cannot be specified by a context-free grammar!

First attempt using a context-free grammar

Let’s try to describe this language using a context-free grammar. We will use a formalism similar to the BNF grammar formalism.

S: A B C. 
A: "a" A | . 
B: "b" B | . 
C: "c" C | .

Possible word derivation using that grammar are

S →* aaabbbccc
S →* aaabbc

Both words are part of the language defined by our context-free grammar. However, the second word is not part of our context-sensitive language { aⁿbⁿcⁿ | n ≥ 1 } — the number of "a", "b" and "c" characters are not equal! The language defined by this context free grammar has "more" words than the context-sensitive one we are looking for.

So, the the context-free grammar from above is not sufficient to describe our language. We somehow need to "count" the "a", "b" and "c" characters and compare the counts.

Short excursion: Unary Numeral System

The Unary Numeral System is probably the first numeral system at all. Any child is starting to count using its fingers — its first unary numeral system. The first humans have probably used it. And in Germany they are counting the beers in the pub that way.

Here is an example using tally marks and a second using "i" for counting:

0:             0: 
1: |           1: i
4: ||||        4: iiii

This kind of counting can be expressed by a context free grammar:

N → 'i' N .
N → .

The nonterminal N derives to either a leading 'i' followed by a recursive application of the same nonterminal N, or it derives to the empty alternative. The final dot just marks the end of rule, especially helful for seeing empty alternatives.

This grammar defines the infinite language of all countable 'i’s. Formally:

L(G_N) = { i, ii, iii, … } = { iⁿ | n ≥ 0}

We will come back to that later on.

Counting as part of a context free grammar

But back to our context sensitive example aⁿbⁿcⁿ. As we have figured out, we need to count the "a", "b" and "c" characters to compare the equality of their counts.

Let’s first do it naively by setting up a context free grammar for our language which represents the number of to be recognized characters of "a", "b" and "c" as part of the nonterminal name. The number of the characters is expressed as the corresponding number of "i"s.

We would start with the S, the start symbol of the grammar.

S → Ai   Bi   Ci .
S → Aii  Bii  Cii .
S → Aiii Biii Ciii .
...

We would need to add one rule for each natural number where the number of "i" behind the A, B and C nonterminal references corresponds to a natural number.

The same for the A rule:

A    → .
Ai   → "a" A .
Aii  → "a" Ai . 
Aiii → "a" Aii .
...

Informally, here the number of "i" on the left side of the grammar rule corrsponds to the number of recognized "a" characters when all nonterminal gets resolved. This has been done similar for rules B and C.

Yes, this leads to an infinite number of rules. So it cannot be written down! But this infinite grammar extactly defines our language in questions.

Nevertheless, let’s verify this on a example by writing down the grammar derivation steps for a particular sentence of our language. The example sentence is

aabbcc

We start at the grammars start symbol trying to derive the example sentence. The grammar rules applied for the current derivation step is given in paranthesis behind:

S   → Aii Bii Cii
    → aAi Bii Cii   [Aii → ”a” Ai .]
    → aaA Bii Cii   [Ai → ”a” A .]
    → aa Bii Cii    [A → .]
    → aa bBi Cii    [Bii → ”b” Bi .]
    → aa bbB Cii    [Bi → ”b” B .]
    → aa bb Cii     [B → .]
    → aa bb cCi     [Cii → ”c” Ci .]
    → aa bb ccC     [Ci → ”c” C .]
    → aa bb cc      [C → .]

This shows that the sentence aa bb cc can be derived from the start symbol S and therefore this sentence is part of the language defined by the infinte grammar from above.

Get rid of the infinity

Still, the grammar is infinite. It cannot be written down. How to overcome this?
To get rid of it we will apply 2 tricks.

First, lets add <> parenthesis to the rule names to separate the original nonterminals from the context free grammar from the "i" counting name parts:

S → A<i>   B<i>   C<i> .
S → A<ii>  B<ii>  C<ii> .
S → A<iii> B<iii> C<iii> .
...

The added <> paranethesis can just be taken as part of the rule names and nothing will change from grammar and sentence derivation point of view. The nonterminal names just got longer. The grammar is still inifinite.

The second trick is to introduce a second grammar, called the Meta Grammar, and embedd it into the first one. … That’s where the term "two level grammar" is coming from.

You probably remember the grammar for the Unary Numeral System from our excursion:

N → 'i' N .
N → .

Anyone of the sequences of "i" in the grammar with <> parenthesis is a sentence of the language build by the Unary Numeral Sytem grammar.

N →* i
N →* ii
N →* iii
N →* iiii
...

So each occurence of "i" in the grammar with <> parenthesis could be replaced by the nonterminal N if N is treated as the start symbol for a sentence of the second grammar, the meta grammar. While the <> parenthesis are treated as real paranthesis separating the first grammar from the second grammar parts.

If we do this for all the infinite number of grammar rules they become a finite number.

The start symbol rules of S

S → A<i>   B<i>   C<i> .
S → A<ii>  B<ii>  C<ii> .
S → A<iii> B<iii> C<iii> .
...

shrinks together to the one following rule if sequences of "i" are substituted by the start symbol of the Unary Numeral System grammar N:

S → A<N> B<N> C<N> .

We call the grammar parts in the parenthesis an Affix and more specifc, the nonterminal of the meta grammar inside an Affix Variable. It’s a variable as it stands for an arbitrary sentence of the meta grammar language. In our case it can be an arbitrary number of "i".

And here a major constraint of our formalism comes into play: If an affix variable occurs several times in a rule, it must be derived to the same sentence of the meta grammar language. Here it means that all N behind A, B and C must derive to the same number of "i". This coinstraint is later on ensure by the implementation of the compiler generator. We call it, the Principle of Consistent Substitution.

We do it similar for the A rule as an example. It may be done similar for B and C. They are structurally equivalent.

A<>    → .
A<i>   → "a" A .
A<ii>  → "a" A<i> . 
A<iii> → "a" A<ii> .
...

becomes

A<>    → .
A<"i" N>   → "a" A<N> .

That way, we got only 2 rules for A. Here the same constraint is applied. But as the number if "i" on the left and the right hand side of the rules are different, only part of the "i" can be substituted by an N because N must be derived to the same number of "i" inside one rule according to the Principle of Consistent Substitution.

So one "i" on the left hand side of the rule remains. But this just represents the counting of an "a" from the right hand side of the rule.

This becomes more clear when we try to derive a sentence of the langue using our 2-level grammar. Let’s try to derive again the sentence "aabbcc" as we have done already before.

Note, the rule in the brackets below are the one which got applied for the derivation step result shown on the left. So, the previous derivation state the rule got applied to is shown one line above. And, the character "ɛ" is Epsilon, the empty word.

S   → A<N> B<N> C<N>        [S → A<N> B<N> C<N>]
    → A<iN>  B<N>   C<N>    [meta: N → "i" N.]
    → A<iiN> B<N>   C<N>    [meta: N → "i" N.]
    → A<ii>  B<N>   C<N>    [meta: N → .]
    → A<ii>  B<iN>  C<N>    [meta: N → "i" N.]
    → A<ii>  B<iiN> C<N>    [meta: N → "i" N.]
    → A<ii>  B<ii>  C<N>    [meta: N → .]
    → A<ii>  B<ii>  C<iN>   [meta: N → "i" N.]
    → A<ii>  B<ii>  C<iiN>  [meta: N → "i" N.]
    → A<ii>  B<ii>  C<ii>   [meta: N → .]
    → aA<i>  B<ii>  C<ii>   [A<"i" N> → ”a” A<N> . | with N →* i ]
    → aaA<>  B<ii>  C<ii>   [A<"i" N> → ”a” A<N> . | with N →* ɛ ]
    → aa     B<ii>  C<ii>   [A<> → .]
    → aa     bB<i>  C<ii>   [B<"i" N> → ”b” B<N> . | with N →* i ]
    → aa     bbB<>  C<ii>   [B<"i" N> → ”b” B<N> . | with N →* ɛ ]
    → aa     bb     C<ii>   [B<> → .]
    → aa     bb     cC<i>   [C<"i" N> → ”c” C<N> . | with N →* i ]
    → aa     bb     ccC<>   [C<"i" N> → ”c” C<N> . | with N →* ɛ ]
    → aa     bb     cc      [C<> → .]

So, the derivation is done the same way for a 2 level grammar as for a one level grammar, just derivation rules get applied. Some more rule steps have to be applied, however.
This shows that this approach is not just a trick but a formal method!

The grammar with affix in paranethesis is called Hyper Grammar.