Skip to content
This repository has been archived by the owner on Mar 25, 2022. It is now read-only.

Commit

Permalink
Rewrite
Browse files Browse the repository at this point in the history
  • Loading branch information
Jeffrey Kegler authored and Jeffrey Kegler committed Feb 20, 2014
1 parent 69a27aa commit 6c9614d
Showing 1 changed file with 108 additions and 50 deletions.
158 changes: 108 additions & 50 deletions recce.ltx
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@ Marpa::XS\cite{Marpa-XS},
the first stable version of a tool from the Marpa project,
was uploaded to the CPAN Perl archive
on Solstice Day in 2011.
This paper describes the algorithm of Marpa::R2\cite{Marpa-R2},
This \doc{} describes the algorithm of Marpa::R2\cite{Marpa-R2},
the current version.

As implemented,
Expand Down Expand Up @@ -353,7 +353,7 @@ $\LHS{\Vrule{r}}$ and $\RHS{\Vrule{r}}$, respectively.
This definition follows \cite{AH2002},
which departs from tradition by disallowing an empty RHS.

Note that this paper, departing from tradition, does not define
Note that this \doc{}, departing from tradition, does not define
\Cg{} using a set of non-terminals that is disjoint from
\Vsymset{terminals}.
As implemented, Marpa allows terminals to serve as LHS symbols.
Expand Down Expand Up @@ -487,61 +487,119 @@ It is therefore important that this rewrite be of a kind
that can be done and undone efficiently,
while preserving the semantics.

The original grammar is called the {\bf external grammar},
because in the Marpa implementation, that is the one which
the user sees.
The grammar at the completion of the rewrite is called
the {\bf internal grammar}.
The Marpa parse engine actually runs on the internal grammar.

Conceptually,
the rewrite takes place as if the following steps were executed.
the rewrite takes place as if the following steps were executed:

The actual implementation of the rewrite differs somewhat from
the above, for reasons of efficiency.
\begin{enumerate}
\item Where \Vsym{old-start} is the current start symbol,
create a new non-terminal, \Vsym{new-start}, and a new rule
\begin{equation*}
\Vdr{initial} = [\Vsym{new-start} \de \mydot \Vsym{old-start} ].
\end{equation*}
This rewrite step is very common, and is called ``augmemting''
the grammar.

\item Eliminate inaccessbile and unproductive rules and symbols
from the grammar. If no rules are left, report an error.

\item If the grammar derives only the null string, it is called
a trivial grammar. The trivial grammar is treated as a special-case.

\item Find those rules whose RHS contains more than two properly
nullable symbols.
Rewrite them by dividing them into multiple rules
until none of the rules contains more than two propertly nullable
symbols.
In a worst case, this would be a rewrite into Chomsky
Normal Form.

\section{Properties of the rewritten grammar}
\label{s:rewrite-props}
\item Determine which symbols are properly nullable.
A symbol is properly nullable if and only if it is nullable,
but not nulling.
For each properly nullable symbol, create two aliases --
a nulling alias, and a non-nulling alias.

We have already noted
that no rules of \Cg{}
\item ``Factor'' any rules which contain properly nullable
into rules which contain only nulling and nonnulling symbols.
This divides each rule into at most four ``factors''.

\item Discard all nulling rules.

\item Remove all nulling symbols from the rules, recording
their location.
Their location should be recording in two maps.
Create a map should from location in the rewrtten to the sequence
of nulling symbols. This map can be used, for example, to restore
the nulling symbols when this rewrite is reversed during
evaluation.

\end{enumerate}

The actual implementation of the rewrite differs somewhat from
the above, for reasons of efficiency.
At all points in the rewrite process, a map from location in
the pre-rewrite rules to locations in the rewritten rules is
kept.
This map is a total function from external dotted rules to
internal dotted rules.
In reverse, this map is a partial function,
from internal dotted rules to internal dotted rules.

This rewrite is based on that in Aycock and Horspool\cite{AH2002},
and we call it Chomsky-Horspool-Aycock Form (CHAF).
A major difference from \cite{AH2002}
is the division of rules into rules with at
most two properly nullable symbols before ``factoring''.
Without this step, factoring is exponential in the length of the rule
and rules with many optional symbols on their RHS could very large numbers
of factors.
SQL's select statement is one example of a rule
whose rewrite would be pathological if not divided up first.

From this point on, in this \doc{}
the grammar \Cg{} will refer to the internal grammar,
unless otherwise noted.
After the rewrite
no rules of \Cg{}
have a zero-length RHS,
and that all symbols must be either nulling or non-nullable.
These restrictions follow Aycock and Horspool\cite{AH2002}.
The elimination of empty rules and proper nullables
is done by rewriting the grammar.
\cite{AH2002} shows how to do this
without loss of generality.
and no symbols are nulling.

Because Marpa claims to be a practical parser,
it is important to emphasize
that all grammar rewrites in this \doc{}
are done in such a way that the semantics
of the original grammar can be reconstructed
simply and efficiently at evaluation time.
As one example,
when a rewrite involves the introduction of new rule,
semantics for the new rule can be defined to pass its operands
up to a parent rule as a list.
Where needed, the original semantics
of a pre-existing parent rule can
be ``wrapped'' to reassemble these lists
into operands that are properly formed
for that original semantics.

As implemented,
the Marpa parser allows users to associate
semantics with an original grammar
that has none of the restrictions imposed
on grammars in this \doc{}.
The user of a Marpa parser
may specify any context-free grammar,
including one with properly nullable symbols,
empty rules, etc.
The user specifies his semantics in terms
of this original, ``free-form'', grammar.
Marpa implements the rewrites,
and performs evaluation,
in such a way as to keep them invisible to
the user.
From the user's point of view,
the ``free-form'' of his grammar is the
one being used for the parse,
and the one to which
his semantics are applied.
it is important to that this rewrite can be performed efficiently,
and, for evaluation purposes reversed efficiently,
while preserving the semantics attached to the external grammar.
Further, the ability to pause, examine and even alter a grammar
during recognition is an important one in Marpa.
Pauses are defined in terms of the external grammar,
the application will want to examine the grammar in terms of
external grammar,
and changes to the grammar will be expressed in terms the
rules, symbols and dotted rules of the external grammar.

When a parse tree (or forest) is produced from the Earley sets,
each instance of an external rule corresponds to
a subtree of the parse tree.
The LHS of the external rule will correspond to the root of this subtree,
and its RHS symbols to the leaves of the subtree.
When is it nececessary to scan the RHS of an external rule,
this can be implemented simply and efficiently as
a traversal of the subtree.

The location mapping, referred to above, allows the application
to request that recognition
be paused in terms of external locations.
The rewrite of this section
can be performed in both directly simply and quickly.
The only problematic case arises, not from Marpa's grammar rewriting,
but from its use of Leo memoization during recognition.
Leo memoization is discussed below.

\section{Earley's algorithm}
\label{s:earley}
Expand Down

0 comments on commit 6c9614d

Please sign in to comment.