Skip to content

Commit

Permalink
[GGE] TreeSpider refactor
Browse files Browse the repository at this point in the history
There's a way to compensate for Perl 6's lack of continuations, by
building closures out of regex parts.  However, I'm not yet ready to
take such a route.  Instead, this refactor causes the crawling through
the tree to be handled explicitly.

The crawling consists of two things.  First, there's the usual tree
traversal.  Commonly, this is carried out using the call stack: calling
things to descend and returning to ascend back up.  Since the result of
parent nodes depends on the result of child nodes, the match results
can even be passed back up via the return mechanism.

Unfortunately, due to the need for backtracking, straightforward tree
traversal isn't enough.  For certain 'backtracking-enabled' nodes which
leave savepoints upon matching, the same nodes must be prepared to
re-activate in the -same- state as they were when the savepoint was
registered.  With 'state' is meant the smallest possible set of values
involved in the regex match such that when the savepoint is activated,
it is -as- -if- the whole history after the registering of the
savepoint hadn't happened.  (This parallels the way saving/restoring
works in most computer games.)

When possible, savepoints are implemented using continuations.  These
store away the lexical pad of the currently executing routine, the
stack of routines waiting to resume along with their lexical pads,
along with things possibly stored outside the tree-traversal routines,
such as the current matching position in the target string.  Invoking
the continuation restores all those things automatically, making
continuations a desirable tool for implementing backtracking.

STD.pm's gimme5 uses closures to emulate the continuation-like
behavior.  Closures have the same ability to capture state as
continuations, and can be made to implement backtracking.  But not
while also doing tree traversal using the conventional call stack.
(Closures don't close over the call stack.)  Which means one has to
do the traversal in some other way, for exaple by using more
closures.  Replacing the call stack with closures is called
'continuation-passing style'.  Instead of popping the stack when
returning, one invokes a continuation pointing back to the calling
routine.

The refactor does not take this route.  Rather, it handles all tree
traversal and registering/invoking of savepoints explicitly.  In a way,
this implementation is as unsugared as it gets, since the control flow
is all explicit, abstracted away neither by continuations nor by
closures.

The tree spider keeps track of the currently executing regex node, the
target string and the current match position, a lexical pad for the
current node as well as a stack of pads for the path of nodes up to the
root, and all currently active savepoints, registered at regex nodes in
the tree, each containing its own stack of lexical pads from which
execution at a given node can be restored.

Under this model, regex nodes do their thing and return the control to
the tree spider as soon as they can.  In doing this, it returns a
partial match result, one of four possible values to guide the tree
spider on:

DESCEND    Node needs result from its children; call downwards
MATCH      Node has completed successfully; return upwards
FAIL       Node has completed unsuccessfully; return downwards

In addition to these three return valus, there's also a fourth value
which is never returned by a node, but generated by the tree spider
when it activates a registered savepoint:

BACKTRACK  Savepoint activated; restore state and continue

Corresponding to these four states are four methods for each regex node
class: .start, .succeeded, .failed and .backtracked, respectively.
Since these methods are called on different nodes than the ones that
send the return values, the tree spider can be seen as a mediator of
signals between nodes, and the regex match as a whole can be seen as an
intricate negotiation between many nodes, with the tree spider making
sure that the negotiation messages are delivered.

A savepoint is registered each time a regex node doing the Backtracking
role returns MATCH.  The tree spider takes the savepoint and places it
not on the node itself, but on the closest ancestor that also does
Backtracking.  This is required to make savepoints trigger when they
should; when failures propagate up the tree, backtracking savepoints
need to be activated as late as possible, to allow for all nodes along
the way to turn a FAIL into a MATCH (as, for example, greedy
quantifiers may do).  The requirement of such an ancestor doing
Backtracking creates the need for a special-cased regex node, called
Regex, which functions as a kind of last-resort savepoint vessel for
those Backtracking nodes which would otherwise not have an ancestor
doing Backtracking.

There's an easy set of low-hanging fruit to be picked in making the
crawling through the tree less explicit.  If a node deep in the tree
suceeds (or fails), it doesn't have to send success (or failure)
signals up several levels to the first node that might do something
interesting with this information.  Instead, the computation of
the first node which will do something interesting with a MATCH
(or FAIL) signal can be computed before the traversal begins.
Similarly, the registering, activation and de-registering of
backtracking savepoints can all be replaced by once-only generated
static information in the form of pointers in the tree.  This all
corresponds to making the tree spider prescient, so that it won't
have to check all the boring intermediate nodes to know where to
go.  Again, this type of optimization has not yet been attempted.

There is a water-tight division between Exp objects and the tree
spider: regex nodes *can't*, and shouldn't, cause backtracking to be
initiated.  Should you find that a certain class of regex node
really can't be implemented without doing explicit backtracking, my
best advice to you is to take a deep breath.  That helped for me.
  • Loading branch information
Carl Masak committed Dec 25, 2009
1 parent 51cab0f commit d71f92d
Show file tree
Hide file tree
Showing 8 changed files with 308 additions and 216 deletions.
1 change: 1 addition & 0 deletions .gitignore
@@ -1 +1,2 @@
*.pir *.pir
Makefile
19 changes: 0 additions & 19 deletions Makefile

This file was deleted.

4 changes: 2 additions & 2 deletions Makefile.in
Expand Up @@ -2,8 +2,8 @@ PERL6=<PERL6>
RAKUDO_DIR=<RAKUDO_DIR> RAKUDO_DIR=<RAKUDO_DIR>
PERL6LIB='<PERL6LIB>:$(RAKUDO_DIR)' PERL6LIB='<PERL6LIB>:$(RAKUDO_DIR)'


SOURCES=lib/GGE/Match.pm lib/GGE/Exp.pm lib/GGE/Traversal.pm \ SOURCES=lib/GGE/Match.pm lib/GGE/Exp.pm lib/GGE/TreeSpider.pm \
lib/GGE/Cursor.pm lib/GGE/OPTable.pm lib/GGE/Perl6Regex.pm lib/GGE.pm lib/GGE/OPTable.pm lib/GGE/Perl6Regex.pm lib/GGE.pm


PIRS=$(SOURCES:.pm=.pir) PIRS=$(SOURCES:.pm=.pir)


Expand Down
109 changes: 0 additions & 109 deletions lib/GGE/Cursor.pm

This file was deleted.

0 comments on commit d71f92d

Please sign in to comment.