Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
There's a way to compensate for Perl 6's lack of continuations, by building closures out of regex parts. However, I'm not yet ready to take such a route. Instead, this refactor causes the crawling through the tree to be handled explicitly. The crawling consists of two things. First, there's the usual tree traversal. Commonly, this is carried out using the call stack: calling things to descend and returning to ascend back up. Since the result of parent nodes depends on the result of child nodes, the match results can even be passed back up via the return mechanism. Unfortunately, due to the need for backtracking, straightforward tree traversal isn't enough. For certain 'backtracking-enabled' nodes which leave savepoints upon matching, the same nodes must be prepared to re-activate in the -same- state as they were when the savepoint was registered. With 'state' is meant the smallest possible set of values involved in the regex match such that when the savepoint is activated, it is -as- -if- the whole history after the registering of the savepoint hadn't happened. (This parallels the way saving/restoring works in most computer games.) When possible, savepoints are implemented using continuations. These store away the lexical pad of the currently executing routine, the stack of routines waiting to resume along with their lexical pads, along with things possibly stored outside the tree-traversal routines, such as the current matching position in the target string. Invoking the continuation restores all those things automatically, making continuations a desirable tool for implementing backtracking. STD.pm's gimme5 uses closures to emulate the continuation-like behavior. Closures have the same ability to capture state as continuations, and can be made to implement backtracking. But not while also doing tree traversal using the conventional call stack. (Closures don't close over the call stack.) Which means one has to do the traversal in some other way, for exaple by using more closures. Replacing the call stack with closures is called 'continuation-passing style'. Instead of popping the stack when returning, one invokes a continuation pointing back to the calling routine. The refactor does not take this route. Rather, it handles all tree traversal and registering/invoking of savepoints explicitly. In a way, this implementation is as unsugared as it gets, since the control flow is all explicit, abstracted away neither by continuations nor by closures. The tree spider keeps track of the currently executing regex node, the target string and the current match position, a lexical pad for the current node as well as a stack of pads for the path of nodes up to the root, and all currently active savepoints, registered at regex nodes in the tree, each containing its own stack of lexical pads from which execution at a given node can be restored. Under this model, regex nodes do their thing and return the control to the tree spider as soon as they can. In doing this, it returns a partial match result, one of four possible values to guide the tree spider on: DESCEND Node needs result from its children; call downwards MATCH Node has completed successfully; return upwards FAIL Node has completed unsuccessfully; return downwards In addition to these three return valus, there's also a fourth value which is never returned by a node, but generated by the tree spider when it activates a registered savepoint: BACKTRACK Savepoint activated; restore state and continue Corresponding to these four states are four methods for each regex node class: .start, .succeeded, .failed and .backtracked, respectively. Since these methods are called on different nodes than the ones that send the return values, the tree spider can be seen as a mediator of signals between nodes, and the regex match as a whole can be seen as an intricate negotiation between many nodes, with the tree spider making sure that the negotiation messages are delivered. A savepoint is registered each time a regex node doing the Backtracking role returns MATCH. The tree spider takes the savepoint and places it not on the node itself, but on the closest ancestor that also does Backtracking. This is required to make savepoints trigger when they should; when failures propagate up the tree, backtracking savepoints need to be activated as late as possible, to allow for all nodes along the way to turn a FAIL into a MATCH (as, for example, greedy quantifiers may do). The requirement of such an ancestor doing Backtracking creates the need for a special-cased regex node, called Regex, which functions as a kind of last-resort savepoint vessel for those Backtracking nodes which would otherwise not have an ancestor doing Backtracking. There's an easy set of low-hanging fruit to be picked in making the crawling through the tree less explicit. If a node deep in the tree suceeds (or fails), it doesn't have to send success (or failure) signals up several levels to the first node that might do something interesting with this information. Instead, the computation of the first node which will do something interesting with a MATCH (or FAIL) signal can be computed before the traversal begins. Similarly, the registering, activation and de-registering of backtracking savepoints can all be replaced by once-only generated static information in the form of pointers in the tree. This all corresponds to making the tree spider prescient, so that it won't have to check all the boring intermediate nodes to know where to go. Again, this type of optimization has not yet been attempted. There is a water-tight division between Exp objects and the tree spider: regex nodes *can't*, and shouldn't, cause backtracking to be initiated. Should you find that a certain class of regex node really can't be implemented without doing explicit backtracking, my best advice to you is to take a deep breath. That helped for me.
- Loading branch information