scala based text pattern scanner
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

Scala based text pattern scanner

This implements a text pattern scanner.

The basis is taken from "Regular Expression Derivatives Reexamined" -Scott Owens, John Reppy,Aaron Turon, In Journal of Functional Programming, March 2009, vol 19, issue 02, pp. 173-190. By definition, the derivative of a set of strings S with respect to a symbol a is the set of strings generated by stripping the leading a from the strings in S that start with a. For regular sets of strings, i.e., sets defined by regular expressions (REs), the derivative is also a regular set. In a 1964 paper, Janusz Brzozowski presented an elegant method for directly constructing a recognizer from a regular expression based on regular-expression derivatives (Brzozowski,1964).

The notion of a derivative applies to any language. Intuitively, the derivative of a language L � with respect to a symbol a contained in L � is the language that includes only those suffixes of strings with a leading symbol a in L.

This package computes the entire Derivativation map for a regular expression, and then for the vector of regular expressions. The maps are reduced by applying the notion of weak equivalence as defined for specific expressions. A state machine is then constructed over the vector of derivation maps. This allows for minimal runtime overhead and O(1) complexity with regard to the number of patterns

When the scanner is run, it looks for strings that match any of the input patterns. If it finds more than one match, it takes the one matching the longest string text. If it finds two or more matches of the same length, the rule listed first in the input is chosen.

The matching is greedy, meaning the longest match will always be returned. While scanning, all states representing a match are placed in the match buffer. When a non-matching state is reached, the scanner will backtrack through the match buffer, from longest match to shortest. The longest match is then returned.

A scanner instance is constructed with a ScannerCtxt object. A ScannerCtxtis built with the ScannerCtxt compiler, which can compile a context from a pattern definition file defined in XML or JSON. The compilation step takes a long time and isn't needed unless there is a change to the pattern list. In normal operation, a context should be compiled and new Scanner instances created with the existing context until the context needs to be changed.

The scanner itself is also "pausable" This means that partial files can be scanned progressively until complete. The scanner state is updated with tokens serially and stores state specific to the input data being scanned.