-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Currently, the prefilter in OptimizationInfo
only works when there is a single character that starts all possible matches.
For example, it can optimize cases like AROA|AIDA
now by finding A
in the input string before executing the regex engines, but it misses cases like (foo|bar)baz
which would allow us to find f
or b
first.
Proposed implementation
1. Extend OptimizationInfo
structure
The current OptimizationInfo
structure only stores a single firstChar
. We need to extend it to handle multiple possible starting characters:
structure OptimizationInfo where
/--
When `firstChars` is `.some chars`, all matching strings must start with one of the characters in `chars`.
The regex engine will optimize the search to skip positions that do not start with any character in `chars`.
-/
firstChars : Option (Array Char)
deriving Repr, Inhabited, DecidableEq, Lean.ToExpr
2. Extend Expr.firstChar
to Expr.firstChars
Modify the firstChar
function in regex/Regex/Data/Expr/Optimization.lean
to return an array of possible starting characters:
def firstChars (expr : Expr) : Option (Array Char) := sorry
-- TODO: Implement this!
-- We don't want to make this return a large array (which would slow down the prefilter),
-- so let's limit the array to 8 characters.
-- The structure will be similar to the current `firstChar` function, but there will be some differences:
-- * For `.alternate e₁ e₂`, we can take the union of the first characters of `e₁` and `e₂` instead of comparing the two.
-- * For `.classes cs`, let's limit the extraction to `Class.single` to avoid large arrays.
3. Update findStart
method
The current findStart
method uses Iterator.find
with a single character predicate. We need to extend it to handle multiple characters.
4. Update correctness proofs
The curr_of_captures_of_firstChar_some
theorem in correctness/RegexCorrectness/Data/Expr/Optimization.lean
proves that the starting character (it.curr
) equals to expr.firstChar
, when returns .some
. You need to update the theorem to check if it.curr
is in expr.firstChars
.
The remaining fix in correctness/RegexCorrectness/Regex/OptimizationInfo.lean
should be easy.