Skip to content

Extend prefilter to support multiple starting characters #118

@pandaman64

Description

@pandaman64

Currently, the prefilter in OptimizationInfo only works when there is a single character that starts all possible matches.
For example, it can optimize cases like AROA|AIDA now by finding A in the input string before executing the regex engines, but it misses cases like (foo|bar)baz which would allow us to find f or b first.

Proposed implementation

1. Extend OptimizationInfo structure

The current OptimizationInfo structure only stores a single firstChar. We need to extend it to handle multiple possible starting characters:

structure OptimizationInfo where
  /--
  When `firstChars` is `.some chars`, all matching strings must start with one of the characters in `chars`.
  
  The regex engine will optimize the search to skip positions that do not start with any character in `chars`.
  -/
  firstChars : Option (Array Char)
deriving Repr, Inhabited, DecidableEq, Lean.ToExpr

2. Extend Expr.firstChar to Expr.firstChars

Modify the firstChar function in regex/Regex/Data/Expr/Optimization.lean to return an array of possible starting characters:

def firstChars (expr : Expr) : Option (Array Char) := sorry
  -- TODO: Implement this!
  -- We don't want to make this return a large array (which would slow down the prefilter),
  -- so let's limit the array to 8 characters.
  -- The structure will be similar to the current `firstChar` function, but there will be some differences:
  -- * For `.alternate e₁ e₂`, we can take the union of the first characters of `e₁` and `e₂` instead of comparing the two.
  -- * For `.classes cs`, let's limit the extraction to `Class.single` to avoid large arrays.

3. Update findStart method

The current findStart method uses Iterator.find with a single character predicate. We need to extend it to handle multiple characters.

4. Update correctness proofs

The curr_of_captures_of_firstChar_some theorem in correctness/RegexCorrectness/Data/Expr/Optimization.lean proves that the starting character (it.curr) equals to expr.firstChar, when returns .some. You need to update the theorem to check if it.curr is in expr.firstChars.

The remaining fix in correctness/RegexCorrectness/Regex/OptimizationInfo.lean should be easy.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions