Skip to content

RuntimeHelpers.matches() recompiles regex Pattern on every evaluation #1038

@devin-ai-integration

Description

@devin-ai-integration

Summary

RuntimeHelpers.matches() calls Pattern.compile(regexp) on every CEL evaluation, even when the regex pattern is a constant string literal. This is a significant performance bottleneck for libraries like protovalidate-java that evaluate the same regex patterns millions of times.

Problem

In RuntimeHelpers.java, the matches method recompiles the regex on every call:

public static boolean matches(String string, String regexp, CelOptions celOptions) {
    Pattern pattern = Pattern.compile(regexp);  // called every evaluation
    ...
}

In protovalidate, CEL expressions like input.matches('^0x[0-9a-f]{64}$') are compiled into programs that are cached and reused. But the regex pattern itself is recompiled from the string on every eval() call. This applies to both runtime paths (DefaultInterpreter and ProgramPlanner), since both funnel through RuntimeHelpers.matches().

Impact

Benchmarks against real-world proto validation patterns show 38-82% end-to-end improvement when patterns are cached:

Pattern Unpatched Patched Improvement
Hex hash ^0x[0-9a-f]{64}$ 4,683 ns 2,897 ns 38%
Blockchain address (alternation pattern) 13,868 ns 2,455 ns 82%
UTXO pattern ^[0-9a-f]{64}:[0-9]+$ 16,038 ns 8,403 ns 48%

The cost of Pattern.compile() scales with pattern complexity — alternation patterns like ^(0x[0-9a-fA-F]{40}|[1-9A-HJ-NP-Za-km-z]{26,64}|bc1[0-9a-zA-Z]{25,87})$ cost ~11μs per compile.

Context: cel-go already solves this

cel-go has MatchesRegexOptimization in interpreter/optimizations.go which precompiles regex patterns at program creation time. protovalidate-go enables it via cel.OptOptimize. cel-java has no equivalent mechanism.

Suggested fix

A minimal fix — add a ConcurrentHashMap cache to RuntimeHelpers.matches():

@SuppressWarnings("Immutable")
private static final ConcurrentHashMap<String, Pattern> COMPILED_PATTERNS =
    new ConcurrentHashMap<>();

public static boolean matches(String string, String regexp, CelOptions celOptions) {
    Pattern pattern = COMPILED_PATTERNS.computeIfAbsent(regexp, Pattern::compile);
    // ... rest unchanged
}

This is a 3-line change. The cache is unbounded, which is safe because:

  • In practice, regex patterns in CEL come from compiled proto definitions (finite, small set)
  • If bounded eviction is desired, a Caffeine or LRU cache could replace ConcurrentHashMap

A more sophisticated approach (matching cel-go's MatchesRegexOptimization) would precompile at program creation time in ProgramPlanner.planCall(), but that only helps the planner runtime path and doesn't fix DefaultInterpreter.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions