<regex>: Optimize searches for patterns with initial branching#6191
Merged
StephanTLavavej merged 9 commits intomicrosoft:mainfrom Mar 31, 2026
Merged
Conversation
StephanTLavavej
approved these changes
Mar 30, 2026
Member
|
I'm mirroring this to the MSVC-internal repo. Please notify me if any further changes are pushed, otherwise no action is required. |
StephanTLavavej
approved these changes
Mar 31, 2026
Member
|
Resolved trivial adjacent-add conflicts with #6189 in |
Member
🌳 🕵️ 🚀 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Towards #5468.
This PR extends the skip heuristic to regexes with initial branching, i.e., disjunctions or loops that allow no repetition.
The skip heuristic included an optimization attempt for disjunctions in the past. But this attempt was based on a depth-first search of the NFA. #5452 showed that this could result in quadratic worst-case complexity for some regexes and inputs, which is unacceptable if the input would have been processed in linear time without this optimization attempt. It was therefore removed in #5457.
This quadratic slowdown can be avoided by searching the NFA breadth-first instead. But this search strategy is typically slower and hostile to vectorization.
This PR tries to strike a balance between depth-first and breadth-first search: The input string is split into search windows of a maximum length. We search for a potential skip position within a window and only move to the next window if we couldn't find a potential match. This means that the heuristic performs depth-first search within each search window (essentially resurrecting the code removed in #5457), but behaves like breadth-first search between search windows. This ensures that it only reads a constant number of characters in the input beyond the actually determined skip position, thus achieving a linear time complexity in the worst case, but also allows to take advantage of vectorization and the faster processing of the depth-first search within each search window.
The current choice of the window size is a bit arbitrary: It shouldn't be too low so that vectorization can still give some benefit, but it also shouldn't be too large to limit the amount of unnecessary work. I tried to strike a bit of a balance when I chose the value, but it isn't based on any measurements on real-world data.
The boundaries of the search windows require some special attention: The skip heuristic for some NFA nodes can search for whole character sequences, not just single characters. If a matching string were to sit exactly on a boundary between search windows, a naive implementation would not find it because the match is not contained in the first search window and the search in the following window would only start in the middle of the match. For this reason, the heuristic for
_N_strnodes had to be adapted such that it can read beyond the search window by as much as necessary. (This doesn't break the runtime analysis because the amount of characters read beyond the window is determined by the regex and the analysis assumes that the regex is fixed.) No adjustment had to be made to the heuristic for_N_classnodes because it has never considered the value of the_Lastiterator in the_Skipfunction.I considered implementing all this using
std::distance()andstd::advance(), but instead opted for a new_Advance_at_most()function because the former implementation would have resulted in cubic worst-case complexity for bidirectional iterators.Benchmark
The adjusted benchmark code is included in #6189. Relevant changes are highlighted.