Skip to content

<regex>: Optimize searches for patterns with initial branching#6191

Merged
StephanTLavavej merged 9 commits intomicrosoft:mainfrom
muellerj2:regex-optimize-searches-with-initial-branches
Mar 31, 2026
Merged

<regex>: Optimize searches for patterns with initial branching#6191
StephanTLavavej merged 9 commits intomicrosoft:mainfrom
muellerj2:regex-optimize-searches-with-initial-branches

Conversation

@muellerj2
Copy link
Copy Markdown
Contributor

@muellerj2 muellerj2 commented Mar 29, 2026

Towards #5468.

This PR extends the skip heuristic to regexes with initial branching, i.e., disjunctions or loops that allow no repetition.

The skip heuristic included an optimization attempt for disjunctions in the past. But this attempt was based on a depth-first search of the NFA. #5452 showed that this could result in quadratic worst-case complexity for some regexes and inputs, which is unacceptable if the input would have been processed in linear time without this optimization attempt. It was therefore removed in #5457.

This quadratic slowdown can be avoided by searching the NFA breadth-first instead. But this search strategy is typically slower and hostile to vectorization.

This PR tries to strike a balance between depth-first and breadth-first search: The input string is split into search windows of a maximum length. We search for a potential skip position within a window and only move to the next window if we couldn't find a potential match. This means that the heuristic performs depth-first search within each search window (essentially resurrecting the code removed in #5457), but behaves like breadth-first search between search windows. This ensures that it only reads a constant number of characters in the input beyond the actually determined skip position, thus achieving a linear time complexity in the worst case, but also allows to take advantage of vectorization and the faster processing of the depth-first search within each search window.

The current choice of the window size is a bit arbitrary: It shouldn't be too low so that vectorization can still give some benefit, but it also shouldn't be too large to limit the amount of unnecessary work. I tried to strike a bit of a balance when I chose the value, but it isn't based on any measurements on real-world data.

The boundaries of the search windows require some special attention: The skip heuristic for some NFA nodes can search for whole character sequences, not just single characters. If a matching string were to sit exactly on a boundary between search windows, a naive implementation would not find it because the match is not contained in the first search window and the search in the following window would only start in the middle of the match. For this reason, the heuristic for _N_str nodes had to be adapted such that it can read beyond the search window by as much as necessary. (This doesn't break the runtime analysis because the amount of characters read beyond the window is determined by the regex and the analysis assumes that the regex is fixed.) No adjustment had to be made to the heuristic for _N_class nodes because it has never considered the value of the _Last iterator in the _Skip function.

I considered implementing all this using std::distance() and std::advance(), but instead opted for a new _Advance_at_most() function because the former implementation would have resulted in cubic worst-case complexity for bidirectional iterators.

Benchmark

The adjusted benchmark code is included in #6189. Relevant changes are highlighted.

benchmark before [ns] after [ns] speedup
bm_lorem_search/"^bibe"/2 50.2232 51.5625 0.97
bm_lorem_search/"^bibe"/3 50.2232 49.6689 1.01
bm_lorem_search/"^bibe"/4 50 50.2232 1.00
bm_lorem_search/"bibe"/2 2887.83 2964.57 0.97
bm_lorem_search/"bibe"/3 5625 5580.36 1.01
bm_lorem_search/"bibe"/4 11718.8 11997.8 0.98
bm_lorem_search/"bibe".collate/2 3013.39 2915.74 1.03
bm_lorem_search/"bibe".collate/3 5580.36 5719.87 0.98
bm_lorem_search/"bibe".collate/4 10742.2 10986.3 0.98
bm_lorem_search/"(bibe)"/2 3529.57 3930.66 0.90
bm_lorem_search/"(bibe)"/3 7114.96 7149.83 1.00
bm_lorem_search/"(bibe)"/4 13811.3 13811.5 1.00
bm_lorem_search/"(bibe)+"/2 4603.8 5156.25 0.89
bm_lorem_search/"(bibe)+"/3 8998.29 9835.34 0.91
bm_lorem_search/"(bibe)+"/4 17578.3 17996.8 0.98
bm_lorem_search/"(?:bibe)+"/2 4185.27 4289.91 0.98
bm_lorem_search/"(?:bibe)+"/3 7847.38 8370.5 0.94
bm_lorem_search/"(?:bibe)+"/4 15694.7 15485.6 1.01
bm_lorem_search/R"(\bbibe)"/2 64174.1 69754.5 0.92
bm_lorem_search/R"(\bbibe)"/3 131138 136021 0.96
bm_lorem_search/R"(\bbibe)"/4 256696 256696 1.00
bm_lorem_search/R"(\Bibe)"/2 144385 142299 1.01
bm_lorem_search/R"(\Bibe)"/3 288771 341797 0.84
bm_lorem_search/R"(\Bibe)"/4 610352 625000 0.98
bm_lorem_search/R"((?=....)bibe)"/2 3989.95 5022.33 0.79
bm_lorem_search/R"((?=....)bibe)"/3 8021.76 8196.15 0.98
bm_lorem_search/R"((?=....)bibe)"/4 15346 16043.5 0.96
bm_lorem_search/R"((?=bibe)....)"/2 3759.77 3934.14 0.96
bm_lorem_search/R"((?=bibe)....)"/3 7324.22 7324.22 1.00
bm_lorem_search/R"((?=bibe)....)"/4 13950.9 14125.2 0.99
bm_lorem_search/R"((?!lorem)bibe)"/2 3449.35 3683.04 0.94
bm_lorem_search/R"((?!lorem)bibe)"/3 6835.94 7149.83 0.96
bm_lorem_search/R"((?!lorem)bibe)"/4 13253.3 14439.1 0.92
bm_lorem_search/"bibe|soda"/2 429688 8893.69 48.31
bm_lorem_search/"bibe|soda"/3 836680 18833.7 44.42
bm_lorem_search/"bibe|soda"/4 1727580 34527.8 50.03
bm_lorem_search/"(id )?bibe"/2 486592 8370.54 58.13
bm_lorem_search/"(id )?bibe"/3 1004020 16741.2 59.97
bm_lorem_search/"(id )?bibe"/4 1968830 32889.7 59.86
bm_lorem_search/".bibe"/2 190438 199507 0.95
bm_lorem_search/".bibe"/3 374930 359869 1.04
bm_lorem_search/".bibe"/4 784738 784738 1.00

@muellerj2 muellerj2 requested a review from a team as a code owner March 29, 2026 18:09
@github-project-automation github-project-automation Bot moved this to Initial Review in STL Code Reviews Mar 29, 2026
@StephanTLavavej StephanTLavavej added performance Must go faster regex meow is a substring of homeowner labels Mar 29, 2026
@StephanTLavavej StephanTLavavej self-assigned this Mar 29, 2026
Comment thread stl/inc/regex Outdated
Comment thread stl/inc/regex Outdated
Comment thread stl/inc/regex Outdated
Comment thread tests/std/tests/GH_005204_regex_collating_ranges/test.cpp Outdated
Comment thread tests/std/tests/GH_005204_regex_collating_ranges/test.cpp Outdated
Comment thread tests/std/tests/VSO_0000000_regex_use/test.cpp Outdated
Comment thread tests/std/tests/VSO_0000000_regex_use/test.cpp Outdated
@StephanTLavavej StephanTLavavej removed their assignment Mar 30, 2026
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Mar 30, 2026
@StephanTLavavej
Copy link
Copy Markdown
Member

I'm mirroring this to the MSVC-internal repo. Please notify me if any further changes are pushed, otherwise no action is required.

@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Mar 31, 2026
@StephanTLavavej
Copy link
Copy Markdown
Member

Resolved trivial adjacent-add conflicts with #6189 in VSO_0000000_regex_use.

@StephanTLavavej StephanTLavavej merged commit 4a9a7db into microsoft:main Mar 31, 2026
49 checks passed
@github-project-automation github-project-automation Bot moved this from Merging to Done in STL Code Reviews Mar 31, 2026
@StephanTLavavej
Copy link
Copy Markdown
Member

🌳 🕵️ 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Must go faster regex meow is a substring of homeowner

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants