`<regex>`: Optimize searches for patterns with initial branching by muellerj2 · Pull Request #6191 · microsoft/STL

muellerj2 · 2026-03-29T18:09:10Z

Towards #5468.

This PR extends the skip heuristic to regexes with initial branching, i.e., disjunctions or loops that allow no repetition.

The skip heuristic included an optimization attempt for disjunctions in the past. But this attempt was based on a depth-first search of the NFA. #5452 showed that this could result in quadratic worst-case complexity for some regexes and inputs, which is unacceptable if the input would have been processed in linear time without this optimization attempt. It was therefore removed in #5457.

This quadratic slowdown can be avoided by searching the NFA breadth-first instead. But this search strategy is typically slower and hostile to vectorization.

This PR tries to strike a balance between depth-first and breadth-first search: The input string is split into search windows of a maximum length. We search for a potential skip position within a window and only move to the next window if we couldn't find a potential match. This means that the heuristic performs depth-first search within each search window (essentially resurrecting the code removed in #5457), but behaves like breadth-first search between search windows. This ensures that it only reads a constant number of characters in the input beyond the actually determined skip position, thus achieving a linear time complexity in the worst case, but also allows to take advantage of vectorization and the faster processing of the depth-first search within each search window.

The current choice of the window size is a bit arbitrary: It shouldn't be too low so that vectorization can still give some benefit, but it also shouldn't be too large to limit the amount of unnecessary work. I tried to strike a bit of a balance when I chose the value, but it isn't based on any measurements on real-world data.

The boundaries of the search windows require some special attention: The skip heuristic for some NFA nodes can search for whole character sequences, not just single characters. If a matching string were to sit exactly on a boundary between search windows, a naive implementation would not find it because the match is not contained in the first search window and the search in the following window would only start in the middle of the match. For this reason, the heuristic for _N_str nodes had to be adapted such that it can read beyond the search window by as much as necessary. (This doesn't break the runtime analysis because the amount of characters read beyond the window is determined by the regex and the analysis assumes that the regex is fixed.) No adjustment had to be made to the heuristic for _N_class nodes because it has never considered the value of the _Last iterator in the _Skip function.

I considered implementing all this using std::distance() and std::advance(), but instead opted for a new _Advance_at_most() function because the former implementation would have resulted in cubic worst-case complexity for bidirectional iterators.

Benchmark

The adjusted benchmark code is included in #6189. Relevant changes are highlighted.

benchmark	before [ns]	after [ns]	speedup
bm_lorem_search/"^bibe"/2	50.2232	51.5625	0.97
bm_lorem_search/"^bibe"/3	50.2232	49.6689	1.01
bm_lorem_search/"^bibe"/4	50	50.2232	1.00
bm_lorem_search/"bibe"/2	2887.83	2964.57	0.97
bm_lorem_search/"bibe"/3	5625	5580.36	1.01
bm_lorem_search/"bibe"/4	11718.8	11997.8	0.98
bm_lorem_search/"bibe".collate/2	3013.39	2915.74	1.03
bm_lorem_search/"bibe".collate/3	5580.36	5719.87	0.98
bm_lorem_search/"bibe".collate/4	10742.2	10986.3	0.98
bm_lorem_search/"(bibe)"/2	3529.57	3930.66	0.90
bm_lorem_search/"(bibe)"/3	7114.96	7149.83	1.00
bm_lorem_search/"(bibe)"/4	13811.3	13811.5	1.00
bm_lorem_search/"(bibe)+"/2	4603.8	5156.25	0.89
bm_lorem_search/"(bibe)+"/3	8998.29	9835.34	0.91
bm_lorem_search/"(bibe)+"/4	17578.3	17996.8	0.98
bm_lorem_search/"(?:bibe)+"/2	4185.27	4289.91	0.98
bm_lorem_search/"(?:bibe)+"/3	7847.38	8370.5	0.94
bm_lorem_search/"(?:bibe)+"/4	15694.7	15485.6	1.01
bm_lorem_search/R"(\bbibe)"/2	64174.1	69754.5	0.92
bm_lorem_search/R"(\bbibe)"/3	131138	136021	0.96
bm_lorem_search/R"(\bbibe)"/4	256696	256696	1.00
bm_lorem_search/R"(\Bibe)"/2	144385	142299	1.01
bm_lorem_search/R"(\Bibe)"/3	288771	341797	0.84
bm_lorem_search/R"(\Bibe)"/4	610352	625000	0.98
bm_lorem_search/R"((?=....)bibe)"/2	3989.95	5022.33	0.79
bm_lorem_search/R"((?=....)bibe)"/3	8021.76	8196.15	0.98
bm_lorem_search/R"((?=....)bibe)"/4	15346	16043.5	0.96
bm_lorem_search/R"((?=bibe)....)"/2	3759.77	3934.14	0.96
bm_lorem_search/R"((?=bibe)....)"/3	7324.22	7324.22	1.00
bm_lorem_search/R"((?=bibe)....)"/4	13950.9	14125.2	0.99
bm_lorem_search/R"((?!lorem)bibe)"/2	3449.35	3683.04	0.94
bm_lorem_search/R"((?!lorem)bibe)"/3	6835.94	7149.83	0.96
bm_lorem_search/R"((?!lorem)bibe)"/4	13253.3	14439.1	0.92
bm_lorem_search/"bibe\|soda"/2	429688	8893.69	48.31
bm_lorem_search/"bibe\|soda"/3	836680	18833.7	44.42
bm_lorem_search/"bibe\|soda"/4	1727580	34527.8	50.03
bm_lorem_search/"(id )?bibe"/2	486592	8370.54	58.13
bm_lorem_search/"(id )?bibe"/3	1004020	16741.2	59.97
bm_lorem_search/"(id )?bibe"/4	1968830	32889.7	59.86
bm_lorem_search/".bibe"/2	190438	199507	0.95
bm_lorem_search/".bibe"/3	374930	359869	1.04
bm_lorem_search/".bibe"/4	784738	784738	1.00

…wing.

StephanTLavavej · 2026-03-31T15:04:06Z

I'm mirroring this to the MSVC-internal repo. Please notify me if any further changes are pushed, otherwise no action is required.

StephanTLavavej · 2026-03-31T20:47:41Z

Resolved trivial adjacent-add conflicts with #6189 in VSO_0000000_regex_use.

StephanTLavavej · 2026-03-31T22:06:16Z

🌳 🕵️ 🚀

<regex>: Optimize searches for patterns with initial branching

23a822e

muellerj2 requested a review from a team as a code owner March 29, 2026 18:09

github-project-automation Bot added this to STL Code Reviews Mar 29, 2026

github-project-automation Bot moved this to Initial Review in STL Code Reviews Mar 29, 2026

StephanTLavavej added performance Must go faster regex meow is a substring of homeowner labels Mar 29, 2026

StephanTLavavej self-assigned this Mar 29, 2026

StephanTLavavej added 7 commits March 30, 2026 09:37

Rewrap comments.

39e30a9

Scalars should be plain constexpr.

9303954

Extract nested _Skip to an _Intermediate result.

4f41a21

Include <cstddef> for size_t.

baa31e5

count => prefix_size, input_size for clarity and to avoid shado…

7648c4e

…wing.

Use list::const_iterator.

5870830

Rework list loop to avoid allocating a zillion nodes.

834b18a

StephanTLavavej reviewed Mar 30, 2026

View reviewed changes

StephanTLavavej approved these changes Mar 30, 2026

View reviewed changes

StephanTLavavej removed their assignment Mar 30, 2026

StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Mar 30, 2026

StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Mar 31, 2026

Merge branch 'main' into regex-optimize-searches-with-initial-branches

0138fef

StephanTLavavej approved these changes Mar 31, 2026

View reviewed changes

StephanTLavavej merged commit 4a9a7db into microsoft:main Mar 31, 2026
49 checks passed

github-project-automation Bot moved this from Merging to Done in STL Code Reviews Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`<regex>`: Optimize searches for patterns with initial branching#6191

`<regex>`: Optimize searches for patterns with initial branching#6191
StephanTLavavej merged 9 commits intomicrosoft:mainfrom
muellerj2:regex-optimize-searches-with-initial-branches

muellerj2 commented Mar 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Mar 31, 2026

Uh oh!

StephanTLavavej commented Mar 31, 2026

Uh oh!

Uh oh!

StephanTLavavej commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

muellerj2 commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Mar 31, 2026

Uh oh!

StephanTLavavej commented Mar 31, 2026

Uh oh!

Uh oh!

StephanTLavavej commented Mar 31, 2026

🌳 🕵️ 🚀

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

muellerj2 commented Mar 29, 2026 •

edited

Loading