Skip to content

<regex>: Fix reentrant loops containing backreferences#6055

Merged
StephanTLavavej merged 1 commit intomicrosoft:mainfrom
muellerj2:regex-fix-backreferences-in-reentrant-loops
Feb 2, 2026
Merged

<regex>: Fix reentrant loops containing backreferences#6055
StephanTLavavej merged 1 commit intomicrosoft:mainfrom
muellerj2:regex-fix-backreferences-in-reentrant-loops

Conversation

@muellerj2
Copy link
Contributor

I introduced a subtle bug in #6022: I missed that there are loops that get marked as branchless but can match strings of different lengths, so the value of _Loop_length in the loop state can change. This applies to loops with backreferences, since the length of the captured string might have changed when the loop is reentered.

The problem is that _Loop_length is used during unwinding of branchless/simple loops, but its value is not restored during backtracking.

I see two ways to fix this:

  • The matcher restores _Loop_length like _Loop_frame_idx and _Loop_idx during backtracking. But there is no unused memory to store this value on the stack, so we would have to add a new frame to the stack or increase the size of each frame.
  • The parser does not mark loops containing backreferences as branchless if they are reentrant (even though they are branchless, strictly speaking).

Given that this is about relatively rare regexes that contain subpatterns like (prefix(\1)*suffix)*, the first option seems like a waste of memory pessimizing more common regexes. For this reason, I went for the latter option.

_N_rep, _N_if, _N_assert and _N_class (if collating elements are contained) are other node types that can cause a loop to match strings or capturing groups of different lengths. but they are already handled. The remaining node types always match strings of the same length.

While debugging this, I was a bit annoyed that the debugger showed the _Fl_rep_branchless flag under the name _Fl_class_cl_all_bits, so I moved the flag to another free bit for the _N_rep node type in this PR. (We could also reuse mask 0x0004, since the old parser never set this bit on nodes of type _N_rep, but this is probably unnecessarily subtle while we still have plenty of bits available.)

@muellerj2 muellerj2 requested a review from a team as a code owner January 28, 2026 21:14
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Jan 28, 2026
@StephanTLavavej StephanTLavavej self-assigned this Jan 28, 2026
@StephanTLavavej StephanTLavavej added bug Something isn't working regex meow is a substring of homeowner labels Jan 28, 2026
@StephanTLavavej StephanTLavavej removed their assignment Jan 29, 2026
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Jan 29, 2026
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Jan 30, 2026
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit b379eb3 into microsoft:main Feb 2, 2026
45 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews Feb 2, 2026
@StephanTLavavej
Copy link
Member

Thanks for breaking and then fixing this! 🏚️ 🛠️ 🏡

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working regex meow is a substring of homeowner

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants