Skip to content

Conversation

@muellerj2
Copy link
Contributor

To recap from #5889, simple loops have the following properties:

  1. They are non-reentrant.
  2. They are branchless.
  3. Each repetition matches strings of the same length.
  4. Each capturing group matched in a repetition has the same relative position to the beginning and end of the strings matched by the repetition.

The matcher has always used properties 1 and 2 of such loops. It also took slight advantage of property 3: As a corollary, all repetitions match the empty string iff the first repetition matches the empty string, so the matcher only checked whether the first repetition is empty.

But properties 3 and 4 can be exploited further: If we know that each successful repetition shifts the matched string and the capturing groups by the same distance, we do not have to explicitly store these positions on the stack for unwinding greedy matching but can restore the positions while backtracking by shifting the positions in the other direction by the same amount. As for non-greedy matching, failing to match the next repetition will immediately result in backtracking beyond the first repetition, so we actually do not even have to know the length of the strings matched by each repetition, but only have to allow backtracking to proceed for the first stack frames that were pushed while matching the first repetition.

At least for greedy matching, though, we can't easily avoid that these stack frames are pushed while matching the next repetition, because they are still needed to restore the match state when backtracking from the last attempted match.

This PR implements that stack frames pushed while matching a repetition are popped afterwards from the stack without any further processing from the second repetition on. Thus, the stack stops growing while the matcher processes the simple loop.

This is probably the most intricate PR since the start of the non-recursive matcher PR, because it keeps tampering with this stack and does not just pop from the stack, but even repeatedly modifies two special stack frames that were pushed earlier while processing the loop's _N_rep and _N_end_rep nodes. However, I believe the performance benefit for simple loops is worth this complication (especially because I hope to extend this optimization to even more loops that are currently not marked simple).

The two special stack frames (that can be recognized in the code by the assignment of opcode _Do_nothing to them in some cases) are used as follows:

  • When the loop is entered, the first stack frame (at this point identified by index _Loop_vals[_Node->_Loop_number]._Loop_frame_idx) stores the initial position in the searched string at the start of the loop. If matching is greedy and the minimum number of repetitions is zero, the opcode is _Loop_simple_greedy_firstrep (to set up tail matching if even matching the first repetition fails), otherwise it is _Do_nothing. Backtracking during non-greedy matching on the other hand is handled by pushing an additional stack frame with opcode _Loop_simple_nongreedy.
  • After the first repetition matched successfully, a second stack frame is pushed. The first stack frame's index is now stored in the second stack frame's _Loop_frame_idx_sav member, while _Loop_vals[_Nr->_Loop_number]._Loop_frame_idx points to the second stack frame. The second stack frame's iterator is generally changed to point to the current position in the input string (except when its opcode is _Do_nothing). The frame's code is assigned as follows:
    • If the loop is matched greedily and the minimum number of repetitions is zero (i.e., if the first frame has opcode _Loop_simple_greedy_firstrep assigned), the code of the second frame is initialized to _Loop_simple_greedy_lastrep.
    • If the minimum number of repetitions has not been reached, the second frame has opcode _Do_nothing.
    • If the minimum number of repetitions is reached and the loop is matched greedily, the opcode is changed to _Loop_simple_greedy_lastrep.
    • If the minimum number of repetitions is reached and the loop is matched non-greedily, the opcode is changed to _Loop_simple_nongreedy (each time, because the contents might have been overwritten in-between).
      Meanwhile, the position in the first stack frame is used during backtracking from greedy matching to indicate when we have backtracked beyond the second repetition or the minimum number of repetitions. It is set to the start position of the first repetition or the repetition whose match resulted in reaching the minimum. If these positions in the input string are reached while backtracking successive repetitions of the loop, the backtracking logic for non-initial repetitions is stopped and the normal stack unwinding logic is allowed to proceed again.

Because positions must be shifted back during greedy matching, the iterators of the input string must be decremented during backtracking (to either calculate the position where the previous repetition stopped or to move the start and end positions of capturing groups accordingly). The standard requires that provided iterators must be bidirectional, so the matcher must always be able to perform such decrements. But I think the matcher has only required forward iterators in practice before this PR and I think the matcher will enter an endless after this PR if assertions are disabled (because std::advance() will just not shift the iterator by a negative distance if the iterator isn't bidirectional). For this reason, this PR also adds static assertions checking the bidirectional iterator requirement.

Individual changes

  • Add assertions checking the bidi iterator requirement to regex_match(), regex_search() and regex_replace().
  • Add a new member _Rep_length to (renamed) _Loop_vals_v3_t, which will hold the length of the first (and thus every) repetition for simple loops after matching the first repetition. The storage is now templated on the difference type of the input string iterator.
  • Split the opcode for greedy simple loops into three: One for backtracking from the first repetition, one for backtracking from the last attempted repetition (which is not the first one) and one for backtracking for any intermediate repetition. These cases have to be handled differently now:
    • Backtracking from the initial repetition keeps the same logic as before.
    • Backtracking from the last attempted repetition now has to additionally push another stack frame again for backtracking the prior repetition, if the prior repetition is not the first or minimum repetition.
    • Backtracking from the intermediate repetition additionally has to shift the start and end positions of capturing groups by the length of the loop.
  • In the handler of _N_rep, merge the _Loop_simple_greedy stack frame into the previously pushed stack frame with code _Do_nothing by changing the code of the former to _Loop_simple_greedy_firstrep.
  • In the handler of _N_end_rep for simple loops:
    • After the first repetition, determine the length of this (and thus every) repetition. Perform the original logic and exit the handler if this length is zero. If not, push a new special stack frame and store its position in _Sav._Loop_frame_idx (while storing the position of the first special one to _Frame._Loop_frame_idx in this second frame). The code of the second stack frame is initialized to _Loop_simple_greedy_lastrep if the first one's code is _Loop_simple_greedy_firstrep (i.e., backtracking from greedy matching might happen until the very first repetition), else it's set to _Do_nothing for now.
    • After any following repetition, pop all stack frames pushed while matching this repetition by setting _Frames_count to _Sav._Loop_frame_idx + 1, keeping only the second special frame around.
    • If greedy matching is performed:
      • Check if this repetition reached the minimum number of repetitions (which is the case if the second stack frame's code hasn't been changed from _Do_nothing yet). If so, set the iterator of the first special stack frame (with code _Do_nothing as well) to the start of the prior repetition and change the second stack frame's code to _Loop_simple_greedy_lastrep.
      • Update the second stack frame's iterator to the current position in the input string.
      • If the maximum hasn't been reached yet, increment the loop counter and set the next node pointer to the start node of the loop.
      • If the maximum has been reached, set up tail matching. If this is backtracked from, we are essentially handling the repetition before the last one (and thus have to shift the capturing groups), so the code in the second special stack frame has to be changed to _Loop_simple_greedy_intermediaterep.
    • If non-greedy matching is performed:
      • Set up the second special stack frame for non-greedy unwinding. (We know this stack frame must exist in the _Frames vector, so we can avoid calling _Push_frame(). However, we have to reset its members as necessary because its contents might have been overwritten.)
  • In the stack unwinding loop:
    • Make the handler of _Loop_simple_greedy the one for _Loop_simple_greedy_firstrep.
    • Copy the logic of _Loop_simple_greedy_lastrep for the handler of _Loop_simple_greedy_lastrep and add the logic to set up the stack frame for unwinding to the prior repetition in case of match failure.
    • Put the handler for _Loop_simple_greedy_intermediaterep before _Loop_simple_greedy_lastrep, add code to shift the start and end iterators of the capturing groups. The capturing groups matched by each repetition are identified by walking the stack frames between the first and second special stack frame. After adjusting the capturing groups, fall through to the _Loop_simple_greedy_lastrep handler.

Tests

The tests verify that backtracking from loops still works and capturing groups are set correctly despite these intricate stack manipulations. Backreferences are used to verify the contents of the capturing groups.

In the non-greedy case, failing to match a single repetition means that the loop is backtracked from completely. This is why a single test case verifying that the capturing group is unmatched is sufficient here.

In the greedy case, we have to verify that three different opcodes are handled correctly during unwinding, and backtracking after failing the last attempted repetition might stop at any repetition in-between. Moreover, special handling is necessary when the maximum number of repetitions is reached or when backtracking beyond the second or the minimum repetition. The tests are chosen to provide coverage for all these cases.

Benchmark

benchmark before [ns] after [ns] speedup
bm_match_sequence_of_as/"a*"/100 2148.44 1286.97 1.67
bm_match_sequence_of_as/"a*"/200 3379.61 2343.75 1.44
bm_match_sequence_of_as/"a*"/400 5580.36 4425.92 1.26
bm_match_sequence_of_as/"a*?"/100 1967.08 2040.32 0.96
bm_match_sequence_of_as/"a*?"/200 3717.91 3683.04 1.01
bm_match_sequence_of_as/"a*?"/400 6835.94 6975.45 0.98
bm_match_sequence_of_as/"(?:a)*"/100 2622.77 1757.81 1.49
bm_match_sequence_of_as/"(?:a)*"/200 4237.58 3247.08 1.31
bm_match_sequence_of_as/"(?:a)*"/400 7952.01 5998.88 1.33
bm_match_sequence_of_as/"(a)*"/100 3989.95 2786.7 1.43
bm_match_sequence_of_as/"(a)*"/200 6835.94 5312.5 1.29
bm_match_sequence_of_as/"(a)*"/400 32994.1 9416.81 3.50
bm_match_sequence_of_as/"(a)*?"/100 4541.02 3288.92 1.38
bm_match_sequence_of_as/"(a)*?"/200 7847.38 6417.41 1.22
bm_match_sequence_of_as/"(a)*?"/400 20402.9 11474.6 1.78
bm_match_sequence_of_as/"(?:b|a)*"/100 3923.69 4589.84 0.85
bm_match_sequence_of_as/"(?:b|a)*"/200 7149.83 8021.76 0.89
bm_match_sequence_of_as/"(?:b|a)*"/400 13183.5 15066.9 0.87
bm_match_sequence_of_as/"(b|a)*"/100 6417.41 6835.94 0.94
bm_match_sequence_of_as/"(b|a)*"/200 16043.5 20996.1 0.76
bm_match_sequence_of_as/"(b|a)*"/400 53013.4 52550.4 1.01
bm_match_sequence_of_as/"(a)(?:b|a)*"/100 4464.29 4499.17 0.99
bm_match_sequence_of_as/"(a)(?:b|a)*"/200 7672.99 8196.15 0.94
bm_match_sequence_of_as/"(a)(?:b|a)*"/400 14125.2 14997.2 0.94
bm_match_sequence_of_as/"(a)(b|a)*"/100 6406.25 6406.25 1.00
bm_match_sequence_of_as/"(a)(b|a)*"/200 14648.4 14125.2 1.04
bm_match_sequence_of_as/"(a)(b|a)*"/400 53013.4 56250 0.94
bm_match_sequence_of_as/"(a)(?:b|a)*c"/100 5161.83 5859.38 0.88
bm_match_sequence_of_as/"(a)(?:b|a)*c"/200 10253.9 9835.34 1.04
bm_match_sequence_of_as/"(a)(?:b|a)*c"/400 18415.3 18415.3 1.00

@muellerj2 muellerj2 requested a review from a team as a code owner December 5, 2025 22:43
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Dec 5, 2025
@StephanTLavavej StephanTLavavej added performance Must go faster regex meow is a substring of homeowner labels Dec 5, 2025
@StephanTLavavej StephanTLavavej self-assigned this Dec 5, 2025
@StephanTLavavej StephanTLavavej removed their assignment Dec 6, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Dec 6, 2025
@StephanTLavavej
Copy link
Member

As always, thank you for the incredible insight behind these changes, and the exceptionally detailed writeup - absolutely the gold standard for PRs. 😻

I pushed a minor style nitpick because I like to feel useful 😹

@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Dec 7, 2025
@StephanTLavavej StephanTLavavej merged commit 3635601 into microsoft:main Dec 8, 2025
45 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews Dec 8, 2025
@StephanTLavavej
Copy link
Member

My thanks for this series of regex PRs is endlessly growing! 😻 📈 💚

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Must go faster regex meow is a substring of homeowner

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants