Fix boyer_moore_searcher with the Rytter correction #724

StephanTLavavej · 2020-04-16T22:23:07Z

This fixes #713, silent bad codegen in an important C++17 feature.

The Boyer-Moore algorithm (published 1977) relies on a "delta2" table, but that paper didn't explain how to generate the table. Another paper by Knuth, Morris, and Pratt (also 1977) provided the algorithm for the delta2 table. Actually, it provided two algorithms producing tables compatible with the usage of delta2: a basic algorithm dd and an improved algorithm dd'. We implemented the latter, and properly translated it from 1-based indexing to 0-based indexing.

However, the published algorithm for dd' was incorrect! This was discovered and fixed by Rytter in 1980, which we weren't aware of until we received a bug report. While the "Rytter correction" was known in the computer science literature, I find it very curious that it isn't constantly mentioned when explaining Boyer-Moore (e.g. Wikipedia's page currently makes no mention of this).

This PR applies this 40-year-old bugfix and significantly expands our test coverage.

<functional>
- Add top-level const to _Shifts. (This is the dd' output array; I kept the name.)
- Rename _Pat_size to _Mx to follow the naming in the published algorithms.
- Fix comment typo: _RanIt doesn't exist, it's _RanItPat.
- Add comments about the history for future programmer-archaeologists.
- Rename _Suffix_fn to _Fx, again following the papers. Note that in the usage below, there is a semantic change: _Suffix_fn stored 0-based values, while _Fx stores 1-based values. This undoes a micro-optimization (_Suffix_fn avoided unnecessarily translating between the 0-based and 1-based domains), but with the increased usage of f in the Rytter correction, I wanted greater correspondence with the published algorithms in order to verify the implementation.
- _Fx can be top-level const (we don't reassign/reset the unique_ptr).
- Obscure bugfix: unique_ptr<T[]> uses default_delete<T[]> which uses delete[], so we should use new[] to match, not ::new[]. (This makes a difference only for pathologically fancy _Diff types.)
- Change 0-based _Idx to 1-based _Kx.
- Change 0-based _Suffix to 1-based _Tx.
- Rename 1-based _Idx to 1-based _Jx. (While the code was correct, I found it confusing that _Idx was 0-based in other loops but 1-based in this loop.)
- Note that after these changes, the code closely corresponds to the published algorithms, except that subscripting needs to adjust from 1-based to 0-based indexing.
- Implement the Rytter correction, which replaces the final loop of the KMP77 algorithm.
- For clarity and debug codegen, I extracted a repeated array access into _Temp after verifying that this doesn't disturb the algorithm.
P0220R1_searchers/test.cpp
- Add test cases to test_boyer_moore_table2_construction() from Rytter's paper, and other repetitive patterns where the Rytter correction produces different results. Those patterns were "made from scratch" but for the results, I just used the output of the implementation and manually verified selected answers for the AB and ABC categories against my understanding of delta2's meaning (including the unused last entry, see <functional>: Boyer-Moore's delta_2 table contains an unnecessary last entry #714; I was able to understand why it's 10 for "aaaaaaaaaa", 2 for "abaabaabaa", and why it should be 1 for "ababababab" and "abcabcabc").
- Add the test case from <functional>: boyer_moore_searcher produces incorrect results #713, plus hand-selected test cases from the randomized testing below (selected for interesting-looking patterns).
- Add randomized test coverage for both Boyer-Moore and Boyer-Moore-Horspool (the latter is not known to have any bugs). This uses a fully-seeded mt19937 (for speed, instead of directly using random_device). It prints out the needle/haystack for any failures, so we don't need the seed printing/reload machinery from P0067R5_charconv/test.cpp. (I'm using mt19937 for 32-bit performance, since we don't need 64-bit values.) The randomized coverage uses alphabets from [a-b] to [a-f]; the former finds more examples of the bug being fixed here (as it's more likely to create highly repetitive patterns). Expanding the alphabet makes repetition unlikely, which is why I stop at [a-f]. The test does a few things to improve non-optimized debug performance (reusing needle and haystack to avoid repeated allocations, using const char * to avoid iterator overhead), although I didn't pursue this to the ultimate limit (e.g. uniform_int_distribution is somewhat costly and could be avoided at the expense of some bias). Finally, as with the other randomized coverage, I print out timing statistics if it takes a long time; on my i7-8700 it's tuned to take ~400 ms which seems reasonable (given the number of configurations, the performance of VMs, and the time needed for compilation).

While this changes the behavior of a function, it is ABI-safe. This doesn't require coordinated changes across functions - the rest of the Boyer-Moore machinery is unchanged, and the layout of the delta2 table is unchanged. We're simply filling it with different values. In the event of mismatch, the linker will either pick the correct or incorrect algorithm, so this can't make things any worse.

…ct results

stl/inc/functional

Add an overflow check for pathologically small _Diff types with large patterns.

CaseyCarter

Nitpicky style-level comments since I feel like I let people down if I don't complain about something in a PR.

tests/std/tests/P0220R1_searchers/test.cpp

stl/inc/functional

tests/std/tests/P0220R1_searchers/test.cpp

* Use auto after static_cast. * Simplify mt19937 seeding because it's always 32-bit. * Stay within the chrono::duration domain.

shreevatsa · 2020-04-17T05:16:15Z

For what it's worth: note that although Knuth mentions Rytter's correction in his list of papers (see P71 in his vita), the addendum to his reprint of this paper (as Chapter 9 of Selected Papers on Design of Algorithms) seems to suggest that there may be simpler corrections than Rytter's:

(skipping a page and a bit, where the actual correction by Mehlhorn and the simplification by Dahl are described…)

— sorry if this is irrelevant or already well-known; just posting here in case it's helpful.

miscco · 2020-04-17T06:33:39Z

This makes me really appreciate the current times where naming is actually considered as important.

That said the algorithm is a clear implementation of the paper.

stl/inc/functional

mozjag · 2020-04-17T13:33:47Z

For what it's worth: note that although Knuth mentions Rytter's correction in his list of papers (see P71 in his vita), the addendum to his reprint of this paper (as Chapter 9 of Selected Papers on Design of Algorithms) seems to suggest that there may be simpler corrections than Rytter's:

This might also be of interest: "On the shift-table in Boyer-Moore’s String Matching Algorithm" by Yang Wang, which discusses a different way to compute the shift-table. It also includes an improved version (by A. V. Aho) of the original algorithm.

StephanTLavavej · 2020-04-17T14:00:10Z

Thanks, @shreevatsa and @mozjag - I wasn't aware of that additional history. I've filed #727 to track investigating those algorithms later. For now, because this is silent bad codegen and VS 2019 16.7 is locking down for release, I'm going to go ahead with the Rytter correction, but the enhanced test coverage in this PR should make it easier to replace this algorithm in the future.

(As long as the performance characteristics are similar, my primary desire here is for less source code, so Mehlhorn's and possibly Aho's versions could be interesting; my quick glance at Wang's algorithm makes me think that it requires two phases and more source code. However, if we find that there are significant performance differences with modern hardware/compilers, then that's worth paying significantly more source code here.)

Fix microsoft#713 <functional>: boyer_moore_searcher produces incorre…

a7da5cc

…ct results

StephanTLavavej added bug Something isn't working high priority Important! labels Apr 16, 2020

StephanTLavavej requested a review from BillyONeal April 16, 2020 22:23

StephanTLavavej requested a review from a team as a code owner April 16, 2020 22:23

BillyONeal reviewed Apr 16, 2020

View reviewed changes

Avoid comment redundancy.

c17d986

BillyONeal approved these changes Apr 17, 2020

View reviewed changes

Work in size_t, casting to _Diff only when necessary.

2087e11

Add an overflow check for pathologically small _Diff types with large patterns.

StephanTLavavej requested a review from BillyONeal April 17, 2020 00:28

BillyONeal approved these changes Apr 17, 2020

View reviewed changes

CaseyCarter approved these changes Apr 17, 2020

View reviewed changes

tests/std/tests/P0220R1_searchers/test.cpp Outdated Show resolved Hide resolved

stl/inc/functional Outdated Show resolved Hide resolved

tests/std/tests/P0220R1_searchers/test.cpp Outdated Show resolved Hide resolved

Code review feedback.

dd90341

* Use auto after static_cast. * Simplify mt19937 seeding because it's always 32-bit. * Stay within the chrono::duration domain.

sachinjoseph approved these changes Apr 17, 2020

View reviewed changes

mattspring reviewed Apr 17, 2020

View reviewed changes

stl/inc/functional Show resolved Hide resolved

vityok mentioned this pull request Apr 17, 2020

see if BM and BMH need correction vityok/cl-string-match#3

Open

StephanTLavavej mentioned this pull request Apr 17, 2020

<functional>: Investigate other algorithms for Boyer-Moore delta2 #727

Open

CaseyCarter assigned CaseyCarter and StephanTLavavej and unassigned CaseyCarter Apr 17, 2020

StephanTLavavej merged commit 5144647 into microsoft:master Apr 20, 2020

StephanTLavavej deleted the rytter branch April 20, 2020 21:37

StephanTLavavej mentioned this pull request Jun 18, 2020

tests: Avoid dialog boxes, prevent stdout from being lost #906

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix boyer_moore_searcher with the Rytter correction #724

Fix boyer_moore_searcher with the Rytter correction #724

StephanTLavavej commented Apr 16, 2020

CaseyCarter left a comment

shreevatsa commented Apr 17, 2020 •

edited

Loading

miscco commented Apr 17, 2020

mozjag commented Apr 17, 2020

StephanTLavavej commented Apr 17, 2020

Fix boyer_moore_searcher with the Rytter correction #724

Fix boyer_moore_searcher with the Rytter correction #724

Conversation

StephanTLavavej commented Apr 16, 2020

CaseyCarter left a comment

Choose a reason for hiding this comment

shreevatsa commented Apr 17, 2020 • edited Loading

miscco commented Apr 17, 2020

mozjag commented Apr 17, 2020

StephanTLavavej commented Apr 17, 2020

shreevatsa commented Apr 17, 2020 •

edited

Loading