Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate combinable DFA capture resolution for our PCRE dialect regexes, part 1. #440

Open
wants to merge 54 commits into
base: sv/integration-target--combinable-DFA-capture-resolution
Choose a base branch
from

Commits on May 1, 2023

  1. Configuration menu
    Copy the full SHA
    db702ff View commit details
    Browse the repository at this point in the history
  2. Switch adt code to uint64_t for hashes, not unsigned long.

    Remove include/adt/mappingset.h, it's no longer used.
    silentbicycle committed May 1, 2023
    Configuration menu
    Copy the full SHA
    41a92d7 View commit details
    Browse the repository at this point in the history

Commits on May 2, 2023

  1. stateset: Add EXPENSIVE_CHECKS guard around expensive asserts.

    This and src/libfsm/internal.h's EXPENSIVE_CHECKS should move to a
    common place later.
    silentbicycle committed May 2, 2023
    Configuration menu
    Copy the full SHA
    ec30958 View commit details
    Browse the repository at this point in the history
  2. Namespace epsilon_closure and closure_free with "fsm_".

    These symbols are exported in the library.
    silentbicycle committed May 2, 2023
    Configuration menu
    Copy the full SHA
    d1ce686 View commit details
    Browse the repository at this point in the history

Commits on May 3, 2023

  1. tests/capture/: Delete outdated capture tests.

    These rely on either direct FSM construction or setting captures via a
    capture path, but captures are now implemented by a completely different
    mechanism.
    
    New tests will be added with the capture code in a future commit.
    silentbicycle committed May 3, 2023
    Configuration menu
    Copy the full SHA
    deb14a8 View commit details
    Browse the repository at this point in the history

Commits on Jun 2, 2023

  1. Configuration menu
    Copy the full SHA
    ffea22e View commit details
    Browse the repository at this point in the history
  2. Complemely rework capture resolution.

    This is a big commit, unfortunately difficult to break apart
    further due to interface changes, metadata being passed through
    whole-FSM transformations, and so on. Sorry about that.
    
    - Delete code related to capture action metadata on edges.
      That approach made FSM transformations (determinisation,
      minimisation, etc.) considerably more expensive, and there
      were some corner cases that I wasn't able to get working
      correctly.
    
    - Switch to a somewhat simpler method, adapted from Russ Cox's
      "Regular Expression Matching: the Virtual Machine Approach".
      Since the capture resolution metadata (an opcode program
      for a virtual machine) is associated with individual end
      states, this combines cleanly when multiple regexes are
      unioned into a single large DFA that matches them all at once.
    
    - Add lots of capture regression tests, mostly from using libfsm's
      `fsm_generate_matches` and a fuzzer to compare behavior against
      PCRE. This brought many, many obscure cases to light. The updated
      fuzzer harness will appear in a later commit.
    silentbicycle committed Jun 2, 2023
    Configuration menu
    Copy the full SHA
    2beeff4 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    31ff234 View commit details
    Browse the repository at this point in the history
  4. fuzz/target.c: Add multi-regex and single regex cmp against PCRE.

    Add a couple more execution modes to the fuzzer test harness.
    
    In particular, add support for comparing match and capture behavior
    against PCRE's -- because this depends on linking with libpcre, it's
    disabled by defalut. Set `PCRE_CMP` in the Makefile to enable it, or
    pass it as a build argument.
    silentbicycle committed Jun 2, 2023
    Configuration menu
    Copy the full SHA
    77f9260 View commit details
    Browse the repository at this point in the history
  5. re: Add -FC flag to disable captures.

    I'm adding this because many of the existing tests under
    `tests/pcre-anchor/` and so on contain regexes that would now be
    rejected as unsupported in combination with captures, but they are
    testing cases unrelated to capturing.
    silentbicycle committed Jun 2, 2023
    Configuration menu
    Copy the full SHA
    318c9ad View commit details
    Browse the repository at this point in the history
  6. tests/*/Makefile: Add -FC (no captures) for some calls to RE.

    There are several tests that have nothing to do with captures,
    capture behavior is tested directly with `tests/captures/`.
    silentbicycle committed Jun 2, 2023
    Configuration menu
    Copy the full SHA
    5e49713 View commit details
    Browse the repository at this point in the history
  7. ast_analysis: Reject '\z' as unsupported.

    We currently get the wrong capture result for it.
    silentbicycle committed Jun 2, 2023
    Configuration menu
    Copy the full SHA
    6e410b3 View commit details
    Browse the repository at this point in the history

Commits on Jun 15, 2023

  1. Merge branch 'main' into sv/integrate-combinable-DFA-capture-resolution

    There were modifications to metadata associated with end states on both
    sides. A few other changes will appear in separate commits.
    
    Conflicts:
    	include/fsm/fsm.h
    	src/libfsm/consolidate.c
    	src/libfsm/endids.c
    	src/libfsm/merge.c
    silentbicycle committed Jun 15, 2023
    Configuration menu
    Copy the full SHA
    5ddb4f0 View commit details
    Browse the repository at this point in the history
  2. union.c: Remove EXPENSIVE_CHECKS based on removed interface.

    Now instead of exposing exactly how many captures the fsm has, we keep
    track of the ceiling of the count, to track how large the capture buffer
    needs to be. We could add this back if fsm_capture_count gets re-added.
    silentbicycle committed Jun 15, 2023
    Configuration menu
    Copy the full SHA
    86ec9e4 View commit details
    Browse the repository at this point in the history
  3. bugfix: resize endid buffer for carry_end_metadata properly.

    Previously we doubled the buffer if it wasn't large enough, but
    the next endid array may be > twice the size of the old buffer,
    so we need to keep expanding until it's large enough.
    
    We weren't saving the updated array size, so this could potentially lead
    to repeated doubling and eventually allocation failures.
    
    Also, change the assert for fsm_getendids's result -- if we get to
    that point, it should always be found, not just not an insufficient
    space error.
    silentbicycle committed Jun 15, 2023
    Configuration menu
    Copy the full SHA
    d54ba0d View commit details
    Browse the repository at this point in the history
  4. Address a couple warnings from scan-build.

    determinise: It's not possible to find a cached result in the hash
    table without allocating a to-set buffer first, so assert that it
    will be non-NULL.
    
    fsm_findmode: This should never be used on a state without edges.
    
    vm/v1.c and vm/v2.c: Free allocated return value on error.
    silentbicycle committed Jun 15, 2023
    Configuration menu
    Copy the full SHA
    e2ac4d7 View commit details
    Browse the repository at this point in the history
  5. fuzz/target.c: In MULTI mode, check endid behavior.

    For each pattern 0..n that will be combined, set an endid on them.
    Then, generate inputs that match, and check that the endid result
    on the single and combined FSMs are consistent.
    silentbicycle committed Jun 15, 2023
    Configuration menu
    Copy the full SHA
    c11f372 View commit details
    Browse the repository at this point in the history

Commits on Jul 10, 2023

  1. minimisation: Distinct sets of end IDs should not split ECs.

    In minimise.c, `split_ecs_by_end_metadata` does a first pass splitting
    ECs based on their end data (like the name says). This sets which end
    metadata should prevent states from starting out in the same EC,
    effectively which states can/cannot be combined once minimisation's
    analysis is done.
    
    Previously, distinct sets of end IDs would keep states from merging, but
    if epsilon removal and determinisation have led to end states with
    distinct sets of end IDs, that alone shouldn't prevent them from merging
    later -- the same end state just becomes associated with all those end
    IDs. We do prevent states with distinct capture ID sets from merging,
    but that's because of a few special cases like "^a(b)c$" and "^a(b*)c$",
    where combining partially overlapping regexes' end states could lead to
    false positives in the capture result.
    
    Note: I added checking the program IDs (which was missing) to
    `same_end_metadata`. This seems correct to me, but at the moment I can't
    think of any test inputs that would lead to the same sets of capture IDs
    but distinct sets of program IDs. I will see if fuzzing can find any.
    
    This is tested by tests/endids/endids2_union_many_endids.c
    and the new multi test in tests/capture/capture_test_case_list.c.
    silentbicycle committed Jul 10, 2023
    Configuration menu
    Copy the full SHA
    8b49a1a View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    a3a79ca View commit details
    Browse the repository at this point in the history

Commits on Jul 11, 2023

  1. capture_vm_exec: Address scan-build warnings.

    There were several unused store warnings about values that were only
    set for logging, either `(void)x` them out or restructure the code
    so that their scope is more limited.
    
    Remove `next_offset` from `populate_solution`, since that isn't
    being used at all. IIRC it was added before some of the path sharing
    details made it unnecessary.
    silentbicycle committed Jul 11, 2023
    Configuration menu
    Copy the full SHA
    bfc9a80 View commit details
    Browse the repository at this point in the history

Commits on Jul 17, 2023

  1. fuzz/target.c: Expand build_and_check_multi.

    Also compare in the other direction, generating matching inputs from
    the combined DFA and then check that individual regexes's captures
    still match as expected.
    silentbicycle committed Jul 17, 2023
    Configuration menu
    Copy the full SHA
    f43585a View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    3a93f6d View commit details
    Browse the repository at this point in the history
  3. Remove nullable ALT backpatching, it's now unreachable.

    This was set up to handle edge cases like `^(($)|x)+$` where there is an
    alt with some nullable cases with anchors, surrounded by + repetition,
    which turned up during fuzzing. This is an obscure edge case, it is
    proving very difficult to handle correctly, and there is probably little
    value in actually doing so.
    
    Now we are flagging it as an unsupported PCRE construct in
    ast_analysis.c's analysis_iter_repetition, so the code using
    repeated_alt_backpatches is unreachable. Just remove it.
    silentbicycle committed Jul 17, 2023
    Configuration menu
    Copy the full SHA
    da51d9e View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    9f3ae29 View commit details
    Browse the repository at this point in the history

Commits on Jul 18, 2023

  1. re_capvm_compile: Change "active_node" categorization.

    Strictly speaking this shouldn't include nodes that have been flagged
    with `AST_FLAG_UNSATISFIABLE`. I have re-fuzzed with this changed and it
    does not seem to have introduced any new issues. `active_node` is only
    used in one place.
    silentbicycle committed Jul 18, 2023
    Configuration menu
    Copy the full SHA
    26483cf View commit details
    Browse the repository at this point in the history

Commits on Jul 21, 2023

  1. capture_vm: rename split's .cont and .new to .greedy and .nongreedy.

    Rather than commenting throughout to note that .cont is the greedy
    branch and .new is non-greedy, just rename them.
    silentbicycle committed Jul 21, 2023
    Configuration menu
    Copy the full SHA
    e9353af View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    1e55220 View commit details
    Browse the repository at this point in the history

Commits on Jul 24, 2023

  1. Merge branch 'main' into sv/integrate-combinable-DFA-capture-resolution

    Addressed some merge conflicts to fuzz/target.c.
    silentbicycle committed Jul 24, 2023
    Configuration menu
    Copy the full SHA
    36dfddf View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    937c9f0 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    5e54a8b View commit details
    Browse the repository at this point in the history

Commits on Jul 25, 2023

  1. Configuration menu
    Copy the full SHA
    94d6e7f View commit details
    Browse the repository at this point in the history
  2. re_capvm_compile.c: Avoid potential sign-extension bug.

    `expr->u.literal.c` is signed, so when negative casting it to `uint64_t`
    can lead to a 64-bit value > 255. Clamp it to 8 bits.
    silentbicycle committed Jul 25, 2023
    Configuration menu
    Copy the full SHA
    dced7c7 View commit details
    Browse the repository at this point in the history
  3. Do non-zero allocations to silence EFENCE.

    EFENCE breaks CI builds that (correctly) do a zero-size allocation when
    calling `calloc` with a count of zero, reporting that it's a likely bug.
    Use a size of 1 instead of 0 to silence it.
    silentbicycle committed Jul 25, 2023
    Configuration menu
    Copy the full SHA
    8f0a683 View commit details
    Browse the repository at this point in the history
  4. re_capvm_compile: subtree_represents_character_class: handle empty b.

    The comment says this handles a .b that is EMPTY, but the recursive
    call rejects it. Something in PCRE suite in CI is triggering this.
    silentbicycle committed Jul 25, 2023
    Configuration menu
    Copy the full SHA
    722260a View commit details
    Browse the repository at this point in the history

Commits on Sep 11, 2023

  1. state_set: Improve state_set_search performance, correct result.

    The description says "Return where an item would be, if it were
    inserted", but it was returning the last element <= rather than
    the first element >=, then the call to `state_set_cmpval` later
    was shifting i by 1 for that specific case. Handle it correctly
    inside the search function instead. Two other all call sites
    need to check whether the result refers to the append position
    (one past the end of the array) before checking `set->a[i] == state`,
    update them.
    
    Add a fast path upfront: It's VERY common to append states in order
    to the state array, so before we binary search each first compare
    against the last entry (unless empty).
    silentbicycle authored and katef committed Sep 11, 2023
    Configuration menu
    Copy the full SHA
    75bb88d View commit details
    Browse the repository at this point in the history
  2. stateset: Avoid memmove of size 0.

    silentbicycle authored and katef committed Sep 11, 2023
    Configuration menu
    Copy the full SHA
    709b8cc View commit details
    Browse the repository at this point in the history
  3. stateset: Add note about potentially expensive assertion.

    In -O0 this can become pretty expensive (~25% of overall runtime for
    `time ./re -rpcre -C '^[ab]{0,2000}$'`), but when built with -O3 very
    little overhead remains. I'm adding this comment because every time I
    see this it seems to me like it should have `EXPENSIVE_CHECKS` around
    it, but profiling is telling me it really doesn't matter.
    silentbicycle authored and katef committed Sep 11, 2023
    Configuration menu
    Copy the full SHA
    cead0d9 View commit details
    Browse the repository at this point in the history
  4. stateset: Comment struct fields.

    silentbicycle authored and katef committed Sep 11, 2023
    Configuration menu
    Copy the full SHA
    cbfeddd View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    c3dab77 View commit details
    Browse the repository at this point in the history
  6. edgeset: Switch from linear to binary searching in edge_set_add_bulk.

    This is a major hotspot when doing epsilon removal over large runs of
    potentially skipped states (as might appear from `^[ab]{0,2000}$`).
    
    Add a fast path for appending, which is also very common.
    
    Extract the edge set destination search into its own function,
    `find_state_position`, and add a `#define` to switch between linear
    search, binary search, or calling both and comparing the result.
    I will remove linear search in the next commit, but am checking this
    in as an intermediate step for checking & benchmarking.
    silentbicycle authored and katef committed Sep 11, 2023
    Configuration menu
    Copy the full SHA
    1ada07c View commit details
    Browse the repository at this point in the history
  7. edgeset: Commit to using binary search.

    When I run `time ./re -rpcre -C '^[ab]{0,2000}$'` locally for -O3:
    - linear search: 2.991s
    - binary search: 1.521s
    silentbicycle authored and katef committed Sep 11, 2023
    Configuration menu
    Copy the full SHA
    7122d2f View commit details
    Browse the repository at this point in the history
  8. determinise: Drastically reduce calls to qsort.

    After the other changes in this PR, calls to qsort from
    `sort_and_dedup_dst_buf` are one of the largest remaining hotspots in
    the profile. We can often avoid calling qsort, though:
    
    - If there is <= 1 entry, just return, it's sorted.
    
    - Otherwise, first do a sweep through the array noting the min and max
      values. Unless there is a huge range between them, it's much faster to
      build a bitset from them in a small (max 10KB) stack-allocated array
      and then unpack the bitset (now sorted and unique). Only the needed
      portion of the array is initialized.
    
    I have not done a lot of experimentation to find a cutoff point where
    the bitset becomes slower than qsort (it may be much larger), I picked
    10KB because it's likely to be safe to stack-allocate.
    
    I tried changing the bitset unpacking to use an 8 or 16 bit mask and
    jump forward faster through large sub-word ranges of 0 bits, but any
    improvement was lost among random variation, so I decided it wasn't
    worth the extra complexity. We already skip whole words that are 0.
    silentbicycle authored and katef committed Sep 11, 2023
    Configuration menu
    Copy the full SHA
    937585a View commit details
    Browse the repository at this point in the history
  9. edgeset: Remove stale comment.

    silentbicycle authored and katef committed Sep 11, 2023
    Configuration menu
    Copy the full SHA
    30e34ef View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    cf6051f View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    c1e1282 View commit details
    Browse the repository at this point in the history
  12. bugfix: The range is min..max inclusive, so add 1.

    If min and max are exactly 64 states apart the upper value was getting
    silently dropped due to an incorrect `words` value here.
    
    One of the patterns in the PCRE suite triggers this:
    
        ./re -rpcre '(?:c|d)(?:)(?:aaaaaaaa(?:)(?:bbbbbbbb)(?:bbbbbbbb(?:))(?:bbbbbbbb(?:)(?:bbbbbbbb)))' "caaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
    
    This should match, but did not.
    silentbicycle authored and katef committed Sep 11, 2023
    Configuration menu
    Copy the full SHA
    6eff0f9 View commit details
    Browse the repository at this point in the history
  13. Address a couple warnings from scan-build.

    determinise: It's not possible to find a cached result in the hash
    table without allocating a to-set buffer first, so assert that it
    will be non-NULL.
    
    fsm_findmode: This should never be used on a state without edges.
    
    vm/v1.c and vm/v2.c: Free allocated return value on error.
    silentbicycle authored and katef committed Sep 11, 2023
    Configuration menu
    Copy the full SHA
    98ee906 View commit details
    Browse the repository at this point in the history
  14. Configuration menu
    Copy the full SHA
    51892e3 View commit details
    Browse the repository at this point in the history
  15. Configuration menu
    Copy the full SHA
    7c6644f View commit details
    Browse the repository at this point in the history
  16. Configuration menu
    Copy the full SHA
    c646868 View commit details
    Browse the repository at this point in the history
  17. Configuration menu
    Copy the full SHA
    0789d61 View commit details
    Browse the repository at this point in the history

Commits on Sep 12, 2023

  1. Configuration menu
    Copy the full SHA
    1ca3726 View commit details
    Browse the repository at this point in the history

Commits on Sep 27, 2023

  1. Merge branch 'main' into sv/integrate-combinable-DFA-capture-resolution

    There were merge conflicts, which had to do with the return type of
    idmap_iter's callback -- on `main` it was `void`, on the branch it was
    `int` and indicated whether iterations should continue or halt. I picked
    the `int` form, and updated `idmap_iter` so that it was actually
    checked, because it probably doesn't make sense to continue iterating
    when earlier steps have errored out.
    silentbicycle committed Sep 27, 2023
    Configuration menu
    Copy the full SHA
    5ece637 View commit details
    Browse the repository at this point in the history

Commits on Oct 2, 2023

  1. ci.yml: Update fuzzer modes for CI.

    - 'd' (default) no longer exists, it's now 'r'
    - add 's', 'i', 'M'
    silentbicycle committed Oct 2, 2023
    Configuration menu
    Copy the full SHA
    239927c View commit details
    Browse the repository at this point in the history