Integrate misc changes from long running capture branch #414

silentbicycle · 2023-04-21T20:31:20Z

This is a bunch of changes I can integrate independently from my long-running branch to add capture support, to help keep the review size down:

Address all current warnings for UB (add casts to lx output, fix signed/unsigned arithmetic in vm/v1.c)
Move some common code to include/adt/.
Add #ifdef'd out code that will be necessary for fuzzing.
Add fsm_generate_matches, which will be used by the fuzzer later to compare capture behavior against PCRE.
Make ast_rewrite's ALT case deduplication preserve order, since ALT branch order impacts capture results.

I would like to fsm_generate_matches to have random walk support later on -- this is depth-first with a bounded result buffer triggering backtracking -- but that has already been very useful for testing capture behavior. Any other suggestions for making that more generally useful? The current implementation likely has some other functionality gaps because work so far was mainly motivated by testing.

This is mainly used for fuzz testing -- we can use gen to walk a DFA to generate matching input strings up to a certain length, so then we can compare capture behavior against PCRE for those particular inputs.

See #386 on katef/libfsm. This is a workaround for a bug in the parser -- once the fuzzer finds it, it tends to get in the way of finding deeper issues.

Previously it sorted the ALT case subtrees to find and discard unique ones, but capture results are affected by ALT case ordering, so we need to preserve ordering while eliminating duplicates.

Add the explicit cast to `lx->push(lx->buf_opaque, (char)c)`.

silentbicycle · 2023-04-24T14:15:13Z

Rebased onto main.

katef · 2023-04-24T17:10:09Z

include/fsm/walk.h

+    fsm_state_t end_state, void *opaque);
+int
+fsm_generate_matches(struct fsm *fsm, size_t max_length,
+    fsm_generate_matches_cb *cb, void *opaque);


this is great, it makes sense alongside the printing to ABNF and then through kgt to blab(1) pipeline I've been using for generating matches, because here you have exact coverage for the entire graph. I like that.

Can you add an interface to fsm(1) to do this please?

In re as well? For testing purposes I only did it in re, but that's because on my working branch re's output isn't yet serializing capture info in a way that fsm can read back.

I think adding a random walk for generation will be good, the existing graphs can obviously get pretty cyclic, but that seems like it belongs in its own PR -- this one is just about clearing out some misc. stuff that would make the PR for captures even bigger.

There's a -W <arg> interface for gen_words that is not yet implemented; I could use that for "walk", since -g and -G are already taken.

On the other hand, -G is for glushkovise, which is now an alias for -E / remove_epsilons, and -G is available in re, so if I could use that for generating output in both, but that's a breaking change. It doesn't look like anything in tests/ is using -G though. Thoughts?

Agreed, let's merge this PR as-is, and introduce CLI interfaces separately

katef · 2023-04-24T17:12:04Z

src/libfsm/gen.c

+grow_buf(struct gen_ctx *ctx);
+
+static bool
+grow_stack(struct gen_ctx *ctx);


Just a style thing, there's no need for prototypes in .c files unless you have mutually-recursive functions. Declaring a function does the same thing as writing a prototype. At some point I'll go through and re-order a lot of these files and remove the prototypes, but I'm holding off on that until the group capture work lands.

katef · 2023-04-24T17:13:32Z

src/libfsm/print/c.c

@@ -97,7 +97,7 @@ print_ranges(FILE *f, const struct ir *ir, const struct fsm_options *opt,
 			}
 		} else for (c = ranges[k].start; c <= ranges[k].end; c++) {
 			fprintf(f, "\t\t\tcase ");
-			c_escputcharlit(f, opt, c);
+			c_escputcharlit(f, opt, (char)c);


sfstewman · 2023-05-04T02:02:04Z

src/libfsm/vm/v1.c

+				if (rel >= 0) {
+					st->pc += (uint32_t)rel;
+				} else {
+					st->pc -= (uint32_t)-rel;
+				}


Sorry to be late to comment here, but I think there's a simpler change that avoids adding a branch. rel is an int32_t and st->pc is a uint32_t. We should be able to cast rel to uint32_t and just add them; the unsigned wrap-around will take care of the rest:

st->pc += (uint32_t)rel;

silentbicycle requested a review from katef April 21, 2023 20:31

silentbicycle added 10 commits April 24, 2023 10:14

include/adt/pv.h: Permutation vector building.

e41b850

minimise.c: Use include/adt/pv.h.

7027369

Add fsm_generate_matches (src/libfsm/gen.c).

5727957

This is mainly used for fuzz testing -- we can use gen to walk a DFA to generate matching input strings up to a certain length, so then we can compare capture behavior against PCRE for those particular inputs.

UBSAN: Add char signedness casts.

fa960bc

parser.act: Avoid crash in parser from '(*:'.

692ce7b

See #386 on katef/libfsm. This is a workaround for a bug in the parser -- once the fuzzer finds it, it tends to get in the way of finding deeper issues.

print/fsm.c: Use u64bitset ADT.

f2c4075

ast_rewrite: Make ast_rewrite's ALT case deduplication preserve order.

b5fb213

Previously it sorted the ALT case subtrees to find and discard unique ones, but capture results are affected by ALT case ordering, so we need to preserve ordering while eliminating duplicates.

internedstateset: Use hash funcitons in adt/hash.h.

5263ebe

UB: Address warning for signed/unsigned arithmetic.

9e34f98

UB: Re-gen lexer/parser files to include explicit (char) cast.

0915cdf

Add the explicit cast to `lx->push(lx->buf_opaque, (char)c)`.

silentbicycle force-pushed the sv/integrate-misc-changes-from-long-running-capture-branch branch from 3ba5753 to 0915cdf Compare April 24, 2023 14:14

katef reviewed Apr 24, 2023

View reviewed changes

katef merged commit a63886e into main Apr 24, 2023
426 checks passed

katef deleted the sv/integrate-misc-changes-from-long-running-capture-branch branch April 24, 2023 18:15

sfstewman reviewed May 4, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate misc changes from long running capture branch #414

Integrate misc changes from long running capture branch #414

silentbicycle commented Apr 21, 2023

silentbicycle commented Apr 24, 2023

katef Apr 24, 2023

silentbicycle Apr 24, 2023 •

edited

silentbicycle Apr 24, 2023

katef Apr 24, 2023

katef Apr 24, 2023

katef Apr 24, 2023

sfstewman May 4, 2023

Integrate misc changes from long running capture branch #414

Integrate misc changes from long running capture branch #414

Conversation

silentbicycle commented Apr 21, 2023

silentbicycle commented Apr 24, 2023

katef Apr 24, 2023

Choose a reason for hiding this comment

silentbicycle Apr 24, 2023 • edited

Choose a reason for hiding this comment

silentbicycle Apr 24, 2023

Choose a reason for hiding this comment

katef Apr 24, 2023

Choose a reason for hiding this comment

katef Apr 24, 2023

Choose a reason for hiding this comment

katef Apr 24, 2023

Choose a reason for hiding this comment

sfstewman May 4, 2023

Choose a reason for hiding this comment

silentbicycle Apr 24, 2023 •

edited