Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate misc changes from long running capture branch #414

Merged
merged 10 commits into from
Apr 24, 2023

Conversation

silentbicycle
Copy link
Collaborator

This is a bunch of changes I can integrate independently from my long-running branch to add capture support, to help keep the review size down:

  • Address all current warnings for UB (add casts to lx output, fix signed/unsigned arithmetic in vm/v1.c)
  • Move some common code to include/adt/.
  • Add #ifdef'd out code that will be necessary for fuzzing.
  • Add fsm_generate_matches, which will be used by the fuzzer later to compare capture behavior against PCRE.
  • Make ast_rewrite's ALT case deduplication preserve order, since ALT branch order impacts capture results.

I would like to fsm_generate_matches to have random walk support later on -- this is depth-first with a bounded result buffer triggering backtracking -- but that has already been very useful for testing capture behavior. Any other suggestions for making that more generally useful? The current implementation likely has some other functionality gaps because work so far was mainly motivated by testing.

This is mainly used for fuzz testing -- we can use gen to
walk a DFA to generate matching input strings up to a certain
length, so then we can compare capture behavior against PCRE
for those particular inputs.
See #386 on katef/libfsm.

This is a workaround for a bug in the parser -- once the fuzzer
finds it, it tends to get in the way of finding deeper issues.
Previously it sorted the ALT case subtrees to find and discard unique
ones, but capture results are affected by ALT case ordering, so we
need to preserve ordering while eliminating duplicates.
Add the explicit cast to `lx->push(lx->buf_opaque, (char)c)`.
@silentbicycle silentbicycle force-pushed the sv/integrate-misc-changes-from-long-running-capture-branch branch from 3ba5753 to 0915cdf Compare April 24, 2023 14:14
@silentbicycle
Copy link
Collaborator Author

Rebased onto main.

fsm_state_t end_state, void *opaque);
int
fsm_generate_matches(struct fsm *fsm, size_t max_length,
fsm_generate_matches_cb *cb, void *opaque);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great, it makes sense alongside the printing to ABNF and then through kgt to blab(1) pipeline I've been using for generating matches, because here you have exact coverage for the entire graph. I like that.

Can you add an interface to fsm(1) to do this please?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In re as well? For testing purposes I only did it in re, but that's because on my working branch re's output isn't yet serializing capture info in a way that fsm can read back.

I think adding a random walk for generation will be good, the existing graphs can obviously get pretty cyclic, but that seems like it belongs in its own PR -- this one is just about clearing out some misc. stuff that would make the PR for captures even bigger.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a -W <arg> interface for gen_words that is not yet implemented; I could use that for "walk", since -g and -G are already taken.

On the other hand, -G is for glushkovise, which is now an alias for -E / remove_epsilons, and -G is available in re, so if I could use that for generating output in both, but that's a breaking change. It doesn't look like anything in tests/ is using -G though. Thoughts?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, let's merge this PR as-is, and introduce CLI interfaces separately

grow_buf(struct gen_ctx *ctx);

static bool
grow_stack(struct gen_ctx *ctx);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a style thing, there's no need for prototypes in .c files unless you have mutually-recursive functions. Declaring a function does the same thing as writing a prototype. At some point I'll go through and re-order a lot of these files and remove the prototypes, but I'm holding off on that until the group capture work lands.

@@ -97,7 +97,7 @@ print_ranges(FILE *f, const struct ir *ir, const struct fsm_options *opt,
}
} else for (c = ranges[k].start; c <= ranges[k].end; c++) {
fprintf(f, "\t\t\tcase ");
c_escputcharlit(f, opt, c);
c_escputcharlit(f, opt, (char)c);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@katef katef merged commit a63886e into main Apr 24, 2023
426 checks passed
@katef katef deleted the sv/integrate-misc-changes-from-long-running-capture-branch branch April 24, 2023 18:15
Comment on lines +414 to +418
if (rel >= 0) {
st->pc += (uint32_t)rel;
} else {
st->pc -= (uint32_t)-rel;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be late to comment here, but I think there's a simpler change that avoids adding a branch. rel is an int32_t and st->pc is a uint32_t. We should be able to cast rel to uint32_t and just add them; the unsigned wrap-around will take care of the rest:

st->pc += (uint32_t)rel;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants