-
-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate misc changes from long running capture branch #414
Integrate misc changes from long running capture branch #414
Conversation
This is mainly used for fuzz testing -- we can use gen to walk a DFA to generate matching input strings up to a certain length, so then we can compare capture behavior against PCRE for those particular inputs.
See #386 on katef/libfsm. This is a workaround for a bug in the parser -- once the fuzzer finds it, it tends to get in the way of finding deeper issues.
Previously it sorted the ALT case subtrees to find and discard unique ones, but capture results are affected by ALT case ordering, so we need to preserve ordering while eliminating duplicates.
Add the explicit cast to `lx->push(lx->buf_opaque, (char)c)`.
3ba5753
to
0915cdf
Compare
Rebased onto |
fsm_state_t end_state, void *opaque); | ||
int | ||
fsm_generate_matches(struct fsm *fsm, size_t max_length, | ||
fsm_generate_matches_cb *cb, void *opaque); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is great, it makes sense alongside the printing to ABNF and then through kgt to blab(1) pipeline I've been using for generating matches, because here you have exact coverage for the entire graph. I like that.
Can you add an interface to fsm(1) to do this please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In re
as well? For testing purposes I only did it in re
, but that's because on my working branch re
's output isn't yet serializing capture info in a way that fsm
can read back.
I think adding a random walk for generation will be good, the existing graphs can obviously get pretty cyclic, but that seems like it belongs in its own PR -- this one is just about clearing out some misc. stuff that would make the PR for captures even bigger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a -W <arg>
interface for gen_words
that is not yet implemented; I could use that for "walk", since -g
and -G
are already taken.
On the other hand, -G
is for glushkovise
, which is now an alias for -E
/ remove_epsilons
, and -G
is available in re
, so if I could use that for generating output in both, but that's a breaking change. It doesn't look like anything in tests/
is using -G
though. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, let's merge this PR as-is, and introduce CLI interfaces separately
grow_buf(struct gen_ctx *ctx); | ||
|
||
static bool | ||
grow_stack(struct gen_ctx *ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a style thing, there's no need for prototypes in .c files unless you have mutually-recursive functions. Declaring a function does the same thing as writing a prototype. At some point I'll go through and re-order a lot of these files and remove the prototypes, but I'm holding off on that until the group capture work lands.
@@ -97,7 +97,7 @@ print_ranges(FILE *f, const struct ir *ir, const struct fsm_options *opt, | |||
} | |||
} else for (c = ranges[k].start; c <= ranges[k].end; c++) { | |||
fprintf(f, "\t\t\tcase "); | |||
c_escputcharlit(f, opt, c); | |||
c_escputcharlit(f, opt, (char)c); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
if (rel >= 0) { | ||
st->pc += (uint32_t)rel; | ||
} else { | ||
st->pc -= (uint32_t)-rel; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to be late to comment here, but I think there's a simpler change that avoids adding a branch. rel
is an int32_t
and st->pc
is a uint32_t
. We should be able to cast rel
to uint32_t
and just add them; the unsigned wrap-around will take care of the rest:
st->pc += (uint32_t)rel;
This is a bunch of changes I can integrate independently from my long-running branch to add capture support, to help keep the review size down:
vm/v1.c
)include/adt/
.#ifdef
'd out code that will be necessary for fuzzing.fsm_generate_matches
, which will be used by the fuzzer later to compare capture behavior against PCRE.ast_rewrite
's ALT case deduplication preserve order, since ALT branch order impacts capture results.I would like to
fsm_generate_matches
to have random walk support later on -- this is depth-first with a bounded result buffer triggering backtracking -- but that has already been very useful for testing capture behavior. Any other suggestions for making that more generally useful? The current implementation likely has some other functionality gaps because work so far was mainly motivated by testing.