Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a new (split) PEG special #1346

Merged
merged 1 commit into from
Jan 15, 2024
Merged

Conversation

ianthehenry
Copy link
Contributor

@ianthehenry ianthehenry commented Jan 1, 2024

This works similarly to string/split, but the separator is a PEG.

I also added a peg/split helper function, and a new opcode to optimize the (to -1) pattern so that it will only do two passes over the input instead of three.

@sogaiu
Copy link
Contributor

sogaiu commented Jan 1, 2024

With respect to peg/split, perhaps there will be requests for it later and if so, addition can be considered at that time? (Although I have wanted something along those lines before and imagine others have too.)

@sogaiu
Copy link
Contributor

sogaiu commented Jan 1, 2024

For any other folks who might be checking this PR out, highly recommended are the tests for gaining some familiarity.

As the tests demonstrate, it seems worth noting that captures get dropped from the separator pattern and that the separator pattern doesn't have to match anything.

That the "separator can be an arbitrary PEG" is especially nice (possibly one of the main motivators?) and what inclines me toward a peg/split function (^^;

@ianthehenry ianthehenry force-pushed the peg-split branch 4 times, most recently from ee5eee9 to 8dddb0c Compare January 2, 2024 00:47
@ianthehenry
Copy link
Contributor Author

ianthehenry commented Jan 2, 2024

I went ahead and added a peg/split helper function too; it seems like it's generally useful.

@sogaiu
Copy link
Contributor

sogaiu commented Jan 2, 2024

Working well here so far.

Nice that the start and args equivalents of peg/match seem to work too.

@sogaiu
Copy link
Contributor

sogaiu commented Jan 5, 2024

So I've been studying RULE_SPLIT using gdb (actually rr [1]) a bit and for the following Janet code (one of the tests I think):

(peg/match ~(split "," (capture (to -1)))
           "a,,bar,c")

when RULE_SPLIT is done and returning, text seems to point to what seems like garbage "\031\373\207\253\325\376\377A" -- it's one byte beyond the terminating zero byte I think (see below for more details).

What I'm seeing in the locals view is this:

 const uint8_t *            text 0x55ab899fb241 "\031\373\207\253\325\376\377A"

The rest of the locals view at the time looks like this:

 const uint8_t *       saved_end 0x55ab899fb240 ""
const uint32_t *  rule_separator 0x55ab899fbdec
const uint32_t * rule_subpattern 0x55ab899fbdf8
 const uint8_t *   separator_end 0x0
      PegState *               s 0x7ffceeb16f98
const uint32_t *            rule 0x55ab899fbde0

saved_end is at 0x55ab899fb240 -- this is one byte before text 0x55ab899fb241.

rr says:

(rr) p s->text_start
$1 = (const uint8_t *) 0x55ab899fb238 "a,,bar,c"

(rr) p saved_end
$2 = (const uint8_t *) 0x55ab899fb240 ""

(rr) p s->text_start + 8
$3 = (const uint8_t *) 0x55ab899fb240 ""

(rr) p *(s->text_start + 8)
$4 = 0 '\000'

(rr) p s->text_start + 9
$5 = (const uint8_t *) 0x55ab899fb241 "\031\373\207\253\325\376\377A"

IIUC, s->text_start + 8 is the same as saved_end and it points to the terminating zero byte.

The code works fine here but this feels odd. May be it's nothing to worry about?


[1] See this gist for using rr with janet if interested.

@ianthehenry
Copy link
Contributor Author

Hey thanks! That's interesting. It does seem wrong to advance past the input; I don't know if anything bad would happen but it seems plausible. I don't think any other rule does it, so split probably shouldn't either.

I just pushed a fix although I'm not really sure how to write an automated test for this; I'll try harder when I get off work.

@sogaiu
Copy link
Contributor

sogaiu commented Jan 5, 2024

I've no clue how one would write a test for this.

Perhaps it isn't necessary to do so.

This works similarly to string/split, but the separator is a PEG.
@ianthehenry
Copy link
Contributor Author

I found a test that failed with the old behavior but not the new behavior: the rule 0 should be a no-op that always succeeds, but (* (split ...) 0) would fail because it had advanced past the end of the input. This works as expected after the fix this morning. Great catch!

@sogaiu
Copy link
Contributor

sogaiu commented Jan 6, 2024

Nice that you found that test and thanks for the fix 🎉

@bakpakin
Copy link
Member

bakpakin commented Jan 6, 2024

So I don't like how peg/split doesn't accept precompiled PEGs unlike all of the other peg functions. I think we should remove that particular fuction for that reason, or implement it so that that works.

@bakpakin
Copy link
Member

bakpakin commented Jan 6, 2024

the TO_END rule also looks good and seems to be a good way to possibly speed up some matches. That said, I think for more optimizations a benchmark suite is needed. I'm reluctant to add performance optimizations without measurement unless they are trivial.

@ianthehenry
Copy link
Contributor Author

ianthehenry commented Jan 6, 2024

Oh yeah, I didn't think of that! It would be possible but maybe crazy to add a RULE_RUN_THIS_OTHER_PEG and allow interpolating pre-compiled PEGs inside arbitrary PEGs. Or to copy in their bytecode/constants with adjusted offsets. But probably the simplest thing is to implement peg/split as a separate cfunction.

I just removed that commit for now; I'm not really convinced the function is pulling its weight if it requires re-implementing the split logic.

@bakpakin
Copy link
Member

Ok, currently LGTM. Thanks @ianthehenry !

@bakpakin bakpakin merged commit 6d7e852 into janet-lang:master Jan 15, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants