Add a new (sub) PEG special #1344

ianthehenry · 2023-12-27T17:30:42Z

The first commit in this branch condenses the two of the existing opcodes into a single opcode, so that there are still exactly 32 PEG opcodes. I asked about this on Zulip, and as I said there I'm still not very confident that this code is necessary. It does remove a conditional branch on the main bytecode loop, but I worry there might be some 32-bitness thing that I don't understand that might make it even more costly to remove.

It then proposes ~~two~~ one new specials:

`sub`

(sub window patt) executes the window pattern, remembers how many bytes it matches, then executes patt over exactly the bytes matched by the window. This allows you to limit what the subpattern can match, e.g. (sub (to "\n") :foo) ensures that :foo does not have a chance to match beyond the current line. If patt also matches, then the full window is consumed by the (sub) rule.

Another good name for this might be limit, as in (limit (to "\n") patt), but this doesn't really communicate that the full matched text is equal to the window pattern.

On the surface this is similar to (cmt (% ...) ,|(peg/match ... $)), but differs in several ways: the subpattern in sub can still refer to :other patterns inside struct rules, the position and other helpers are still relative to the whole input, the pattern can leave multiple captures...

`sep`

I'll move this into its own PR, but I'm preserving the original description so the discussion still makes sense.

(sep separator patt) is a combination of sub, to, and thru. The canonical example would be something like ~(sep "," :w*) to match comma-separated words.

It's similar to ~(some (* (sub (to (+ ,separator -1)) ,patt) (? ,separator))), except that it will not consume terminating separators: the final (? ,separator) rule is conditional on the next instance of the subpattern matching. A more accurate translation would be:

(defn sep [separator patt]
  ~(? (sub (to (+ ,separator -1)) ,patt) (any (* ,separator (sub (to (+ ,separator -1)) ,patt)))))

Which is pretty unwieldy. I think this is a common enough pattern to deserve an optimized helper -- it compiles to a lot less bytecode and doesn't match separators twice.

Since sep repeats, it might be useful to have separte forms like sep-any and sep-some.

Neither of these combinators are really necessary, as you can do a one-pass thing that ensures that your inner patterns never advance past a separator (which is necessary in the case of a real csv parser that has to handle escapes). But they're convenient for lots of simple ad-hoc parsing tasks (hello from advent of code).

bakpakin · 2023-12-27T22:50:36Z

The fact that there are 32 opcodes is entirely coincidental and I was not even aware of this. I suppose this could result in tighter code generation but I doubt it does.

bakpakin · 2023-12-27T22:52:50Z

The sub pattern looks interesting; it's similar to the lookahead combinator but requiring that the pattern is inside the window is useful.

ianthehenry · 2023-12-28T03:47:57Z

Well that simplifies things! Yeah I couldn't work out why the bitmask was there so I didn't want to remove it. But I pulled that commit out.

Potential idea for distinguishing sep-some vs sep-any: an optional "miminum matches" argument. So (sep "," :w+ 0) for any, (sep "," :w+ 1) for some. I doubt you'd really use anything else, but (sep "," :w+ 10) would theoretically work... basically works like at-least.

Another idea is that sep should always require at least one match, and if you want sep-any you can just make it optional. Although that's... hmm. Don't love it.

My instinct is sep and sep+, using the regex sigils, but that's sort of inconsistent with PEGs...

pepe · 2023-12-28T08:48:57Z

Unfortunately, I cannot add anything meaningful to the implementation debate. Yet both specials proposed here make sense to me. In my current project, I am using PEG extensively for parsing small user inputs, and at least sub will be very handy.

sogaiu · 2023-12-28T09:58:05Z

Interested in sub (and am in the midst of implementing it in margaret).

I've not happened to have noticed a case of wanting a sep-like thing (perhaps just not recognizing opportunities), though it seemed a number of folks were discussing an AOC thing this year that might have benefited?

I wonder if there's some value in adding just one special at first (e.g. sub) and seeing how it works out in combination with the existing system and then after a bit of exercising, revisiting the other one. May be that's unnecessary.

For sub, perhaps a test involving error might be relevant:

  (try
    (peg/match ~(sequence "a"
                          (sub "bcd" (error "bc")))
               "abcdef")
    ([e] e))
  # =>
  "match error at line 1, column 2"

as it makes use of the line and column functionality and the PR touches get_linecol_from_position which is used from the error special.

pepe · 2023-12-28T10:45:10Z

@sogaiu I have the same feeling about sep, but the arguments in the description of this PR make it a little bit easier to believe for me :-). Anyway, I like the idea of starting with sub; if everything is correct, consider sep.

sogaiu · 2023-12-28T14:12:48Z

FWIW, in addition to the tests for sub contained in this PR, I've added some additional tests for the sub implementation in margaret. I generated the expected values via a janet compiled from this commit of this PR.

For the interested, they live here.

ianthehenry · 2023-12-28T23:36:58Z

@sogaiu there is a test showing the behavior of the line/col/position helpers, which I think is sufficient to cover the error case as well (if I understand correctly).

Regarding sep, I think it does make sense to open a separate PR for that. I'm not sure if the behavior here is exactly right -- really I think of sep as a way to run string/split inside a PEG. Except that the implementation here was not exactly that -- it's sort of "string split for as long as you can but stop as soon as something fails to match." Which is maybe less useful than something that requires that all split substrings match exactly.

I just rebased this branch and updated the description. There is another high-level sub-inspired combinator that I think would be useful, I might open a separate PR for that and sep/split if this gets merged.

sogaiu · 2023-12-29T02:25:02Z

@ianthehenry The latest commit looks good to me 👍

Probably it wasn't necessary but I also verified against the additional tests I mentioned earlier.

So far my impression is that sub is working well with other bits -- have my fingers crossed that remains the case and that we can see additional goodies you have in mind ;)

(sub) will first match one pattern, then match another pattern against the text that the first pattern advanced over.

ianthehenry · 2023-12-29T06:17:12Z

I found an issue with PEG unmarshaling and I added a previously-failing test for it -- the bytecode verifier was also doing the & 0x1f bitmasking, and I failed to remove that when I rebased out the 32-opcodes-only commit.

bakpakin · 2024-01-01T00:40:44Z

LGTM.

ianthehenry force-pushed the peg-sub-special branch from 4a0798e to c4629ab Compare December 27, 2023 17:31

ianthehenry force-pushed the peg-sub-special branch 2 times, most recently from 909b701 to dfb527b Compare December 28, 2023 03:35

ianthehenry changed the title ~~Add two new PEG specials, (sub) and (sep)~~ Add a new PEG specials (sub) Dec 28, 2023

ianthehenry changed the title ~~Add a new PEG specials (sub)~~ Add a new (sub) PEG special Dec 28, 2023

ianthehenry force-pushed the peg-sub-special branch from dfb527b to 4503ca2 Compare December 28, 2023 23:39

add a new (sub) PEG special

ea75086

(sub) will first match one pattern, then match another pattern against the text that the first pattern advanced over.

ianthehenry force-pushed the peg-sub-special branch from 4503ca2 to ea75086 Compare December 29, 2023 06:15

bakpakin merged commit e8ed961 into janet-lang:master Jan 1, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new (sub) PEG special #1344

Add a new (sub) PEG special #1344

ianthehenry commented Dec 27, 2023 •

edited

bakpakin commented Dec 27, 2023

bakpakin commented Dec 27, 2023

ianthehenry commented Dec 28, 2023

pepe commented Dec 28, 2023

sogaiu commented Dec 28, 2023 •

edited

pepe commented Dec 28, 2023

sogaiu commented Dec 28, 2023

ianthehenry commented Dec 28, 2023 •

edited

sogaiu commented Dec 29, 2023 •

edited

ianthehenry commented Dec 29, 2023 •

edited

bakpakin commented Jan 1, 2024

Add a new (sub) PEG special #1344

Add a new (sub) PEG special #1344

Conversation

ianthehenry commented Dec 27, 2023 • edited

sub

sep

bakpakin commented Dec 27, 2023

bakpakin commented Dec 27, 2023

ianthehenry commented Dec 28, 2023

pepe commented Dec 28, 2023

sogaiu commented Dec 28, 2023 • edited

pepe commented Dec 28, 2023

sogaiu commented Dec 28, 2023

ianthehenry commented Dec 28, 2023 • edited

sogaiu commented Dec 29, 2023 • edited

ianthehenry commented Dec 29, 2023 • edited

bakpakin commented Jan 1, 2024

ianthehenry commented Dec 27, 2023 •

edited

`sub`

`sep`

sogaiu commented Dec 28, 2023 •

edited

ianthehenry commented Dec 28, 2023 •

edited

sogaiu commented Dec 29, 2023 •

edited

ianthehenry commented Dec 29, 2023 •

edited