Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: negative/positive look-ahead/behind, non-consuming surrounding context parsers, negation parsers #50

Open
ladihzey opened this issue Dec 29, 2022 · 7 comments
Assignees
Labels
A-combinators Area: Issues related to combinators A-parsers Area: Issues related to parsers C-feature-accepted Category: A feature request that has been accepted and awaiting implementation P-high Priority: High

Comments

@ladihzey
Copy link

ladihzey commented Dec 29, 2022

There are multiple things described in this issue, but I think it's better to keep them close.

I'm doing some free text parsing migrating from the regular expressions based solution. I faced an issue that I couldn't fully specify the surrounding context. when is the only parser that could work with context but documentation doesn't say anything whether it consumes input or not (I assume it does which raises a question how it is different from sequence). Also when allows to specify preceding context only and I'm not sure what should I use to specify the context after the target parser.

It would be nice to have non-consuming parser (context) and negation parser (not) so it would be possible to specify surrounding context e.g.

// context(before, target, optional after)

const helloWorld = string('hello world');
const strictHelloWorld = context(
    not(letter()),
    helloWorld,
    not(whole())
);

const parser1 = sequence(any(), strictHelloWorld, any());
const parser2 = sequence(any(), helloWorld, any());

run(parser1).with('1hello worldA'); // result.isOk = true, value = ['1', 'hello world', 'A'];
run(parser1).with('Ahello world1'); // result.isOk = false

run(parser2).with('1hello worldA'); // result.isOk = true, value = ['1', 'hello world', 'A'];
run(parser2).with('Ahello world1'); // result.isOk = true, value = ['A', 'hello world', '1'];

Also it would be nice to have some default non-consuming parsers such as wordBoundary. What is more this will require polyfilling JS \b since it's not unicode friendly.

More examples can be found in the codesandbox.

@norskeld norskeld self-assigned this Dec 29, 2022
@norskeld norskeld added the feat New feature or request label Dec 29, 2022
@norskeld
Copy link
Owner

Hey, thanks for the feedback and suggestions! I've already started working on lookahead/backtracking combinators and playing around with non-consuming parsers in general, so your use cases and thoughts will be a great help for sure.

I have to take care of some personal stuff though, so I'll get back to this and other issues on weekends-holidays.

@ladihzey
Copy link
Author

ladihzey commented Dec 30, 2022

Thank you for your work, I really appreciate it!
I've just realized that I forgot to save changes in the codesandbox and provided you incorrect samples. Now it should work: codesandbox.

@norskeld
Copy link
Owner

norskeld commented Dec 31, 2022

I took a glance at your examples and can tell you right away what the problem is: the missing g flag in regular expressions. This "caveat" is mentioned in the docs, as well as the other one.

Under the hood regexp parser uses RegExp.exec (for speed) and resets lastIndex to parsing position, so yeah, use the g flag.

As for the word boundaries... Well, it's sad it doesn't work, but I believe this should be handled on the library consumer's side, as this seems to be a rather specific issue with regular expressions in JS. Besides, nothing stops you from either wrapping regexp parser and providing it with a fixed regular expression, or even writing a custom parser (this actually should be documented, hopefully I'll be able to spare some time for that).

Ultimately, I would like to keep the core of the library as small as possible.


As for the other stuff in the issue, I'll reply later and link a PR.

@ladihzey
Copy link
Author

Oh, adding g flag actually resolved the issue. I was thinking all this time that I must not use g flag, seems like I missread it.

What concerns word boundaries it was my desire to completely migrate from regexps and I'm just fine to implement it myself! I do not expect that this library will cover all the features of regular expressions. And if there will be such a need I guess it should be done through an extension package e.g. sigma/extra or something like that.

@norskeld
Copy link
Owner

norskeld commented Jan 5, 2023

Returning to the first part of the issue...

  1. I believe the context combinator you suggested will be essentially attempt(takeMid(before, target, after)) or lookahead(takeMid(before, target, after)). Both will be rip-offs of try and lookAhead from Parsec respectively.

  2. The not combinator is kinda tricky, not only because it can't be typed correctly without ugly casts, but also because I have no clear understanding on how to properly integrate it into the existing wiring. I'll leave it for now along with context, will be added as separate issues to backlog.

  3. The when combinator was originally conceived to allow creating parsers dynamically, not statically (like sequence). See the example below.

    when(context, parser)

    It is consuming if context is consuming and it will call parser callback only if context succeeds. Why? Because allowing to produce something on failure would be virtually a way for performing recovery, which I want to be a separate combinator's responsibility. Also I kinda regret of not giving it a better name, like chain, but I already had chainl at the moment which provides completely orthogonal functionality, so it would be confusing. Naming is damn hard...

    Also I have to admit that the example in the docs is hilariously bad, will be fixed. A more elaborate example would look like this:

    const Parser = when(takeLeft(letters(), whitespace()), ({ value }) => {
      switch (value) {
        case 'integer': return integer()
        case 'string': return letters()
        case 'bracketed': return takeMid(string('('), letters(), string(')'))
        // TODO: Add `success` and `failure` helpers.
        default: return failure('Expected integer, string or bracketed string.')
      }
    })
    
    console.log(run(Parser).with('integer 42'))
    
    // {
    //   isOk: true,
    //   pos: 10,
    //   value: 42
    // }
    
    console.log(run(Parser).with('string Something'))
    
    // {
    //   isOk: true,
    //   pos: 16,
    //   value: 'Something'
    // }
    
    console.log(run(Parser).with('bracketed (Something)'))
    
    // {
    //   isOk: true,
    //   pos: 21,
    //   value: 'Something'
    // }
    
    console.log(run(Parser).with('unknown input'))
    
    // {
    //   isOk: false,
    //   pos: 8,
    //   expected: 'Expected integer, string or bracketed string.'
    // }

@ladihzey
Copy link
Author

ladihzey commented Jan 6, 2023

The not combinator is kinda tricky, not only because it can't be typed correctly without ugly casts, but also because I have no clear understanding on how to properly integrate it into the existing wiring. I'll leave it for now along with context, will be added as separate issues to backlog.

The idea must be silly but what if we restrict the use of not parser only for work with contexts (lookahead/behind) and make it return boolean? So its only responsibility would be to answer the question whether the input was matched or not which is the only thing that lookahead/behind requires in regexp world.

I know that it conflicts with the overall API design but maybe it's not needed to make it work the way other parsers do. It may be just fine to make it work as a support function for lookahead/behind functionality only.

@norskeld
Copy link
Owner

norskeld commented Jan 6, 2023

Hm-m, that could work, thanks for the idea! I'll have to add names or some other identification means for parsers and combinators, but I'll need to add it anyway for future tracing/debugging features.

@norskeld norskeld added A-parsers Area: Issues related to parsers A-combinators Area: Issues related to combinators C-feature-accepted Category: A feature request that has been accepted and awaiting implementation P-high Priority: High and removed feat New feature or request labels Jan 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-combinators Area: Issues related to combinators A-parsers Area: Issues related to parsers C-feature-accepted Category: A feature request that has been accepted and awaiting implementation P-high Priority: High
Projects
None yet
Development

No branches or pull requests

2 participants