Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement regexes #174

Open
masak opened this issue Aug 14, 2016 · 33 comments
Open

Implement regexes #174

masak opened this issue Aug 14, 2016 · 33 comments

Comments

@masak
Copy link
Owner

masak commented Aug 14, 2016

We're going to need them soon if we want to do is parsed stuff.

Let's start very simple. What are the things needed to get #163 to work?

  • Literal strings. / "abc" /
  • Calling into 007 grammar rules. / <EXPR> / (Update: but see also Regex rule call wishlist #348.)
  • Whitespace handling. / <.ws> / (or something)

There are various other things I foresee we need pretty soon after that (quantifiers, code blocks, maybe assertions, lookaheads/lookbehinds?), but the above is a modest start. I'm not sure we'll ever need backtracking, since people will mostly be implementing parser fragments with these regexes. At the very least, backtracking should be off by default.

But some care would also need to go into the APIs around a new Val::Regex type.

  • How does one match a regex against a string?
    • It's tempting to use infix:<~~> and infix:<!~~> for this... but that'd take it outside of its current role, which is to type-match. It'd turn into a more generic smartmatch operator. Maybe not such a bad thing.
    • Java has a method .matcher which creates an iterator-like Matcher object. We're not doing that.
    • Maybe we should just copy Python's regex API.
  • What does a regex match return?
    • A boolean? That's what infix:<~~> currently returns for type matches. (Modulo Introduce a Val::Bool type #157.) But that won't allow us to get capture group information out of the match...
    • A Match object? But if we want to use infix:<~~> in if statements (and who doesn't?), then some Match objects would need to be falsy, and there is currently no boolification protocol to allow that. Or should we return None when things don't match? (Python does.)
@masak
Copy link
Owner Author

masak commented Aug 14, 2016

Oh, and we'll want to do this behind a feature flag before it's nice and stable.

@vendethiel
Copy link
Collaborator

Do you have a list of links I could browse to have any idea how to parse/execute regexps?

@eritain
Copy link

eritain commented Aug 17, 2016

The links here are relevant to the "execute" side of your question: Implementing Regular Expressions. But the expressions they discuss are traditional Unix-style regular expressions, not the Perl 6 style used here. For the "parse" side then, we need a grammar of Perl 6 grammars (modulo whatever differences there end up being between those and 007 grammars).

The potentially big issue in applying the "Implementing" link I gave is that Perl 6 grammars are strictly more expressive than traditional regular expressions -- that is, there are patterns you can specify in a Perl 6 grammar that you cannot specify in a true regular expression. How to implement them for 007 is going to depend where 007 regexps fall on that scale of expressiveness.

@vendethiel
Copy link
Collaborator

The issue is not parsing regexps. That's "easy" to do. Dealing with the matching itself sounds a bit harder (backtracking etc. Though not too hard standing alone)

@masak
Copy link
Owner Author

masak commented Aug 17, 2016

I don't see us needing backtracking, really.

On Aug 17, 2016 20:16, "ven" notifications@github.com wrote:

The issue is not parsing regexps. That's "easy" to do. Dealing with the
matching itself sounds a bit harder (backtracking etc. Though not too hard
standing alone)


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#174 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AABQk_IV1n5kkeZpQRIiWXz3758V13Zcks5qg1AUgaJpZM4Jj9M6
.

@masak
Copy link
Owner Author

masak commented Aug 18, 2016

Do you have a list of links I could browse to have any idea how to parse/execute regexps?

Allow me to add to @eritain++'s tips by wholeheartedly recommending Regular Expression Matching Can Be Simple And Fast. It makes excellent points about performance (which are relevant to us) and backtracking (or the lack thereof), and it has actual implementation which is clear and helps understanding.

An apposite addendum to that text is that I imagine 007 regexes as favoring simplicity of implementation. That is, we might not want to head down the NFA route with regexes themselves. We can take some amount of performance hit if it gives us a simpler regex implementation. "Oh, a * quantifier, well then, let's while loop until this atom doesn't match anymore, and then succeed." If the code expresses that, then we're fine.

We do want to go the whole NFA hog with grammars, though. It's a later concern, but it's worth pointing out already now. In the end, this will also gain us simplicity; thinking properly in terms of language braids essentially pushes one towards an NFA-based point of view. The grammar is a directed graph on which things match; sometimes nodes come and go due is parsed, slangs, language modules, operators, scoping, etc.

@masak
Copy link
Owner Author

masak commented Aug 18, 2016

Do you have a list of links I could browse to have any idea how to parse/execute regexps?

By the way, nice volunteering! Looking forward to collaborating on this.

In the interests of encouraging @vendethiel and like-minded people, here's what I'm looking for in a first PR:

  • A definition of token term:regex in Syntax.pm.
    • It should recognize a term:str inside /.../ slashes, and nothing else for now. Let's iterate and extend on it bit by bit in later PRs. (I learned this from the object literal PR, which could've been broken up into smaller morsels.)
    • It should be hidden behind a feature flag. See 2759902 for prior art.
  • A corresponding term:regex method in Actions.pm.
  • A new Q type Q::Term::Regex (not Regexp) in Q.pm.
    • Haven't given the internal structure any thought. It's up for demo/discussion.
  • A new value type Val::Regex in Val.pm.
  • Two methods on Val::Regex (in Runtime.pm):
    • fullmatch(str) that returns True if the entire str matches the regex
    • search(str) that returns True if str.index(regex_str) != -1
  • Tests for both methods (positive and negative). So, four tests, I guess. 2759902 also has prior art for how to test behind a feature flag.

@vendethiel
Copy link
Collaborator

vendethiel commented Aug 18, 2016

Right. In this case, it does not look too different from what I did in faux-combinator... It "simply" works on text rather than tokens. (it doesn't really have backtracking. Though it has a simple system of storing the parser state before trying an optional match to restore them if the match fails somewhere along).

Exemple of faux-combinator/perl5:

sub subrule($parser) {
  $parser->one_of({ $parser->expect('a'); }, { $parser->any_of('b'); });
  $parser->try({ $parser->expect(','); });
}

my $parser = FauxCombinator::Parser::new([@tokens]);
$parser->expect('*');
$parser->many_of(&subrule);

which would parse /'*' [['a' | 'b'*] ','?]+/.

By the way, nice volunteering! Looking forward to collaborating on this.

I have so much on my plate ($work, $school, $memoir, and my little rust vm) I'm certainly not going to specify an ETA. But it does look like an interesting project!

@masak
Copy link
Owner Author

masak commented Aug 21, 2016

I'm tempted to close this one now, as we technically have regexes in 007 (under a feature flag). The rest is just a Simple Matter of Programming. 😄

Alternatively, I can keep it open until something like /<EXPR>/ works, too.

@masak
Copy link
Owner Author

masak commented Aug 22, 2016

Ok, let's to <EXPR> before we close it. I'll phrase this one as a PR request, but I don't mind trying my own hand at it too. Anyone wants to grab this one, let me know.

  • Allow <EXPR> to be parsed in a regex. This in itself will lead to various extensions:
    • In the term:regex rule itself
    • In the term:regex action method
    • Most crucially, in Q::Term::Regex — will end up containing a Val::Array of something
      • I'm available for discussing what that "something" would be. It's not clear to me either right now without more thought/discussion.
    • Also in Val::Regex which will also end up holding a Val::Array of values of some kind
      • What should the value corresponding to an <EXPR> call be? That's a really good question.
      • Actually, should the Val::Regex link to or copy from the Q::Term::Regex? Can the latter exist without the former?
  • Check (with tests) that /"James" <EXPR> "Bond"/ matches something like "James1 + 2 + 3Bond"
  • But not something like "James1 + 2 + 3" or something like "Jame1 + 2 + 3Bond".
    • This will necessitate fleshing out the matching algorithm a little bit; it will at least have to start scanning the string, succeeding atom by atom, marking success if all succeed.
    • No quantifiers yet, so we don't have to bother about looping.
    • It'd probably be better if the matching algorithm ended up as a method in Val::Regex, not in Runtime.

@vendethiel
Copy link
Collaborator

I'd say

data RegexFragment = StringFragment str | VariableFragment identifier | SubruleFragment Regex
data Regex = Regex [RegexFragment]

@vendethiel
Copy link
Collaborator

vendethiel commented Aug 22, 2016

Then later, we can add AtLeastNFragment RegexFragment Int and others, rather easily :).

and it means to match a regex you "just" need an eval for your ADT.

@masak
Copy link
Owner Author

masak commented Aug 22, 2016

Maybe 007-ified names for the above would be:

  • Q::Regex::Str (Q::Literal::Str)
  • Q::Regex::Identifier (Q::Identifier)
  • Q::Regex::Call (Q::Identifier) (I think it should reference, not contain, another regex)

And

  • Q::Regex::ZeroOrOne
  • Q::Regex::ZeroOrMore
  • Q::Regex::OneOrMore

Just suggestions, but I think these fit the implicit naming scheme we have.

@vendethiel
Copy link
Collaborator

Yes, that was more about structure than actual names :). Seems good to me.

@vendethiel
Copy link
Collaborator

Seeing as we're reaching some complexity level, should this be its own 007::Regex file?

@masak
Copy link
Owner Author

masak commented Oct 17, 2016

Hm, not as part of this issue, I'd say.

Usually I'd jump at suggestions to bring out more structure and separating into files, but right now the force keeping all the Q types in the same file is pretty strong.

@vendethiel
Copy link
Collaborator

Duely noted :-). I'll just be adding some fences (... not).

@vendethiel
Copy link
Collaborator

I got some stuff done: https://gist.github.com/vendethiel/4ce172f656bdec64c1d2b09857d382c6

Two questions:

  • Do I keep $runtime around to look up functions/variables?
  • How do I get Q:: types in that file?

@masak
Copy link
Owner Author

masak commented Jul 17, 2017

  • Do I keep $runtime around to look up functions/variables?

I presume so. A Val::Regex can end up holding references to variables that'll need to be looked up in the right lexical scope. (In this sense, regex values "close over their environment" just as function values do.)

<EXPR> is an uncomfortable exception, too. Where was EXPR defined? In 007's parser. I don't see why it should be defined among the normal builtins. Which seems to imply something unnatural is going on with the OUTER scoping of regex values; at least those in an is parsed trait. Ugh.

  • How do I get Q:: types in that file?

Sorry, which file are we talking about? 😄

@vendethiel
Copy link
Collaborator

Sorry, Val.pm

@masak
Copy link
Owner Author

masak commented Jul 18, 2017

Right now in Val.pm, there are two strategies for using Q:: types:

  • Just use them implicitly, not referring to them by type. (Val::Sub has a $.parameterlist, for example, but it isn't typed as Q::ParameterList.)
  • Refer to them using strings, like here.

If we need something more than that, I'd consider it a signal that something is off in the design and needs us to go back to the drawing-board a little.

@vendethiel
Copy link
Collaborator

Well, I have a for @.fragments { when Q::Regex::Group (e.g.). Maybe I really should convert these to Val::s...

@masak
Copy link
Owner Author

masak commented Jul 18, 2017

My design brain is missing, but — try it and see? My biggest objection for now is I don't see a bit use of such Val:: objects outside a regex. But not everything needs to be used a lot, I guess.

I seem to vaguely recall that nqp solves this by having a single regex node type that can take on all possible regex-fragment roles, like a chameleon. But don't quote me on that.

@vendethiel
Copy link
Collaborator

Made a lot of progress today.

I didn't decide yet on a dispatch mechanism. For now the code uses when .^name eq "Q::X::Y".

This regex parses and works: /[["a"+ "b"*] "1"?]+/.

I'll have lots of cleanup to do, and I need to writeup some kind of helper to test those (or maybe just test many regexes in a single 007 test).

@vendethiel
Copy link
Collaborator

I didn't yet implement identifier or call, I need to take a look at how .eval on an identifier works.

@masak
Copy link
Owner Author

masak commented Aug 6, 2017

Wohoo! Just want to say how nice it is to see progress happen on this! 🎉

Getting to where we can do is parsed will be a bit of a watershed moment for 007. We'll get to play around with macros that are a lot more syntactically interesting.

@vendethiel
Copy link
Collaborator

For reference: #239.

Will push alternations shortly.

@masak
Copy link
Owner Author

masak commented Nov 17, 2017

#239 has merged. We are almost at a point (being able to implement #163) where we can close this one.

What's missing concretely:

  • Being able to call <EXPR>. (Again, we can massively hard-code and fake this one.)
  • is parsed for macros.

I started experimenting with is parsed in b03ff6a and immediately realized that is parsed AST fragments don't have an unexpanded form — the closest thing they have is that part of the source text itself, which maybe should be wrapped in a dedicated Qtype.

@vendethiel
Copy link
Collaborator

It's also probably missing captures.

@masak
Copy link
Owner Author

masak commented Nov 17, 2017

I have a feeling the best way to proceed from here is "pull-based", that is, start from a use case such as #163, run it, find the most proximal reason it doesn't work, implement that, lather, rinse, repeat.

@masak
Copy link
Owner Author

masak commented Nov 17, 2017

Or we set the sights a little higher and just make a list of the features missing, and pick a sensible dependency order to tick them off.

To wit:

  • <EXPR>
  • captures
  • is parsed accepting a regex and feeding the (:thinking:) Qtrees from the operands and the captures into the parameters of the macro

I guess that's it.

@masak
Copy link
Owner Author

masak commented Nov 3, 2018

I hereby commit to championing this issue through to completion. I have this idea that I can prototype things in pure 007, and then bring them back in a nice way in the implementation.

@masak
Copy link
Owner Author

masak commented Apr 17, 2019

I'd like to write a small note about three "modes" of regex matching that I've noticed:

  1. "Does it match?" A boolean question, used in an if statement somewhere. Maybe the match produces a Match object, but a Sufficiently Smart compiler could optimize it away and just keep the code that answers the yes/no question.

  2. "Make the following replacements." (AKA Perl 6's .subst method.) Here the result is a string, not a boolean. Probably best built using a StringBuilder-like mechanism. Again, the naïve way to go about this would be to first generate the Match object and then use the information therein to generate the string. But a dedicated optimizer might be able to do better, basically specializing into something that produced the string on-the-fly, without bothering with the Match object.

  3. "Give me the structure corresponding to the parse of this string." (AKA grammars.) This is the general case. Here we actually expect a Match back... Or rather... We want the resulting AST/DOM structure. In Perl 6, you get the Match object, and then you call .ast/.made on it, but... I can't really see right now why calling .parse on a grammar wouldn't hand back the .ast immediately.

So, unexpectedly, invoking a regex on a bit of text seems to return a Match, but the three main use cases for regex matching all circumvent the Match and boolify/stringify/AST-ify it to get to what they really care about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants