Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: regexp: add "parse until delimiter" operation #44254

Open
aclements opened this issue Feb 13, 2021 · 3 comments
Open

proposal: regexp: add "parse until delimiter" operation #44254

aclements opened this issue Feb 13, 2021 · 3 comments
Labels
Projects
Milestone

Comments

@aclements
Copy link
Member

@aclements aclements commented Feb 13, 2021

Regular expressions are often embedded in other languages, and the current regexp package makes it difficult to correctly parse such regexps. Common examples of such embedding include awk, Perl, and Javascript, all of which have a /regexp/ expression syntax. In Go, this appears in the testing package's "-test.run" flag, which is a sequence of /-separated regexps; in benchstat v2's filter syntax; and in at least one other place @rsc mentioned that's now slipping my mind.

In general, this is difficult to implement outside regexp itself because the delimiter may appear nested in the regexp. For example, in the testing package, the run expression a[/]b/c matches subtest c of top-level tests matching a[/]b. The first slash is not a separator because it does not appear at the top level of the regexp. The testing package implements a simple, ad hoc parser for this (splitRegexp) but it doesn't get every corner case.

Since this is now a pattern, the regexp package (or perhaps regexp/syntax) should itself implement a "parse until delimiter" function, which would make it easy to parse regular expressions embedded in a larger syntax.

To make a concrete proposal, I propose we add the following function to regexp/syntax:

// ParseUntil parses a regular expression from the beginning of str
// until the string delim appears at the top level of the expression.
// It returns the regular expression prefix of str and the remainder of str.
// If successful, rest will always begin with delim.
// If delim does not appear at the top level of str, it returns str, "", ErrNoDelim.
func ParseUntil(str, delim string) (expr, rest string, err error)

I propose this should return the split input string, rather than the parsed regexp, so it can be composed with any other regexp parsing entry point (e.g., regexp/syntax.Parse or regexp.Compile).

I don't think this operation needs to take Flags, but I'm not positive.

/cc @rsc

@gopherbot gopherbot added this to the Proposal milestone Feb 13, 2021
@ianlancetaylor ianlancetaylor added this to Incoming in Proposals Feb 14, 2021
@jfesler
Copy link

@jfesler jfesler commented Feb 14, 2021

I would use this feature for config files taking use input. It would be great for my end users. Currently I force them to cope with single/double quoting rules, which is not good when they need to match on quotes. By allowing / as a delimiter, and a rebel aware parser looking for it intelligently, my users would be able to use more familiar regexes directly without figuring out what to escape or how many backslashes to escape with.

Instead of “remainder of string” I would prefer returning bytes consumed.

@aclements
Copy link
Member Author

@aclements aclements commented Feb 15, 2021

Instead of “remainder of string” I would prefer returning bytes consumed.

I'm fine either way, but I'm curious what your rationale for preferring bytes consumed is. It seems like one would always take that count and just slice the string to get the remainder.

@mpx
Copy link
Contributor

@mpx mpx commented Feb 16, 2021

I've wanted something like this in the past to support handling s/REGEXP/REPLACEMENT/ in the past. As pointed out, that's impractical without parsing the regexp itself.

regexp.Compile and regexp.CompilePOSIX parse slightly differently which might affect the composability if parse flags aren't supported - I haven't thought about this too hard yet, perhaps it isn't necessary.

Given how regexp.Compile and regexp.CompilePOSIX are implemented, it might be better to make the parse/compile steps available to developers:

package syntax
func ParseUntil(s string, flags Flags, delim string) (_ *Regexp, rest string, err error)

package regexp
func CompileSyntax(re *syntax.Regexp, longest bool) (*Regexp, error)

This has some advantages:

  • Avoids duplicating parsing work
  • Should support updating flags after parsing (eg, supporting s/regexp/replacement/flags).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Proposals
Incoming
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants