Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: regexp: add (*Regexp).SubexpIndex #32420

Open
ajwerner opened this issue Jun 4, 2019 · 8 comments
Open

proposal: regexp: add (*Regexp).SubexpIndex #32420

ajwerner opened this issue Jun 4, 2019 · 8 comments

Comments

@ajwerner
Copy link

@ajwerner ajwerner commented Jun 4, 2019

Regular expressions are handy in a variety of simple string parsing situations. Using named capture groups is a good way to document the structure of a regular expression and to eliminate bugs due to the introduction of additional capture groups. Mapping the name to the capture group index is currently quite heavy-weight. I regularly find myself writing code like namedSubexp in the below toy example when using regular expressions to parse strings. What's worse is that I often don't write this code and instead just rely on a brittle hard-coded index.

The rejected proposal in #24208 argued in favour of a much more heavyweight interface change that also does not appeal to me. github.com/ghemawat/re.Scan uses slices and reflection and thus is too inefficient for anything performance critical. I understand that the bar for changes here is high Furthermore, I do hear an argument in favour of using an external library as the regexp package is already quite large but I think it's exactly in those cases where I'd avoid writing this helper function that I'd also avoid pulling in a new dependency. The proposal here is compact, useful and encourages more maintainable code so I figured I'd float it and see if the it resonates.

This issue proposes a new method on *regexp.Regexp called NamedSubexp which takes a string and returns an integer. The open-ended portion of this proposal is how to deal with the case where no such capture group exists. I'm quite open to that integer being -1 if no such named capture group exists (as in the strings package) or to augmenting the method signature to additionally return a boolean. My inclination for the panic comes from a tendency to use this pattern to define global vars and it's probably bad practice to search the list of strings at runtime.

package main

import (
	"fmt"
	"regexp"
)

var (
	re     = regexp.MustCompile("foo (?P<bar>[0-9]+) (?P<baz>0x[0-9A-Fa-f]+)")
	barIdx = namedSubexp(re, "bar")
	bazIdx = namedSubexp(re, "baz")
)

func namedSubexp(re *regexp.Regexp, name string) int {
	for i, exp := range re.SubexpNames() {
		if exp == name {
			return i
		}
	}
        panic(fmt.Errorf("%v does not have a capture group named %s", re, name))
}

func main() {
	matches := re.FindStringSubmatch("foo 12 0xA123")
        // if matches == nil { ... }
	bar, baz := matches[barIdx], matches[bazIdx]
	fmt.Println(baz, bar)
}

https://play.golang.org/p/WT8dFyp1TCE

@ianlancetaylor ianlancetaylor changed the title regexp: add method to look up named capture group index by name proposal: regexp: add method to look up named capture group index by name Jun 4, 2019
@gopherbot gopherbot added this to the Proposal milestone Jun 4, 2019
@gopherbot gopherbot added the Proposal label Jun 4, 2019
@ajwerner

This comment has been minimized.

Copy link
Author

@ajwerner ajwerner commented Jun 4, 2019

Perhaps a more palatable solution would be to augment the strings library instead of adding another method to Regexp. The following function or something like it in the strings package would make me similarly happy:

package strings

// SliceIndex returns the index of the first instance of needle in haystack, or -1 if needle is not present in haystack.
// Slice index performs a linear search of haystack for needle.
func SliceIndex(haystack []string, needle string) int {
   for i, s := range haystack {
      if s == needle {
         return i
      }
   }
   return -1
}

Of course this then begs the question about the need for at least (Last)?SliceIndex(Func)? functions for symmetry with the regular string functions. It's also not obvious that this slice searching belongs in the strings package though I can't think of a more suitable package.

@rsc

This comment has been minimized.

Copy link
Contributor

@rsc rsc commented Jun 25, 2019

What do people think of this?

// SubexpIndex returns the index of the first subexpression with the given name,
// or else -1 if there is no subexpression with that name.
//
// Note that multiple subexpressions can be written using the same name, as in
// (?P<bob>a+)(?P<bob>b+), which declares two subexpressions named "bob".
// In this case SubexpIndex returns the index of the leftmost such subexpression
// in the regular expression.
func (*Regexp) SubexpIndex(name string) int
@robpike

This comment has been minimized.

Copy link
Contributor

@robpike robpike commented Jul 1, 2019

Leftmost match or leftmost in the expression?

(?Pa*)|(?Pb+) matching b

@rsc

This comment has been minimized.

Copy link
Contributor

@rsc rsc commented Jul 2, 2019

Leftmost in the expression, because the method is on Regexp. There's no input text involved when you ask the question. (Hopefully people will just not name the same subexpression twice but there has to be a clear rule.)

@rsc

This comment has been minimized.

Copy link
Contributor

@rsc rsc commented Jul 16, 2019

Based on mild happiness and no negativity, accepting for API in #32420 (comment).

@rsc rsc changed the title proposal: regexp: add method to look up named capture group index by name proposal: regexp: add (*Regexp).SubexpIndex Jul 16, 2019
@rsc rsc modified the milestones: Proposal, Go1.14 Jul 16, 2019
@ajwerner

This comment has been minimized.

Copy link
Author

@ajwerner ajwerner commented Jul 17, 2019

Sounds good to me. I'm hopeful this will encourage people to name subexpressions. I'm especially on board if the lookup will be O(1). I worry a bit that a linear scan might be something of a footgun for somebody if they call it in a hot loop. The thought occurred to me that a detailed example on Regexp.SubexpNames which demonstrates looking up indexes at init time with a small function one could just copy out of the godoc might be good enough. Having this method is more likely to change behavior and thus is better. Thanks for taking this up!

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Jul 29, 2019

Change https://golang.org/cl/187919 mentions this issue: regexp: add (*Regexp).SubexpIndex

@sylvinus

This comment has been minimized.

Copy link
Contributor

@sylvinus sylvinus commented Jul 29, 2019

I just pushed a simple implementation of this, with tests and an example.

@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.