Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: index/suffixarray: added functionality via longest common prefix array #36450

Open
petar-dambovaliev opened this issue Jan 8, 2020 · 7 comments

Comments

@petar-dambovaliev
Copy link
Contributor

@petar-dambovaliev petar-dambovaliev commented Jan 8, 2020

What version of Go are you using (go version)?

$ go version
go version go1.13.1 linux/amd64

I have added an LCP array to the suffix array to provide the following functionality

func (x *Index) DistinctSubsCount() int
func (x *Index) DistinctSubs() [][]byte
func (x *Index) LongestRepeatedSubs() [][]byte

The LCP array is being initialized by once.Do() inside these methods, so existing users that don't need the new functionality won't have to deal with the overhead.

If this is something of interest, i will create a PR.
Cheers.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jan 8, 2020

CC @griesemer

I suspect that we're going to consider the index/suffixarray package as complete. It might be better to provide your changes as a library available for anyone who wants it. https://golang.org/doc/faq#x_in_std

@toothrot toothrot added this to the Backlog milestone Jan 8, 2020
@griesemer
Copy link
Contributor

@griesemer griesemer commented Jan 8, 2020

Agreeing with @ianlancetaylor : It is not that we're not interested in having more functionality, the problem is that by adding more functionality to the std library we're adding maintenance burden (to the Go team) for a library that is not super-widely used but of significant complexity.

Marking NeedsDecision for input from the proposal committee.

@odeke-em
Copy link
Member

@odeke-em odeke-em commented Mar 29, 2020

@griesemer should we perhaps mark this as a proposal so that we can gain visibility from the proposal review committee? I ask because proposal have been reviewed periodically but no response on this one. Thank you.

@griesemer griesemer changed the title index/suffixarray: added functionality via longest common prefix array Proposal: index/suffixarray: added functionality via longest common prefix array Mar 31, 2020
@griesemer griesemer added the Proposal label Mar 31, 2020
@griesemer
Copy link
Contributor

@griesemer griesemer commented Mar 31, 2020

Marked as Proposal for visibility, per @odeke-em 's suggestion.

@rsc
Copy link
Contributor

@rsc rsc commented Apr 1, 2020

@petar-dambovaliev, I've been doing some index/suffixarray uses recently (including the recent optimizations) and it's true that lots of algorithms start with "compute the suffix array and the LCP array". If the code were not much more to maintain, it might be worth having in the package proper instead of delegating to third-party packages. I'm less certain about the API, but we can figure that out.

You said you could make a PR. How much code is it? I'm just curious what level of complexity we're talking about taking on. Thanks.

@rsc rsc changed the title Proposal: index/suffixarray: added functionality via longest common prefix array proposal: index/suffixarray: added functionality via longest common prefix array Apr 1, 2020
@rsc rsc added this to Active in Proposals Apr 1, 2020
@odeke-em
Copy link
Member

@odeke-em odeke-em commented Apr 1, 2020

@petar-dambovaliev
Copy link
Contributor Author

@petar-dambovaliev petar-dambovaliev commented Apr 2, 2020

@petar-dambovaliev, I've been doing some index/suffixarray uses recently (including the recent optimizations) and it's true that lots of algorithms start with "compute the suffix array and the LCP array". If the code were not much more to maintain, it might be worth having in the package proper instead of delegating to third-party packages. I'm less certain about the API, but we can figure that out.

You said you could make a PR. How much code is it? I'm just curious what level of complexity we're talking about taking on. Thanks.

I am sure that when reviewed, you will have good ideas to improve on it.
At the moment it is 264 lines, including some tests and the generated code for 32/64 int slices.
The actual implementation is less than a 100 lines.

@rsc rsc modified the milestones: Backlog, Proposal Apr 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
8 participants
You can’t perform that action at this time.