New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: index/suffixarray: expose qsufsort #29080

Closed
st326 opened this Issue Dec 3, 2018 · 7 comments

Comments

Projects
None yet
6 participants
@st326
Copy link

st326 commented Dec 3, 2018

What version of Go are you using (go version)?

$ go version
go version go1.11 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/usr/local/google/home/sjt/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/usr/local/google/home/sjt/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/lib/google-golang"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/google-golang/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="clang"
CXX="third_party/crosstool/v18/stable/toolchain/bin/x86_64-grtev4-linux-gnu-llvm_driver_is_not_gcc"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build503570168=/tmp/go-build -gno-record-gcc-switches"

What did you do?

This is not an error, rather a missing feature in the src/index/suffixarray library.

What did you expect to see?

The library is written assuming that the qsufsort embedded within it will only ever be used as a suffix array for lookup purposes. Since the direct output of qsufsort is extremely useful, for example in performing a BWT, I'd expect qsufsort() in qsufsort.go to be exported, e.g., as QSufSort().

What did you see instead?

qsufsort() is not exported, making it difficult to get at the suffix array data structure directly. Serializing and then deserializing it via the (supported) mechanism would be unacceptably slow for a large suffix array, so is a nonstarter as an alternative to modifying qsufsort.go.

Note: I will submit a patch shortly.

@agnivade

This comment has been minimized.

Copy link
Member

agnivade commented Dec 3, 2018

Maybe this is better suited for an external package rather than in standard library ? Please see https://golang.org/doc/faq#x_in_std

@st326

This comment has been minimized.

Copy link

st326 commented Dec 3, 2018

Possibly, but qsufsort.go is already in the library, and forking it just to export one symbol doesn't feel right somehow.

Patch here: https://go-review.googlesource.com/c/go/+/151982

@agnivade agnivade changed the title Allowing direct access to qsufsort proposal: index/suffixarray: expose qsufsort Dec 3, 2018

@gopherbot gopherbot added this to the Proposal milestone Dec 3, 2018

@gopherbot gopherbot added the Proposal label Dec 3, 2018

@dsnet

This comment has been minimized.

Copy link
Member

dsnet commented Dec 3, 2018

Suffix arrays are a relatively simple data structure with complex algorithms to construct them (see SACA). Qsufsort is a relatively simply algorithm, but is not state of the art. Exposing QSufSort is an implementation detail of index/suffixarray that I do not believe should be user visible. I believe this should be outside the standard library.

@griesemer

This comment has been minimized.

Copy link
Contributor

griesemer commented Dec 3, 2018

I don't have an objection exposing it; the signature and semantics are very well-defined and clean: Given a byte array, return an array of indices of the sorted substrings. It seems to me independent of algorithm for this specific function.

But I'd use a different name when exporting. Perhaps just Sort as in suffixarray.Sort.

@dsnet

This comment has been minimized.

Copy link
Member

dsnet commented Dec 3, 2018

I'm still not a fan of exposing this, but if we were I'd suggest a signature like:

func Sort(b []byte, sa []int)

rather than the current internal signature. This gives us flexibility to reuse memory provided by the user.

Also, state-of-the-art SACAs are several times faster than the current suffixarray package. IIRC, I think my own implementation of SAIS was 30 times faster than QSufSort. Do we want for the Go authors to be on the hook for optimizing the suffix array algorithm more? SAIS was not trivial for me to learn.

@griesemer

This comment has been minimized.

Copy link
Contributor

griesemer commented Dec 3, 2018

@dsnet +1 on the signature suggestion.

Regarding the Go authors being on the hook: We currently are to some extend as this is already in the std library. It doesn't mean we have to provide the best implementation (it's already not the case).

What we should decide is whether this package should be in the std library in the first place, or whether it should be deprecated in favor of external high-performance implementations. Hence deference to the proposal committee for now.

@andybons

This comment has been minimized.

Copy link
Member

andybons commented Dec 5, 2018

Per discussion with @golang/proposal-review, our plan is to move this outside of the standard library eventually, so we’re not going to make any changes.

In the meantime, we encourage you to fork the code into its own repo and experiment as you see fit.

@andybons andybons closed this Dec 5, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment