-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Description
What version of Go are you using (go version)?
$ go version
go version go1.9.1 darwin/amd64Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (go env)?
$ go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/akutz/Projects/go"
GORACE=""
GOROOT="/Users/akutz/.go/1.9.1"
GOTOOLDIR="/Users/akutz/.go/1.9.1/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/48/52fjq9r10zx3gnm7j57th0500000gn/T/go-build196061406=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"CompareFold
I recently wrote a blog post about the overwhelming performance increase that comes with sorting a slice of strings in a case-insensitive manner using an implementation of CompareFold. For example:
Ultimately the implementation of CompareFold I used is really just a modification of the existing strings.EqualFold:
// CompareFold returns an integer comparing two strings lexicographically
// using Unicode case-folding.
// The result will be 0 if a==b, -1 if a < b, and +1 if a > b.
func CompareFold(s, t string) int {
for s != "" && t != "" {
// Extract first rune from each string.
var sr, tr rune
if s[0] < utf8.RuneSelf {
sr, s = rune(s[0]), s[1:]
} else {
r, size := utf8.DecodeRuneInString(s)
sr, s = r, s[size:]
}
if t[0] < utf8.RuneSelf {
tr, t = rune(t[0]), t[1:]
} else {
r, size := utf8.DecodeRuneInString(t)
tr, t = r, t[size:]
}
// If they match, keep going; if not, return false.
// Easy case.
if tr == sr {
continue
}
// Make sr < tr to simplify what follows.
result := 1
if tr < sr {
result = -result
tr, sr = sr, tr
}
// Fast check for ASCII.
if tr < utf8.RuneSelf && 'A' <= sr && sr <= 'Z' {
// ASCII, and sr is upper case. tr must be lower case.
srr := sr + 'a' - 'A'
if tr == srr {
continue
}
if tr < srr {
return result
}
if tr > srr {
return -result
}
}
// General case. SimpleFold(x) returns the next equivalent rune > x
// or wraps around to smaller values.
r := unicode.SimpleFold(sr)
for r != sr && r < tr {
r = unicode.SimpleFold(r)
}
if r == tr {
continue
}
if tr < r {
return result
}
if tr > r {
return -result
}
}
// One string is empty. Are both?
if s == "" && t != "" {
return -1
}
if s != "" && t == "" {
return 1
}
return 0
}However, because writing CompareFold from scratch (even basing it off of an existing function) isn't necessarily a simple task, I'd like to propose its inclusion in the Go standard library strings package. To that end, the implementation of strings.EqualFold could then be replaced with the following:
// EqualFold reports whether s and t, interpreted as UTF-8 strings,
// are equal under Unicode case-folding.
func EqualFold(s, t string) bool {
return CompareFold(s, t) == 0
}Sorting slices of strings in a case-insensitive manner is not an obscure use case. Therefore I'd posit there is great value in providing an easy-to-use implementation of a comparison function that both uses Unicode case-folding and returns more than a binary result relating to equality.
Please note that akutz/sortfold has the full implementation of CompareFold as well as Go unit tests, examples, and benchmarks.
