Skip to content

proposal: strings: add CompareFold function #22425

@akutz

Description

@akutz

What version of Go are you using (go version)?

$ go version
go version go1.9.1 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

$ go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/akutz/Projects/go"
GORACE=""
GOROOT="/Users/akutz/.go/1.9.1"
GOTOOLDIR="/Users/akutz/.go/1.9.1/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/48/52fjq9r10zx3gnm7j57th0500000gn/T/go-build196061406=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

CompareFold

I recently wrote a blog post about the overwhelming performance increase that comes with sorting a slice of strings in a case-insensitive manner using an implementation of CompareFold. For example:

542000 Word Benchmark

Ultimately the implementation of CompareFold I used is really just a modification of the existing strings.EqualFold:

// CompareFold returns an integer comparing two strings lexicographically
// using Unicode case-folding.
// The result will be 0 if a==b, -1 if a < b, and +1 if a > b.
func CompareFold(s, t string) int {
	for s != "" && t != "" {

		// Extract first rune from each string.
		var sr, tr rune
		if s[0] < utf8.RuneSelf {
			sr, s = rune(s[0]), s[1:]
		} else {
			r, size := utf8.DecodeRuneInString(s)
			sr, s = r, s[size:]
		}
		if t[0] < utf8.RuneSelf {
			tr, t = rune(t[0]), t[1:]
		} else {
			r, size := utf8.DecodeRuneInString(t)
			tr, t = r, t[size:]
		}

		// If they match, keep going; if not, return false.

		// Easy case.
		if tr == sr {
			continue
		}

		// Make sr < tr to simplify what follows.
		result := 1
		if tr < sr {
			result = -result
			tr, sr = sr, tr
		}
		// Fast check for ASCII.
		if tr < utf8.RuneSelf && 'A' <= sr && sr <= 'Z' {
			// ASCII, and sr is upper case.  tr must be lower case.
			srr := sr + 'a' - 'A'
			if tr == srr {
				continue
			}
			if tr < srr {
				return result
			}
			if tr > srr {
				return -result
			}
		}

		// General case. SimpleFold(x) returns the next equivalent rune > x
		// or wraps around to smaller values.
		r := unicode.SimpleFold(sr)
		for r != sr && r < tr {
			r = unicode.SimpleFold(r)
		}
		if r == tr {
			continue
		}
		if tr < r {
			return result
		}
		if tr > r {
			return -result
		}
	}

	// One string is empty. Are both?
	if s == "" && t != "" {
		return -1
	}
	if s != "" && t == "" {
		return 1
	}
	return 0
}

However, because writing CompareFold from scratch (even basing it off of an existing function) isn't necessarily a simple task, I'd like to propose its inclusion in the Go standard library strings package. To that end, the implementation of strings.EqualFold could then be replaced with the following:

// EqualFold reports whether s and t, interpreted as UTF-8 strings,
// are equal under Unicode case-folding.
func EqualFold(s, t string) bool {
	return CompareFold(s, t) == 0
}

Sorting slices of strings in a case-insensitive manner is not an obscure use case. Therefore I'd posit there is great value in providing an easy-to-use implementation of a comparison function that both uses Unicode case-folding and returns more than a binary result relating to equality.

Please note that akutz/sortfold has the full implementation of CompareFold as well as Go unit tests, examples, and benchmarks.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions