Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/text: support API for Unicode word breaking and word extraction (Annex #29) #17256

Open
nightlyone opened this issue Sep 27, 2016 · 0 comments
Open
Assignees
Milestone

Comments

@nightlyone
Copy link
Contributor

@nightlyone nightlyone commented Sep 27, 2016

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version go1.7 linux/amd64

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/ioe/sources/go"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/user/1000/go-build353744209=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"

What did you do?

Trying to split text at word boundaries like mentioned at http://unicode.org/reports/tr29/#Word_Boundaries and also trying to extract words from strings as mentioned in the same document.

What did you expect to see?

Given the sentence "The quick (“brown”) fox can’t jump 32.3 feet, right?"

  • detecting word boundaries at all places marked with "|"
The| |quick| |(|“|brown|”|)| |fox| |can’t| |jump| |32.3| |feet|,| |right|?
  • support word extraction to a string slice with the following content
words := []string{"The", "quick", "brown", "fox", "can’t", "jump", "32.3", "feet", "right"}

What did you see instead?

That depends on the API used. example with strings.Fields at https://play.golang.org/p/dhJtlR-b3w displays:

[]string{"The", "quick", "(“brown”)", "fox", "can’t", "jump", "32.3", "feet,", "right?"}

Note: Proper test vectors are here: Test vectors are here: http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/WordBreakTest.txt

An implementation using Ruby magic and a state machine generated by Ragel can be found here: github.com/blevesearch/segment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.