New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/text/collate: incorrect sorting of Slovak words #48061
Comments
Go string ordering is not intended to be the order things would sort in any language. It's just ordering the raw bytes of the utf8 encoding. |
Sorry, you are using that. I guess this is a bug in |
Yes, it looks like it's |
cc @mpvl |
I tried to look into this, but didn't make it down the rabbit hole deep enough to make a fix. Here is what I discovered (buckle up, it's a long ride): First I want to start with the fact that the alphabet seems to be defined correctly here. So that is not the problem. The collate pkg uses this unicode algorithm to collate words:
The letter
The letter
A more in-depth example The sorting string for
The sorting string for
So the reason that Other letters
Result:
Next Steps |
@ameowlia, thanks for the analysis! Unfortunately I have very little context on I do have one question, though: is the element-splitting problem present in the Unicode TR10 algorithm itself, or does the Go implementation deviate from the algorithm? (If the problem is in the TR10 algorithm itself, the Go project can't unilaterally fix it...) |
Hi @bcmills,
Above I said that the issue is because
|
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
This is not 100% correct solution. (Still quite good in comparison to other languages such as Java, C#, or Python.)
From the languages I have tested, Raku and Go provide the closest solutions. (Raku does accents right, but fails with
digraphs.)
The following is the correct ordering of characters in Slovak language:
a, á, ä, b, c, č, d, ď, dz, dž, e, é, f, g, h, ch, i, í, j, k, l, ĺ, ľ, m, n, ň, o, ó, ô, p, q, r, ŕ, s, š, t, ť, u, ú, v, w, x, y, ý, z, ž
Note that plain characters always precede the accented ones. So the word ďateľ should go after drevo, the sequence
kľak klam kĺb should be klam kĺb kľak, márny mat mäta should be mat márny mäta, pól pot pôst should be
pot pól pôst, and tŕň troska should be troska tŕň.
The words with digraph ch are correct. (The dž digraphs are probably also good. Not 100% sure because there is
an issue with ď.)
So the correct ordering should be:
The text was updated successfully, but these errors were encountered: