basic latin characters sort order is rather strange #1

0xadri · 2014-09-29T09:32:23Z

see how basic latin characters are sorted when using your natural-compare-lite javascript library: http://jsbin.com/vigegoguvixe/1/edit?js,console

a bunch of special characters are appearing:

before the numbers
between the '9' and the uppercase 'A'
between the uppercase 'Z' and the lowercase 'a'
after lowercase 'z'

I would have expected special characters to all be "grouped" together in one place, except for the space special character maybe (which would always be the first character). That is, either all before numbers, or all between numbers and letters (lowercase & uppercase being "together" one after another), or all after letters.

Read more: basic Latin characters http://en.wikipedia.org/wiki/List_of_Unicode_characters#Basic_Latin

Note that the built-in localeCompare() method seems to do a much better job at sorting basic latin characters. see live demo on http://jsbin.com/beboroyifomu/1/edit?js,console . The only problem using the localeCompare() method vs your implementation would be the non-consistency across browsers. see https://code.google.com/p/v8/issues/detail?id=459 . There is more about this topic on http://stackoverflow.com/questions/51165/how-do-you-do-string-comparison-in-javascript

The text was updated successfully, but these errors were encountered:

lauriro · 2014-09-30T22:27:33Z

Thank you for reporting, I had never thought about special characters.

I played a little bit with rearranging ascii charcodes till better idea will come.

0xadri · 2014-10-02T08:37:26Z

Naturally sorting strings is a very challenging task. Typically, its difficulty is completely underestimated so you're not the first one having issue solving this ;)

"I had never thought about special characters." yes, in the end there are so many possible characters in UTF-8 that you may want to consider processing (sort) "only" the ones you "know" & then put all the rest after the "known/expected" sorted characters.

I believe English mostly use the UTF-8 "basic Latin characters", but maybe also some of the "Latin-1 Supplement". I doubt it would use any other characters but I could not find anything stating what characters are used for English. Seems like a rather complex topic.

For more details about the character sets, here is a list of the "Unicode 7.0 Character Code Charts" : http://www.unicode.org/charts/

Also, note that there is a "Unicode Collation Algorithm specification", "which details how to compare two Unicode strings while remaining conformant to the requirements of the Unicode Standard". So if you want to get your implementation right, you might want to look into this: http://www.unicode.org/reports/tr10/

lauriro · 2014-10-02T10:26:00Z

For right implementation it would be wiser to keep transformed sort_key within object and not to do complex processing on each comparison. That would be another library. Then, of course, you could also pad numbers with lot of zeros and use just native sort.

The main goal of this project it to be as lightweight as possible and provide good enough result, seeing 10 before 2 is main issue with sort.

lauriro · 2014-10-02T10:33:04Z

You made me thinking, that configurable alphabet could be useful :)

As in Estonian a desired order would be ABDEFGHIJKLMNOPRSŠZŽTUVÕÄÖÜXYabdefghijklmnoprsšzžtuvõäöüxy

0xadri · 2014-10-02T14:35:38Z

Or maybe: AaBbDdEeFf....
Or maybe: aAbBdDeEfF...
Or maybe: AaÄäBbDdEeFf....
Or maybe: aAäÄbBdDeEfF...

Yes, a "configurable alphabet could be useful" ;)

lauriro closed this as completed in ce6f646 Sep 30, 2014

0xadri mentioned this issue Oct 10, 2014

Sorting incorrect when there is a space javve/natural-sort#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

basic latin characters sort order is rather strange #1

basic latin characters sort order is rather strange #1

0xadri commented Sep 29, 2014

lauriro commented Sep 30, 2014

0xadri commented Oct 2, 2014

lauriro commented Oct 2, 2014

lauriro commented Oct 2, 2014

0xadri commented Oct 2, 2014

basic latin characters sort order is rather strange #1

basic latin characters sort order is rather strange #1

Comments

0xadri commented Sep 29, 2014

lauriro commented Sep 30, 2014

0xadri commented Oct 2, 2014

lauriro commented Oct 2, 2014

lauriro commented Oct 2, 2014

0xadri commented Oct 2, 2014