-
-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
basic latin characters sort order is rather strange #1
Comments
Thank you for reporting, I had never thought about special characters. I played a little bit with rearranging ascii charcodes till better idea will come. |
Naturally sorting strings is a very challenging task. Typically, its difficulty is completely underestimated so you're not the first one having issue solving this ;) "I had never thought about special characters." yes, in the end there are so many possible characters in UTF-8 that you may want to consider processing (sort) "only" the ones you "know" & then put all the rest after the "known/expected" sorted characters. I believe English mostly use the UTF-8 "basic Latin characters", but maybe also some of the "Latin-1 Supplement". I doubt it would use any other characters but I could not find anything stating what characters are used for English. Seems like a rather complex topic. For more details about the character sets, here is a list of the "Unicode 7.0 Character Code Charts" : http://www.unicode.org/charts/ Also, note that there is a "Unicode Collation Algorithm specification", "which details how to compare two Unicode strings while remaining conformant to the requirements of the Unicode Standard". So if you want to get your implementation right, you might want to look into this: http://www.unicode.org/reports/tr10/ |
For right implementation it would be wiser to keep transformed sort_key within object and not to do complex processing on each comparison. That would be another library. Then, of course, you could also pad numbers with lot of zeros and use just native sort. The main goal of this project it to be as lightweight as possible and provide good enough result, seeing 10 before 2 is main issue with sort. |
You made me thinking, that configurable alphabet could be useful :) As in Estonian a desired order would be ABDEFGHIJKLMNOPRSŠZŽTUVÕÄÖÜXYabdefghijklmnoprsšzžtuvõäöüxy |
Or maybe: AaBbDdEeFf.... Yes, a "configurable alphabet could be useful" ;) |
see how basic latin characters are sorted when using your
natural-compare-lite
javascript library: http://jsbin.com/vigegoguvixe/1/edit?js,consolea bunch of special characters are appearing:
I would have expected special characters to all be "grouped" together in one place, except for the space special character maybe (which would always be the first character). That is, either all before numbers, or all between numbers and letters (lowercase & uppercase being "together" one after another), or all after letters.
Read more: basic Latin characters http://en.wikipedia.org/wiki/List_of_Unicode_characters#Basic_Latin
Note that the built-in
localeCompare()
method seems to do a much better job at sorting basic latin characters. see live demo on http://jsbin.com/beboroyifomu/1/edit?js,console . The only problem using thelocaleCompare()
method vs your implementation would be the non-consistency across browsers. see https://code.google.com/p/v8/issues/detail?id=459 . There is more about this topic on http://stackoverflow.com/questions/51165/how-do-you-do-string-comparison-in-javascriptThe text was updated successfully, but these errors were encountered: