Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

basic latin characters sort order is rather strange #1

Closed
0xadri opened this issue Sep 29, 2014 · 5 comments
Closed

basic latin characters sort order is rather strange #1

0xadri opened this issue Sep 29, 2014 · 5 comments

Comments

@0xadri
Copy link

0xadri commented Sep 29, 2014

see how basic latin characters are sorted when using your natural-compare-lite javascript library: http://jsbin.com/vigegoguvixe/1/edit?js,console

a bunch of special characters are appearing:

  • before the numbers
  • between the '9' and the uppercase 'A'
  • between the uppercase 'Z' and the lowercase 'a'
  • after lowercase 'z'

I would have expected special characters to all be "grouped" together in one place, except for the space special character maybe (which would always be the first character). That is, either all before numbers, or all between numbers and letters (lowercase & uppercase being "together" one after another), or all after letters.

Read more: basic Latin characters http://en.wikipedia.org/wiki/List_of_Unicode_characters#Basic_Latin

Note that the built-in localeCompare() method seems to do a much better job at sorting basic latin characters. see live demo on http://jsbin.com/beboroyifomu/1/edit?js,console . The only problem using the localeCompare() method vs your implementation would be the non-consistency across browsers. see https://code.google.com/p/v8/issues/detail?id=459 . There is more about this topic on http://stackoverflow.com/questions/51165/how-do-you-do-string-comparison-in-javascript

@lauriro
Copy link
Member

lauriro commented Sep 30, 2014

Thank you for reporting, I had never thought about special characters.

I played a little bit with rearranging ascii charcodes till better idea will come.

@0xadri
Copy link
Author

0xadri commented Oct 2, 2014

Naturally sorting strings is a very challenging task. Typically, its difficulty is completely underestimated so you're not the first one having issue solving this ;)

"I had never thought about special characters." yes, in the end there are so many possible characters in UTF-8 that you may want to consider processing (sort) "only" the ones you "know" & then put all the rest after the "known/expected" sorted characters.

I believe English mostly use the UTF-8 "basic Latin characters", but maybe also some of the "Latin-1 Supplement". I doubt it would use any other characters but I could not find anything stating what characters are used for English. Seems like a rather complex topic.

For more details about the character sets, here is a list of the "Unicode 7.0 Character Code Charts" : http://www.unicode.org/charts/

Also, note that there is a "Unicode Collation Algorithm specification", "which details how to compare two Unicode strings while remaining conformant to the requirements of the Unicode Standard". So if you want to get your implementation right, you might want to look into this: http://www.unicode.org/reports/tr10/

@lauriro
Copy link
Member

lauriro commented Oct 2, 2014

For right implementation it would be wiser to keep transformed sort_key within object and not to do complex processing on each comparison. That would be another library. Then, of course, you could also pad numbers with lot of zeros and use just native sort.

The main goal of this project it to be as lightweight as possible and provide good enough result, seeing 10 before 2 is main issue with sort.

@lauriro
Copy link
Member

lauriro commented Oct 2, 2014

You made me thinking, that configurable alphabet could be useful :)

As in Estonian a desired order would be ABDEFGHIJKLMNOPRSŠZŽTUVÕÄÖÜXYabdefghijklmnoprsšzžtuvõäöüxy

@0xadri
Copy link
Author

0xadri commented Oct 2, 2014

Or maybe: AaBbDdEeFf....
Or maybe: aAbBdDeEfF...
Or maybe: AaÄäBbDdEeFf....
Or maybe: aAäÄbBdDeEfF...

Yes, a "configurable alphabet could be useful" ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants