Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

request to implement LATIN sorting algorithm in pandas #24403

Closed
jopasi opened this issue Dec 23, 2018 · 1 comment
Closed

request to implement LATIN sorting algorithm in pandas #24403

jopasi opened this issue Dec 23, 2018 · 1 comment
Labels
Duplicate Report Duplicate issue or pull request

Comments

@jopasi
Copy link

jopasi commented Dec 23, 2018

Code Sample, a copy-pastable example if possible

# Your code here
import locale
locale.setlocale(locale.LC_ALL, 'hu_HU.UTF-8')
a = ["A", "E", "Z", "a", "e", "é","z" , "5","4","1", "AA", "AÁ", "ÁA", "ÁÁ", "aa", "aá", "áa", "áá"]
sorted(a, key=locale.strxfrm) 
['1', '4', '5', 'a', 'A', 'aa', 'AA', 'aá', 'AÁ', 'áa', 'ÁA', 'áá', 'ÁÁ', 'e', 'E', 'é', 'z', 'Z']

Problem description

Dear developers,
I would like to suggest a following request which could be good for lots of foreign people who wants to use their foreign characters/words in pandas data frame.

In Pandas there is no possibility to sort string data in Latin or other country specific algorithm.
Although python gives already a solution for this:
import locale
locale.setlocale(locale.LC_ALL, 'hu_HU.UTF-8') # this is for Hungarian characters, but could be any other country specific sorting as well. Like 'fr_FR.UTF-8' for France, etc.
a = ["A", "E", "Z", "a", "e", "é","z" , "5","4","1", "AA", "AÁ", "ÁA", "ÁÁ", "aa", "aá", "áa", "áá"]
sorted(a)
['1', '4', '5', 'A', 'AA', 'AÁ', 'E', 'Z', 'a', 'aa', 'aá', 'e', 'z', 'ÁA', 'ÁÁ', 'áa', 'áá', 'é'] # this gives a standard English sorting. This order absolutely wrong for other, foreign countries.
sorted(a, key=locale.strxfrm) #this is the good way of sorting Latin, or Hungarian characters
['1', '4', '5', 'a', 'A', 'aa', 'AA', 'aá', 'AÁ', 'áa', 'ÁA', 'áá', 'ÁÁ', 'e', 'E', 'é', 'z', 'Z']

in pandas there Is no way to specify the sorting order;
df.sort_values() #this gives wrong sorting for Latin and other characters
Would be fine to have a function like this:
df.sort_values(key=locale.strxfrm)

I would appreciate if this features which already exists in Python will be implemented in pandas as well.
Thank you

Expected Output

df.sort_values(key=locale.strxfrm)
['1', '4', '5', 'a', 'A', 'aa', 'AA', 'aá', 'AÁ', 'áa', 'ÁA', 'áá', 'ÁÁ', 'e', 'E', 'é', 'z', 'Z']

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

@WillAyd
Copy link
Member

WillAyd commented Dec 25, 2018

This is a duplicate of #3942 - investigation and PRs are always welcome

@WillAyd WillAyd closed this as completed Dec 25, 2018
@WillAyd WillAyd added the Duplicate Report Duplicate issue or pull request label Dec 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

2 participants