Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

When querying a contact in non-English language (UTF-8) it always treats it as case-sensitive #3

Closed
aldanor opened this Issue · 6 comments

2 participants

@aldanor

So, "abc" will trigger case-insensitive search, but "абв" will not - looks like there's a bug with checking capitalization properly.

@kopischke kopischke was assigned
@kopischke kopischke was assigned
@kopischke
Owner

That’s a locale specific collating issue, I guess. Could you provide me with two bits of nformation?

Your locale settings, as returned by

env | grep -E 'LC_|LANG'

and a few example strings to test against (as I don’t speak any language using a non-Latin alphabet.

@aldanor

I have everything but LC_ALL set to en_US.UTF-8, which I have also set to en_US.UTF-8:

$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"

However, the case-insensitive matching still doesn't seem to work (although it works out of the box e.g. with Python regex module with flags re.IGNORECASE | re.UNICODE).

Several examples: Путин Вор should be matched by both путин and вор, АБВГД should be matched by абвгд.

@aldanor

Hmmm...

$ echo Путин Вор | awk '{print tolower($0)}'
Путин Вор

$ echo $a | tr '[A-Z]' '[a-z]'
Путин Вор

$ echo $a | perl -e 'print lc <>;'
Путин Вор

$ python -c "import sys; print sys.argv[1].lower()" "Путин Вор"
Путин Вор

However:

$ python -c "import sys; print sys.argv[1].decode('UTF-8').lower()" "Путин Вор"
путин вор

Related question on SO:
http://stackoverflow.com/questions/13381746/tr-upper-lower-with-cyrillic-text

P.S. just in case, the Russian UTF-8 locale is ru_RU.UTF-8.

@aldanor

@kopischke Hi again!

I don't know if this would be helpful but I tried and rewrote the whole thing in Python, works perfectly with correct unicode handling: https://gist.github.com/Aldanor/8c24ba71e8de85a8b669

Usage: put the script in the workflow folder and replace ./numbers "{query}" with ./get_contacts.py "{query}".

It is very simple (i.e., without too much hassle) to add advanced regex matching (mix and match first/last names and company/nickname in any order) and/or sort results by the quality of the match (i.e. full first/last names get matched first), and you can also easily implement fuzzy matching kinda like Alfred does.

@kopischke kopischke closed this issue from a commit
@kopischke Set case comparison to use en_US.UTF-8 locale
Brings UTF-8 case awareness to string operations. Fixes #3
c89fa42
@kopischke kopischke closed this in c89fa42
@kopischke
Owner

The issue is solvable in bash: all that was needed was setting the locale inside the Alfred script, as it defaults to “C” otherwise, where case comparison only works on the ASCII 7 bit range. Also: thanks for the suggested Python code, but it (like code in my preferred language, Ruby) is far too slow to be useful in a script filter. The current bash script beats it by an order of magnitude (on my elderly iMac, average lookup time is below 1 s on first run, below 0.5 sec after that; the Python version takes over 4 sec on first run, over 2.5 sec after that).

@kopischke
Owner

Fixed in release 1.1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.