Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Figure out what we're doing with ICU tokenisation and locales #15124

Open
reivilibre opened this issue Feb 21, 2023 · 4 comments
Open

Figure out what we're doing with ICU tokenisation and locales #15124

reivilibre opened this issue Feb 21, 2023 · 4 comments
Labels
A-I18n A-Message-Search Searching messages A-User-Directory T-Other Questions, user support, anything else.

Comments

@reivilibre
Copy link
Contributor

The ICU tokenisation rules seem to vary on different platforms.

Is it the ICU version? The locale? (How does ICU even get a default locale? I had a quick spelunk in the source code and couldn't find it!)

We need to figure out:

  • what we actually want from the ICU library
  • how we get that
  • how we get consistent results.

It feels like we want a 'universal' locale independent of the host's settings, so that Synapse works well with all languages. (This may be a pie in the sky goal!)
What's the best we can do?

This issue was originally dug up in #15079, but e.g. Patrick's machine generates another tokenisation yet again. I'm not satisfied with the current solution..

@reivilibre reivilibre added A-User-Directory T-Other Questions, user support, anything else. A-Message-Search Searching messages A-I18n labels Feb 21, 2023
@clokep
Copy link
Member

clokep commented Feb 21, 2023

For reference I'm getting the following error on UserDirectoryICUTestCase.test_icu_word_boundary_punctuation:

Traceback (most recent call last):
  File "synapse/tests/storage/test_user_directory.py", line 559, in test_icu_word_boundary_punctuation
    self.assertIn(
  File ".env/twisted/trial/_synctest.py", line 506, in assertIn
    raise self.failureException(msg or f"{containee!r} not in {container!r}")
twisted.trial.unittest.FailTest: ["lazy'fox", 'jumped', 'over', 'the.dog'] not in (["lazy'fox", 'jumped', 'over', 'the', 'dog'], ["lazy'fox", 'jumped:over', 'the.dog'])

@deepbluev7
Copy link
Contributor

I think the change in colon behaviour was actually added by Apple for ICU 72.1: https://unicode-org.atlassian.net/browse/ICU-22112

So possibly that will break in future Ubuntu versions too :)

@reivilibre
Copy link
Contributor Author

reivilibre commented Mar 2, 2023

I'm somewhat confused here; I tried segmenting this snippet on 3 versions of Ubuntu (3 versions of ICU) and they all gave the same result. From this, we might hope to conclude that ICU version is not what affects it.

But I tried playing around with locales and could not invite a change that way either.

print(_parse_words_with_icu("lazy'dog jumped:over the.fox 授業は八時三十分から始まるから。"))

-----
Ubuntu 20.04
ICU  66.1
["lazy'dog", 'jumped', 'over', 'the', 'fox', '授業', 'は', '八時', '三', '十分', 'から', '始まる', 'から']
-----
Ubuntu 22.04
ICU  70.1
["lazy'dog", 'jumped', 'over', 'the', 'fox', '授業', 'は', '八時', '三', '十分', 'から', '始まる', 'から']
-----
Ubuntu 23.04
ICU  72.1
["lazy'dog", 'jumped', 'over', 'the', 'fox', '授業', 'は', '八時', '三', '十分', 'から', '始まる', 'から']

The above were containerised (docker) installs.

My laptop has Kubuntu 22.10 and says

ICU 71.1
["lazy'dog", 'jumped:over', 'the.fox', '授業', 'は', '八時', '三', '十分', 'から', '始まる', 'から']

>>> [f"{k}={v}" for k, v in os.environ.items() if k.startswith("LC_")]
['LC_MONETARY=en_GB.UTF-8', 'LC_MEASUREMENT=fr_FR.UTF-8', 'LC_CTYPE=en_GB.UTF-8', 'LC_TIME=en_GB.UTF-8', 'LC_COLLATE=en_GB.UTF-8', 'LC_NUMERIC=en_GB.UTF-8']

In an Ubuntu docker container, I get

>>> [f"{k}={v}" for k, v in os.environ.items() if k.startswith("LC_")]
['LC_CTYPE=C.UTF-8']

Even if I change the LC_ env vars on my laptop to just that, I get the same tokenisation. (I also tried using different methods on Locale to see if I could get a different Locale from within PyICU. Unless it's just this one version of ICU that has the difference? I guess I should try that one in docker as well)
EDIT: ICU 71.1 in a container acted the same as every other ICU version in a container...

@deepbluev7
Copy link
Contributor

Huh, interesting, because I did see that change on their github, but possibly that is a different library? I probably mixed something up then, sorry!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-I18n A-Message-Search Searching messages A-User-Directory T-Other Questions, user support, anything else.
Projects
None yet
Development

No branches or pull requests

3 participants