Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do you plan to update dictionary0*.txt ? #521

Closed
utuhiro78 opened this issue Sep 12, 2021 · 5 comments
Closed

Do you plan to update dictionary0*.txt ? #521

utuhiro78 opened this issue Sep 12, 2021 · 5 comments

Comments

@utuhiro78
Copy link

The dictionary0*.txt have not been updated for 13 months. They don't have "新型コロナ" (COVID-19) related entries.
Do you plan to update dictionary0*.txt ?

Thank you for your regular updates. I can't use Linux without Mozc.

@hiroyuki-komatsu
Copy link
Collaborator

Hi utuhiro78,

We have not stopped the updating of the dictionary files, although we cannot promise when we will update them.

By the way, we have introduced aux_dictionary.tsv, which is an additional way to add new words. I will add some words related with "新型コロナ" (COVID-19) to aux_dictionary.txt in near feature.

We accept pull requests for this file.

If you are interested in making pull requests. Please also consider following things.

Here's the description of aux_dictionary.txt quoted from 7500b9b.

* aux_dictionary.tsv is a list of additional words to be added to Mozc's word dictionary.

Format:
| key      | value    | base_key | base_value |
| -------- | -------- | -------- | ---------- |
| あるぱか | アルパカ | かぴばら | カピバラ |

* key and value (i.e. アルパカ) are fields for the new word.
* base_key and base_value (i.e. カピバラ) are fields of the reference word already in the dictionary.
* other fields of the new word (e.g. lid, rid , cost) are copied from the reference word.

We hope it helps for you.
Thank you,

@utuhiro78
Copy link
Author

utuhiro78 commented Sep 13, 2021

Thanks hiroyuki-komatsu.

I added some words to dictionary.tsv and oss.tsv.

# aux_dictionary.tsv
しゅうだんせっしゅ	集団接種	よぼうせっしゅ	予防接種
こべつせっしゅ	個別接種	よぼうせっしゅ	予防接種
ひゃっきん	100均	こんびにえんすすとあ	コンビニエンスストア
ちゅうはくしょく	昼白色	ちゅうこうしょく	昼光色

# oss.tsv
utuhiro78	しゅうだんせっしゅ	集団接種	Conversion Match
utuhiro78	こべつせっしゅ	個別接種	Conversion Match
utuhiro78	ひゃっきん	100均	Conversion Match
utuhiro78	ちゅうはくしょく	昼白色	Conversion Match
utuhiro78	もでるな	モデルナ	Conversion Match
utuhiro78	りょうなのか	量なのか	Conversion Match
utuhiro78	しつなのか	質なのか	Conversion Match

I found the document.

Differences between Bazel build and GYP build

GYP build is under maintenance mode. While the existing targets are supported by both GYP and Bazel as much as possible, new targets will be supported by Bazel only.

Targets only for Bazel:

AUX dictionary (//data/dictionary_oss:aux_dictionary)

I think major Linux distributions don't support Bazel build in their package build systems. Debian, openSUSE and Arch Linux packagers use GYP for their mozc packages.
I'm using Arch Linux, so I added the lines to fcitx5-mozc.PKGBUILD.

  # Generate aux_dictionary.txt
  cd src/data/oss/
  PYTHONPATH="${PYTHONPATH}:../../" \
  python ../../dictionary/gen_aux_dictionary.py \
  --output aux_dictionary.txt \
  --aux_tsv aux_dictionary.tsv \
  --dictionary_txts ../../data/dictionary_oss/dictionary0*.txt
  cat aux_dictionary.txt >> ../../data/dictionary_oss/dictionary09.txt
  cd -

aux_dictionary.txt was enabled, but I can't enable oss.tsv.

label key value command
oss_issue12 ほしゅてきなめんもあるね Conversion Not Match

"ほしゅてきなめんもあるね" is converted to "保守的な麺もあるね".
"Conversion Not Match" (= ignore the entry?) is not enabled.

I don't understand the "label" and "command".

  • "utuhiro78" is OK for the label?
  • What is the difference between "Conversion Match" and "Conversion Expected"?
  • Is it possible to enable oss.tsv in GYP build?

@hiroyuki-komatsu
Copy link
Collaborator

oss.tsv is a file for test cases.

The evaluation result (evaluation.tsv) is generated from oss.tsv.

You can generate this evaluation result as follows:

bazel build data/dictionary_oss:evaluation --config oss_linux

"utuhiro78" is OK for the label?

Would you use "oss"?

What is the difference between "Conversion Match" and "Conversion Expected"?

  • "Conversion Match": value should be a part of the result (partial match).
  • "Conversion Not Match": value should not be a part of the result (partial match).
  • "Conversion Expect": value should be the same with the result (exact match).
  • "Conversion Expect 3": value should be within 4th candidate. The default is 0 (1st candidate).

Is it possible to enable oss.tsv in GYP build?

In theory, it is possible to copy the build logic from Bazel to GYP.
However, we do not have a plan to support aux_dictionary.txt or the evaluation in GYP build.

Indeed, GYP build is one of major blocking issues for us to accept pull requests.
If we consider supporting GYP build for those files, we have to do either of:

  • Large refactoring of our GYP build in our code repository to keep accepting pull requests.
  • Stop accepting pull requests.

None of them are good options for us.

We think accepting pull requests is more beneficial and constructive in the long run.
Thanks,

@utuhiro78
Copy link
Author

Hello,

I created a pull request and signed a Google's Contributor License Agreement (CLA),
but I think you have no time to check pull requests as a job.
You update Mozc every Sunday and it's enough for me.

I remove the pull request and close this issue.

Thank you very much.

@hiroyuki-komatsu
Copy link
Collaborator

Hi,
Although my response is basically a weekend basis and not quick, I'm still happy to check pull requests.
Thank you,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants