Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest release breaks spell checking for Korean #903

Closed
mike-fabian opened this issue Jan 6, 2023 · 6 comments
Closed

Latest release breaks spell checking for Korean #903

mike-fabian opened this issue Jan 6, 2023 · 6 comments

Comments

@mike-fabian
Copy link

mike-fabian commented Jan 6, 2023

See also: https://bugzilla.redhat.com/show_bug.cgi?id=2158548

Test file:

korean.txt

[mfabian@fedora ~]$ cat /etc/fedora-release 
Fedora release 38 (Rawhide)
[mfabian@fedora ~]$ rpm -q hunspell
hunspell-1.7.2-1.fc38.x86_64
[mfabian@fedora ~]$ hunspell -a -d ko_KR korean.txt
@(#) International Ispell Version 3.2.06 (but really Hunspell 1.7.2)
# 안녕하세이 0

[mfabian@fedora ~]$

That is wrong.

Downgrading to hunspell-1.7.1-1.fc38.x86_64 fixes the problem.

[mfabian@fedora ~]$ cat /etc/fedora-release 
Fedora release 38 (Rawhide)
[mfabian@fedora ~]$ rpm -q hunspell
hunspell-1.7.1-1.fc38.x86_64
[mfabian@fedora ~]$ hunspell -a -d ko_KR korean.txt
@(#) International Ispell Version 3.2.06 (but really Hunspell 1.7.1)
& 안녕하세이 2 0: 안녕하세요, 안녕하여있다

[mfabian@fedora ~]$
@caolanm
Copy link
Contributor

caolanm commented Jan 6, 2023

bisected to:

commit 05e44e0 (HEAD)
Author: Caolán McNamara caolanm@redhat.com
Date: Thu Sep 1 13:46:40 2022 +0100

Check word limit (#813)

* check against hentry blen max

* don't leak in the case of an aff parse error

and the issue is a word of byte len 519 김수한무거북이와두루미삼천갑자동방삭치치카포사리사리센타워리워리세브리캉무드셀라구름위허리케인에담벼락서생원에고양이고양이는바둑이바둑이는돌돌이들
at line 101398 of the .dic

@caolanm
Copy link
Contributor

caolanm commented Jan 6, 2023

blen is an unsigned char, word is longer than that (in UTF8), so is newly correctly detected as not insertable so errors out and entire dict is discarded.

Options are to leave it as is, and hunspell-ko has to remove the long entries to work, silently drop it instead of flagging an error, or make blen a bigger type

caolanm pushed a commit to caolanm/hunspell that referenced this issue Jan 6, 2023
hunspell#903

A problem since the sanity check added in:

commit 05e44e0
Author: Caolán McNamara <caolanm@redhat.com>
Date:   Thu Sep 1 13:46:40 2022 +0100

    Check word limit (hunspell#813)

    * check against hentry blen max
caolanm pushed a commit that referenced this issue Jan 6, 2023
#903

A problem since the sanity check added in:

commit 05e44e0
Author: Caolán McNamara <caolanm@redhat.com>
Date:   Thu Sep 1 13:46:40 2022 +0100

    Check word limit (#813)

    * check against hentry blen max
@caolanm
Copy link
Contributor

caolanm commented Jan 6, 2023

lets try making blen (and clen) unsigned short as the first port of call

@caolanm caolanm closed this as completed Jan 6, 2023
tdf-gerrit pushed a commit to LibreOffice/core that referenced this issue Jan 7, 2023
hunspell/hunspell#903

A problem since the sanity check added in:

commit 05e44e069e4cfaa9ce1264bf13f23fc9abd7ed05
Author: Caolán McNamara <caolanm@redhat.com>
Date:   Thu Sep 1 13:46:40 2022 +0100

    Check word limit (#813)

    * check against hentry blen max

Change-Id: Iab2c062584da076260c3262537690435eae7f396
Reviewed-on: https://gerrit.libreoffice.org/c/core/+/145154
Tested-by: Jenkins
Reviewed-by: Caolán McNamara <caolanm@redhat.com>
tdf-gerrit pushed a commit to LibreOffice/core that referenced this issue Jan 7, 2023
hunspell/hunspell#903

A problem since the sanity check added in:

commit 05e44e069e4cfaa9ce1264bf13f23fc9abd7ed05
Author: Caolán McNamara <caolanm@redhat.com>
Date:   Thu Sep 1 13:46:40 2022 +0100

    Check word limit (#813)

    * check against hentry blen max

Change-Id: Iab2c062584da076260c3262537690435eae7f396
Reviewed-on: https://gerrit.libreoffice.org/c/core/+/145121
Tested-by: Jenkins
Reviewed-by: Adolfo Jayme Barrientos <fitojb@ubuntu.com>
@reneengelhard
Copy link

I think this warrants a 1.7.3 :)

@reneengelhard
Copy link

reneengelhard commented Jan 8, 2023

Even with this patch applied some parts of hunspell-kos test fail. See https://ci.debian.net/data/autopkgtest/unstable/amd64/h/hunspell-dict-ko/30142014/log.gz

make -C tests test DICT=/usr/share/hunspell/ko
make[1]: Entering directory '/tmp/autopkgtest-lxc.1yt2gr__/downtmp/build.BXh/src/tests'
echo | hunspell -d /usr/share/hunspell/ko | head -1
Hunspell 1.7.2 - hunspell-dict-ko 0.7.92 (requires Hunspell 1.3.1) https://spellcheck-ko.github.io/
python3 checkhunspellversion.py
Testing 001-pos-dependent-inflection.test...
Testing 002-irregular-inflection.test...
Testing 003-abbreviated-inflection.test...
Testing 004-compound-removing-rieul.test...
Testing 005-vowel-harmony.test...
Testing 006-descriptive-josa.test...
Testing 007-noun-suffix-and-josa.test...
Testing 008-dependent-josa.test...
Testing 009-auxiliary-verb.test...
009-auxiliary-verb.test:17: Y 사과인듯하네: & 사과인듯하네 15 0: 사과인 듯하네, 사과인듯하네, 사관인듯하네, 사과인듯하나, 사과인듯하니, 사기인듯하네, 사고인듯하네, 사구인듯하네, 사교인듯하네, 사과인듯하냐, 사계인듯하네, 사과인듯하게, 사과인듯하데, 다과인듯하네, 사과일듯하네
009-auxiliary-verb.test:18: Y 선생님이신듯하고: & 선생님이신듯하고 1 0: 선생님이신 듯하고
Testing 010-jamo-swap.test...
Testing 011-abbreviated-verb-suggestion.test...
Testing 012-wrong-inflection.test...
Testing 013-yeoncheol-buncheol.test...
Testing 014-sai-sios.test...
Testing 015-beginning-sound-rule.test...
Testing 016-numbers.test...
make[1]: *** [Makefile:9: test] Error 1
make: *** [Makefile:53: hosttest] Error 2
make[1]: Leaving directory '/tmp/autopkgtest-lxc.1yt2gr__/downtmp/build.BXh/src/tests'
/tmp/autopkgtest-lxc.1yt2gr__/downtmp/wrapper.sh: Killing leaked background processes: 1525 
    PID TTY      STAT   TIME COMMAND
   1525 ?        R      0:01 hunspell -i UTF-8 -d /usr/share/hunspell/ko
autopkgtest [14:40:59]: test command1: -----------------------]
command1             FAIL non-zero exit status 2
autopkgtest [14:40:59]: test command1:  - - - - - - - - - - results - - - - - - - - - -
autopkgtest [14:40:59]: test command1:  - - - - - - - - - - stderr - - - - - - - - - -
009-auxiliary-verb.test:17: Y 사과인듯하네: & 사과인듯하네 15 0: 사과인 듯하네, 사과인듯하네, 사관인듯하네, 사과인듯하나, 사과인듯하니, 사기인듯하네, 사고인듯하네, 사구인듯하네, 사교인듯하네, 사과인듯하냐, 사계인듯하네, 사과인듯하게, 사과인듯하데, 다과인듯하네, 사과일듯하네
009-auxiliary-verb.test:18: Y 선생님이신듯하고: & 선생님이신듯하고 1 0: 선생님이신 듯하고
make[1]: *** [Makefile:9: test] Error 1
make: *** [Makefile:53: hosttest] Error 2

@OctopusET
Copy link

OctopusET commented Jan 9, 2023

That long string 김수한무거북이와두루미삼천갑자동방삭치치카포사리사리센타워리워리세브리캉무드셀라구름위허리케인에담벼락서생원에고양이고양이는바둑이바둑이는돌돌이들 is just kinda easter egg. It's okay to remove.

References:
spellcheck-ko/hunspell-dict-ko#50 (comment)
https://github.com/spellcheck-ko/hunspell-dict-ko/blob/master/dict-ko-builtins.yaml#L208
https://forum.wordreference.com/threads/%EA%B9%80%EC%88%98%ED%95%9C%EB%AC%B4-%EA%B1%B0%EB%B6%81%EC%9D%B4%EC%99%80%EF%BB%BF-%EB%91%90%EB%A3%A8%EB%AF%B8.2115344/

edit: add some more references.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants