Skip to content
This repository has been archived by the owner on Jul 22, 2022. It is now read-only.

KeyError with specific character #68

Closed
PicoSushi opened this issue Mar 24, 2019 · 7 comments · Fixed by #74
Closed

KeyError with specific character #68

PicoSushi opened this issue Mar 24, 2019 · 7 comments · Fixed by #74
Assignees
Labels
Milestone

Comments

@PicoSushi
Copy link

A bad character causes KeyError in pykakasi.

from pykakasi import kakasi


def text_convert():
    bad_char = ""
    print(ord(bad_char))
    kks = kakasi()
    kks.setMode("J", "H")
    convert = kks.getConverter()

    text = convert.do(bad_char)
    print(text)


if __name__ == "__main__":
    text_convert()

results

57496
Traceback (most recent call last):
  File "/tmp/pyk.py", line 16, in <module>
    text_convert()
  File "/tmp/pyk.py", line 11, in text_convert
    text = convert.do(bad_char)
  File "/home/picosushi/pyenv/py3/lib/python3.7/site-packages/pykakasi/kakasi.py", line 146, in do                                                                     
    (t, l1) = self._conv[mode].convert(text[i:w])
  File "/home/picosushi/pyenv/py3/lib/python3.7/site-packages/pykakasi/j2.py", line 80, in convert_H                                                                   
    table = self._kanwa.load(text[0])
  File "/home/picosushi/pyenv/py3/lib/python3.7/site-packages/pykakasi/kanwa.py", line 40, in load                                                                     
    self._jisyo_table[key] = loads(decompress(self._kanwadict[key]))
  File "/home/picosushi/pyenv/py3/lib/python3.7/site-packages/semidbm/db.py", line 93, in __getitem__                                                                  
    offset, size = self._index[key]
KeyError: b'e098'

NOTE: This issue is from python - pykakasiで文字列置き換えの際にKeyErrorが発生する - スタック・オーバーフロー .

@PicoSushi
Copy link
Author

And, this error was caused in Python 3.7.

@miurahr miurahr self-assigned this Mar 25, 2019
@miurahr miurahr added the bug label Mar 25, 2019
@miurahr
Copy link
Owner

miurahr commented Mar 25, 2019

It should not be raise error when getting bad characters, it may ignore silently.

@miurahr
Copy link
Owner

miurahr commented Mar 26, 2019

The character seems \uE098 that is in private use code(PUC) area in Unicode standard.
It should be ignored by pykakasi

@miurahr
Copy link
Owner

miurahr commented Mar 26, 2019

It is a reason why it comes in j2.py:80, a method isRegion returns true as follows:

    def isRegion(self, c):
        return 0x3400 <= ord(c[0]) < 0xfa2e

has a PUC area \ue0000-f8ff

it should be

    def isRegion(self, c):
        return 0x3400 <= ord(c[0]) < 0xe000 or  0xf900 <= ord(c[0]) 0xfa2e

@harpaj
Copy link

harpaj commented Jun 5, 2019

We ran into the same problem with character \u57C7:

>>> import pykakasi
>>> kks = pykakasi.kakasi()
>>> kks.setMode("J", "H")
>>> convert = kks.getConverter()
>>> convert.do("埇")

results in

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/johannes/matching_pipeline/venv/lib/python3.7/site-packages/pykakasi/kakasi.py", line 146, in do
    (t, l1) = self._conv[mode].convert(text[i:w])
  File "/home/johannes/matching_pipeline/venv/lib/python3.7/site-packages/pykakasi/j2.py", line 80, in convert_H
    table = self._kanwa.load(text[0])
  File "/home/johannes/matching_pipeline/venv/lib/python3.7/site-packages/pykakasi/kanwa.py", line 40, in load
    self._jisyo_table[key] = loads(decompress(self._kanwadict[key]))
  File "/home/johannes/matching_pipeline/venv/lib/python3.7/site-packages/semidbm/db.py", line 93, in __getitem__
    offset, size = self._index[key]
KeyError: b'57c7'

@miurahr miurahr added this to the v0.95 milestone Jun 6, 2019
miurahr added a commit that referenced this issue Jun 6, 2019
Signed-off-by: Hiroshi Miura <miurahr@linux.com>
miurahr added a commit that referenced this issue Jun 6, 2019
Signed-off-by: Hiroshi Miura <miurahr@linux.com>
@miurahr miurahr closed this as completed in 3d92897 Jun 8, 2019
miurahr added a commit that referenced this issue Jun 8, 2019
@harpaj
Copy link

harpaj commented Jun 11, 2019

Hi @miurahr,
first, thanks for the quick fix of this problem. While it does work for the character I reported above now, there are still characters for which it fails:

>>> import pykakasi
>>> kks = pykakasi.kakasi()
>>> kks.setMode("J", "H")
>>> convert = kks.getConverter()
>>> convert.do("埇")
'よう'
>>> convert.do("潑")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/johannes/matching_pipeline/venv/lib/python3.7/site-packages/pykakasi/kakasi.py", line 145, in do
    (t, l1) = self._conv[mode].convert(text[i:w])
  File "/home/johannes/matching_pipeline/venv/lib/python3.7/site-packages/pykakasi/j2.py", line 80, in convert_H
    table = self._kanwa.load(text[0])
  File "/home/johannes/matching_pipeline/venv/lib/python3.7/site-packages/pykakasi/kanwa.py", line 41, in load
    self._jisyo_table[key] = loads(decompress(self._kanwadict[key]))
  File "/home/johannes/matching_pipeline/venv/lib/python3.7/site-packages/semidbm/db.py", line 93, in __getitem__
    offset, size = self._index[key]
KeyError: b'6f51'

I extracted a list of characters from our logs - note that this will definitely not include every character for which pykakasi fails currently, but hopefully it gives you enough information to pinpoint the problem.

3402
4f60
4fc9
4ff1
503b
51ee
541e
5496
55ce
56cd
5fb7
60a8
6852
6a45
6c78
6c85
6df8
6e53
6f51
6ff5
7028
70ab
70e4
73c9
7407
7b73
7de3
82f9
915b
9243
946b
94c2
9592
9943
9ad9
9e2d
9eb5
fa11

@harpaj
Copy link

harpaj commented Jun 12, 2019

@miurahr, thank you very much for the quick fix. I just tested the new version with the full dataset I had at hand, it worked without any problems.

I left out one piece of information from the list of failures above: The two characters \u9ad9 and \u73c9 were at least in this dataset by far the most frequent ones (each appearing a few hundred times while the others appeared only very rarely).

Potentially you could add transliterations for these two characters instead of skipping them? It's not important, I just thought that this might be interesting to you.

I would provide a PR, but I don't speak Japanese so I can't help with this.
Again, thanks for your work on this library, it is extremely useful!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants