KeyError with specific character #68

PicoSushi · 2019-03-24T10:04:35Z

A bad character causes KeyError in pykakasi.

from pykakasi import kakasi


def text_convert():
    bad_char = ""
    print(ord(bad_char))
    kks = kakasi()
    kks.setMode("J", "H")
    convert = kks.getConverter()

    text = convert.do(bad_char)
    print(text)


if __name__ == "__main__":
    text_convert()

results

57496
Traceback (most recent call last):
  File "/tmp/pyk.py", line 16, in <module>
    text_convert()
  File "/tmp/pyk.py", line 11, in text_convert
    text = convert.do(bad_char)
  File "/home/picosushi/pyenv/py3/lib/python3.7/site-packages/pykakasi/kakasi.py", line 146, in do                                                                     
    (t, l1) = self._conv[mode].convert(text[i:w])
  File "/home/picosushi/pyenv/py3/lib/python3.7/site-packages/pykakasi/j2.py", line 80, in convert_H                                                                   
    table = self._kanwa.load(text[0])
  File "/home/picosushi/pyenv/py3/lib/python3.7/site-packages/pykakasi/kanwa.py", line 40, in load                                                                     
    self._jisyo_table[key] = loads(decompress(self._kanwadict[key]))
  File "/home/picosushi/pyenv/py3/lib/python3.7/site-packages/semidbm/db.py", line 93, in __getitem__                                                                  
    offset, size = self._index[key]
KeyError: b'e098'

NOTE: This issue is from python - pykakasiで文字列置き換えの際にKeyErrorが発生する - スタック・オーバーフロー .

The text was updated successfully, but these errors were encountered:

PicoSushi · 2019-03-24T10:05:58Z

And, this error was caused in Python 3.7.

miurahr · 2019-03-25T08:29:49Z

It should not be raise error when getting bad characters, it may ignore silently.

miurahr · 2019-03-26T02:24:14Z

The character seems \uE098 that is in private use code(PUC) area in Unicode standard.
It should be ignored by pykakasi

miurahr · 2019-03-26T02:30:41Z

It is a reason why it comes in j2.py:80, a method isRegion returns true as follows:

    def isRegion(self, c):
        return 0x3400 <= ord(c[0]) < 0xfa2e

has a PUC area \ue0000-f8ff

it should be

    def isRegion(self, c):
        return 0x3400 <= ord(c[0]) < 0xe000 or  0xf900 <= ord(c[0]) 0xfa2e

harpaj · 2019-06-05T08:52:23Z

We ran into the same problem with character \u57C7:

>>> import pykakasi
>>> kks = pykakasi.kakasi()
>>> kks.setMode("J", "H")
>>> convert = kks.getConverter()
>>> convert.do("埇")

results in

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/johannes/matching_pipeline/venv/lib/python3.7/site-packages/pykakasi/kakasi.py", line 146, in do
    (t, l1) = self._conv[mode].convert(text[i:w])
  File "/home/johannes/matching_pipeline/venv/lib/python3.7/site-packages/pykakasi/j2.py", line 80, in convert_H
    table = self._kanwa.load(text[0])
  File "/home/johannes/matching_pipeline/venv/lib/python3.7/site-packages/pykakasi/kanwa.py", line 40, in load
    self._jisyo_table[key] = loads(decompress(self._kanwadict[key]))
  File "/home/johannes/matching_pipeline/venv/lib/python3.7/site-packages/semidbm/db.py", line 93, in __getitem__
    offset, size = self._index[key]
KeyError: b'57c7'

Signed-off-by: Hiroshi Miura <miurahr@linux.com>

Fix #68

harpaj · 2019-06-11T11:19:28Z

Hi @miurahr,
first, thanks for the quick fix of this problem. While it does work for the character I reported above now, there are still characters for which it fails:

>>> import pykakasi
>>> kks = pykakasi.kakasi()
>>> kks.setMode("J", "H")
>>> convert = kks.getConverter()
>>> convert.do("埇")
'よう'
>>> convert.do("潑")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/johannes/matching_pipeline/venv/lib/python3.7/site-packages/pykakasi/kakasi.py", line 145, in do
    (t, l1) = self._conv[mode].convert(text[i:w])
  File "/home/johannes/matching_pipeline/venv/lib/python3.7/site-packages/pykakasi/j2.py", line 80, in convert_H
    table = self._kanwa.load(text[0])
  File "/home/johannes/matching_pipeline/venv/lib/python3.7/site-packages/pykakasi/kanwa.py", line 41, in load
    self._jisyo_table[key] = loads(decompress(self._kanwadict[key]))
  File "/home/johannes/matching_pipeline/venv/lib/python3.7/site-packages/semidbm/db.py", line 93, in __getitem__
    offset, size = self._index[key]
KeyError: b'6f51'

I extracted a list of characters from our logs - note that this will definitely not include every character for which pykakasi fails currently, but hopefully it gives you enough information to pinpoint the problem.

3402
4f60
4fc9
4ff1
503b
51ee
541e
5496
55ce
56cd
5fb7
60a8
6852
6a45
6c78
6c85
6df8
6e53
6f51
6ff5
7028
70ab
70e4
73c9
7407
7b73
7de3
82f9
915b
9243
946b
94c2
9592
9943
9ad9
9e2d
9eb5
fa11

Fix #68 Signed-off-by: Hiroshi Miura <miurahr@linux.com>

harpaj · 2019-06-12T14:00:08Z

@miurahr, thank you very much for the quick fix. I just tested the new version with the full dataset I had at hand, it worked without any problems.

I left out one piece of information from the list of failures above: The two characters \u9ad9 and \u73c9 were at least in this dataset by far the most frequent ones (each appearing a few hundred times while the others appeared only very rarely).

Potentially you could add transliterations for these two characters instead of skipping them? It's not important, I just thought that this might be interesting to you.

I would provide a PR, but I don't speak Japanese so I can't help with this.
Again, thanks for your work on this library, it is extremely useful!

miurahr self-assigned this Mar 25, 2019

miurahr added the bug label Mar 25, 2019

miurahr added this to the v0.95 milestone Jun 6, 2019

miurahr added a commit that referenced this issue Jun 6, 2019

Fix #68

5208de5

Signed-off-by: Hiroshi Miura <miurahr@linux.com>

miurahr added a commit that referenced this issue Jun 6, 2019

Fix #68

dcf8c3b

Signed-off-by: Hiroshi Miura <miurahr@linux.com>

miurahr closed this as completed in 3d92897 Jun 8, 2019

miurahr added a commit that referenced this issue Jun 8, 2019

Merge pull request #69 from miurahr/issue_68

a73c978

Fix #68

miurahr reopened this Jun 11, 2019

This was referenced Jun 11, 2019

Raise Key error with several non-standard form of Japanese character #72

Closed

JIS X0213 characters (old IBM/NEC extensions) generate key error #73

Closed

miurahr added a commit that referenced this issue Jun 12, 2019

Fix keyerror in kanwa.py when input unknown kanji

143a7fa

Fix #68 Signed-off-by: Hiroshi Miura <miurahr@linux.com>

miurahr mentioned this issue Jun 12, 2019

Fix keyerror in kanwa.py when input unknown kanji #74

Merged

miurahr closed this as completed in #74 Jun 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError with specific character #68

KeyError with specific character #68

PicoSushi commented Mar 24, 2019

PicoSushi commented Mar 24, 2019

miurahr commented Mar 25, 2019

miurahr commented Mar 26, 2019 •

edited

Loading

miurahr commented Mar 26, 2019 •

edited

Loading

harpaj commented Jun 5, 2019

harpaj commented Jun 11, 2019 •

edited

Loading

harpaj commented Jun 12, 2019

KeyError with specific character #68

KeyError with specific character #68

Comments

PicoSushi commented Mar 24, 2019

PicoSushi commented Mar 24, 2019

miurahr commented Mar 25, 2019

miurahr commented Mar 26, 2019 • edited Loading

miurahr commented Mar 26, 2019 • edited Loading

harpaj commented Jun 5, 2019

harpaj commented Jun 11, 2019 • edited Loading

harpaj commented Jun 12, 2019

miurahr commented Mar 26, 2019 •

edited

Loading

miurahr commented Mar 26, 2019 •

edited

Loading

harpaj commented Jun 11, 2019 •

edited

Loading