Characters 々〇 cause Exception on conv.do() #46

10000shiro · 2018-04-26T20:57:22Z

Traceback (most recent call last):

  File "<ipython-input-1-4ab2ca517509>", line 1, in <module>
    runfile('D:/syosetsu-dl/kigou_conversion_issue.py', wdir='D:/syosetsu-dl')

  File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "D:/syosetsu-dl/kigou_conversion_issue.py", line 118, in <module>
    test_faulty_characters(conv)

  File "D:/syosetsu-dl/kigou_conversion_issue.py", line 18, in test_faulty_characters
    transscripted_string = conv.do(string)

  File "C:\ProgramData\Anaconda3\lib\site-packages\pykakasi\kakasi.py", line 235, in do
    otext = otext + self._conv["E"].convert(text[i])

TypeError: must be str, not NoneType

This error happens when mode "E" is set to "a" and one attempts to convert a string containing "々" or "〇".

Attached you can find a small script demonstrating the issue.
kigou_conversion_issue.txt

The text was updated successfully, but these errors were encountered:

miurahr · 2018-04-28T01:56:21Z

OK, what should be converted from 〇?

There are different discussion for "々".
"々" itself don't have it's pronounce and is a mark of repeat.
So "苦々しい" should converted to "niganiga shii", and "若々しい" become "wakawaka shii" and other many words.

These symbols are located "CJK Symbols and Punctuation Range: 3000–303F" in Unicode standard.
https://www.unicode.org/charts/PDF/U3000.pdf
"〇": "\u3007" "々": "\u3005"

miurahr · 2018-04-28T02:55:45Z

Definition in sym2.py:

class sym2 (object):
    # U3000 - 301F
    # \u3000、。〃〄〆〈〉《》「」『』【】〒〓〔〕〖〗〘〙
    # 〚〛〜〝〞〟〠
    _table_1 = [" ",",",".",'"',"(kigou)",None,"(sime)",None,"<",">","<<",">>","(",")","(",")",
            "(",")","(kigou)","(geta)","(",")","(",")","(",")","(",
            ")","~","(kigou)","\"","(kigou)","(kigou)"]

It returns None then TypeError: must be str, not NoneType happens.
The definition comes from original KAKASI, now in kakasi-2.3.6 defines

    static char E2alphabet_a1table[94][12] = {
	" ",",",".",",",".",".",":",";","?","!","\"","(maru)","'","`","..","~",
	"~","_","(kurikaesi)","(kurikaesi)","(kurikaesi)","(kurikaesi)","(kurikaesi)",
	"(kurikaesi)","(kurikaesi)","sime","(maru)","^","-","-","/","\\","~","||",
	"|","...","..","`","'","\"","\"","(",")","[","]","[","]","{","}","<",">",
	"<<",">>","(",")","(",")","(",")","+","-","+-","X","/","=","!=","<",">",
	"<=",">=","(kigou)","...","(osu)","(mesu)","(do)","'","\"","(Sessi)","\\",
	"$","(cent)","(pound)","%","#","&","*","@","(setu)","(hosi)","(hosi)","(maru)",
	"(maru)","(maru)","(diamond)" };

So we can modify the defition in pykakasi/sym2.py:

-    _table_1 = [" ",",",".",'"',"(kigou)",None,"(sime)",None,"<",">","<<",">>","(",")","(",")",
+    _table_1 = [" ",",",".",'"',"(kigou)","(kurikaeshi)","(sime)","(maru)","<",">","<<",">>","(",")","(",")",

Any opinions?

miurahr · 2018-04-28T03:10:24Z

I found there is a test case missing for E2a and also a missing logics for the case!

miurahr · 2018-04-28T04:24:55Z

@10000shiro I've updated a code to fix here. Cloud you test again in master branch?

10000shiro · 2018-04-28T13:00:51Z

With the updated symbols.py the code snippet works without an exception.

The 〇 in the snippet together with the 三〇 was meant to translate to 30. But I'm do not know whether a standard interpretation for this character exists. The main problem for me was more that it caused an exception.

10000shiro · 2018-04-28T15:46:19Z

Found another faulty symbol: ： the fullwidth colon \uff1a

miurahr · 2018-04-29T02:34:03Z

@10000shiro Please put it as another issue? Found another faulty symbol: ： the fullwidth colon \uff1a

Thanks.

miurahr · 2018-04-29T02:56:50Z

@10000shiro Cloud you propose the update of dictionary or a better way for processing?

The 〇 in the snippet together with the 三〇 was meant to translate to 30.

In fact, 三 is registered as 'み' or other pronounce in kakasidict. Internally a way as same as original KAKASI, 三 is converted to 'み' and then 'み' converted to 'Mi' in J2a mode. Because J2a mode is realized with J2H (Japanese to Hiragana, by lookup of dictionary) and H2a (Hiragana to romaji) conversion.

You can register '〇' as '０' in a translation table in pykakasi/sym2.py but '三〇' may be converted to 'Mi0'. You also can register '三〇' as '３０’ in Kana-Kanji dictioanry, then it cloud be converted to '30'.
So how to deal '三〇〇'、 '二〇〇' 、 '三〇二〇', ... and infinite combinations of numbers?

You can easily observe dictionary using grep command such as 'grep 三 pykakasi/data/kakasidict.utf8`

10000shiro · 2018-04-29T11:43:18Z

One solution would be to preprocess the input and convert the partial strings containing numerals and "〇" accordingly, e.g.: '三〇二〇' -> '三千二十' before doing the conversion. Here a prototype implementation for this solution:
maru_replacement.txt

One isssue I can see with this is that the usage of "〇" is not limited to 0, but often also as a means to ommit other Kanji which could lead to strange results.
Another concern might be a decrease in performance.

All in all, I'm not sure what the right way to handle this character is.

miurahr · 2018-05-02T02:15:05Z

@10000shiro Interesting! But I'm afraid that it seems to be an out of scope of KAKASI functionality
It would be better to be an another python module to do so like a named kansuji_converter.

10000shiro · 2018-05-02T05:38:30Z

@miurahr Yes, I agree that this goes beyond the scope of a simple kana kanji inverter. Replacing the "〇" with (maru) and having the user do the context analysis on their own is probably for the best.

miurahr self-assigned this Apr 28, 2018

miurahr added the bug label Apr 28, 2018

miurahr added a commit that referenced this issue Apr 28, 2018

Fix issue #46

9f5c5ca

10000shiro mentioned this issue Apr 29, 2018

Fullwitdh colon \u11fa causes Exception on conv.do() #51

Closed

miurahr closed this as completed Jul 1, 2018

miurahr mentioned this issue Jul 1, 2018

Issue with ー #57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Characters 々〇 cause Exception on conv.do() #46

Characters 々〇 cause Exception on conv.do() #46

10000shiro commented Apr 26, 2018 •

edited

miurahr commented Apr 28, 2018 •

edited

miurahr commented Apr 28, 2018

miurahr commented Apr 28, 2018

miurahr commented Apr 28, 2018

10000shiro commented Apr 28, 2018

10000shiro commented Apr 28, 2018

miurahr commented Apr 29, 2018

miurahr commented Apr 29, 2018

10000shiro commented Apr 29, 2018

miurahr commented May 2, 2018

10000shiro commented May 2, 2018

Characters 々 〇 cause Exception on conv.do() #46

Characters 々 〇 cause Exception on conv.do() #46

Comments

10000shiro commented Apr 26, 2018 • edited

miurahr commented Apr 28, 2018 • edited

miurahr commented Apr 28, 2018

miurahr commented Apr 28, 2018

miurahr commented Apr 28, 2018

10000shiro commented Apr 28, 2018

10000shiro commented Apr 28, 2018

miurahr commented Apr 29, 2018

miurahr commented Apr 29, 2018

10000shiro commented Apr 29, 2018

miurahr commented May 2, 2018

10000shiro commented May 2, 2018

Characters 々〇 cause Exception on conv.do() #46

Characters 々〇 cause Exception on conv.do() #46

10000shiro commented Apr 26, 2018 •

edited

miurahr commented Apr 28, 2018 •

edited