Skip to content
This repository has been archived by the owner on Jul 22, 2022. It is now read-only.

Characters 々 〇 cause Exception on conv.do() #46

Closed
10000shiro opened this issue Apr 26, 2018 · 11 comments
Closed

Characters 々 〇 cause Exception on conv.do() #46

10000shiro opened this issue Apr 26, 2018 · 11 comments
Assignees
Labels

Comments

@10000shiro
Copy link

10000shiro commented Apr 26, 2018

Traceback (most recent call last):

  File "<ipython-input-1-4ab2ca517509>", line 1, in <module>
    runfile('D:/syosetsu-dl/kigou_conversion_issue.py', wdir='D:/syosetsu-dl')

  File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "D:/syosetsu-dl/kigou_conversion_issue.py", line 118, in <module>
    test_faulty_characters(conv)

  File "D:/syosetsu-dl/kigou_conversion_issue.py", line 18, in test_faulty_characters
    transscripted_string = conv.do(string)

  File "C:\ProgramData\Anaconda3\lib\site-packages\pykakasi\kakasi.py", line 235, in do
    otext = otext + self._conv["E"].convert(text[i])

TypeError: must be str, not NoneType

This error happens when mode "E" is set to "a" and one attempts to convert a string containing "々" or "〇".

Attached you can find a small script demonstrating the issue.
kigou_conversion_issue.txt

@miurahr
Copy link
Owner

miurahr commented Apr 28, 2018

OK, what should be converted from 〇?

There are different discussion for "々".
"々" itself don't have it's pronounce and is a mark of repeat.
So "苦々しい" should converted to "niganiga shii", and "若々しい" become "wakawaka shii" and other many words.

These symbols are located "CJK Symbols and Punctuation Range: 3000–303F" in Unicode standard.
https://www.unicode.org/charts/PDF/U3000.pdf
"〇": "\u3007" "々": "\u3005"

@miurahr
Copy link
Owner

miurahr commented Apr 28, 2018

Definition in sym2.py:

class sym2 (object):
    # U3000 - 301F
    # \u3000、。〃〄〆〈〉《》「」『』【】〒〓〔〕〖〗〘〙
    # 〚〛〜〝〞〟〠
    _table_1 = [" ",",",".",'"',"(kigou)",None,"(sime)",None,"<",">","<<",">>","(",")","(",")",
            "(",")","(kigou)","(geta)","(",")","(",")","(",")","(",
            ")","~","(kigou)","\"","(kigou)","(kigou)"]

It returns None then TypeError: must be str, not NoneType happens.
The definition comes from original KAKASI, now in kakasi-2.3.6 defines

    static char E2alphabet_a1table[94][12] = {
	" ",",",".",",",".",".",":",";","?","!","\"","(maru)","'","`","..","~",
	"~","_","(kurikaesi)","(kurikaesi)","(kurikaesi)","(kurikaesi)","(kurikaesi)",
	"(kurikaesi)","(kurikaesi)","sime","(maru)","^","-","-","/","\\","~","||",
	"|","...","..","`","'","\"","\"","(",")","[","]","[","]","{","}","<",">",
	"<<",">>","(",")","(",")","(",")","+","-","+-","X","/","=","!=","<",">",
	"<=",">=","(kigou)","...","(osu)","(mesu)","(do)","'","\"","(Sessi)","\\",
	"$","(cent)","(pound)","%","#","&","*","@","(setu)","(hosi)","(hosi)","(maru)",
	"(maru)","(maru)","(diamond)" };

So we can modify the defition in pykakasi/sym2.py:

-    _table_1 = [" ",",",".",'"',"(kigou)",None,"(sime)",None,"<",">","<<",">>","(",")","(",")",
+    _table_1 = [" ",",",".",'"',"(kigou)","(kurikaeshi)","(sime)","(maru)","<",">","<<",">>","(",")","(",")",

Any opinions?

@miurahr miurahr self-assigned this Apr 28, 2018
@miurahr miurahr added the bug label Apr 28, 2018
@miurahr
Copy link
Owner

miurahr commented Apr 28, 2018

I found there is a test case missing for E2a and also a missing logics for the case!

miurahr added a commit that referenced this issue Apr 28, 2018
@miurahr
Copy link
Owner

miurahr commented Apr 28, 2018

@10000shiro I've updated a code to fix here. Cloud you test again in master branch?

@10000shiro
Copy link
Author

With the updated symbols.py the code snippet works without an exception.

The 〇 in the snippet together with the 三〇 was meant to translate to 30. But I'm do not know whether a standard interpretation for this character exists. The main problem for me was more that it caused an exception.

@10000shiro
Copy link
Author

Found another faulty symbol: : the fullwidth colon \uff1a

@miurahr
Copy link
Owner

miurahr commented Apr 29, 2018

@10000shiro Please put it as another issue? Found another faulty symbol: : the fullwidth colon \uff1a

Thanks.

@miurahr
Copy link
Owner

miurahr commented Apr 29, 2018

@10000shiro Cloud you propose the update of dictionary or a better way for processing?

The 〇 in the snippet together with the 三〇 was meant to translate to 30.

In fact, is registered as 'み' or other pronounce in kakasidict. Internally a way as same as original KAKASI, is converted to 'み' and then 'み' converted to 'Mi' in J2a mode. Because J2a mode is realized with J2H (Japanese to Hiragana, by lookup of dictionary) and H2a (Hiragana to romaji) conversion.

You can register '〇' as '0' in a translation table in pykakasi/sym2.py but '三〇' may be converted to 'Mi0'. You also can register '三〇' as '30’ in Kana-Kanji dictioanry, then it cloud be converted to '30'.
So how to deal '三〇〇'、 '二〇〇' 、 '三〇二〇', ... and infinite combinations of numbers?

You can easily observe dictionary using grep command such as 'grep 三 pykakasi/data/kakasidict.utf8`

@10000shiro
Copy link
Author

One solution would be to preprocess the input and convert the partial strings containing numerals and "〇" accordingly, e.g.: '三〇二〇' -> '三千二十' before doing the conversion. Here a prototype implementation for this solution:
maru_replacement.txt

One isssue I can see with this is that the usage of "〇" is not limited to 0, but often also as a means to ommit other Kanji which could lead to strange results.
Another concern might be a decrease in performance.

All in all, I'm not sure what the right way to handle this character is.

@miurahr
Copy link
Owner

miurahr commented May 2, 2018

@10000shiro Interesting! But I'm afraid that it seems to be an out of scope of KAKASI functionality
It would be better to be an another python module to do so like a named kansuji_converter.

@10000shiro
Copy link
Author

@miurahr Yes, I agree that this goes beyond the scope of a simple kana kanji inverter. Replacing the "〇" with (maru) and having the user do the context analysis on their own is probably for the best.

@miurahr miurahr closed this as completed Jul 1, 2018
@miurahr miurahr mentioned this issue Jul 1, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants