New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple roots occured in ヤニス・スマラグディス監督 #42
Comments
Similar result in "ドミニコ会修道士のハインリッヒ・クレイマーはザルツブルク大司教の助手を務めた。"
|
Another result in "ライトの兄弟オスカーはコミックブック作家だ。"
|
It seems that GiNZA considered the first input "ヤニス・スマラグディス監督の『エル・グレコ』であった。" to be two separate sentences: "ヤニス・スマラグ" and "ディス監督の『エル・グレコ』であった。". And that is the reason why there are two roots in the result. I think this is essentially the same behavior as the example below which also has two roots.
|
@KoichiYasuoka Thank you for reporting considerable examples. I analyzed these sentences and found that the spaCy's parsing logic returns multiple roots for these situations. I suppose this is kinds of unexpected behavior of spaCy v2.1.x. I've pushed a patch for the situation of revising "root_as_xxx" type POS disambiguation but it has no relation to above multiple root issues. I'd like to keep this issue for future works. @TomokiMatsumoto Thanks for your analysis. Your observation is correct, I think. |
#42 Use 'dep' label when root_as_xxx arises for non-root tokens
GiNZA v2.2.0 has improved this issue in the two of three examples I showed above, but the result of "ライトの兄弟オスカーはコミックブック作家だ。" has not been improved yet:
|
Could you please test the new ja_ginza-3.0.0 model with some sentences and report the analyzing error if you'd find? @KoichiYasuoka @TomokiMatsuno |
I've just tried
Umm... |
Are you using pipenv? |
No, I don't use pipenv in any of these environments: Debian, Mac OS X (High Sierra), and Cygwin64. Error output on Mac OS X (High Sierra) as follows:
|
Incidentally, I expanded ja-ginza-3.0.0.tar.gz files and which system.dic was empty file. |
@fortharrow Sure. The system.dic file is empty in ja_ginza package and it should be overwritten during executing ja_ginza/setup.py from pip install process. I'm going to research this phenomenon. |
@KoichiYasuoka Thanks for testing. Could you please paste the version number of pip and the log of |
For Cygwin64:
|
Umm... It seems not to have accessed to |
@KoichiYasuoka Sure. The download process of SudachiDict_core*.zip does not appear in pip log. |
|
@KoichiYasuoka I just released GiNZA v3.1.0 with some implementation improvements around sudachidict distribution. Please try it and feedback the errors if you found.
|
In my environments of Debian, Mac OS X (High Sierra) and Cygwin64 (with $ pip3.7 install 'spacy>=2.2.3' --no-build-isolation
$ pip3.7 install 'ginza>=3.1.0'
$ echo ライトの兄弟オスカーはコミックブック作家だ。 | ginza
# text = ライトの兄弟オスカーはコミックブック作家だ。
1 ライト ライト NOUN 名詞-普通名詞-一般 _ 3 nmod _ BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|SpaceAfter=No|NP_B
2 の の ADP 助詞-格助詞 _ 1 case _ BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|SpaceAfter=No
3 兄弟 兄弟 NOUN 名詞-普通名詞-一般 _ 4 compound _ BunsetuBILabel=B|BunsetuPositionType=CONT|SpaceAfter=No|NP_B
4 オスカー オスカー PROPN 名詞-固有名詞-人名-一般 _ 7 nsubj _ BunsetuBILabel=I|BunsetuPositionType=SEM_HEAD|SpaceAfter=No|NP_I
5 は は ADP 助詞-係助詞 _ 4 case _ BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|SpaceAfter=No
6 コミックブック コミックブック NOUN 名詞-普通名詞-一般 _ 7 compound _ BunsetuBILabel=B|BunsetuPositionType=CONT|SpaceAfter=No|NP_B
7 作家 作家 NOUN 名詞-普通名詞-一般 _ 0 root _ BunsetuBILabel=I|BunsetuPositionType=ROOT|SpaceAfter=No|NP_I
8 だ だ AUX 助動詞 _ 7 cop _ BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|SpaceAfter=No
9 。 。 PUNCT 補助記号-句点 _ 7 punct _ BunsetuBILabel=I|BunsetuPositionType=CONT|SpaceAfter=No |
I've just written GiNZA v3.1.0で読む「ライトの兄弟オスカーはコミックブック作家だ。」 in my blog. Thank you @hiroshi-matsuda-rit and now I close this issue. |
I've just tried a sentence "ヤニス・スマラグディス監督の『エル・グレコ』であった。" and got curious result.
We can see multiple roots exist at tokens 3 and 8. It seems some kind of bug in
compound
but I'm vague in its reason...The text was updated successfully, but these errors were encountered: