Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix same Kanji character not being replaced twice #10

Merged
merged 4 commits into from
Jun 19, 2022

Conversation

inoueakimitsu
Copy link
Contributor

Some of the kanji characters that were once furiganized
are matched again to the current regular expression.
This fix will make sure that Kanji characters in
tags will not be replaced.
To do so, a modification is made to the
regular expression for the search.

This relates #9

Some of the kanji characters that were once furiganized
are matched again to the current regular expression.
This fix will make sure that Kanji characters in
<ruby> tags will not be replaced.
To do so, a modification is made to the
regular expression for the search.
Sort tagged array in order to add
furigana for the longer Kanji series first.
Enable toggle ON/OFF of the function to give priority
to consecutive Kanji characters from options_uit.
@inoueakimitsu
Copy link
Contributor Author

To add an option to avoid splitting consecutive Kanji characters,
the following two points have been implemented.

  • Pre-sorting of tagger output to process longer kanji candidates first
  • An option prevent_splitting_consecutive_kanjis was added to options_ui

The example sentences used for testing are as follows:

  1. 作成します。作ります。
  2. 作ります。作成します。
  3. 食べる。飲食店。
  4. 飲食店。食べる。
  5. 飲食店 飲食店 飲食店 飲食店 飲食店 食べる。飲食店。 食べる。飲食店。 食べる。飲食店。食べる。飲食店。
  6. お茶について問う問題。「とうもんだい」と読んでください。
  7. 問題を問う。「もんだいをとう」と読んでください。

This relates #9

@kuanyui
Copy link
Owner

kuanyui commented Jun 19, 2022

Oh my God thanks for your PR and hard working! I'm investigating and testing your PR currently.

And add more example sentences for test:

I tried to find these sentences that cause "multiple furigana in the same kanji/tango" from NHK's Twitter.
but I don't exactly know why after I copied-and-pasted them into Github Issue, Furiganaize works correctly in most of them...
=> I guess it's because the syntax analyzer (igo.js) may give out multiple potentially correct analysis results for a same long sentence.

  1. 栃木 那須町の養豚場で豚舎3棟ほぼ全焼 豚2000頭が死んだか
  2. 水泳授業 3年ぶりの本格再開 水着も泳ぐ場所も変化が…
  3. 体に帯状に出る発疹「帯状疱疹」 (読み方は違う)
  4. 小中学生の子どもたちも春休み中でしたが外出できず、ずっと自宅で子育てしなくてはいけません。
  5. 司法取引が適用 無罪主張の元社長 業務上横領の罪で実刑確定へ
  6. ロシア海軍の7隻 伊豆諸島付近を通過し南西方向に 海自が確認
  7. 【円安加速の懸念に…】欧米の中央銀行が相次いで金融引き締めに動く中、日銀は大規模な緩和策の維持を決定
  8. 夕方のラッシュ時間帯、列車内の方もいらっしゃると思います。
  9. 強風による運転見合わせや徐行が相次いでいることから、JR西日本は最新の強風予測システムを活用して運転見合わせなどの回数を減らすための検証を行うことになりました。
  10. 自動車の安全評価で自転車への自動ブレーキ機能試験を導入
  11. 自転車が車にはねられる事故があとを絶たない中、国は車の安全性を評価する「自動車アセスメント」に、自転車に対する自動ブレーキ機能を確かめる項目を追加し試験を始めました。歩行者に比べスピードがある自転車の検知は技術的に難しいということです。
  12. 「選挙ポスターの掲示板」立候補者が多くなって、その枠が足りなくなる可能性があるとして、参院選の東京選挙区では、急きょ、掲示板の枠を増やす作業が進められています。
  13. 新検事総長に東京高検検事長の甲斐行夫氏を起用
  14. 痴漢についてそんな調査結果があります(2011年警察庁調べ)。

@kuanyui
Copy link
Owner

kuanyui commented Jun 19, 2022

The result is much better than I've expected before, I even want to make this option enable by default... But I guess I must be calm down and be patient, test it for a period before making it enabled by default.

Actually, I'm still a little worried about the reliability of RegExp new RegExp(kanji + "(?![^<]*<\/rb>)", 'g');, but donno how to prove it. Whatever, let's try to use it in real world :P

Thanks for your contribution again!

@kuanyui kuanyui merged commit af1cecd into kuanyui:master Jun 19, 2022
@kuanyui
Copy link
Owner

kuanyui commented Jun 19, 2022

Your PR has been included in latest v0.6.6, now available for update.

BTW, I add more information to the option.
Screenshot_20220619_204822

@inoueakimitsu
Copy link
Contributor Author

@kuanyui Thank you for your

  • adding sentences for testing
  • survey about igo.js's multiple outputs behavior
  • adding more information on the options_ui.

If you encounter any problems with the regular expression in question, please let me know.

@kuanyui kuanyui mentioned this pull request Aug 23, 2022
@kuanyui kuanyui mentioned this pull request Jul 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants