Add support for small ヵ/ヶ being read as large か in words #28
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This closes #27.
Currently, if you try to generate furigana on a word like 一ヵ月 or 二ヶ国, the plugin will throw an exception about not being able to match the Regex group. The reason for this is that the readings for both of these words have "か” (full sized) in the space where the text has ヵ or ヶ (small). Even after we convert these to hiragana (ゕ and ゖ), they don't match against the full-size character. When we're processing these two characters specifically, we need to also register that they should match both their own hiragana characters (ゕ and ゖ) but they should also match full-sized か.
In this PR, I rework some of the
kanjiToRegex
function for when we're processing a regular kana character. If the single hiragana/katakana character we're looking at has additional possible readings beyond just their own reading, then instead of producing a string literal within the Regex, we'll now produce a Regex capture group.kanjiToRegex("ヶ月")
:^ゖ(.+?)$
^(ゖ|か)(.+?)$
reading = "かげつ"
(via MeCab)If we don't have additional readings, we'll continue to go down the regular pathway, where we just output the hiragana directly.
kanjiToRegex("ローマ字")
:^ろーま(.+?)$
^ろーま(.+?)$
(no change)In the case where we have ヵ and ヶ, I've chosen to include furigana readings for them. I did this because Jisho includes readings in this situation, and because it's a situation where one character is being read as a different character — this can easily mess up beginner learners of Japanese, as it's very non-standard.
I've added unit tests to track all of this and prevent regressions.
I've tested this change in both Anki 2.1.54 and Anki 2.1.49 (the version prior to the Python 2.10 bump).