Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for small ヵ/ヶ being read as large か in words #28

Merged
merged 1 commit into from
Feb 21, 2023

Conversation

ahlec
Copy link
Collaborator

@ahlec ahlec commented Feb 20, 2023

This closes #27.

Currently, if you try to generate furigana on a word like 一ヵ月 or 二ヶ国, the plugin will throw an exception about not being able to match the Regex group. The reason for this is that the readings for both of these words have "か” (full sized) in the space where the text has ヵ or ヶ (small). Even after we convert these to hiragana (ゕ and ゖ), they don't match against the full-size character. When we're processing these two characters specifically, we need to also register that they should match both their own hiragana characters (ゕ and ゖ) but they should also match full-sized か.

In this PR, I rework some of the kanjiToRegex function for when we're processing a regular kana character. If the single hiragana/katakana character we're looking at has additional possible readings beyond just their own reading, then instead of producing a string literal within the Regex, we'll now produce a Regex capture group.

  • Given kanjiToRegex("ヶ月"):
    • Before: ^ゖ(.+?)$
    • After: ^(ゖ|か)(.+?)$
    • In both situations, these regular expressions are then being matched against reading = "かげつ" (via MeCab)

If we don't have additional readings, we'll continue to go down the regular pathway, where we just output the hiragana directly.

  • Given kanjiToRegex("ローマ字"):
    • Before: ^ろーま(.+?)$
    • After: ^ろーま(.+?)$ (no change)

In the case where we have ヵ and ヶ, I've chosen to include furigana readings for them. I did this because Jisho includes readings in this situation, and because it's a situation where one character is being read as a different character — this can easily mess up beginner learners of Japanese, as it's very non-standard.

I've added unit tests to track all of this and prevent regressions.

I've tested this change in both Anki ⁨2.1.54 and Anki 2.1.49 (the version prior to the Python 2.10 bump).

@obynio
Copy link
Owner

obynio commented Feb 21, 2023

Seems good to me, I'll deploy this change asap

@obynio obynio merged commit 545631b into master Feb 21, 2023
@obynio obynio deleted the ahlec/kagetsu branch February 21, 2023 11:43
@obynio
Copy link
Owner

obynio commented Feb 21, 2023

This change has been released in version 1.4.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error when generating readings including ヶ月
2 participants