-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NormalizerNFKC: add unify_kana_prolonged_sound_mark
option
#1522
Conversation
Should this be |
test/command/suite/normalizers/nfkc121/unify_katakana_prolonged_sound_mark.test
Outdated
Show resolved
Hide resolved
Okay, I will rename to |
unify_katakana_prolonged_sound_mark
optionunify_kana_prolonged_sound_mark
option
Thanks for your comments. |
Oops, I have added tests only for NormalizerNFKC100, we need to add tests for all normalizers. I will add them. |
Added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to use this feature with unify_hyphen_and_prolonged_sound_mark
too.
Should we also implement unify_kana_hyphen
as a separated feature?
lib/normalizer.c
Outdated
@@ -1892,6 +1892,469 @@ grn_nfkc_normalize_unify_katakana_trailing_o(grn_ctx *ctx, | |||
return current; | |||
} | |||
|
|||
static const unsigned char * | |||
grn_nfkc_normalize_unify_kana_prolonged_sound_mark(grn_ctx *ctx, | |||
const unsigned char *start, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you fix indent?
(previous[1] == 0x82 && previous[2] == 0x8f) || | ||
/* U+3095 HIRAGANA LETTER SMALL KA */ | ||
(previous[1] == 0x82 && previous[2] == 0x95)) { | ||
unified_buffer[(*n_unified_bytes)++] = previous[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add /* U+3042 HIRAGANA LETTER A */
comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
lib/normalizer.c
Outdated
} | ||
/* U+3043 HIRAGANA LETTER SMALL I */ | ||
else if ((previous[1] == 0x81 && previous[2] == 0x83) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
/* U+3043 HIRAGANA LETTER SMALL I */ | |
else if ((previous[1] == 0x81 && previous[2] == 0x83) || | |
} else if (/* U+3043 HIRAGANA LETTER SMALL I */ | |
(previous[1] == 0x81 && previous[2] == 0x83) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
0.0 | ||
], | ||
{ | ||
"normalized": "ああぁあいいぃいううぅうええぇえおおぉお", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use あ
instead of ぁ
for ぁー
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean ぁー
should be normalized to ぁぁ
or ああ
or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ぁぁ
. Why do you choose ぁあ
? Easy to implement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you choose
ぁあ?
Easy to implement?
Because ぁぁ
is not pronounceable in Japanese.
Some one uses ぁぁ
like ぎゃぁぁぁ
, but I think it is broken Japanese.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, the pronunciation of ファー
is closer to ファア
than ファァ
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand. I'm OK with ああ
then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will fix them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry. ぁあ
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see.
@@ -0,0 +1,907 @@ | |||
normalize 'NormalizerNFKC121("unify_kana_prolonged_sound_mark", true, "report_source_offset", true)' "ァーアーィーイーゥーウーェーエーォーオーヵーカーガーキーギークーグーヶーケーゲーコーゴーサーザーシージースーズーセーゼーソーゾーターダーチーヂーツーヅーテーデートードーナーニーヌーネーノーハーバーパーヒービーピーフーブープーヘーベーペーホーボーポーマーミームーメーモーャーヤーューユーョーヨーラーリールーレーローヮーワーヰーヱーヲーンーヴーヷーヸーヹーヺー" WITH_CHECKS|WITH_TYPES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Sure. |
Would you re-review this? |
lib/normalizer.c
Outdated
*n_used_characters = 1; | ||
|
||
if (*previous_length == 3 && | ||
grn_nfkc_normalize_is_prolonged_sound_mark_famity(current, char_length)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really?
I think that we should use only U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK
here.
If an user wants normalized prolonged sound mark, the user should use unify_prolonged_sound_mark
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I will add a test using this option with unify_prolonged_sound_mark
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
lib/normalizer.c
Outdated
} | ||
else if (/* U+3043 HIRAGANA LETTER SMALL I */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
else if (/* U+3043 HIRAGANA LETTER SMALL I */ | |
} else if (/* U+3043 HIRAGANA LETTER SMALL I */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
lib/normalizer.c
Outdated
(previous[1] == 0x82 && previous[2] == 0x8f) || | ||
/* U+3095 HIRAGANA LETTER SMALL KA */ | ||
(previous[1] == 0x82 && previous[2] == 0x95)) { | ||
/* U+3041 HIRAGANA LETTER SMALL A */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* U+3041 HIRAGANA LETTER SMALL A */ | |
/* U+3041 HIRAGANA LETTER A */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
lib/normalizer.c
Outdated
(previous[1] == 0x82 && previous[2] == 0x8a) || | ||
/* U+3090 HIRAGANA LETTER WI */ | ||
(previous[1] == 0x82 && previous[2] == 0x90)) { | ||
/* U+3043 HIRAGANA LETTER SMALL I */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* U+3043 HIRAGANA LETTER SMALL I */ | |
/* U+3043 HIRAGANA LETTER I */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -2091,6 +2561,7 @@ grn_nfkc_normalize_unify(grn_ctx *ctx, | |||
data->options->unify_katakana_di_sound || | |||
data->options->unify_katakana_gu_small_sounds || | |||
data->options->unify_katakana_trailing_o || | |||
data->options->unify_kana_prolonged_sound_mark || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work with unify_katakana_trailing_o
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add test and fix implementation if need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That didn't work.
I have fixed it and added tests.
Thank you for your comments. |
NormalizerNFKC*
normalizeprolonged_sound_mark
(ー
) to the vowel of the previous kana character orン
orん
whenunify_kana_prolonged_sound_mark
is specified.ー
Usage:
We can use
unify_kana_prolonged_sound_mark
withunify_katakana_trailing_o
orunify_prolonged_sound_mark
.