Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NormalizerNFKC: add unify_kana_prolonged_sound_mark option #1522

Merged
merged 25 commits into from
Feb 27, 2023

Conversation

HashidaTKS
Copy link
Contributor

@HashidaTKS HashidaTKS commented Feb 13, 2023

NormalizerNFKC* normalize prolonged_sound_mark ( ) to the vowel of the previous kana character or or when unify_kana_prolonged_sound_mark is specified.

Previous Char Normalized
Katakana & vowel is A
Katakana & vowel is I
Katakana & vowel is U
Katakana & vowel is E
Katakana & vowel is O
Hiragana & vowel is A
Hiragana & vowel is I
Hiragana & vowel is U
Hiragana & vowel is E
Hiragana & vowel is O
ァー -> ァア, アー -> アア, ヵー -> ヵア, カー -> カア, ガー -> ガア, サー -> サア, ザー -> ザア, 
ター -> タア, ダー -> ダア, ナー -> ナア, ハー -> ハア, バー -> バア, パー -> パア, マー -> マア, 
ャー -> ャア, ヤー -> ヤア, ラー -> ラア, ヮー -> ヮア, ワー -> ワア, ヷー -> ヷア, 

ィー -> ィイ, イー -> イイ, キー -> キイ, ギー -> ギイ, シー -> シイ, ジー -> ジイ, チー -> チイ,
ヂー -> ヂイ, ニー -> ニイ, ヒー -> ヒイ, ビー -> ビイ, ピー -> ピイ, ミー -> ミイ, リー -> リイ,
ヰー -> ヰイ, ヸー -> ヸイ, 

ゥー -> ゥウ, ウー -> ウウ, クー -> クウ, グー -> グウ, スー -> スウ, ズー -> ズウ, ツー -> ツウ,
ヅー -> ヅウ, ヌー -> ヌウ, フー -> フウ, ブー -> ブウ, プー -> プウ, ムー -> ムウ, ュー -> ュウ,
ユー -> ユウ, ルー -> ルウ, ヱー -> ヱウ, ヴー -> ヴウ,

ェー -> ェエ, エー -> エエ, ヶー -> ヶエ, ケー -> ケエ, ゲー -> ゲエ, セー -> セエ, ゼー -> ゼエ,
テー -> テエ, デー -> デエ, ネー -> ネエ, ヘー -> ヘエ, ベー -> ベエ, ペー -> ペエ, メー -> メエ,
レー -> レエ, ヹー -> ヹエ,

ォー -> ォオ, オー -> オオ, コー -> コオ, ゴー -> ゴオ, ソー -> ソオ, ゾー -> ゾオ, トー -> トオ,
ドー -> ドオ, ノー -> ノオ, ホー -> ホオ, ボー -> ボオ, ポー -> ポオ, モー -> モオ, ョー -> ョオ,
ヨー -> ヨオ, ロー -> ロオ, ヲー -> ヲオ, ヺー -> ヺオ, 

ンー -> ンン

ぁー -> ぁあ, あー -> ああ, ゕー -> ゕあ, かー -> かあ, がー -> があ, さー -> さあ, ざー -> ざあ, 
たー -> たあ, だー -> だあ, なー -> なあ, はー -> はあ, ばー -> ばあ, ぱー -> ぱあ, まー -> まあ, 
ゃー -> ゃあ, やー -> やあ, らー -> らあ, ゎー -> ゎあ, わー -> わあ 

ぃー -> ぃい, いー -> いい, きー -> きい, ぎー -> ぎい, しー -> しい, じー -> じい, ちー -> ちい,
ぢー -> ぢい, にー -> にい, ひー -> ひい, びー -> びい, ぴー -> ぴい, みー -> みい, りー -> りい,
ゐー -> ゐい

ぅー -> ぅう, うー -> うう, くー -> くう, ぐー -> ぐう, すー -> すう, ずー -> ずう, つー -> つう,
づー -> づう, ぬー -> ぬう, ふー -> ふう, ぶー -> ぶう, ぷー -> ぷう, むー -> むう, ゅー -> ゅう,
ゆー -> ゆう, るー -> るう, ゑー -> ゑう, ゔー -> ゔう

ぇー -> ぇえ, えー -> ええ, ゖー -> ゖえ, けー -> けえ, げー -> げえ, せー -> せえ, ぜー -> ぜえ,
てー -> てえ, でー -> でえ, ねー -> ねえ, へー -> へえ, べー -> べえ, ぺー -> ぺえ, めー -> めえ,
れー -> れえ

ぉー -> ぉお, おー -> おお, こー -> こお, ごー -> ごお, そー -> そお, ぞー -> ぞお, とー -> とお,
どー -> どお, のー -> のお, ほー -> ほお, ぼー -> ぼお, ぽー -> ぽお, もー -> もお, ょー -> ょお,
よー -> よお, ろー -> ろお, をー -> をお

んー -> んん

Usage:

normalize \
  'NormalizerNFKC100("unify_kana_prolonged_sound_mark", true, \
                     "report_source_offset", true)' \
  "アーイーウーエーオーあーいーうーえーおー" \
  WITH_CHECKS|WITH_TYPES

We can use unify_kana_prolonged_sound_mark with unify_katakana_trailing_o or unify_prolonged_sound_mark.

@HashidaTKS HashidaTKS marked this pull request as ready for review February 13, 2023 06:09
@HashidaTKS
Copy link
Contributor Author

@kou @komainu8

Would you review this when you have time?

@kou
Copy link
Member

kou commented Feb 13, 2023

Should this be unify_kana_prolonged_sound_mark?

@HashidaTKS
Copy link
Contributor Author

Should this be unify_kana_prolonged_sound_mark?

Okay, I will rename to unify_kana_prolonged_sound_mark and add Hiragana cases

@HashidaTKS HashidaTKS changed the title NormalizerNFKC: add unify_katakana_prolonged_sound_mark option NormalizerNFKC: add unify_kana_prolonged_sound_mark option Feb 14, 2023
@HashidaTKS
Copy link
Contributor Author

HashidaTKS commented Feb 14, 2023

@kou

Thanks for your comments.
Would you re-review this when you have time?

@HashidaTKS
Copy link
Contributor Author

HashidaTKS commented Feb 14, 2023

Oops, I have added tests only for NormalizerNFKC100, we need to add tests for all normalizers. I will add them.

@HashidaTKS
Copy link
Contributor Author

Oops, I have added tests only for NormalizerNFKC100, we need to add tests for all normalizers. I will add them.

Added.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to use this feature with unify_hyphen_and_prolonged_sound_mark too.

Should we also implement unify_kana_hyphen as a separated feature?

lib/normalizer.c Outdated
@@ -1892,6 +1892,469 @@ grn_nfkc_normalize_unify_katakana_trailing_o(grn_ctx *ctx,
return current;
}

static const unsigned char *
grn_nfkc_normalize_unify_kana_prolonged_sound_mark(grn_ctx *ctx,
const unsigned char *start,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you fix indent?

lib/normalizer.c Outdated Show resolved Hide resolved
lib/normalizer.c Outdated Show resolved Hide resolved
(previous[1] == 0x82 && previous[2] == 0x8f) ||
/* U+3095 HIRAGANA LETTER SMALL KA */
(previous[1] == 0x82 && previous[2] == 0x95)) {
unified_buffer[(*n_unified_bytes)++] = previous[0];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add /* U+3042 HIRAGANA LETTER A */ comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

lib/normalizer.c Outdated
Comment on lines 1968 to 1970
}
/* U+3043 HIRAGANA LETTER SMALL I */
else if ((previous[1] == 0x81 && previous[2] == 0x83) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
/* U+3043 HIRAGANA LETTER SMALL I */
else if ((previous[1] == 0x81 && previous[2] == 0x83) ||
} else if (/* U+3043 HIRAGANA LETTER SMALL I */
(previous[1] == 0x81 && previous[2] == 0x83) ||

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

lib/normalizer.c Outdated Show resolved Hide resolved
0.0
],
{
"normalized": "ああぁあいいぃいううぅうええぇえおおぉお",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use instead of for ぁー?

Copy link
Contributor Author

@HashidaTKS HashidaTKS Feb 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean ぁー should be normalized to ぁぁ or ああ or something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ぁぁ. Why do you choose ぁあ? Easy to implement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you choose ぁあ? Easy to implement?

Because ぁぁ is not pronounceable in Japanese.
Some one uses ぁぁ like ぎゃぁぁぁ, but I think it is broken Japanese.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, the pronunciation of ファー is closer to ファア than ファァ.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. I'm OK with ああ then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will fix them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry. ぁあ.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see.

@@ -0,0 +1,907 @@
normalize 'NormalizerNFKC121("unify_kana_prolonged_sound_mark", true, "report_source_offset", true)' "ァーアーィーイーゥーウーェーエーォーオーヵーカーガーキーギークーグーヶーケーゲーコーゴーサーザーシージースーズーセーゼーソーゾーターダーチーヂーツーヅーテーデートードーナーニーヌーネーノーハーバーパーヒービーピーフーブープーヘーベーペーホーボーポーマーミームーメーモーャーヤーューユーョーヨーラーリールーレーローヮーワーヰーヱーヲーンーヴーヷーヸーヹーヺー" WITH_CHECKS|WITH_TYPES
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@HashidaTKS
Copy link
Contributor Author

We want to use this feature with unify_hyphen_and_prolonged_sound_mark too.

Should we also implement unify_kana_hyphen as a separated feature?

Sure.

@HashidaTKS
Copy link
Contributor Author

@kou

Would you re-review this?

lib/normalizer.c Outdated
*n_used_characters = 1;

if (*previous_length == 3 &&
grn_nfkc_normalize_is_prolonged_sound_mark_famity(current, char_length)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really?
I think that we should use only U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK here.
If an user wants normalized prolonged sound mark, the user should use unify_prolonged_sound_mark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will fix it.

Copy link
Contributor Author

@HashidaTKS HashidaTKS Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I will add a test using this option with unify_prolonged_sound_mark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

lib/normalizer.c Outdated
Comment on lines 1965 to 1966
}
else if (/* U+3043 HIRAGANA LETTER SMALL I */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
else if (/* U+3043 HIRAGANA LETTER SMALL I */
} else if (/* U+3043 HIRAGANA LETTER SMALL I */

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

lib/normalizer.c Outdated
(previous[1] == 0x82 && previous[2] == 0x8f) ||
/* U+3095 HIRAGANA LETTER SMALL KA */
(previous[1] == 0x82 && previous[2] == 0x95)) {
/* U+3041 HIRAGANA LETTER SMALL A */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* U+3041 HIRAGANA LETTER SMALL A */
/* U+3041 HIRAGANA LETTER A */

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

lib/normalizer.c Outdated
(previous[1] == 0x82 && previous[2] == 0x8a) ||
/* U+3090 HIRAGANA LETTER WI */
(previous[1] == 0x82 && previous[2] == 0x90)) {
/* U+3043 HIRAGANA LETTER SMALL I */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* U+3043 HIRAGANA LETTER SMALL I */
/* U+3043 HIRAGANA LETTER I */

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -2091,6 +2561,7 @@ grn_nfkc_normalize_unify(grn_ctx *ctx,
data->options->unify_katakana_di_sound ||
data->options->unify_katakana_gu_small_sounds ||
data->options->unify_katakana_trailing_o ||
data->options->unify_kana_prolonged_sound_mark ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work with unify_katakana_trailing_o?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add test and fix implementation if need.

Copy link
Contributor Author

@HashidaTKS HashidaTKS Feb 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That didn't work.
I have fixed it and added tests.

@HashidaTKS
Copy link
Contributor Author

@kou

Thank you for your comments.
Would you please re-review this?

@kou kou merged commit ad3e229 into master Feb 27, 2023
@kou kou deleted the add_unify_katakana_prolonged_sound_mark branch February 27, 2023 07:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants