NormalizerNFKC: add `unify_kana_prolonged_sound_mark` option #1522

HashidaTKS · 2023-02-13T03:12:12Z

NormalizerNFKC* normalize prolonged_sound_mark ( ー ) to the vowel of the previous kana character or ン or ん when unify_kana_prolonged_sound_mark is specified.

Previous Char	Normalized `ー`
Katakana & vowel is A	ア
Katakana & vowel is I	イ
Katakana & vowel is U	ウ
Katakana & vowel is E	エ
Katakana & vowel is O	オ
ン	ン
Hiragana & vowel is A	あ
Hiragana & vowel is I	い
Hiragana & vowel is U	う
Hiragana & vowel is E	え
Hiragana & vowel is O	お
ん	ん

ァー -> ァア, アー -> アア, ヵー -> ヵア, カー -> カア, ガー -> ガア, サー -> サア, ザー -> ザア, 
ター -> タア, ダー -> ダア, ナー -> ナア, ハー -> ハア, バー -> バア, パー -> パア, マー -> マア, 
ャー -> ャア, ヤー -> ヤア, ラー -> ラア, ヮー -> ヮア, ワー -> ワア, ヷー -> ヷア, 

ィー -> ィイ, イー -> イイ, キー -> キイ, ギー -> ギイ, シー -> シイ, ジー -> ジイ, チー -> チイ,
ヂー -> ヂイ, ニー -> ニイ, ヒー -> ヒイ, ビー -> ビイ, ピー -> ピイ, ミー -> ミイ, リー -> リイ,
ヰー -> ヰイ, ヸー -> ヸイ, 

ゥー -> ゥウ, ウー -> ウウ, クー -> クウ, グー -> グウ, スー -> スウ, ズー -> ズウ, ツー -> ツウ,
ヅー -> ヅウ, ヌー -> ヌウ, フー -> フウ, ブー -> ブウ, プー -> プウ, ムー -> ムウ, ュー -> ュウ,
ユー -> ユウ, ルー -> ルウ, ヱー -> ヱウ, ヴー -> ヴウ,

ェー -> ェエ, エー -> エエ, ヶー -> ヶエ, ケー -> ケエ, ゲー -> ゲエ, セー -> セエ, ゼー -> ゼエ,
テー -> テエ, デー -> デエ, ネー -> ネエ, ヘー -> ヘエ, ベー -> ベエ, ペー -> ペエ, メー -> メエ,
レー -> レエ, ヹー -> ヹエ,

ォー -> ォオ, オー -> オオ, コー -> コオ, ゴー -> ゴオ, ソー -> ソオ, ゾー -> ゾオ, トー -> トオ,
ドー -> ドオ, ノー -> ノオ, ホー -> ホオ, ボー -> ボオ, ポー -> ポオ, モー -> モオ, ョー -> ョオ,
ヨー -> ヨオ, ロー -> ロオ, ヲー -> ヲオ, ヺー -> ヺオ, 

ンー -> ンン

ぁー -> ぁあ, あー -> ああ, ゕー -> ゕあ, かー -> かあ, がー -> があ, さー -> さあ, ざー -> ざあ, 
たー -> たあ, だー -> だあ, なー -> なあ, はー -> はあ, ばー -> ばあ, ぱー -> ぱあ, まー -> まあ, 
ゃー -> ゃあ, やー -> やあ, らー -> らあ, ゎー -> ゎあ, わー -> わあ 

ぃー -> ぃい, いー -> いい, きー -> きい, ぎー -> ぎい, しー -> しい, じー -> じい, ちー -> ちい,
ぢー -> ぢい, にー -> にい, ひー -> ひい, びー -> びい, ぴー -> ぴい, みー -> みい, りー -> りい,
ゐー -> ゐい

ぅー -> ぅう, うー -> うう, くー -> くう, ぐー -> ぐう, すー -> すう, ずー -> ずう, つー -> つう,
づー -> づう, ぬー -> ぬう, ふー -> ふう, ぶー -> ぶう, ぷー -> ぷう, むー -> むう, ゅー -> ゅう,
ゆー -> ゆう, るー -> るう, ゑー -> ゑう, ゔー -> ゔう

ぇー -> ぇえ, えー -> ええ, ゖー -> ゖえ, けー -> けえ, げー -> げえ, せー -> せえ, ぜー -> ぜえ,
てー -> てえ, でー -> でえ, ねー -> ねえ, へー -> へえ, べー -> べえ, ぺー -> ぺえ, めー -> めえ,
れー -> れえ

ぉー -> ぉお, おー -> おお, こー -> こお, ごー -> ごお, そー -> そお, ぞー -> ぞお, とー -> とお,
どー -> どお, のー -> のお, ほー -> ほお, ぼー -> ぼお, ぽー -> ぽお, もー -> もお, ょー -> ょお,
よー -> よお, ろー -> ろお, をー -> をお

んー -> んん

Usage:

normalize \
  'NormalizerNFKC100("unify_kana_prolonged_sound_mark", true, \
                     "report_source_offset", true)' \
  "アーイーウーエーオーあーいーうーえーおー" \
  WITH_CHECKS|WITH_TYPES

We can use unify_kana_prolonged_sound_mark with unify_katakana_trailing_o or unify_prolonged_sound_mark.

HashidaTKS · 2023-02-13T06:09:26Z

@kou @komainu8

Would you review this when you have time?

kou · 2023-02-13T06:11:32Z

Should this be unify_kana_prolonged_sound_mark?

test/command/suite/normalizers/nfkc121/unify_katakana_prolonged_sound_mark.test

HashidaTKS · 2023-02-13T06:15:05Z

Should this be unify_kana_prolonged_sound_mark?

Okay, I will rename to unify_kana_prolonged_sound_mark and add Hiragana cases

HashidaTKS · 2023-02-14T05:46:33Z

@kou

Thanks for your comments.
Would you re-review this when you have time?

HashidaTKS · 2023-02-14T05:55:01Z

Oops, I have added tests only for NormalizerNFKC100, we need to add tests for all normalizers. I will add them.

HashidaTKS · 2023-02-14T05:59:00Z

Oops, I have added tests only for NormalizerNFKC100, we need to add tests for all normalizers. I will add them.

Added.

kou

We want to use this feature with unify_hyphen_and_prolonged_sound_mark too.

Should we also implement unify_kana_hyphen as a separated feature?

kou · 2023-02-15T00:43:51Z

lib/normalizer.c

@@ -1892,6 +1892,469 @@ grn_nfkc_normalize_unify_katakana_trailing_o(grn_ctx *ctx,
  return current;
 }

+static const unsigned char *
+grn_nfkc_normalize_unify_kana_prolonged_sound_mark(grn_ctx *ctx,
+                                             const unsigned char *start,


Could you fix indent?

lib/normalizer.c

kou · 2023-02-15T00:47:18Z

lib/normalizer.c

+          (previous[1] == 0x82 && previous[2] == 0x8f) ||
+          /* U+3095 HIRAGANA LETTER SMALL KA */
+          (previous[1] == 0x82 && previous[2] == 0x95)) {
+        unified_buffer[(*n_unified_bytes)++] = previous[0];


Could you add /* U+3042 HIRAGANA LETTER A */ comment?

kou · 2023-02-15T00:48:12Z

lib/normalizer.c

+      }
+               /* U+3043 HIRAGANA LETTER SMALL I */
+      else if ((previous[1] == 0x81 && previous[2] == 0x83) ||


Suggested change

}

/* U+3043 HIRAGANA LETTER SMALL I */

else if ((previous[1] == 0x81 && previous[2] == 0x83) ||

} else if (/* U+3043 HIRAGANA LETTER SMALL I */

(previous[1] == 0x81 && previous[2] == 0x83) ||

lib/normalizer.c

kou · 2023-02-15T03:06:10Z

test/command/suite/normalizers/nfkc100/unify_kana_prolonged_sound_mark/hiragana/a.expected

+    0.0
+  ],
+  {
+    "normalized": "ああぁあいいぃいううぅうええぇえおおぉお",


Should we use あ instead of ぁ for ぁー?

Do you mean ぁー should be normalized to ぁぁ or ああ or something else?

ぁぁ. Why do you choose ぁあ? Easy to implement?

Why do you choose ぁあ? Easy to implement?

Because ぁぁ is not pronounceable in Japanese.
Some one uses ぁぁ like ぎゃぁぁぁ, but I think it is broken Japanese.

For example, the pronunciation of ファー is closer to ファア than ファァ.

I understand. I'm OK with ああ then.

OK, I will fix them.

Ah, sorry. ぁあ.

kou · 2023-02-15T03:10:17Z

test/command/suite/normalizers/nfkc121/unify_kana_prolonged_sound_mark.expected

@@ -0,0 +1,907 @@
+normalize   'NormalizerNFKC121("unify_kana_prolonged_sound_mark", true,                      "report_source_offset", true)'   "ァーアーィーイーゥーウーェーエーォーオーヵーカーガーキーギークーグーヶーケーゲーコーゴーサーザーシージースーズーセーゼーソーゾーターダーチーヂーツーヅーテーデートードーナーニーヌーネーノーハーバーパーヒービーピーフーブープーヘーベーペーホーボーポーマーミームーメーモーャーヤーューユーョーヨーラーリールーレーローヮーワーヰーヱーヲーンーヴーヷーヸーヹーヺー"   WITH_CHECKS|WITH_TYPES


Can we remove this file?

Co-authored-by: Sutou Kouhei <kou@clear-code.com>

HashidaTKS · 2023-02-15T04:51:33Z

We want to use this feature with unify_hyphen_and_prolonged_sound_mark too.

Should we also implement unify_kana_hyphen as a separated feature?

Sure.

HashidaTKS · 2023-02-17T01:05:15Z

@kou

Would you re-review this?

kou · 2023-02-21T08:44:33Z

lib/normalizer.c

+  *n_used_characters = 1;
+
+  if (*previous_length == 3 &&
+      grn_nfkc_normalize_is_prolonged_sound_mark_famity(current, char_length)) {


Really?
I think that we should use only U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK here.
If an user wants normalized prolonged sound mark, the user should use unify_prolonged_sound_mark.

OK, I will fix it.

Also I will add a test using this option with unify_prolonged_sound_mark.

kou · 2023-02-21T08:46:45Z

lib/normalizer.c

+      }
+      else if (/* U+3043 HIRAGANA LETTER SMALL I */


Suggested change

}

else if (/* U+3043 HIRAGANA LETTER SMALL I */

} else if (/* U+3043 HIRAGANA LETTER SMALL I */

kou · 2023-02-21T08:47:17Z

lib/normalizer.c

+          (previous[1] == 0x82 && previous[2] == 0x8f) ||
+          /* U+3095 HIRAGANA LETTER SMALL KA */
+          (previous[1] == 0x82 && previous[2] == 0x95)) {
+        /* U+3041 HIRAGANA LETTER SMALL A */


Suggested change

/* U+3041 HIRAGANA LETTER SMALL A */

/* U+3041 HIRAGANA LETTER A */

kou · 2023-02-21T08:47:33Z

lib/normalizer.c

+               (previous[1] == 0x82 && previous[2] == 0x8a) ||
+               /* U+3090 HIRAGANA LETTER WI */
+               (previous[1] == 0x82 && previous[2] == 0x90)) {
+        /* U+3043 HIRAGANA LETTER SMALL I */


Suggested change

/* U+3043 HIRAGANA LETTER SMALL I */

/* U+3043 HIRAGANA LETTER I */

kou · 2023-02-21T08:52:54Z

lib/normalizer.c

@@ -2091,6 +2561,7 @@ grn_nfkc_normalize_unify(grn_ctx *ctx,
        data->options->unify_katakana_di_sound ||
        data->options->unify_katakana_gu_small_sounds ||
        data->options->unify_katakana_trailing_o ||
+        data->options->unify_kana_prolonged_sound_mark ||


Does this work with unify_katakana_trailing_o?

I will add test and fix implementation if need.

That didn't work.
I have fixed it and added tests.

HashidaTKS · 2023-02-24T09:44:28Z

@kou

Thank you for your comments.
Would you please re-review this?

HashidaTKS added 3 commits February 13, 2023 12:09

Add unify_katakana_prolonged_sound_mark option

5cfd10a

Rename actual to expected

472b275

Add tests

a3309a9

HashidaTKS marked this pull request as ready for review February 13, 2023 06:09

Fix a typo

3c4cacb

kou reviewed Feb 13, 2023

View reviewed changes

test/command/suite/normalizers/nfkc121/unify_katakana_prolonged_sound_mark.test Outdated Show resolved Hide resolved

HashidaTKS added 4 commits February 13, 2023 17:46

Rename katakana -> kana

b993ffd

Add Hiragana case

255ccf9

Add tests

82a3829

Add expected

3547739

HashidaTKS changed the title ~~NormalizerNFKC: add unify_katakana_prolonged_sound_mark option~~ NormalizerNFKC: add unify_kana_prolonged_sound_mark option Feb 14, 2023

Remove needless test

4965021

Add tests for all normalizers

3ecc235

kou reviewed Feb 15, 2023

View reviewed changes

HashidaTKS and others added 5 commits February 15, 2023 13:17

Update lib/normalizer.c

40f976b

Co-authored-by: Sutou Kouhei <kou@clear-code.com>

Fix comments

92191b6

Remove needless check

a1a5ee1

Remove needless tests

eb3d54b

Align indent

f696a9f

Use grn_nfkc_normalize_is_prolonged_sound_mark_famity

0af0b2b

Fix tests

bf225c1

kou reviewed Feb 21, 2023

View reviewed changes

HashidaTKS added 5 commits February 24, 2023 18:25

Fix tests

211956a

Target KATAKANA-HIRAGANA PROLONGED SOUND MARK only

395632d

Modify .actual to .expected

bb25286

Align "else if" position

57d1d43

Support using with unify_katakana_trailing_o

287126f

HashidaTKS and others added 3 commits February 27, 2023 10:15

Align order of assignment

456023c

Fix indent

e33e33e

Remove garbage

6f2ae63

kou merged commit ad3e229 into master Feb 27, 2023

kou deleted the add_unify_katakana_prolonged_sound_mark branch February 27, 2023 07:35

github-actions bot mentioned this pull request May 22, 2023

nginx: update bundles version 1.23.4 komainu8/groonga#1

Closed

		@@ -0,0 +1,907 @@
		normalize 'NormalizerNFKC121("unify_kana_prolonged_sound_mark", true, "report_source_offset", true)' "ァーアーィーイーゥーウーェーエーォーオーヵーカーガーキーギークーグーヶーケーゲーコーゴーサーザーシージースーズーセーゼーソーゾーターダーチーヂーツーヅーテーデートードーナーニーヌーネーノーハーバーパーヒービーピーフーブープーヘーベーペーホーボーポーマーミームーメーモーャーヤーューユーョーヨーラーリールーレーローヮーワーヰーヱーヲーンーヴーヷーヸーヹーヺー" WITH_CHECKS\|WITH_TYPES

	}
	else if (/* U+3043 HIRAGANA LETTER SMALL I */
	} else if (/* U+3043 HIRAGANA LETTER SMALL I */

	/* U+3041 HIRAGANA LETTER SMALL A */
	/* U+3041 HIRAGANA LETTER A */

	/* U+3043 HIRAGANA LETTER SMALL I */
	/* U+3043 HIRAGANA LETTER I */

NormalizerNFKC: add unify_kana_prolonged_sound_mark option #1522

NormalizerNFKC: add unify_kana_prolonged_sound_mark option #1522

Conversation

HashidaTKS commented Feb 13, 2023 • edited Loading

HashidaTKS commented Feb 13, 2023

kou commented Feb 13, 2023

HashidaTKS commented Feb 13, 2023

HashidaTKS commented Feb 14, 2023 • edited Loading

HashidaTKS commented Feb 14, 2023 • edited Loading

HashidaTKS commented Feb 14, 2023

kou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HashidaTKS Feb 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HashidaTKS commented Feb 15, 2023

HashidaTKS commented Feb 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HashidaTKS Feb 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HashidaTKS Feb 24, 2023 • edited Loading

Choose a reason for hiding this comment

HashidaTKS commented Feb 24, 2023

NormalizerNFKC: add `unify_kana_prolonged_sound_mark` option #1522

NormalizerNFKC: add `unify_kana_prolonged_sound_mark` option #1522

HashidaTKS commented Feb 13, 2023 •

edited

Loading

HashidaTKS commented Feb 14, 2023 •

edited

Loading

HashidaTKS commented Feb 14, 2023 •

edited

Loading

HashidaTKS Feb 15, 2023 •

edited

Loading

HashidaTKS Feb 21, 2023 •

edited

Loading

HashidaTKS Feb 24, 2023 •

edited

Loading