NormalizerNFKC: add `unify_kana_hyphen` option #1526

HashidaTKS · 2023-02-20T05:12:26Z

NormalizerNFKC* normalize hyphen (-) to the vowel of the previous kana character or ン or ん when unify_kana_hyphen is specified.

Previous Char	Normalized `-`
Katakana & vowel is A	ア
Katakana & vowel is I	イ
Katakana & vowel is U	ウ
Katakana & vowel is E	エ
Katakana & vowel is O	オ
ン	ン
Hiragana & vowel is A	あ
Hiragana & vowel is I	い
Hiragana & vowel is U	う
Hiragana & vowel is E	え
Hiragana & vowel is O	お
ん	ん

ァ- -> ァア, ア- -> アア, ヵ- -> ヵア, カ- -> カア, ガ- -> ガア, サ- -> サア, ザ- -> ザア, 
タ- -> タア, ダ- -> ダア, ナ- -> ナア, ハ- -> ハア, バ- -> バア, パ- -> パア, マ- -> マア, 
ャ- -> ャア, ヤ- -> ヤア, ラ- -> ラア, ヮ- -> ヮア, ワ- -> ワア, ヷ- -> ヷア, 

ィ- -> ィイ, イ- -> イイ, キ- -> キイ, ギ- -> ギイ, シ- -> シイ, ジ- -> ジイ, チ- -> チイ,
ヂ- -> ヂイ, ニ- -> ニイ, ヒ- -> ヒイ, ビ- -> ビイ, ピ- -> ピイ, ミ- -> ミイ, リ- -> リイ,
ヰ- -> ヰイ, ヸ- -> ヸイ, 

ゥ- -> ゥウ, ウ- -> ウウ, ク- -> クウ, グ- -> グウ, ス- -> スウ, ズ- -> ズウ, ツ- -> ツウ,
ヅ- -> ヅウ, ヌ- -> ヌウ, フ- -> フウ, ブ- -> ブウ, プ- -> プウ, ム- -> ムウ, ュ- -> ュウ,
ユ- -> ユウ, ル- -> ルウ, ヱ- -> ヱウ, ヴ- -> ヴウ,

ェ- -> ェエ, エ- -> エエ, ヶ- -> ヶエ, ケ- -> ケエ, ゲ- -> ゲエ, セ- -> セエ, ゼ- -> ゼエ,
テ- -> テエ, デ- -> デエ, ネ- -> ネエ, ヘ- -> ヘエ, ベ- -> ベエ, ペ- -> ペエ, メ- -> メエ,
レ- -> レエ, ヹ- -> ヹエ,

ォ- -> ォオ, オ- -> オオ, コ- -> コオ, ゴ- -> ゴオ, ソ- -> ソオ, ゾ- -> ゾオ, ト- -> トオ,
ド- -> ドオ, ノ- -> ノオ, ホ- -> ホオ, ボ- -> ボオ, ポ- -> ポオ, モ- -> モオ, ョ- -> ョオ,
ヨ- -> ヨオ, ロ- -> ロオ, ヲ- -> ヲオ, ヺ- -> ヺオ, 

ン- -> ンン

ぁ- -> ぁあ, あ- -> ああ, ゕ- -> ゕあ, か- -> かあ, が- -> があ, さ- -> さあ, ざ- -> ざあ, 
た- -> たあ, だ- -> だあ, な- -> なあ, は- -> はあ, ば- -> ばあ, ぱ- -> ぱあ, ま- -> まあ, 
ゃ- -> ゃあ, や- -> やあ, ら- -> らあ, ゎ- -> ゎあ, わ- -> わあ 

ぃ- -> ぃい, い- -> いい, き- -> きい, ぎ- -> ぎい, し- -> しい, じ- -> じい, ち- -> ちい,
ぢ- -> ぢい, に- -> にい, ひ- -> ひい, び- -> びい, ぴ- -> ぴい, み- -> みい, り- -> りい,
ゐ- -> ゐい

ぅ- -> ぅう, う- -> うう, く- -> くう, ぐ- -> ぐう, す- -> すう, ず- -> ずう, つ- -> つう,
づ- -> づう, ぬ- -> ぬう, ふ- -> ふう, ぶ- -> ぶう, ぷ- -> ぷう, む- -> むう, ゅ- -> ゅう,
ゆ- -> ゆう, る- -> るう, ゑ- -> ゑう, ゔ- -> ゔう

ぇ- -> ぇえ, え- -> ええ, ゖ- -> ゖえ, け- -> けえ, げ- -> げえ, せ- -> せえ, ぜ- -> ぜえ,
て- -> てえ, で- -> でえ, ね- -> ねえ, へ- -> へえ, べ- -> べえ, ぺ- -> ぺえ, め- -> めえ,
れ- -> れえ

ぉ- -> ぉお, お- -> おお, こ- -> こお, ご- -> ごお, そ- -> そお, ぞ- -> ぞお, と- -> とお,
ど- -> どお, の- -> のお, ほ- -> ほお, ぼ- -> ぼお, ぽ- -> ぽお, も- -> もお, ょ- -> ょお,
よ- -> よお, ろ- -> ろお, を- -> をお

ん- -> んん

Usage:

normalize \
  'NormalizerNFKC100("unify_kana_hyphen", true, \
                     "report_source_offset", true)' \
  "ア-イ-ウ-エ-オ-あ-い-う-え-お-" \
  WITH_CHECKS|WITH_TYPES

HashidaTKS · 2023-02-27T02:42:19Z

@kou

Would you please review this?
(Could you please review #1520 first because it has higher priority than this.)

HashidaTKS · 2023-02-27T09:28:46Z

Rebased to the latest master branch.

@kou

Would you please review this?

HashidaTKS · 2023-02-28T05:54:21Z

@kou

Would you review this?

kou · 2023-02-28T05:59:20Z

lib/normalizer.c

@@ -988,6 +988,32 @@ grn_nfkc_normalize_unify_katakana_voiced_sound_mark(const unsigned char *utf8_ch
  return utf8_char;
 }

+grn_inline static grn_bool


Could you use bool for newly added code?

Suggested change

grn_inline static grn_bool

grn_inline static bool

kou · 2023-02-28T06:00:02Z

lib/normalizer.c

+  if (length == 1 &&
+      utf8_char[0] == '-') {
+    /* U+002D HYPHEN-MINUS */
+    return GRN_TRUE;
+  }
+  return GRN_FALSE;


How about simplifying this?

Suggested change

if (length == 1 &&

utf8_char[0] == '-') {

/* U+002D HYPHEN-MINUS */

return GRN_TRUE;

}

return GRN_FALSE;

/* U+002D HYPHEN-MINUS */

return (length == 1 && utf8_char[0] == '-');

kou · 2023-02-28T06:00:12Z

lib/normalizer.c

+grn_inline static grn_bool
+grn_nfkc_normalize_is_prolonged_sound_mark(const unsigned char *utf8_char,
+                                           size_t length)
+{
+  if (length == 3 &&
+      utf8_char[0] == 0xe3 &&
+      utf8_char[1] == 0x83 &&
+      utf8_char[2] == 0xbc) {
+    /* U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK */
+    return GRN_TRUE;
+  }
+  return GRN_FALSE;
+}


kou · 2023-02-28T06:00:39Z

lib/normalizer.c

+typedef grn_bool
+(*grn_nfkc_normalize_is_target_char_func)(const unsigned char *utf8_char,
+                                          size_t length);


Suggested change

typedef grn_bool

(*grn_nfkc_normalize_is_target_char_func)(const unsigned char *utf8_char,

size_t length);

typedef bool

(*grn_nfkc_normalize_is_target_char_func)(const unsigned char *utf8_char,

size_t length);

kou · 2023-02-28T07:18:33Z

lib/normalizer.c

-                                                   size_t *n_unified_bytes,
-                                                   size_t *n_unified_characters,
-                                                   void *user_data)
+grn_nfkc_normalize_unify_to_previous_kana_vowel_or_n(grn_ctx *ctx,


Can we simplify this?

Suggested change

grn_nfkc_normalize_unify_to_previous_kana_vowel_or_n(grn_ctx *ctx,

grn_nfkc_normalize_unify_kana_prolonged_sound_mark_like(grn_ctx *ctx,

kou · 2023-02-28T07:21:44Z

lib/normalizer.c

+                                                     unsigned char *unified_buffer,
+                                                     size_t *n_unified_bytes,
+                                                     size_t *n_unified_characters,
+                                                     grn_nfkc_normalize_is_target_char_func func,


How about using user_data instead of adding a new argument?

typedef struct { grn_nfkc_normalize_is_target_char_func is_prolonged_sound_mark; bool previous_length; } grn_nfkc_normalize_prolonged_sound_mark_like_data; static const unsigned char * grn_nfkc_normalize_unify_kana_prolonged_sound_mark_like(...) { grn_nfkc_normalize_prolonged_sound_mark_like_data data = user_data; ... if (data->previous_length == 3 && data->is_prolonged_sound_mark(current, char_length)) { ... } }

Sounds good.
I have added new struct used as user_data.

kou · 2023-02-28T07:24:33Z

lib/normalizer.c

-    previous = current - *previous_length;
+      func(current, char_length)) {
+    const unsigned char *previous = current - *previous_length;
+    *previous_length = char_length;


How about setting this before if() because we should always set it?

In order to set *previous_length before this if, it is better that we introduce a temporary variable like saved_previous_length as below because *previous_length is used in the if() condition and in brackets of it.

size_t char_length; size_t *previous_length = user_data; size_t saved_previous_length = *previous_length; char_length = (size_t)grn_charlen_(ctx, current, end, GRN_ENC_UTF8); *previous_length = char_length; *n_used_bytes = char_length; *n_used_characters = 1; if (saved_previous_length == 3 && func(current, char_length)) { const unsigned char *previous = current - saved_previous_length; ...

How do you feel that? it is better than the current implementation?

With #1526 (comment) , it looks like the following:

size_t previous_length = data->previous_length: data->previous_length = char_length; if (previous_length == ...)

It looks better than saved_previous_length.

Make sense.

HashidaTKS · 2023-03-01T07:55:02Z

@kou

Thank you for your comments.
I have addressed your comments.

kou · 2023-03-01T08:22:29Z

lib/normalizer.c

+    grn_nfkc_normalize_prolonged_sound_mark_like_data prolonged_sound_mark_like_data;
+    prolonged_sound_mark_like_data.is_prolonged_sound_mark_like_char = grn_nfkc_normalize_is_hyphen;
+    prolonged_sound_mark_like_data.previous_length = 0;


Suggested change

grn_nfkc_normalize_prolonged_sound_mark_like_data prolonged_sound_mark_like_data;

prolonged_sound_mark_like_data.is_prolonged_sound_mark_like_char = grn_nfkc_normalize_is_hyphen;

prolonged_sound_mark_like_data.previous_length = 0;

grn_nfkc_normalize_prolonged_sound_mark_like_data data;

data.is_prolonged_sound_mark_like_char = grn_nfkc_normalize_is_hyphen;

data.previous_length = 0;

The name data is already used as the argument of this function, this is why I named this variable prolonged_sound_mark_like_data.

grn_nfkc_normalize_unify(grn_ctx *ctx, grn_nfkc_normalize_data *data)

And we use the data variable with prolonged_sound_mark_like_data.

grn_nfkc_normalize_prolonged_sound_mark_like_data prolonged_sound_mark_like_data; prolonged_sound_mark_like_data.is_target_char = grn_nfkc_normalize_is_hyphen; prolonged_sound_mark_like_data.previous_length = 0; grn_nfkc_normalize_unify_stateful(ctx, data, &unify, grn_nfkc_normalize_unify_kana_prolonged_sound_mark_like, &prolonged_sound_mark_like_data, "[unify][kana-hyphen]");

I see. How about stateful_data or subdata?

I adopted subdata.

kou · 2023-03-01T08:24:07Z

lib/normalizer.c

+                                          size_t length);
+
+typedef struct {
+  grn_nfkc_normalize_is_target_char_func is_prolonged_sound_mark_like_char;


Hmm. This may be redundant. We may be use more shorter name without reducing readability because struct name has prolonged_sound_mark_like information:

Suggested change

grn_nfkc_normalize_is_target_char_func is_prolonged_sound_mark_like_char;

grn_nfkc_normalize_is_target_char_func is_target_char;

Make sense.
Fixed.

HashidaTKS · 2023-03-02T04:29:08Z

@kou

Thank you for your comments.
I have addressed your comments.

HashidaTKS force-pushed the add_unify-kana-hyphen branch 3 times, most recently from c8317c9 to 549df00 Compare February 27, 2023 02:35

HashidaTKS marked this pull request as ready for review February 27, 2023 02:38

Add unify-kana-hyphen option

a1473ea

HashidaTKS force-pushed the add_unify-kana-hyphen branch from 549df00 to a1473ea Compare February 27, 2023 09:14

HashidaTKS added 2 commits February 27, 2023 18:15

Align indent

43b2fdb

Align function position

42811dc

HashidaTKS added 2 commits February 28, 2023 14:35

Align indent of arguments

8a25e80

Delete an extra new line

29a97c9

kou reviewed Feb 28, 2023

View reviewed changes

HashidaTKS added 6 commits March 1, 2023 15:40

Rename function name

81f11bf

Simplify grn_nfkc_normalize_is_hyphen/prolonged_sound_mark

e23d1da

Unify previous_length assinment

26f5c99

Use unify_kana_prolonged_sound_mark_like directly

c06fb22

Rename saved_previous_length to previous_length

0ad746a

Modify grn_bool to bool

b5c7afa

kou reviewed Mar 1, 2023

View reviewed changes

HashidaTKS added 2 commits March 1, 2023 17:32

Rename variable.

445e6be

Rename a variable

1bf77e8

kou merged commit 07c0b25 into master Mar 2, 2023

kou deleted the add_unify-kana-hyphen branch March 2, 2023 05:20

github-actions bot mentioned this pull request May 22, 2023

nginx: update bundles version 1.23.4 komainu8/groonga#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NormalizerNFKC: add `unify_kana_hyphen` option #1526

NormalizerNFKC: add `unify_kana_hyphen` option #1526

HashidaTKS commented Feb 20, 2023 •

edited

Loading

HashidaTKS commented Feb 27, 2023 •

edited

Loading

HashidaTKS commented Feb 27, 2023

HashidaTKS commented Feb 28, 2023

kou Feb 28, 2023

kou Feb 28, 2023

kou Feb 28, 2023

kou Feb 28, 2023

HashidaTKS Mar 1, 2023

kou Feb 28, 2023

HashidaTKS Mar 1, 2023

kou Feb 28, 2023

HashidaTKS Mar 1, 2023

kou Feb 28, 2023

HashidaTKS Mar 1, 2023 •

edited

Loading

kou Mar 1, 2023 •

edited

Loading

HashidaTKS Mar 1, 2023

HashidaTKS Mar 1, 2023

HashidaTKS commented Mar 1, 2023

kou Mar 1, 2023

HashidaTKS Mar 1, 2023 •

edited

Loading

HashidaTKS Mar 1, 2023 •

edited

Loading

kou Mar 1, 2023

HashidaTKS Mar 2, 2023

kou Mar 1, 2023

HashidaTKS Mar 1, 2023

HashidaTKS commented Mar 2, 2023

	grn_nfkc_normalize_unify_to_previous_kana_vowel_or_n(grn_ctx *ctx,
	grn_nfkc_normalize_unify_kana_prolonged_sound_mark_like(grn_ctx *ctx,

	grn_nfkc_normalize_is_target_char_func is_prolonged_sound_mark_like_char;
	grn_nfkc_normalize_is_target_char_func is_target_char;

NormalizerNFKC: add unify_kana_hyphen option #1526

NormalizerNFKC: add unify_kana_hyphen option #1526

Conversation

HashidaTKS commented Feb 20, 2023 • edited Loading

HashidaTKS commented Feb 27, 2023 • edited Loading

HashidaTKS commented Feb 27, 2023

HashidaTKS commented Feb 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HashidaTKS Mar 1, 2023 • edited Loading

Choose a reason for hiding this comment

kou Mar 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HashidaTKS commented Mar 1, 2023

Choose a reason for hiding this comment

HashidaTKS Mar 1, 2023 • edited Loading

Choose a reason for hiding this comment

HashidaTKS Mar 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HashidaTKS commented Mar 2, 2023

NormalizerNFKC: add `unify_kana_hyphen` option #1526

NormalizerNFKC: add `unify_kana_hyphen` option #1526

HashidaTKS commented Feb 20, 2023 •

edited

Loading

HashidaTKS commented Feb 27, 2023 •

edited

Loading

HashidaTKS Mar 1, 2023 •

edited

Loading

kou Mar 1, 2023 •

edited

Loading

HashidaTKS Mar 1, 2023 •

edited

Loading

HashidaTKS Mar 1, 2023 •

edited

Loading