Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NormalizerNFKC: add unify_kana_hyphen option #1526

Merged
merged 13 commits into from
Mar 2, 2023
Merged

Conversation

HashidaTKS
Copy link
Contributor

@HashidaTKS HashidaTKS commented Feb 20, 2023

NormalizerNFKC* normalize hyphen (-) to the vowel of the previous kana character or or when unify_kana_hyphen is specified.

Previous Char Normalized -
Katakana & vowel is A
Katakana & vowel is I
Katakana & vowel is U
Katakana & vowel is E
Katakana & vowel is O
Hiragana & vowel is A
Hiragana & vowel is I
Hiragana & vowel is U
Hiragana & vowel is E
Hiragana & vowel is O
ァ- -> ァア, ア- -> アア, ヵ- -> ヵア, カ- -> カア, ガ- -> ガア, サ- -> サア, ザ- -> ザア, 
タ- -> タア, ダ- -> ダア, ナ- -> ナア, ハ- -> ハア, バ- -> バア, パ- -> パア, マ- -> マア, 
ャ- -> ャア, ヤ- -> ヤア, ラ- -> ラア, ヮ- -> ヮア, ワ- -> ワア, ヷ- -> ヷア, 

ィ- -> ィイ, イ- -> イイ, キ- -> キイ, ギ- -> ギイ, シ- -> シイ, ジ- -> ジイ, チ- -> チイ,
ヂ- -> ヂイ, ニ- -> ニイ, ヒ- -> ヒイ, ビ- -> ビイ, ピ- -> ピイ, ミ- -> ミイ, リ- -> リイ,
ヰ- -> ヰイ, ヸ- -> ヸイ, 

ゥ- -> ゥウ, ウ- -> ウウ, ク- -> クウ, グ- -> グウ, ス- -> スウ, ズ- -> ズウ, ツ- -> ツウ,
ヅ- -> ヅウ, ヌ- -> ヌウ, フ- -> フウ, ブ- -> ブウ, プ- -> プウ, ム- -> ムウ, ュ- -> ュウ,
ユ- -> ユウ, ル- -> ルウ, ヱ- -> ヱウ, ヴ- -> ヴウ,

ェ- -> ェエ, エ- -> エエ, ヶ- -> ヶエ, ケ- -> ケエ, ゲ- -> ゲエ, セ- -> セエ, ゼ- -> ゼエ,
テ- -> テエ, デ- -> デエ, ネ- -> ネエ, ヘ- -> ヘエ, ベ- -> ベエ, ペ- -> ペエ, メ- -> メエ,
レ- -> レエ, ヹ- -> ヹエ,

ォ- -> ォオ, オ- -> オオ, コ- -> コオ, ゴ- -> ゴオ, ソ- -> ソオ, ゾ- -> ゾオ, ト- -> トオ,
ド- -> ドオ, ノ- -> ノオ, ホ- -> ホオ, ボ- -> ボオ, ポ- -> ポオ, モ- -> モオ, ョ- -> ョオ,
ヨ- -> ヨオ, ロ- -> ロオ, ヲ- -> ヲオ, ヺ- -> ヺオ, 

ン- -> ンン

ぁ- -> ぁあ, あ- -> ああ, ゕ- -> ゕあ, か- -> かあ, が- -> があ, さ- -> さあ, ざ- -> ざあ, 
た- -> たあ, だ- -> だあ, な- -> なあ, は- -> はあ, ば- -> ばあ, ぱ- -> ぱあ, ま- -> まあ, 
ゃ- -> ゃあ, や- -> やあ, ら- -> らあ, ゎ- -> ゎあ, わ- -> わあ 

ぃ- -> ぃい, い- -> いい, き- -> きい, ぎ- -> ぎい, し- -> しい, じ- -> じい, ち- -> ちい,
ぢ- -> ぢい, に- -> にい, ひ- -> ひい, び- -> びい, ぴ- -> ぴい, み- -> みい, り- -> りい,
ゐ- -> ゐい

ぅ- -> ぅう, う- -> うう, く- -> くう, ぐ- -> ぐう, す- -> すう, ず- -> ずう, つ- -> つう,
づ- -> づう, ぬ- -> ぬう, ふ- -> ふう, ぶ- -> ぶう, ぷ- -> ぷう, む- -> むう, ゅ- -> ゅう,
ゆ- -> ゆう, る- -> るう, ゑ- -> ゑう, ゔ- -> ゔう

ぇ- -> ぇえ, え- -> ええ, ゖ- -> ゖえ, け- -> けえ, げ- -> げえ, せ- -> せえ, ぜ- -> ぜえ,
て- -> てえ, で- -> でえ, ね- -> ねえ, へ- -> へえ, べ- -> べえ, ぺ- -> ぺえ, め- -> めえ,
れ- -> れえ

ぉ- -> ぉお, お- -> おお, こ- -> こお, ご- -> ごお, そ- -> そお, ぞ- -> ぞお, と- -> とお,
ど- -> どお, の- -> のお, ほ- -> ほお, ぼ- -> ぼお, ぽ- -> ぽお, も- -> もお, ょ- -> ょお,
よ- -> よお, ろ- -> ろお, を- -> をお

ん- -> んん

Usage:

normalize \
  'NormalizerNFKC100("unify_kana_hyphen", true, \
                     "report_source_offset", true)' \
  "ア-イ-ウ-エ-オ-あ-い-う-え-お-" \
  WITH_CHECKS|WITH_TYPES

@HashidaTKS HashidaTKS force-pushed the add_unify-kana-hyphen branch 3 times, most recently from c8317c9 to 549df00 Compare February 27, 2023 02:35
@HashidaTKS HashidaTKS marked this pull request as ready for review February 27, 2023 02:38
@HashidaTKS
Copy link
Contributor Author

HashidaTKS commented Feb 27, 2023

@kou

Would you please review this?
(Could you please review #1520 first because it has higher priority than this.)

@HashidaTKS
Copy link
Contributor Author

Rebased to the latest master branch.

@kou

Would you please review this?

@HashidaTKS
Copy link
Contributor Author

@kou

Would you review this?

lib/normalizer.c Outdated
@@ -988,6 +988,32 @@ grn_nfkc_normalize_unify_katakana_voiced_sound_mark(const unsigned char *utf8_ch
return utf8_char;
}

grn_inline static grn_bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use bool for newly added code?

Suggested change
grn_inline static grn_bool
grn_inline static bool

lib/normalizer.c Outdated
Comment on lines 995 to 1000
if (length == 1 &&
utf8_char[0] == '-') {
/* U+002D HYPHEN-MINUS */
return GRN_TRUE;
}
return GRN_FALSE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about simplifying this?

Suggested change
if (length == 1 &&
utf8_char[0] == '-') {
/* U+002D HYPHEN-MINUS */
return GRN_TRUE;
}
return GRN_FALSE;
/* U+002D HYPHEN-MINUS */
return (length == 1 && utf8_char[0] == '-');

lib/normalizer.c Outdated
Comment on lines 1003 to 1015
grn_inline static grn_bool
grn_nfkc_normalize_is_prolonged_sound_mark(const unsigned char *utf8_char,
size_t length)
{
if (length == 3 &&
utf8_char[0] == 0xe3 &&
utf8_char[1] == 0x83 &&
utf8_char[2] == 0xbc) {
/* U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK */
return GRN_TRUE;
}
return GRN_FALSE;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

lib/normalizer.c Outdated
Comment on lines 1921 to 1923
typedef grn_bool
(*grn_nfkc_normalize_is_target_char_func)(const unsigned char *utf8_char,
size_t length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
typedef grn_bool
(*grn_nfkc_normalize_is_target_char_func)(const unsigned char *utf8_char,
size_t length);
typedef bool
(*grn_nfkc_normalize_is_target_char_func)(const unsigned char *utf8_char,
size_t length);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

lib/normalizer.c Outdated
size_t *n_unified_bytes,
size_t *n_unified_characters,
void *user_data)
grn_nfkc_normalize_unify_to_previous_kana_vowel_or_n(grn_ctx *ctx,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify this?

Suggested change
grn_nfkc_normalize_unify_to_previous_kana_vowel_or_n(grn_ctx *ctx,
grn_nfkc_normalize_unify_kana_prolonged_sound_mark_like(grn_ctx *ctx,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

lib/normalizer.c Outdated
unsigned char *unified_buffer,
size_t *n_unified_bytes,
size_t *n_unified_characters,
grn_nfkc_normalize_is_target_char_func func,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using user_data instead of adding a new argument?

typedef struct {
  grn_nfkc_normalize_is_target_char_func is_prolonged_sound_mark;
  bool previous_length;
} grn_nfkc_normalize_prolonged_sound_mark_like_data;

static const unsigned char *
grn_nfkc_normalize_unify_kana_prolonged_sound_mark_like(...)
{
  grn_nfkc_normalize_prolonged_sound_mark_like_data data = user_data;
  ...
  if (data->previous_length == 3 && data->is_prolonged_sound_mark(current, char_length)) {
    ...
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.
I have added new struct used as user_data.

lib/normalizer.c Outdated
previous = current - *previous_length;
func(current, char_length)) {
const unsigned char *previous = current - *previous_length;
*previous_length = char_length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about setting this before if() because we should always set it?

Copy link
Contributor Author

@HashidaTKS HashidaTKS Mar 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to set *previous_length before this if, it is better that we introduce a temporary variable like saved_previous_length as below because *previous_length is used in the if() condition and in brackets of it.

  size_t char_length;
  size_t *previous_length = user_data;
  size_t saved_previous_length = *previous_length;

  char_length = (size_t)grn_charlen_(ctx, current, end, GRN_ENC_UTF8);
  *previous_length = char_length;

  *n_used_bytes = char_length;
  *n_used_characters = 1;

  if (saved_previous_length == 3 &&
      func(current, char_length)) {
    const unsigned char *previous = current - saved_previous_length;
    ...

How do you feel that? it is better than the current implementation?

Copy link
Member

@kou kou Mar 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With #1526 (comment) , it looks like the following:

size_t previous_length = data->previous_length:
data->previous_length = char_length;

if (previous_length == ...)

It looks better than saved_previous_length.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@HashidaTKS
Copy link
Contributor Author

@kou

Thank you for your comments.
I have addressed your comments.

lib/normalizer.c Outdated
Comment on lines 2756 to 2758
grn_nfkc_normalize_prolonged_sound_mark_like_data prolonged_sound_mark_like_data;
prolonged_sound_mark_like_data.is_prolonged_sound_mark_like_char = grn_nfkc_normalize_is_hyphen;
prolonged_sound_mark_like_data.previous_length = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
grn_nfkc_normalize_prolonged_sound_mark_like_data prolonged_sound_mark_like_data;
prolonged_sound_mark_like_data.is_prolonged_sound_mark_like_char = grn_nfkc_normalize_is_hyphen;
prolonged_sound_mark_like_data.previous_length = 0;
grn_nfkc_normalize_prolonged_sound_mark_like_data data;
data.is_prolonged_sound_mark_like_char = grn_nfkc_normalize_is_hyphen;
data.previous_length = 0;

Copy link
Contributor Author

@HashidaTKS HashidaTKS Mar 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name data is already used as the argument of this function, this is why I named this variable prolonged_sound_mark_like_data.

grn_nfkc_normalize_unify(grn_ctx *ctx,
                         grn_nfkc_normalize_data *data)

Copy link
Contributor Author

@HashidaTKS HashidaTKS Mar 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we use the data variable with prolonged_sound_mark_like_data.

    grn_nfkc_normalize_prolonged_sound_mark_like_data prolonged_sound_mark_like_data;
    prolonged_sound_mark_like_data.is_target_char = grn_nfkc_normalize_is_hyphen;
    prolonged_sound_mark_like_data.previous_length = 0;
    grn_nfkc_normalize_unify_stateful(ctx,
                                      data,
                                      &unify,
                                      grn_nfkc_normalize_unify_kana_prolonged_sound_mark_like,
                                      &prolonged_sound_mark_like_data,
                                      "[unify][kana-hyphen]");

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. How about stateful_data or subdata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adopted subdata.

lib/normalizer.c Outdated
size_t length);

typedef struct {
grn_nfkc_normalize_is_target_char_func is_prolonged_sound_mark_like_char;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. This may be redundant. We may be use more shorter name without reducing readability because struct name has prolonged_sound_mark_like information:

Suggested change
grn_nfkc_normalize_is_target_char_func is_prolonged_sound_mark_like_char;
grn_nfkc_normalize_is_target_char_func is_target_char;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense.
Fixed.

@HashidaTKS
Copy link
Contributor Author

@kou

Thank you for your comments.
I have addressed your comments.

@kou kou merged commit 07c0b25 into master Mar 2, 2023
@kou kou deleted the add_unify-kana-hyphen branch March 2, 2023 05:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants