-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NormalizerNFKC: add unify_kana_hyphen
option
#1526
Conversation
c8317c9
to
549df00
Compare
549df00
to
a1473ea
Compare
Rebased to the latest master branch. Would you please review this? |
Would you review this? |
lib/normalizer.c
Outdated
@@ -988,6 +988,32 @@ grn_nfkc_normalize_unify_katakana_voiced_sound_mark(const unsigned char *utf8_ch | |||
return utf8_char; | |||
} | |||
|
|||
grn_inline static grn_bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use bool
for newly added code?
grn_inline static grn_bool | |
grn_inline static bool |
lib/normalizer.c
Outdated
if (length == 1 && | ||
utf8_char[0] == '-') { | ||
/* U+002D HYPHEN-MINUS */ | ||
return GRN_TRUE; | ||
} | ||
return GRN_FALSE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about simplifying this?
if (length == 1 && | |
utf8_char[0] == '-') { | |
/* U+002D HYPHEN-MINUS */ | |
return GRN_TRUE; | |
} | |
return GRN_FALSE; | |
/* U+002D HYPHEN-MINUS */ | |
return (length == 1 && utf8_char[0] == '-'); |
lib/normalizer.c
Outdated
grn_inline static grn_bool | ||
grn_nfkc_normalize_is_prolonged_sound_mark(const unsigned char *utf8_char, | ||
size_t length) | ||
{ | ||
if (length == 3 && | ||
utf8_char[0] == 0xe3 && | ||
utf8_char[1] == 0x83 && | ||
utf8_char[2] == 0xbc) { | ||
/* U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK */ | ||
return GRN_TRUE; | ||
} | ||
return GRN_FALSE; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
lib/normalizer.c
Outdated
typedef grn_bool | ||
(*grn_nfkc_normalize_is_target_char_func)(const unsigned char *utf8_char, | ||
size_t length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typedef grn_bool | |
(*grn_nfkc_normalize_is_target_char_func)(const unsigned char *utf8_char, | |
size_t length); | |
typedef bool | |
(*grn_nfkc_normalize_is_target_char_func)(const unsigned char *utf8_char, | |
size_t length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
lib/normalizer.c
Outdated
size_t *n_unified_bytes, | ||
size_t *n_unified_characters, | ||
void *user_data) | ||
grn_nfkc_normalize_unify_to_previous_kana_vowel_or_n(grn_ctx *ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simplify this?
grn_nfkc_normalize_unify_to_previous_kana_vowel_or_n(grn_ctx *ctx, | |
grn_nfkc_normalize_unify_kana_prolonged_sound_mark_like(grn_ctx *ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
lib/normalizer.c
Outdated
unsigned char *unified_buffer, | ||
size_t *n_unified_bytes, | ||
size_t *n_unified_characters, | ||
grn_nfkc_normalize_is_target_char_func func, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using user_data
instead of adding a new argument?
typedef struct {
grn_nfkc_normalize_is_target_char_func is_prolonged_sound_mark;
bool previous_length;
} grn_nfkc_normalize_prolonged_sound_mark_like_data;
static const unsigned char *
grn_nfkc_normalize_unify_kana_prolonged_sound_mark_like(...)
{
grn_nfkc_normalize_prolonged_sound_mark_like_data data = user_data;
...
if (data->previous_length == 3 && data->is_prolonged_sound_mark(current, char_length)) {
...
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
I have added new struct
used as user_data
.
lib/normalizer.c
Outdated
previous = current - *previous_length; | ||
func(current, char_length)) { | ||
const unsigned char *previous = current - *previous_length; | ||
*previous_length = char_length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about setting this before if()
because we should always set it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to set *previous_length
before this if
, it is better that we introduce a temporary variable like saved_previous_length
as below because *previous_length
is used in the if()
condition and in brackets of it.
size_t char_length;
size_t *previous_length = user_data;
size_t saved_previous_length = *previous_length;
char_length = (size_t)grn_charlen_(ctx, current, end, GRN_ENC_UTF8);
*previous_length = char_length;
*n_used_bytes = char_length;
*n_used_characters = 1;
if (saved_previous_length == 3 &&
func(current, char_length)) {
const unsigned char *previous = current - saved_previous_length;
...
How do you feel that? it is better than the current implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With #1526 (comment) , it looks like the following:
size_t previous_length = data->previous_length:
data->previous_length = char_length;
if (previous_length == ...)
It looks better than saved_previous_length
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Thank you for your comments. |
lib/normalizer.c
Outdated
grn_nfkc_normalize_prolonged_sound_mark_like_data prolonged_sound_mark_like_data; | ||
prolonged_sound_mark_like_data.is_prolonged_sound_mark_like_char = grn_nfkc_normalize_is_hyphen; | ||
prolonged_sound_mark_like_data.previous_length = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grn_nfkc_normalize_prolonged_sound_mark_like_data prolonged_sound_mark_like_data; | |
prolonged_sound_mark_like_data.is_prolonged_sound_mark_like_char = grn_nfkc_normalize_is_hyphen; | |
prolonged_sound_mark_like_data.previous_length = 0; | |
grn_nfkc_normalize_prolonged_sound_mark_like_data data; | |
data.is_prolonged_sound_mark_like_char = grn_nfkc_normalize_is_hyphen; | |
data.previous_length = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name data
is already used as the argument of this function, this is why I named this variable prolonged_sound_mark_like_data
.
grn_nfkc_normalize_unify(grn_ctx *ctx,
grn_nfkc_normalize_data *data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we use the data
variable with prolonged_sound_mark_like_data
.
grn_nfkc_normalize_prolonged_sound_mark_like_data prolonged_sound_mark_like_data;
prolonged_sound_mark_like_data.is_target_char = grn_nfkc_normalize_is_hyphen;
prolonged_sound_mark_like_data.previous_length = 0;
grn_nfkc_normalize_unify_stateful(ctx,
data,
&unify,
grn_nfkc_normalize_unify_kana_prolonged_sound_mark_like,
&prolonged_sound_mark_like_data,
"[unify][kana-hyphen]");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. How about stateful_data
or subdata
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adopted subdata
.
lib/normalizer.c
Outdated
size_t length); | ||
|
||
typedef struct { | ||
grn_nfkc_normalize_is_target_char_func is_prolonged_sound_mark_like_char; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. This may be redundant. We may be use more shorter name without reducing readability because struct name has prolonged_sound_mark_like
information:
grn_nfkc_normalize_is_target_char_func is_prolonged_sound_mark_like_char; | |
grn_nfkc_normalize_is_target_char_func is_target_char; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense.
Fixed.
Thank you for your comments. |
NormalizerNFKC*
normalizehyphen
(-
) to the vowel of the previous kana character orン
orん
whenunify_kana_hyphen
is specified.-
Usage: