-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NormalizerNFKC: add unify_katakana_z_sounds
option
#1502
Conversation
next[1] == 0x82) { | ||
if (next[2] == 0xa1) { /* U+30A1 KATAKANA LETTER SMALL A */ | ||
/* U+30B6 KATAKANA LETTER ZA */ | ||
unified_buffer[(*n_unified_bytes)++] = current[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or directly specify 0xe3
.
I have used current[0]
because current implementation for other normalizer options use the current
variable as possible as we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that current[0]
is better because existing implementation already uses the style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
I see.
lib/normalizer.c
Outdated
(current[0] == 0xe3 && current[1] == 0x82 && current[2] == 0xba); | ||
} | ||
|
||
if (start_char) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to introduce a descriptive variable for this condition, could you use more meaningful name like is_du_or_zu
or something?
In this case, I don't thing that we need a descriptive variable because the following code is well descriptive but it's not a strong opinion:
if ((char_length == 3) &&
/* U+30C5 KATAKANA LETTER DU */
(current[0] == 0xe3 && current[1] == 0x83 && current[2] == 0x85) ||
/* U+30BA KATAKANA LETTER ZU */
(current[0] == 0xe3 && current[1] == 0x82 && current[2] == 0xba)) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to add extra parentheses before and after character checks because &&
has higher priority than ||
, and I feel it somehow undermines readability.
That is the reason why I introduced a descriptive variable.
if ((char_length == 3) &&
/* U+30C5 KATAKANA LETTER DU */
((current[0] == 0xe3 && current[1] == 0x83 && current[2] == 0x85) ||
/* U+30BA KATAKANA LETTER ZU */
(current[0] == 0xe3 && current[1] == 0x82 && current[2] == 0xba)))
But this is also not a strong opinion...
As a conclusion, I will remove the descriptive variable and fix it like above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
next[1] == 0x82) { | ||
if (next[2] == 0xa1) { /* U+30A1 KATAKANA LETTER SMALL A */ | ||
/* U+30B6 KATAKANA LETTER ZA */ | ||
unified_buffer[(*n_unified_bytes)++] = current[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that current[0]
is better because existing implementation already uses the style.
lib/normalizer.c
Outdated
/* U+30C5 KATAKANA LETTER DU */ | ||
((current[0] == 0xe3 && current[1] == 0x83 && current[2] == 0x85) || | ||
/* U+30BA KATAKANA LETTER ZU */ | ||
(current[0] == 0xe3 && current[1] == 0x82 && current[2] == 0xba))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(current[0] == 0xe3 && current[1] == 0x82 && current[2] == 0xba))) { | |
(current[0] == 0xe3 && current[1] == 0x82 && current[2] == 0xba))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
179b2c4
to
62dc84d
Compare
62dc84d
to
fc4648b
Compare
Thanks, I have addressed your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you split ヅァヅィヅヅェヅォ
to unify_katakana_d_sounds
?
test/command/suite/normalizers/nfkc100/unify_katakana_z_sounds.test
Outdated
Show resolved
Hide resolved
….test Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Let me confirm your suggestion. |
Yes. They aren't started with |
Thanks, I will create |
I have removed |
When
unify_katakana_z_sounds
is specified,NormalizerNFKC*
normalize characters as below.ズァ -> ザ
ズィ -> ジ
ズェ -> ゼ
ズォ -> ゾ
Usage: