Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NormalizerNFKC: add unify_katakana_z_sounds option #1502

Merged
merged 6 commits into from
Feb 3, 2023

Conversation

HashidaTKS
Copy link
Contributor

@HashidaTKS HashidaTKS commented Feb 1, 2023

When unify_katakana_z_sounds is specified, NormalizerNFKC* normalize characters as below.

ズァ -> ザ
ズィ -> ジ
ズェ -> ゼ
ズォ -> ゾ

Usage:

normalize \
  'NormalizerNFKC130("unify_katakana_z_sounds", true, \
                     "report_source_offset", true)' \
  "ズァズィズェズォ" \
  WITH_CHECKS|WITH_TYPES

@HashidaTKS HashidaTKS marked this pull request as ready for review February 2, 2023 03:18
next[1] == 0x82) {
if (next[2] == 0xa1) { /* U+30A1 KATAKANA LETTER SMALL A */
/* U+30B6 KATAKANA LETTER ZA */
unified_buffer[(*n_unified_bytes)++] = current[0];
Copy link
Contributor Author

@HashidaTKS HashidaTKS Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or directly specify 0xe3.
I have used current[0] because current implementation for other normalizer options use the current variable as possible as we can.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that current[0] is better because existing implementation already uses the style.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
I see.

lib/normalizer.c Outdated
(current[0] == 0xe3 && current[1] == 0x82 && current[2] == 0xba);
}

if (start_char) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to introduce a descriptive variable for this condition, could you use more meaningful name like is_du_or_zu or something?

In this case, I don't thing that we need a descriptive variable because the following code is well descriptive but it's not a strong opinion:

if ((char_length == 3) &&
    /* U+30C5 KATAKANA LETTER DU */
    (current[0] == 0xe3 && current[1] == 0x83 && current[2] == 0x85) ||
    /* U+30BA KATAKANA LETTER ZU */
    (current[0] == 0xe3 && current[1] == 0x82 && current[2] == 0xba)) {

Copy link
Contributor Author

@HashidaTKS HashidaTKS Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add extra parentheses before and after character checks because && has higher priority than ||, and I feel it somehow undermines readability.
That is the reason why I introduced a descriptive variable.

if ((char_length == 3) &&
    /* U+30C5 KATAKANA LETTER DU */
    ((current[0] == 0xe3 && current[1] == 0x83 && current[2] == 0x85) ||
    /* U+30BA KATAKANA LETTER ZU */
    (current[0] == 0xe3 && current[1] == 0x82 && current[2] == 0xba))) 

But this is also not a strong opinion...
As a conclusion, I will remove the descriptive variable and fix it like above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

next[1] == 0x82) {
if (next[2] == 0xa1) { /* U+30A1 KATAKANA LETTER SMALL A */
/* U+30B6 KATAKANA LETTER ZA */
unified_buffer[(*n_unified_bytes)++] = current[0];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that current[0] is better because existing implementation already uses the style.

lib/normalizer.c Outdated
/* U+30C5 KATAKANA LETTER DU */
((current[0] == 0xe3 && current[1] == 0x83 && current[2] == 0x85) ||
/* U+30BA KATAKANA LETTER ZU */
(current[0] == 0xe3 && current[1] == 0x82 && current[2] == 0xba))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(current[0] == 0xe3 && current[1] == 0x82 && current[2] == 0xba))) {
(current[0] == 0xe3 && current[1] == 0x82 && current[2] == 0xba))) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@HashidaTKS HashidaTKS force-pushed the add-unify_katakana_z_sounds branch 2 times, most recently from 179b2c4 to 62dc84d Compare February 3, 2023 03:21
@HashidaTKS
Copy link
Contributor Author

@kou

Thanks, I have addressed your comments.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you split ヅァヅィヅヅェヅォ to unify_katakana_d_sounds?

HashidaTKS and others added 2 commits February 3, 2023 13:26
@HashidaTKS
Copy link
Contributor Author

HashidaTKS commented Feb 3, 2023

Could you split ヅァヅィヅヅェヅォ to unify_katakana_d_sounds?

Let me confirm your suggestion.
Do you mean unify_katakana_d_sounds converts ヅァヅィヅヅェヅォ to ザジヅゼゾ ?

@kou
Copy link
Member

kou commented Feb 3, 2023

Yes. They aren't started with z.

@HashidaTKS
Copy link
Contributor Author

Yes. They aren't started with z.

Thanks, I will create unify_katakana_d_sounds as such behaviour.

@HashidaTKS
Copy link
Contributor Author

HashidaTKS commented Feb 3, 2023

Could you split ヅァヅィヅヅェヅォ to unify_katakana_d_sounds?

I have removed ヅァヅィヅヅェヅォ from the unify_katakana_z_sounds target.
I will create unify_katakana_d_sounds in another pull request.

@kou kou merged commit 441d800 into master Feb 3, 2023
@kou kou deleted the add-unify_katakana_z_sounds branch February 3, 2023 05:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants