NormalizerNFKC: add `unify_katakana_z_sounds` option #1502

HashidaTKS · 2023-02-01T06:50:58Z

When unify_katakana_z_sounds is specified, NormalizerNFKC* normalize characters as below.

ズァ -> ザ
ズィ -> ジ
ズェ -> ゼ
ズォ -> ゾ

Usage:

normalize \
  'NormalizerNFKC130("unify_katakana_z_sounds", true, \
                     "report_source_offset", true)' \
  "ズァズィズェズォ" \
  WITH_CHECKS|WITH_TYPES

HashidaTKS · 2023-02-02T03:20:42Z

lib/normalizer.c

+        next[1] == 0x82) {
+      if (next[2] == 0xa1) { /* U+30A1 KATAKANA LETTER SMALL A */
+        /* U+30B6 KATAKANA LETTER ZA */
+        unified_buffer[(*n_unified_bytes)++] = current[0];


Or directly specify 0xe3.
I have used current[0] because current implementation for other normalizer options use the current variable as possible as we can.

I think that current[0] is better because existing implementation already uses the style.

Thanks.
I see.

kou · 2023-02-02T06:46:02Z

lib/normalizer.c

+      (current[0] == 0xe3 && current[1] == 0x82 && current[2] == 0xba);
+  }
+
+  if (start_char) {


If you want to introduce a descriptive variable for this condition, could you use more meaningful name like is_du_or_zu or something?

In this case, I don't thing that we need a descriptive variable because the following code is well descriptive but it's not a strong opinion:

if ((char_length == 3) && /* U+30C5 KATAKANA LETTER DU */ (current[0] == 0xe3 && current[1] == 0x83 && current[2] == 0x85) || /* U+30BA KATAKANA LETTER ZU */ (current[0] == 0xe3 && current[1] == 0x82 && current[2] == 0xba)) {

We need to add extra parentheses before and after character checks because && has higher priority than ||, and I feel it somehow undermines readability.
That is the reason why I introduced a descriptive variable.

if ((char_length == 3) && /* U+30C5 KATAKANA LETTER DU */ ((current[0] == 0xe3 && current[1] == 0x83 && current[2] == 0x85) || /* U+30BA KATAKANA LETTER ZU */ (current[0] == 0xe3 && current[1] == 0x82 && current[2] == 0xba)))

But this is also not a strong opinion...
As a conclusion, I will remove the descriptive variable and fix it like above.

kou · 2023-02-02T06:46:55Z

lib/normalizer.c

+        next[1] == 0x82) {
+      if (next[2] == 0xa1) { /* U+30A1 KATAKANA LETTER SMALL A */
+        /* U+30B6 KATAKANA LETTER ZA */
+        unified_buffer[(*n_unified_bytes)++] = current[0];


I think that current[0] is better because existing implementation already uses the style.

kou · 2023-02-02T20:21:38Z

lib/normalizer.c

+      /* U+30C5 KATAKANA LETTER DU */
+      ((current[0] == 0xe3 && current[1] == 0x83 && current[2] == 0x85) ||
+      /* U+30BA KATAKANA LETTER ZU */
+      (current[0] == 0xe3 && current[1] == 0x82 && current[2] == 0xba))) {


Suggested change

(current[0] == 0xe3 && current[1] == 0x82 && current[2] == 0xba))) {

(current[0] == 0xe3 && current[1] == 0x82 && current[2] == 0xba))) {

HashidaTKS · 2023-02-03T03:25:21Z

@kou

Thanks, I have addressed your comments.

kou

Could you split ヅァヅィヅヅェヅォ to unify_katakana_d_sounds?

test/command/suite/normalizers/nfkc100/unify_katakana_z_sounds.test

….test Co-authored-by: Sutou Kouhei <kou@clear-code.com>

HashidaTKS · 2023-02-03T04:40:10Z

Could you split ヅァヅィヅヅェヅォ to unify_katakana_d_sounds?

Let me confirm your suggestion.
Do you mean unify_katakana_d_sounds converts ヅァヅィヅヅェヅォ to ザジヅゼゾ ?

kou · 2023-02-03T04:42:12Z

Yes. They aren't started with z.

HashidaTKS · 2023-02-03T04:53:57Z

Yes. They aren't started with z.

Thanks, I will create unify_katakana_d_sounds as such behaviour.

HashidaTKS · 2023-02-03T04:55:23Z

Could you split ヅァヅィヅヅェヅォ to unify_katakana_d_sounds?

I have removed ヅァヅィヅヅェヅォ from the unify_katakana_z_sounds target.
I will create unify_katakana_d_sounds in another pull request.

test/command/suite/normalizers/nfkc121/unify_katakana_z_sounds.test

test/command/suite/normalizers/nfkc121/unify_katakana_z_sounds.expected

HashidaTKS marked this pull request as ready for review February 2, 2023 03:18

HashidaTKS commented Feb 2, 2023

View reviewed changes

kou reviewed Feb 2, 2023

View reviewed changes

HashidaTKS mentioned this pull request Feb 2, 2023

NormalizerNFKC: add unify_katakana_di_sound option #1504

Merged

kou reviewed Feb 2, 2023

View reviewed changes

HashidaTKS force-pushed the add-unify_katakana_z_sounds branch 2 times, most recently from 179b2c4 to 62dc84d Compare February 3, 2023 03:21

Add unify_katakana_z_sounds

fc4648b

HashidaTKS force-pushed the add-unify_katakana_z_sounds branch from 62dc84d to fc4648b Compare February 3, 2023 03:22

Remove needless diff

2c0b721

kou reviewed Feb 3, 2023

View reviewed changes

test/command/suite/normalizers/nfkc100/unify_katakana_z_sounds.test Outdated Show resolved Hide resolved

HashidaTKS and others added 2 commits February 3, 2023 13:26

Update test/command/suite/normalizers/nfkc100/unify_katakana_z_sounds…

0958a9d

….test Co-authored-by: Sutou Kouhei <kou@clear-code.com>

Update expected

55cd802

Support only chars start with z sounds

ac86943

kou reviewed Feb 3, 2023

View reviewed changes

test/command/suite/normalizers/nfkc121/unify_katakana_z_sounds.test Outdated Show resolved Hide resolved

test/command/suite/normalizers/nfkc121/unify_katakana_z_sounds.expected Outdated Show resolved Hide resolved

Remove garbage

a6b9616

kou merged commit 441d800 into master Feb 3, 2023

kou deleted the add-unify_katakana_z_sounds branch February 3, 2023 05:00

github-actions bot mentioned this pull request May 22, 2023

nginx: update bundles version 1.23.4 komainu8/groonga#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NormalizerNFKC: add `unify_katakana_z_sounds` option #1502

NormalizerNFKC: add `unify_katakana_z_sounds` option #1502

HashidaTKS commented Feb 1, 2023 •

edited by kou

Loading

HashidaTKS Feb 2, 2023 •

edited

Loading

kou Feb 2, 2023

HashidaTKS Feb 2, 2023

kou Feb 2, 2023

HashidaTKS Feb 2, 2023 •

edited

Loading

HashidaTKS Feb 2, 2023

kou Feb 2, 2023

kou Feb 2, 2023

HashidaTKS Feb 3, 2023

HashidaTKS commented Feb 3, 2023

kou left a comment

HashidaTKS commented Feb 3, 2023 •

edited

Loading

kou commented Feb 3, 2023 •

edited

Loading

HashidaTKS commented Feb 3, 2023

HashidaTKS commented Feb 3, 2023 •

edited

Loading

	(current[0] == 0xe3 && current[1] == 0x82 && current[2] == 0xba))) {
	(current[0] == 0xe3 && current[1] == 0x82 && current[2] == 0xba))) {

NormalizerNFKC: add unify_katakana_z_sounds option #1502

NormalizerNFKC: add unify_katakana_z_sounds option #1502

Conversation

HashidaTKS commented Feb 1, 2023 • edited by kou Loading

HashidaTKS Feb 2, 2023 • edited Loading

Choose a reason for hiding this comment

kou Feb 2, 2023

Choose a reason for hiding this comment

HashidaTKS Feb 2, 2023

Choose a reason for hiding this comment

kou Feb 2, 2023

Choose a reason for hiding this comment

HashidaTKS Feb 2, 2023 • edited Loading

Choose a reason for hiding this comment

HashidaTKS Feb 2, 2023

Choose a reason for hiding this comment

kou Feb 2, 2023

Choose a reason for hiding this comment

kou Feb 2, 2023

Choose a reason for hiding this comment

HashidaTKS Feb 3, 2023

Choose a reason for hiding this comment

HashidaTKS commented Feb 3, 2023

kou left a comment

Choose a reason for hiding this comment

HashidaTKS commented Feb 3, 2023 • edited Loading

kou commented Feb 3, 2023 • edited Loading

HashidaTKS commented Feb 3, 2023

HashidaTKS commented Feb 3, 2023 • edited Loading

NormalizerNFKC: add `unify_katakana_z_sounds` option #1502

NormalizerNFKC: add `unify_katakana_z_sounds` option #1502

HashidaTKS commented Feb 1, 2023 •

edited by kou

Loading

HashidaTKS Feb 2, 2023 •

edited

Loading

HashidaTKS Feb 2, 2023 •

edited

Loading

HashidaTKS commented Feb 3, 2023 •

edited

Loading

kou commented Feb 3, 2023 •

edited

Loading

HashidaTKS commented Feb 3, 2023 •

edited

Loading