NormalizerNFKC: add `unify_katakana_trailing_o` option #1506

HashidaTKS · 2023-02-02T10:09:36Z

When unify_katakana_trailing_o is specified, NormalizerNFKC* normalize characters as below.

オオ -> オウ
コオ -> コウ
ソオ -> ソウ
トオ -> トウ
ノオ -> ノウ
ホオ -> ホウ
モオ -> モウ
ヨオ -> ヨウ
ロオ -> ロウ
ゴオ -> ゴウ
ゾオ -> ゾウ
ドオ -> ドウ
ボオ -> ボウ
ポオ -> ポウ

Usage:

normalize \
  'NormalizerNFKC130("unify_katakana_trailing_o", true, \
                     "report_source_offset", true)' \
  "オオコオソオトオノオホオモオヨオロオゴオゾオドオボオポオ" \
  WITH_CHECKS|WITH_TYPES

HashidaTKS · 2023-02-06T01:37:12Z

@kou

Would you review this when you have time?

kou · 2023-02-06T01:43:26Z

lib/normalizer.c

@@ -1783,6 +1783,84 @@ grn_nfkc_normalize_unify_katakana_g_sounds(grn_ctx *ctx,
  return current;
 }

+static const unsigned char *
+grn_nfkc_normalize_unify_katakana_trailing_o(grn_ctx *ctx,
+                                           const unsigned char *start,


Could you fix indent?

HashidaTKS · 2023-02-06T01:43:47Z

test/command/suite/normalizers/nfkc121/unify_katakana_trailing_o.expected

+      "katakana",
+      "katakana"
+    ],
+    "checks": [


I think you told me what checks are before, but now I'm not confident with that these checks results are correct or not...
-1 corresponds to オウ's ウ, for example, right?
And does -1 mean that the character is the second or subsequent character normalized by one definition that normalizes multiple characters at once?

The results are incorrect.
#1506 (comment) will fix them.

-1 corresponds to オウ's ウ, for example, right?

Right.

And does -1 mean that the character is the second or subsequent character normalized by one definition that normalizes multiple characters at once?

No. -1 means that the character (ウ) doesn't have corresponding character in the source (オオ). But the second オ is the corresponding character of ウ.

Hmm, the results applied #1506 (comment) seem wrong...
Even the normalized result ("normalized": "オウオソウオノウオモウオロウオゾウオボウオ",) is wrong.
I will re-check it.

*n_used_bytes and *n_used_characters are already added first character's value, so they should not have added the first character's value again at that line.

But checks result still have -1...

Current implementation normalize two bytes at once, that means when normalizing コオ -> コウ, we normalize not only オ but コオ it self. Is that reason why the check result of オ is -1 ...?

I think in order to normalize only オof コオ, we need to know the previous character in grn_nfkc_normalize_unify_katakana_trailing_o.
In order to know the previous character in grn_nfkc_normalize_unify_katakana_trailing_o, we need to pass data about the previous character as the user_data argument like

groonga/lib/normalizer.c

Line 1888 in 8f04e6d

grn_nfkc_normalize_strip(grn_ctx *ctx,

Is it a good idea?

lib/normalizer.c

test/command/suite/normalizers/nfkc121/unify_katakana_trailing_o.expected

Co-authored-by: Sutou Kouhei <kou@clear-code.com>

HashidaTKS · 2023-02-06T05:28:17Z

@kou

Thank you for your comments.
I have addressed you comments.
Would you re-review this when you have time?

lib/normalizer.c

HashidaTKS changed the title ~~NormalizerNFKC: add unify_katakana_trailing_o option~~ NormalizerNFKC: add unify_katakana_trailing_o option Feb 2, 2023

HashidaTKS force-pushed the add-unify_katakana_trailing_o branch 4 times, most recently from 5e2797b to b78760f Compare February 3, 2023 09:36

Add unify_katakana_trailing_o option

417c9da

HashidaTKS force-pushed the add-unify_katakana_trailing_o branch from b78760f to 417c9da Compare February 3, 2023 09:37

HashidaTKS marked this pull request as ready for review February 6, 2023 01:36

kou reviewed Feb 6, 2023

View reviewed changes

HashidaTKS commented Feb 6, 2023

View reviewed changes

kou reviewed Feb 6, 2023

View reviewed changes

lib/normalizer.c Outdated Show resolved Hide resolved

HashidaTKS commented Feb 6, 2023

View reviewed changes

test/command/suite/normalizers/nfkc121/unify_katakana_trailing_o.expected Show resolved Hide resolved

HashidaTKS and others added 5 commits February 6, 2023 10:50

Align indents

3207253

Update lib/normalizer.c

7f75287

Co-authored-by: Sutou Kouhei <kou@clear-code.com>

Fix expected

8415225

Remove extra increment

8f04e6d

Modify to normalize only trailing "o"

5d8b62d

kou reviewed Feb 6, 2023

View reviewed changes

lib/normalizer.c Outdated Show resolved Hide resolved

Remove a trailing space

3936b92

kou merged commit ba2432a into master Feb 6, 2023

kou deleted the add-unify_katakana_trailing_o branch February 6, 2023 07:17

HashidaTKS mentioned this pull request Feb 8, 2023

doc news: update news for 13.0.0 #1507

Merged

github-actions bot mentioned this pull request May 22, 2023

nginx: update bundles version 1.23.4 komainu8/groonga#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NormalizerNFKC: add `unify_katakana_trailing_o` option #1506

NormalizerNFKC: add `unify_katakana_trailing_o` option #1506

HashidaTKS commented Feb 2, 2023

HashidaTKS commented Feb 6, 2023

kou Feb 6, 2023

HashidaTKS Feb 6, 2023

HashidaTKS Feb 6, 2023 •

edited

Loading

kou Feb 6, 2023

HashidaTKS Feb 6, 2023 •

edited

Loading

HashidaTKS Feb 6, 2023 •

edited

Loading

HashidaTKS Feb 6, 2023 •

edited

Loading

HashidaTKS Feb 6, 2023 •

edited

Loading

HashidaTKS Feb 6, 2023

kou Feb 6, 2023

HashidaTKS Feb 6, 2023

HashidaTKS commented Feb 6, 2023

NormalizerNFKC: add unify_katakana_trailing_o option #1506

NormalizerNFKC: add unify_katakana_trailing_o option #1506

Conversation

HashidaTKS commented Feb 2, 2023

HashidaTKS commented Feb 6, 2023

kou Feb 6, 2023

Choose a reason for hiding this comment

HashidaTKS Feb 6, 2023

Choose a reason for hiding this comment

HashidaTKS Feb 6, 2023 • edited Loading

Choose a reason for hiding this comment

kou Feb 6, 2023

Choose a reason for hiding this comment

HashidaTKS Feb 6, 2023 • edited Loading

Choose a reason for hiding this comment

HashidaTKS Feb 6, 2023 • edited Loading

Choose a reason for hiding this comment

HashidaTKS Feb 6, 2023 • edited Loading

Choose a reason for hiding this comment

HashidaTKS Feb 6, 2023 • edited Loading

Choose a reason for hiding this comment

HashidaTKS Feb 6, 2023

Choose a reason for hiding this comment

kou Feb 6, 2023

Choose a reason for hiding this comment

HashidaTKS Feb 6, 2023

Choose a reason for hiding this comment

HashidaTKS commented Feb 6, 2023

NormalizerNFKC: add `unify_katakana_trailing_o` option #1506

NormalizerNFKC: add `unify_katakana_trailing_o` option #1506

HashidaTKS Feb 6, 2023 •

edited

Loading

HashidaTKS Feb 6, 2023 •

edited

Loading

HashidaTKS Feb 6, 2023 •

edited

Loading

HashidaTKS Feb 6, 2023 •

edited

Loading

HashidaTKS Feb 6, 2023 •

edited

Loading