Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NormalizerNFKC: add unify_katakana_trailing_o option #1506

Merged
merged 7 commits into from
Feb 6, 2023

Conversation

HashidaTKS
Copy link
Contributor

When unify_katakana_trailing_o is specified, NormalizerNFKC* normalize characters as below.

オオ -> オウ
コオ -> コウ
ソオ -> ソウ
トオ -> トウ
ノオ -> ノウ
ホオ -> ホウ
モオ -> モウ
ヨオ -> ヨウ
ロオ -> ロウ
ゴオ -> ゴウ
ゾオ -> ゾウ
ドオ -> ドウ
ボオ -> ボウ
ポオ -> ポウ

Usage:

normalize \
  'NormalizerNFKC130("unify_katakana_trailing_o", true, \
                     "report_source_offset", true)' \
  "オオコオソオトオノオホオモオヨオロオゴオゾオドオボオポオ" \
  WITH_CHECKS|WITH_TYPES

@HashidaTKS HashidaTKS changed the title NormalizerNFKC: add unify_katakana_trailing_o option NormalizerNFKC: add unify_katakana_trailing_o option Feb 2, 2023
@HashidaTKS HashidaTKS force-pushed the add-unify_katakana_trailing_o branch 4 times, most recently from 5e2797b to b78760f Compare February 3, 2023 09:36
@HashidaTKS HashidaTKS marked this pull request as ready for review February 6, 2023 01:36
@HashidaTKS
Copy link
Contributor Author

@kou

Would you review this when you have time?

lib/normalizer.c Outdated
@@ -1783,6 +1783,84 @@ grn_nfkc_normalize_unify_katakana_g_sounds(grn_ctx *ctx,
return current;
}

static const unsigned char *
grn_nfkc_normalize_unify_katakana_trailing_o(grn_ctx *ctx,
const unsigned char *start,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you fix indent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

"katakana",
"katakana"
],
"checks": [
Copy link
Contributor Author

@HashidaTKS HashidaTKS Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you told me what checks are before, but now I'm not confident with that these checks results are correct or not...
-1 corresponds to オウ's , for example, right?
And does -1 mean that the character is the second or subsequent character normalized by one definition that normalizes multiple characters at once?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The results are incorrect.
#1506 (comment) will fix them.

-1 corresponds to オウ's , for example, right?

Right.

And does -1 mean that the character is the second or subsequent character normalized by one definition that normalizes multiple characters at once?

No. -1 means that the character () doesn't have corresponding character in the source (オオ). But the second is the corresponding character of .

Copy link
Contributor Author

@HashidaTKS HashidaTKS Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the results applied #1506 (comment) seem wrong...
Even the normalized result ("normalized": "オウオソウオノウオモウオロウオゾウオボウオ",) is wrong.
I will re-check it.

Copy link
Contributor Author

@HashidaTKS HashidaTKS Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*n_used_bytes and *n_used_characters are already added first character's value, so they should not have added the first character's value again at that line.

Copy link
Contributor Author

@HashidaTKS HashidaTKS Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But checks result still have -1...

Copy link
Contributor Author

@HashidaTKS HashidaTKS Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current implementation normalize two bytes at once, that means when normalizing コオ -> コウ, we normalize not only but コオ it self. Is that reason why the check result of is -1 ...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in order to normalize only of コオ, we need to know the previous character in grn_nfkc_normalize_unify_katakana_trailing_o.
In order to know the previous character in grn_nfkc_normalize_unify_katakana_trailing_o, we need to pass data about the previous character as the user_data argument like

grn_nfkc_normalize_strip(grn_ctx *ctx,

Is it a good idea?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified.

lib/normalizer.c Outdated Show resolved Hide resolved
@HashidaTKS
Copy link
Contributor Author

@kou

Thank you for your comments.
I have addressed you comments.
Would you re-review this when you have time?

lib/normalizer.c Outdated Show resolved Hide resolved
@kou kou merged commit ba2432a into master Feb 6, 2023
@kou kou deleted the add-unify_katakana_trailing_o branch February 6, 2023 07:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants