-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NormalizerNFKC: add unify_katakana_trailing_o
option
#1506
Conversation
unify_katakana_trailing_o
option
5e2797b
to
b78760f
Compare
b78760f
to
417c9da
Compare
Would you review this when you have time? |
lib/normalizer.c
Outdated
@@ -1783,6 +1783,84 @@ grn_nfkc_normalize_unify_katakana_g_sounds(grn_ctx *ctx, | |||
return current; | |||
} | |||
|
|||
static const unsigned char * | |||
grn_nfkc_normalize_unify_katakana_trailing_o(grn_ctx *ctx, | |||
const unsigned char *start, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you fix indent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
"katakana", | ||
"katakana" | ||
], | ||
"checks": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you told me what checks are before, but now I'm not confident with that these checks
results are correct or not...
-1
corresponds to オウ
's ウ
, for example, right?
And does -1
mean that the character is the second or subsequent character normalized by one definition that normalizes multiple characters at once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The results are incorrect.
#1506 (comment) will fix them.
-1
corresponds toオウ
'sウ
, for example, right?
Right.
And does
-1
mean that the character is the second or subsequent character normalized by one definition that normalizes multiple characters at once?
No. -1
means that the character (ウ
) doesn't have corresponding character in the source (オオ
). But the second オ
is the corresponding character of ウ
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the results applied #1506 (comment) seem wrong...
Even the normalized
result ("normalized": "オウオソウオノウオモウオロウオゾウオボウオ",
) is wrong.
I will re-check it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*n_used_bytes
and *n_used_characters
are already added first character's value, so they should not have added the first character's value again at that line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But checks
result still have -1
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current implementation normalize two bytes at once, that means when normalizing コオ
-> コウ
, we normalize not only オ
but コオ
it self. Is that reason why the check result of オ
is -1
...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in order to normalize only オ
of コオ
, we need to know the previous character in grn_nfkc_normalize_unify_katakana_trailing_o
.
In order to know the previous character in grn_nfkc_normalize_unify_katakana_trailing_o
, we need to pass data about the previous character as the user_data
argument like
Line 1888 in 8f04e6d
grn_nfkc_normalize_strip(grn_ctx *ctx, |
Is it a good idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified.
test/command/suite/normalizers/nfkc121/unify_katakana_trailing_o.expected
Show resolved
Hide resolved
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Thank you for your comments. |
When
unify_katakana_trailing_o
is specified,NormalizerNFKC*
normalize characters as below.オオ -> オウ
コオ -> コウ
ソオ -> ソウ
トオ -> トウ
ノオ -> ノウ
ホオ -> ホウ
モオ -> モウ
ヨオ -> ヨウ
ロオ -> ロウ
ゴオ -> ゴウ
ゾオ -> ゾウ
ドオ -> ドウ
ボオ -> ボウ
ポオ -> ポウ
Usage: