Skip to content

Japanese character set regexp? #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
zersiax opened this issue Dec 29, 2020 · 3 comments
Closed

Japanese character set regexp? #4

zersiax opened this issue Dec 29, 2020 · 3 comments

Comments

@zersiax
Copy link

zersiax commented Dec 29, 2020

Hello,

I see there is a character set for Chinese, as well as Russian, but I am missing one for Japanese. I know that Chinese and Japanese share a lot of characters according to UTF-8 so I will most likely have to remove the one for Chinese and add the Japanese one. However, I am having a bit of trouble getting the regexp right for all the three alphabets. My attempt:
ja_ja:([一-龯])

This gives us (most) kanji, and works for those. However, ideally, I would also like to add the following regexp:
([ぁ-んァ-ン])
which should add hiragana and katakana. However, adding that line below the previous one makes the entire thing not work anymore. My regexp is a bit rusty and I'm not sure how to combine these two. Am I reinventing the wheel here?

@mltony
Copy link
Owner

mltony commented Dec 29, 2020 via email

@zersiax
Copy link
Author

zersiax commented Dec 29, 2020

Somewhat :) hat first link, is where I got the first regexp from. Like I said, I think it will probably work, if those two regular expressions can be combined somehow. I just don't know how to do that. Do you know how those two can be combined?
Alternatively, will this notation, from the second link, be supported?
/[\u3000-\u303F]|[\u3040-\u309F]|[\u30A0-\u30FF]|[\uFF00-\uFFEF]|[\u4E00-\u9FAF]|[\u2605-\u2606]|[\u2190-\u2195]|\u203B/g;
Looking at this second one, would placing a | character make it so the first two sets of regexp in the previous comment can be combined as belonging to the same language?

@zersiax
Copy link
Author

zersiax commented Dec 29, 2020

That is an affirmative. For people stumbling across this issue as well, the regexp you want for Japanese to have it read both Kanji as well as kana, is the following:
ja_ja:([一-龯])|([ぁ-んァ-ン])

I am not entirely sure if this will work with half-width and other edge cases, but should help out in the majority of cases.

@zersiax zersiax closed this as completed Dec 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants