Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer for Ja/Ko #30

Closed
kination opened this issue Feb 11, 2021 · 3 comments
Closed

Tokenizer for Ja/Ko #30

kination opened this issue Feb 11, 2021 · 3 comments

Comments

@kination
Copy link
Contributor

Hello~
I'm currently testing tokenizer with Japanese/Korean, but seems it is not working correctly.

Is there some working plan for this?

Thanks.

@kination
Copy link
Contributor Author

In test code, when I add as

    let analyzed = analyzer.analyze("Zut, l’aspirateur, j’ai oublié de l’éteindre !");
    println!("{:?}", analyzed.tokens().map(|t| t.text().to_string()).collect::<Vec<_>>());
    println!("{:?}", analyzed.reconstruct().map(|(s, _)| s.to_string()).collect::<String>());

    let analyzed_kor = analyzer.analyze("안녕하세요, 오늘은 날씨가 매우 춥습니다!");
    println!("{:?}", analyzed_kor.tokens().map(|t| t.text().to_string()).collect::<Vec<_>>());
    println!("{:?}", analyzed_kor.reconstruct().map(|(s, _)| s.to_string()).collect::<String>());

result will be:

["zut", ", ", "l", "’", "aspirateur", ", ", "j", "’", "ai", " ", "oublie", " ", "de", " ", "l", "’", "eteindre", " !"]
"Zut, l’aspirateur, j’ai oublié de l’éteindre !"
["안", "녕", "하", "세", "요", ", ", "오", "늘", "은", " ", "날", "씨", "가", " ", "매", "우", " ", "춥", "습", "니", "다", "!"]
"안녕하세요, 오늘은 날씨가 매우 춥습니다!"

but in my mind, 3rd line should be

"안녕하세요", ",", "오늘은", "날씨가", "매우", "춥습니다!"

and it is same in japanese too...

@kination
Copy link
Contributor Author

Follow-up by #49

@curquiza curquiza linked a pull request Oct 6, 2021 that will close this issue
@curquiza
Copy link
Member

Japanese support was added, however not the Korean language support.
For people wanting the Korean support, please open a discussion in the product repo so that the product team is aware of your need 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants