-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Setup lindera tokenizer for ko/ja support #49
Conversation
src/tokenizer/lindera.rs
Outdated
TokenStream { | ||
inner: Box::new(tokenized.into_iter().scan(0, move |_, lindera_token| { | ||
let char_start = 0; | ||
let char_end = lindera_token.text.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Jieba tokenizer, it is using bytes_len to get the size of tokens. Is there some specific reason, or just difference of tokenizer between jieba and lindera.
I'm not quite familiar of tokenizing logic in meilisearch, so please let me know the detail if you can 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @djKooks!
There is 2 kind of index
in our tokens:
char_start
/char_end
which are indices in term of characters countbyte_start
/byte_end
which are indices in term of bytes count
There are not the same because achar
can contain severalbytes
.
The ..._start
is the index of the beginning of the token from the start of the whole tokenized string.
The ..._end
is the index of the end of the token from the start of the whole tokenized string.
I need to investigate more to understand the detail
attribute of the Lindera Token
.
But, for now, you can just pass a tuple (CharIndex, ByteIndex)
initialized to (0, 0)
to scan
.
src/tokenizer/lindera.rs
Outdated
TokenStream { | ||
inner: Box::new(tokenized.into_iter().scan(0, move |_, lindera_token| { | ||
let char_start = 0; | ||
let char_end = lindera_token.text.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @djKooks!
There is 2 kind of index
in our tokens:
char_start
/char_end
which are indices in term of characters countbyte_start
/byte_end
which are indices in term of bytes count
There are not the same because achar
can contain severalbytes
.
The ..._start
is the index of the beginning of the token from the start of the whole tokenized string.
The ..._end
is the index of the end of the token from the start of the whole tokenized string.
I need to investigate more to understand the detail
attribute of the Lindera Token
.
But, for now, you can just pass a tuple (CharIndex, ByteIndex)
initialized to (0, 0)
to scan
.
src/tokenizer/lindera.rs
Outdated
inner: Box::new(tokenized.into_iter().scan(0, move |_, lindera_token| { | ||
let char_start = 0; | ||
let char_end = lindera_token.text.len(); | ||
Some(Token { | ||
kind: TokenKind::Word, | ||
word: Cow::Borrowed(lindera_token.text), | ||
char_index: 0, | ||
byte_start: char_start, | ||
byte_end: char_end, | ||
}) | ||
})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inner: Box::new(tokenized.into_iter().scan(0, move |_, lindera_token| { | |
let char_start = 0; | |
let char_end = lindera_token.text.len(); | |
Some(Token { | |
kind: TokenKind::Word, | |
word: Cow::Borrowed(lindera_token.text), | |
char_index: 0, | |
byte_start: char_start, | |
byte_end: char_end, | |
}) | |
})) | |
inner: Box::new(tokenized.into_iter().scan((0, 0), move |(char_index, byte_index), lindera_token| { | |
let char_count = lindera_token.text.chars().count(); | |
let char_start = *char_index; | |
*char_index += char_count; | |
let byte_count = lindera_token.text.len(); | |
let byte_start = *byte_index; | |
*byte_index += byte_count; | |
Some(Token { | |
kind: TokenKind::Word, | |
word: Cow::Borrowed(lindera_token.text), | |
char_index: char_start, | |
byte_start, | |
byte_end: byte_index, | |
}) | |
})) |
this should work. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ManyTheFish seems it's correct. Thanks for guide 🙏
# Conflicts: # src/detection.rs
@ManyTheFish thanks for help. Have few more question...
-> Is
-> is there some common rule to keep dictionary to repository? |
Hello @djKooks! @ManyTheFish is on holiday, he will answer you once he comes back! 🙂 A request I do: is it possible to add tests of what you implemented? |
fn test_japanese_language() { | ||
let analyzer = Analyzer::new(AnalyzerConfig::<Vec<u8>>::default()); | ||
|
||
let orig = "関西国際空港限定トートバッグ"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK correct result should be : ["関西国際空港", "限定", "トートバッグ"]
, but it shows as ["関西", "国际", "空港", "限定", "ト", "ー", "ト", "バ", "ッ", "グ"]
Seems this is issue about analyzer.
If the orig
looks like following, seems it selects jieba
tokenizer because first character is 関
, and this is both being used in chinese/japanese.
How could this be solved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! @djKooks!
In the default function of the AnalizerConfig you should add a new pipeline in order to use your new tokenizer when a Japanese script is detected:
fn default() -> Self {
let mut pipeline_map: HashMap<(Script, Language), Pipeline> = HashMap::new();
// Latin script specialized pipeline
pipeline_map.insert((Script::Latin, Language::Other), Pipeline::default()
.set_tokenizer(LegacyMeilisearch));
// Chinese script specialized pipeline
pipeline_map.insert((Script::Mandarin, Language::Other), Pipeline::default()
.set_pre_processor(ChineseTranslationPreProcessor)
.set_tokenizer(Jieba::default()));
// Japanese script specialized pipeline
pipeline_map.insert((Script::Katakana, Language::Other), Pipeline::default()
.set_tokenizer(LinderaTokenizer::new(Mode::Normal, "/path/to/ja-dic")));
pipeline_map.insert((Script::Hiragana, Language::Other), Pipeline::default()
.set_tokenizer(LinderaTokenizer::new(Mode::Normal, "/path/to/ja-dic")));
// Korean script specialized pipeline
pipeline_map.insert((Script::Hangul, Language::Other), Pipeline::default()
.set_tokenizer(LinderaTokenizer::new(Mode::Normal, "/path/to/ko-dic")));
AnalyzerConfig { pipeline_map, stop_words: None }
}
I don't know Japanese or Korean but the previous lines might help you to fix your test 👍
Don't hesitate to ping me if you need more help!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ManyTheFish thanks for your feedback. I have 2 question.
- I've make few change in following commit, to setup dictionary for tokenizer. But because I'm not quite good in Rust, I think you can suggest more better way to define Lindera tokenizer which extends current base tokenizer. Could you give me some suggestion?
- I've tested the code you've shared above. Current main problem is that if first character of text are the one which are being used commonly in CJK(Chinese/Japanese/Korean), it will indicate text as Chinese because ChineseTokenizer comes first(maybe I can be wrong). Is there some solutions about when there are multiple languages in text?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @djkooks, 😄
for your first point, I wrote so comments
about your second point, I agree with you about the fact that the script detection is not the best 😅; I will enhance it sooner or later 😊
src/tokenizer/lindera.rs
Outdated
false => Mode::Decompose | ||
}; | ||
|
||
let mut tokenizer = LinderaTokenizer::new(mode, &self.dict); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not comfortable with initializing LinderaTokenizer
during each tokenization, you should create a new
method where you initialize the LinderaTokenizer
and store it in the Lindera
struct.
Moreover, you may store this initialization in a Lazy wrapper if this initialization is time-consuming where the tokenizer is not useful for everybody.
src/tokenizer/lindera.rs
Outdated
#[derive(Debug, Default)] | ||
pub struct Lindera { | ||
pub normal_mode: bool, | ||
pub dict: &'static str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jieba has a default dictionary stored in a const, could we do the same with lindera ?
fn test_japanese_language() { | ||
let analyzer = Analyzer::new(AnalyzerConfig::<Vec<u8>>::default()); | ||
|
||
let orig = "関西国際空港限定トートバッグ"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @djkooks, 😄
for your first point, I wrote so comments
about your second point, I agree with you about the fact that the script detection is not the best 😅; I will enhance it sooner or later 😊
@ManyTheFish could you confirm new change is the one you're intending?
|
Hello @djKooks! Thanks for your changes! Can you remove the git conflicts you have on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @djKooks, sorry for the time. I requested some changes to your implementation. 😄
Moreover, I think you will have to rebase your branch because of the merge conflicts.
|
||
// Japanese script specialized pipeline | ||
// TODO: define dict path for japanese | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -132,9 +135,25 @@ impl<A> Default for AnalyzerConfig<'_, A> { | |||
.set_tokenizer(LegacyMeilisearch)); | |||
|
|||
// Chinese script specialized pipeline | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let mut tokenizer = LinderaTokenizer::new(Mode::Normal, ""); | ||
|
||
pipeline_map.insert((Script::Katakana, Language::Other), Pipeline::default() | ||
.set_tokenizer(Lindera { tokenizer })); | ||
|
||
pipeline_map.insert((Script::Hiragana, Language::Other), Pipeline::default() | ||
.set_tokenizer(Lindera { tokenizer })); | ||
|
||
// TODO: define dict path for korean | ||
pipeline_map.insert((Script::Hangul, Language::Other), Pipeline::default() | ||
.set_tokenizer(Lindera { tokenizer })); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may impl Default on your Lindera wrapper which create directly the lindera tokenizer internally.
let orig = "関西国際空港限定トートバッグ"; | ||
let analyzed = analyzer.analyze(orig); | ||
let analyzed: Vec<_> = analyzed.tokens().map(|token| token.word).collect(); | ||
println!("Analyzed : {:?}", analyzed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
println!("Analyzed : {:?}", analyzed); | |
assert_eq!(analyzed, vec!["関西国際空港", "限定", "トートバッグ"]); |
I look forward to Korean language support. |
Hey @jungbin-kwon, Thanks for your support! |
@ManyTheFish Hello. Thanks for responding to my comment. v0.23.1 Issue Test environment
PS. |
I'd like to use MeiliSearch in Japanese. |
Hey @yagince! On my side, I will come back on the tokenizer in a few months, but not in January 😞. Thanks for your message and have a good day! 😊 |
@ManyTheFish |
@yagince @jungbin-kwon thanks for your comment! @ManyTheFish are there any progress here? |
Hey @djKooks, sorry if I haven't been clear, For the moment we detect the most used Script in the attribute and we tokenize with the corresponding tokenizer. It's an issue when we have several languages in the same field, but your PR could, at least, enhance tokenization for fields that contain mostly Japanese characters. There are some small change requests to fix, and I think It will be good for me. |
Hello, @djKooks. We are trying to use Meilisearch in Japanese, but we are having some problems with the accuracy of the search for Japanese data. |
70: Setup lindera tokenizer for ja support ( related with #49 ) r=ManyTheFish a=miiton I implemented my idea for #49. I apologize first. I was wondering if it would be better to Pull Request to `@djKooks'` repository or to PR here, but the difference with the main branch was too big. I decided to send it here. Is there any problem? With this, I think that purpose of "using Lindera for Japanese processing" has been achieved. - As a preliminary step, implement the following from #49 - Merge main branch. - Fixed with `@ManyTheFish` 's comments. - lindera-rs bump to v0.8.1 - As anwer to "How to detect language, Chinese or Japanese?" - I implmented with `whatlang::detect_lang()`. - Before this commit, used only `whatlang::detect_script()` to detect "English" or "Other". | Script | Language | Tokenizer | | ---------------- | ------------- | --------- | | Script::Mandarin | Language::Cmn | Jieba | | Script::Mandarin | Language::Jpn | Lindera | | Script::Katakana | Language::Jpn | Lindera | | Script::Hiragana | Language::Jpn | Lindera | We can probably make a detect with only "Language", but I also left "Script" for backward compatibility. ### Remaining problems. `関西国際空港限定トートバッグ` was detected as "Japanese", but `関西国際空港` was detected as "Chinese" even though it is "Japanese", because only Mandarin characters are used. This is a whatlang's issue. ( note: `関西国际空港` is "Chinese") As for Korean, I found some commits in main branch, so I won't touch on that. # Pull Request ## What does this PR do? Fixes #49 ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Co-authored-by: djKooks <inylove82@gmail.com> Co-authored-by: miiton <468745+miiton@users.noreply.github.com>
closing this PR in favor of #70 |
No description provided.