[WIP] Setup lindera tokenizer for ko/ja support #49

kination · 2021-06-12T06:35:54Z

No description provided.

…tokenizer

kination · 2021-06-12T06:38:18Z

src/tokenizer/lindera.rs

+        TokenStream {
+            inner: Box::new(tokenized.into_iter().scan(0, move |_, lindera_token| {
+                let char_start = 0;
+                let char_end = lindera_token.text.len();


In Jieba tokenizer, it is using bytes_len to get the size of tokens. Is there some specific reason, or just difference of tokenizer between jieba and lindera.

I'm not quite familiar of tokenizing logic in meilisearch, so please let me know the detail if you can 🙏

Hello @djKooks!
There is 2 kind of index in our tokens:

char_start/char_end which are indices in term of characters count

byte_start/byte_end which are indices in term of bytes count
There are not the same because a char can contain several bytes.

The ..._start is the index of the beginning of the token from the start of the whole tokenized string.
The ..._end is the index of the end of the token from the start of the whole tokenized string.

I need to investigate more to understand the detail attribute of the Lindera Token.

But, for now, you can just pass a tuple (CharIndex, ByteIndex) initialized to (0, 0) to scan.

ManyTheFish · 2021-06-14T16:44:39Z

src/tokenizer/lindera.rs

+        TokenStream {
+            inner: Box::new(tokenized.into_iter().scan(0, move |_, lindera_token| {
+                let char_start = 0;
+                let char_end = lindera_token.text.len();


Hello @djKooks!
There is 2 kind of index in our tokens:

char_start/char_end which are indices in term of characters count

byte_start/byte_end which are indices in term of bytes count
There are not the same because a char can contain several bytes.

The ..._start is the index of the beginning of the token from the start of the whole tokenized string.
The ..._end is the index of the end of the token from the start of the whole tokenized string.

I need to investigate more to understand the detail attribute of the Lindera Token.

But, for now, you can just pass a tuple (CharIndex, ByteIndex) initialized to (0, 0) to scan.

ManyTheFish · 2021-06-14T17:01:11Z

src/tokenizer/lindera.rs

+            inner: Box::new(tokenized.into_iter().scan(0, move |_, lindera_token| {
+                let char_start = 0;
+                let char_end = lindera_token.text.len();
+                Some(Token {
+                    kind: TokenKind::Word,
+                    word: Cow::Borrowed(lindera_token.text),
+                    char_index: 0,
+                    byte_start: char_start,
+                    byte_end: char_end,
+                })
+            }))


Suggested change

inner: Box::new(tokenized.into_iter().scan(0, move |_, lindera_token| {

let char_start = 0;

let char_end = lindera_token.text.len();

Some(Token {

kind: TokenKind::Word,

word: Cow::Borrowed(lindera_token.text),

char_index: 0,

byte_start: char_start,

byte_end: char_end,

})

}))

inner: Box::new(tokenized.into_iter().scan((0, 0), move |(char_index, byte_index), lindera_token| {

let char_count = lindera_token.text.chars().count();

let char_start = *char_index;

*char_index += char_count;

let byte_count = lindera_token.text.len();

let byte_start = *byte_index;

*byte_index += byte_count;

Some(Token {

kind: TokenKind::Word,

word: Cow::Borrowed(lindera_token.text),

char_index: char_start,

byte_start,

byte_end: byte_index,

})

}))

this should work. 😄

@ManyTheFish seems it's correct. Thanks for guide 🙏

# Conflicts: # src/detection.rs

kination · 2021-06-19T05:53:43Z

@ManyTheFish thanks for help. Have few more question...

seems analyzer is working as below, to indicate which language does input uses

pipeline_map.insert((Script::Latin, Language::Other), Pipeline::default()
            .set_tokenizer(LegacyMeilisearch));

        // Chinese script specialized pipeline
        pipeline_map.insert((Script::Mandarin, Language::Other), Pipeline::default()
            .set_pre_processor(ChineseTranslationPreProcessor)
            .set_tokenizer(Jieba::default()));

-> Is Script::XXX value is given by user, or indicate automatically?

Lindera supports japanese/korean both, but it needs to setup dictionary by language.

/// if japanese...
let mut tokenizer = LinderaTokenizer::new(Mode::Normal, "/path/to/ja-dic");

/// if korean...
let mut tokenizer = LinderaTokenizer::new(Mode::Normal, "/path/to/ko-dic");

-> is there some common rule to keep dictionary to repository?
-> Do you have some good suggestion to give language type as parameter through fn tokenize<'a>(&self, s: &'a ProcessedText<'a>)?

curquiza · 2021-06-20T17:54:51Z

Hello @djKooks! @ManyTheFish is on holiday, he will answer you once he comes back! 🙂

A request I do: is it possible to add tests of what you implemented?

kination · 2021-06-22T12:40:50Z

src/analyzer.rs

+    fn test_japanese_language() {
+        let analyzer = Analyzer::new(AnalyzerConfig::<Vec<u8>>::default());
+
+        let orig = "関西国際空港限定トートバッグ";


@ManyTheFish

AFAIK correct result should be : ["関西国際空港", "限定", "トートバッグ"], but it shows as ["関西", "国际", "空港", "限定", "ト", "ー", "ト", "バ", "ッ", "グ"]

Seems this is issue about analyzer.
If the orig looks like following, seems it selects jieba tokenizer because first character is 関, and this is both being used in chinese/japanese.

How could this be solved?

Hey! @djKooks!

In the default function of the AnalizerConfig you should add a new pipeline in order to use your new tokenizer when a Japanese script is detected:

fn default() -> Self { let mut pipeline_map: HashMap<(Script, Language), Pipeline> = HashMap::new(); // Latin script specialized pipeline pipeline_map.insert((Script::Latin, Language::Other), Pipeline::default() .set_tokenizer(LegacyMeilisearch)); // Chinese script specialized pipeline pipeline_map.insert((Script::Mandarin, Language::Other), Pipeline::default() .set_pre_processor(ChineseTranslationPreProcessor) .set_tokenizer(Jieba::default())); // Japanese script specialized pipeline pipeline_map.insert((Script::Katakana, Language::Other), Pipeline::default() .set_tokenizer(LinderaTokenizer::new(Mode::Normal, "/path/to/ja-dic"))); pipeline_map.insert((Script::Hiragana, Language::Other), Pipeline::default() .set_tokenizer(LinderaTokenizer::new(Mode::Normal, "/path/to/ja-dic"))); // Korean script specialized pipeline pipeline_map.insert((Script::Hangul, Language::Other), Pipeline::default() .set_tokenizer(LinderaTokenizer::new(Mode::Normal, "/path/to/ko-dic"))); AnalyzerConfig { pipeline_map, stop_words: None } }

I don't know Japanese or Korean but the previous lines might help you to fix your test 👍

Don't hesitate to ping me if you need more help!

@ManyTheFish thanks for your feedback. I have 2 question.

I've make few change in following commit, to setup dictionary for tokenizer. But because I'm not quite good in Rust, I think you can suggest more better way to define Lindera tokenizer which extends current base tokenizer. Could you give me some suggestion?

I've tested the code you've shared above. Current main problem is that if first character of text are the one which are being used commonly in CJK(Chinese/Japanese/Korean), it will indicate text as Chinese because ChineseTokenizer comes first(maybe I can be wrong). Is there some solutions about when there are multiple languages in text?

Hello @djkooks, 😄

for your first point, I wrote so comments
about your second point, I agree with you about the fact that the script detection is not the best 😅; I will enhance it sooner or later 😊

ManyTheFish · 2021-07-07T15:52:17Z

src/tokenizer/lindera.rs

+            false => Mode::Decompose
+        };
+
+        let mut tokenizer = LinderaTokenizer::new(mode, &self.dict);


I'm not comfortable with initializing LinderaTokenizer during each tokenization, you should create a new method where you initialize the LinderaTokenizer and store it in the Lindera struct.
Moreover, you may store this initialization in a Lazy wrapper if this initialization is time-consuming where the tokenizer is not useful for everybody.

ManyTheFish · 2021-07-07T16:05:06Z

src/tokenizer/lindera.rs

+#[derive(Debug, Default)]
+pub struct Lindera {
+    pub normal_mode: bool,
+    pub dict: &'static str,


Jieba has a default dictionary stored in a const, could we do the same with lindera ?

ManyTheFish · 2021-07-07T16:10:05Z

src/analyzer.rs

+    fn test_japanese_language() {
+        let analyzer = Analyzer::new(AnalyzerConfig::<Vec<u8>>::default());
+
+        let orig = "関西国際空港限定トートバッグ";


Hello @djkooks, 😄

for your first point, I wrote so comments
about your second point, I agree with you about the fact that the script detection is not the best 😅; I will enhance it sooner or later 😊

kination · 2021-08-08T12:03:03Z

@ManyTheFish could you confirm new change is the one you're intending?
Sorry that I'm not well on rust yet...

I'm not comfortable with initializing LinderaTokenizer during each tokenization, you should create a new method where you initialize the LinderaTokenizer and store it in the Lindera struct.

curquiza · 2021-08-09T08:28:11Z

Hello @djKooks! Thanks for your changes! Can you remove the git conflicts you have on the Cargo.toml file, please? :)
@ManyTheFish will check your PR soon :)

ManyTheFish

Hello @djKooks, sorry for the time. I requested some changes to your implementation. 😄
Moreover, I think you will have to rebase your branch because of the merge conflicts.

ManyTheFish · 2021-09-13T12:12:00Z

src/analyzer.rs

+
+        // Japanese script specialized pipeline
+        // TODO: define dict path for japanese
+


Suggested change

ManyTheFish · 2021-09-13T12:12:15Z

src/analyzer.rs

@@ -132,9 +135,25 @@ impl<A> Default for AnalyzerConfig<'_, A> {
            .set_tokenizer(LegacyMeilisearch));

        // Chinese script specialized pipeline
+


Suggested change

ManyTheFish · 2021-09-13T12:16:16Z

src/analyzer.rs

+        let mut tokenizer = LinderaTokenizer::new(Mode::Normal, "");
+
+        pipeline_map.insert((Script::Katakana, Language::Other), Pipeline::default()
+            .set_tokenizer(Lindera { tokenizer }));
+
+        pipeline_map.insert((Script::Hiragana, Language::Other), Pipeline::default()
+            .set_tokenizer(Lindera { tokenizer }));
+
+        // TODO: define dict path for korean
+        pipeline_map.insert((Script::Hangul, Language::Other), Pipeline::default()
+            .set_tokenizer(Lindera { tokenizer }));


You may impl Default on your Lindera wrapper which create directly the lindera tokenizer internally.

ManyTheFish · 2021-09-13T12:18:51Z

src/analyzer.rs

+        let orig = "関西国際空港限定トートバッグ";
+        let analyzed = analyzer.analyze(orig);
+        let analyzed: Vec<_> = analyzed.tokens().map(|token| token.word).collect();
+        println!("Analyzed : {:?}", analyzed);


Suggested change

println!("Analyzed : {:?}", analyzed);

assert_eq!(analyzed, vec!["関西国際空港", "限定", "トートバッグ"]);

jungbin-kwon · 2021-10-19T07:29:29Z

I look forward to Korean language support.

ManyTheFish · 2021-10-19T15:09:09Z

Hey @jungbin-kwon, Thanks for your support!
In this PR we chose Lindera as the Korean tokenizer, I don't speak Korean, so I don't know if this tokenizer is the best to tokenize Korean, if you have any recommendations, we would be happy to know them. 😃

jungbin-kwon · 2021-10-20T05:33:55Z

@ManyTheFish Hello. Thanks for responding to my comment.
I'm also not sure if Lindera is a suitable Tokenizer for Korean. (Requires testing.)
However, I tried both v0.23.1 version and v0.20.0 version, and v0.20.0 version worked better in Korean searches.

v0.23.1 Issue
There is a problem that each Korean character is cut off.
소고기 => ["소", "고", "기"]

Test environment

AWS ECS Parget
Docker Hub Meilisearch v0.23.1, v0.20.0

PS.
As a result of checking other versions, a problem occurred from v0.22.0 version.

yagince · 2022-01-12T23:21:56Z

I'd like to use MeiliSearch in Japanese.
Do you have any plans to merge this PR?

ManyTheFish · 2022-01-13T11:25:57Z

Hey @yagince!
This PR has been started by an external contributor, I can't promise anything about this.
However, if you have some time to finish this PR or open a new one adding the Japanese support, I'll be glad to follow your work and give some help. ☺️

On my side, I will come back on the tokenizer in a few months, but not in January 😞.

Thanks for your message and have a good day! 😊

yagince · 2022-01-13T23:36:36Z

@ManyTheFish
Thank you for your answer.
I understand the situation.

kination · 2022-01-14T00:13:29Z

@yagince @jungbin-kwon thanks for your comment!
I've used lindera as a tokenizer, but there are some problem commented here #49 (comment) (please refer second comment)

@ManyTheFish are there any progress here?

ManyTheFish · 2022-01-17T14:05:49Z

@yagince @jungbin-kwon thanks for your comment! I've used lindera as a tokenizer, but there are some problem commented here #49 (comment) (please refer second comment)

@ManyTheFish are there any progress here?

Hey @djKooks, sorry if I haven't been clear, For the moment we detect the most used Script in the attribute and we tokenize with the corresponding tokenizer. It's an issue when we have several languages in the same field, but your PR could, at least, enhance tokenization for fields that contain mostly Japanese characters.
If lindera is never chosen to tokenize Japanese because of language detection, it should be a bug in Whatlang more than in your PR, and so, we should fix It in another PR 😄.

There are some small change requests to fix, and I think It will be good for me.

yagince · 2022-02-03T13:32:57Z

Hello, @djKooks.

We are trying to use Meilisearch in Japanese, but we are having some problems with the accuracy of the search for Japanese data.
I know you are very busy, but is it possible for you to proceed with this PR?

70: Setup lindera tokenizer for ja support ( related with #49 ) r=ManyTheFish a=miiton I implemented my idea for #49. I apologize first. I was wondering if it would be better to Pull Request to `@djKooks'` repository or to PR here, but the difference with the main branch was too big. I decided to send it here. Is there any problem? With this, I think that purpose of "using Lindera for Japanese processing" has been achieved. - As a preliminary step, implement the following from #49 - Merge main branch. - Fixed with `@ManyTheFish` 's comments. - lindera-rs bump to v0.8.1 - As anwer to "How to detect language, Chinese or Japanese?" - I implmented with `whatlang::detect_lang()`. - Before this commit, used only `whatlang::detect_script()` to detect "English" or "Other". | Script | Language | Tokenizer | | ---------------- | ------------- | --------- | | Script::Mandarin | Language::Cmn | Jieba | | Script::Mandarin | Language::Jpn | Lindera | | Script::Katakana | Language::Jpn | Lindera | | Script::Hiragana | Language::Jpn | Lindera | We can probably make a detect with only "Language", but I also left "Script" for backward compatibility. ### Remaining problems. `関西国際空港限定トートバッグ` was detected as "Japanese", but `関西国際空港` was detected as "Chinese" even though it is "Japanese", because only Mandarin characters are used. This is a whatlang's issue. ( note: `関西国际空港` is "Chinese") As for Korean, I found some commits in main branch, so I won't touch on that. # Pull Request ## What does this PR do? Fixes #49 ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Co-authored-by: djKooks <inylove82@gmail.com> Co-authored-by: miiton <468745+miiton@users.noreply.github.com>

ManyTheFish · 2022-02-28T09:08:14Z

closing this PR in favor of #70

kination added 3 commits June 9, 2021 22:05

Split japanese/korean detection with chinese, hence it uses differnt …

f3a73c3

…tokenizer

Add rough commit for lindera tokenizer

7b09486

Change index

1cc27f1

kination commented Jun 12, 2021

View reviewed changes

kination changed the title ~~Setup lindera tokenizer for ko/ja support~~ [WIP] Setup lindera tokenizer for ko/ja support Jun 12, 2021

ManyTheFish requested changes Jun 14, 2021

View reviewed changes

kination added 2 commits June 16, 2021 20:24

Fix tokenizer, to check byte and char separately

64c37c5

Merge branch 'main' into update/ko-ja-tokenizer

effb507

# Conflicts: # src/detection.rs

kination mentioned this pull request Jun 19, 2021

Tokenizer for Ja/Ko #30

Closed

Add japanese analyzer

93c742a

kination commented Jun 22, 2021

View reviewed changes

curquiza mentioned this pull request Jun 24, 2021

Fix README.md typos meilisearch/meilisearch#1415

Merged

[Test] tokenizer dict

6c485a4

ManyTheFish reviewed Jul 7, 2021

View reviewed changes

kination added 2 commits July 9, 2021 21:50

Fix tokenizer error

353206a

Create tokenizer instance before initiate lindera

39e270e

curquiza requested a review from ManyTheFish August 9, 2021 08:28

ManyTheFish requested changes Sep 13, 2021

View reviewed changes

curquiza linked an issue Oct 6, 2021 that may be closed by this pull request

Tokenizer for Ja/Ko #30

Closed

miiton mentioned this pull request Feb 10, 2022

Setup lindera tokenizer for ja support ( related with #49 ) #70

Merged

3 tasks

curquiza mentioned this pull request Feb 21, 2022

Handle Japanese by setting up Lindera tokenizer meilisearch/meilisearch#2185

Closed

5 tasks

ManyTheFish closed this Feb 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Setup lindera tokenizer for ko/ja support #49

[WIP] Setup lindera tokenizer for ko/ja support #49

kination commented Jun 12, 2021

kination Jun 12, 2021

ManyTheFish Jun 14, 2021 •

edited

ManyTheFish Jun 14, 2021 •

edited

ManyTheFish Jun 14, 2021

kination Jun 16, 2021

kination commented Jun 19, 2021

curquiza commented Jun 20, 2021

kination Jun 22, 2021 •

edited

ManyTheFish Jul 5, 2021

kination Jul 7, 2021

ManyTheFish Jul 7, 2021

ManyTheFish Jul 7, 2021

ManyTheFish Jul 7, 2021

ManyTheFish Jul 7, 2021

kination commented Aug 8, 2021

curquiza commented Aug 9, 2021 •

edited

ManyTheFish left a comment

ManyTheFish Sep 13, 2021

ManyTheFish Sep 13, 2021

ManyTheFish Sep 13, 2021

ManyTheFish Sep 13, 2021

jungbin-kwon commented Oct 19, 2021

ManyTheFish commented Oct 19, 2021 •

edited

jungbin-kwon commented Oct 20, 2021 •

edited

yagince commented Jan 12, 2022 •

edited

ManyTheFish commented Jan 13, 2022

yagince commented Jan 13, 2022

kination commented Jan 14, 2022

ManyTheFish commented Jan 17, 2022 •

edited

yagince commented Feb 3, 2022

ManyTheFish commented Feb 28, 2022


		// Japanese script specialized pipeline
		// TODO: define dict path for japanese

		@@ -132,9 +135,25 @@ impl<A> Default for AnalyzerConfig<'_, A> {
		.set_tokenizer(LegacyMeilisearch));

		// Chinese script specialized pipeline

	println!("Analyzed : {:?}", analyzed);
	assert_eq!(analyzed, vec!["関西国際空港", "限定", "トートバッグ"]);

[WIP] Setup lindera tokenizer for ko/ja support #49

[WIP] Setup lindera tokenizer for ko/ja support #49

Conversation

kination commented Jun 12, 2021

Choose a reason for hiding this comment

ManyTheFish Jun 14, 2021 • edited

Choose a reason for hiding this comment

ManyTheFish Jun 14, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kination commented Jun 19, 2021

curquiza commented Jun 20, 2021

kination Jun 22, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kination commented Aug 8, 2021

curquiza commented Aug 9, 2021 • edited

ManyTheFish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jungbin-kwon commented Oct 19, 2021

ManyTheFish commented Oct 19, 2021 • edited

jungbin-kwon commented Oct 20, 2021 • edited

yagince commented Jan 12, 2022 • edited

ManyTheFish commented Jan 13, 2022

yagince commented Jan 13, 2022

kination commented Jan 14, 2022

ManyTheFish commented Jan 17, 2022 • edited

yagince commented Feb 3, 2022

ManyTheFish commented Feb 28, 2022

ManyTheFish Jun 14, 2021 •

edited

ManyTheFish Jun 14, 2021 •

edited

kination Jun 22, 2021 •

edited

curquiza commented Aug 9, 2021 •

edited

ManyTheFish commented Oct 19, 2021 •

edited

jungbin-kwon commented Oct 20, 2021 •

edited

yagince commented Jan 12, 2022 •

edited

ManyTheFish commented Jan 17, 2022 •

edited