Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Setup lindera tokenizer for ko/ja support #49

Closed
wants to merge 9 commits into from

Conversation

kination
Copy link
Contributor

No description provided.

TokenStream {
inner: Box::new(tokenized.into_iter().scan(0, move |_, lindera_token| {
let char_start = 0;
let char_end = lindera_token.text.len();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Jieba tokenizer, it is using bytes_len to get the size of tokens. Is there some specific reason, or just difference of tokenizer between jieba and lindera.

I'm not quite familiar of tokenizing logic in meilisearch, so please let me know the detail if you can 🙏

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @djKooks!
There is 2 kind of index in our tokens:

  • char_start/char_end which are indices in term of characters count
  • byte_start/byte_end which are indices in term of bytes count
    There are not the same because a char can contain several bytes.

The ..._start is the index of the beginning of the token from the start of the whole tokenized string.
The ..._end is the index of the end of the token from the start of the whole tokenized string.

I need to investigate more to understand the detail attribute of the Lindera Token.

But, for now, you can just pass a tuple (CharIndex, ByteIndex) initialized to (0, 0) to scan.

@kination kination changed the title Setup lindera tokenizer for ko/ja support [WIP] Setup lindera tokenizer for ko/ja support Jun 12, 2021
TokenStream {
inner: Box::new(tokenized.into_iter().scan(0, move |_, lindera_token| {
let char_start = 0;
let char_end = lindera_token.text.len();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @djKooks!
There is 2 kind of index in our tokens:

  • char_start/char_end which are indices in term of characters count
  • byte_start/byte_end which are indices in term of bytes count
    There are not the same because a char can contain several bytes.

The ..._start is the index of the beginning of the token from the start of the whole tokenized string.
The ..._end is the index of the end of the token from the start of the whole tokenized string.

I need to investigate more to understand the detail attribute of the Lindera Token.

But, for now, you can just pass a tuple (CharIndex, ByteIndex) initialized to (0, 0) to scan.

Comment on lines 20 to 30
inner: Box::new(tokenized.into_iter().scan(0, move |_, lindera_token| {
let char_start = 0;
let char_end = lindera_token.text.len();
Some(Token {
kind: TokenKind::Word,
word: Cow::Borrowed(lindera_token.text),
char_index: 0,
byte_start: char_start,
byte_end: char_end,
})
}))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
inner: Box::new(tokenized.into_iter().scan(0, move |_, lindera_token| {
let char_start = 0;
let char_end = lindera_token.text.len();
Some(Token {
kind: TokenKind::Word,
word: Cow::Borrowed(lindera_token.text),
char_index: 0,
byte_start: char_start,
byte_end: char_end,
})
}))
inner: Box::new(tokenized.into_iter().scan((0, 0), move |(char_index, byte_index), lindera_token| {
let char_count = lindera_token.text.chars().count();
let char_start = *char_index;
*char_index += char_count;
let byte_count = lindera_token.text.len();
let byte_start = *byte_index;
*byte_index += byte_count;
Some(Token {
kind: TokenKind::Word,
word: Cow::Borrowed(lindera_token.text),
char_index: char_start,
byte_start,
byte_end: byte_index,
})
}))

this should work. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ManyTheFish seems it's correct. Thanks for guide 🙏

@kination
Copy link
Contributor Author

@ManyTheFish thanks for help. Have few more question...

  • seems analyzer is working as below, to indicate which language does input uses
pipeline_map.insert((Script::Latin, Language::Other), Pipeline::default()
            .set_tokenizer(LegacyMeilisearch));

        // Chinese script specialized pipeline
        pipeline_map.insert((Script::Mandarin, Language::Other), Pipeline::default()
            .set_pre_processor(ChineseTranslationPreProcessor)
            .set_tokenizer(Jieba::default()));

-> Is Script::XXX value is given by user, or indicate automatically?

  • Lindera supports japanese/korean both, but it needs to setup dictionary by language.
/// if japanese...
let mut tokenizer = LinderaTokenizer::new(Mode::Normal, "/path/to/ja-dic");

/// if korean...
let mut tokenizer = LinderaTokenizer::new(Mode::Normal, "/path/to/ko-dic");

-> is there some common rule to keep dictionary to repository?
-> Do you have some good suggestion to give language type as parameter through fn tokenize<'a>(&self, s: &'a ProcessedText<'a>)?

@kination kination mentioned this pull request Jun 19, 2021
@curquiza
Copy link
Member

Hello @djKooks! @ManyTheFish is on holiday, he will answer you once he comes back! 🙂

A request I do: is it possible to add tests of what you implemented?

fn test_japanese_language() {
let analyzer = Analyzer::new(AnalyzerConfig::<Vec<u8>>::default());

let orig = "関西国際空港限定トートバッグ";
Copy link
Contributor Author

@kination kination Jun 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ManyTheFish

AFAIK correct result should be : ["関西国際空港", "限定", "トートバッグ"], but it shows as ["関西", "国际", "空港", "限定", "ト", "ー", "ト", "バ", "ッ", "グ"]

Seems this is issue about analyzer.
If the orig looks like following, seems it selects jieba tokenizer because first character is , and this is both being used in chinese/japanese.

How could this be solved?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! @djKooks!

In the default function of the AnalizerConfig you should add a new pipeline in order to use your new tokenizer when a Japanese script is detected:

    fn default() -> Self {
        let mut pipeline_map: HashMap<(Script, Language), Pipeline> = HashMap::new();


        // Latin script specialized pipeline
        pipeline_map.insert((Script::Latin, Language::Other), Pipeline::default()
            .set_tokenizer(LegacyMeilisearch));

        // Chinese script specialized pipeline
        pipeline_map.insert((Script::Mandarin, Language::Other), Pipeline::default()
            .set_pre_processor(ChineseTranslationPreProcessor)
            .set_tokenizer(Jieba::default()));
            
        // Japanese script specialized pipeline
        pipeline_map.insert((Script::Katakana, Language::Other), Pipeline::default()
            .set_tokenizer(LinderaTokenizer::new(Mode::Normal, "/path/to/ja-dic")));
        pipeline_map.insert((Script::Hiragana, Language::Other), Pipeline::default()
            .set_tokenizer(LinderaTokenizer::new(Mode::Normal, "/path/to/ja-dic")));

        // Korean script specialized pipeline
        pipeline_map.insert((Script::Hangul, Language::Other), Pipeline::default()
            .set_tokenizer(LinderaTokenizer::new(Mode::Normal, "/path/to/ko-dic")));

        AnalyzerConfig { pipeline_map, stop_words: None }
    }

I don't know Japanese or Korean but the previous lines might help you to fix your test 👍

Don't hesitate to ping me if you need more help!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ManyTheFish thanks for your feedback. I have 2 question.

  • I've make few change in following commit, to setup dictionary for tokenizer. But because I'm not quite good in Rust, I think you can suggest more better way to define Lindera tokenizer which extends current base tokenizer. Could you give me some suggestion?
  • I've tested the code you've shared above. Current main problem is that if first character of text are the one which are being used commonly in CJK(Chinese/Japanese/Korean), it will indicate text as Chinese because ChineseTokenizer comes first(maybe I can be wrong). Is there some solutions about when there are multiple languages in text?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @djkooks, 😄

for your first point, I wrote so comments
about your second point, I agree with you about the fact that the script detection is not the best 😅; I will enhance it sooner or later 😊

false => Mode::Decompose
};

let mut tokenizer = LinderaTokenizer::new(mode, &self.dict);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not comfortable with initializing LinderaTokenizer during each tokenization, you should create a new method where you initialize the LinderaTokenizer and store it in the Lindera struct.
Moreover, you may store this initialization in a Lazy wrapper if this initialization is time-consuming where the tokenizer is not useful for everybody.

#[derive(Debug, Default)]
pub struct Lindera {
pub normal_mode: bool,
pub dict: &'static str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jieba has a default dictionary stored in a const, could we do the same with lindera ?

fn test_japanese_language() {
let analyzer = Analyzer::new(AnalyzerConfig::<Vec<u8>>::default());

let orig = "関西国際空港限定トートバッグ";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @djkooks, 😄

for your first point, I wrote so comments
about your second point, I agree with you about the fact that the script detection is not the best 😅; I will enhance it sooner or later 😊

@kination
Copy link
Contributor Author

kination commented Aug 8, 2021

@ManyTheFish could you confirm new change is the one you're intending?
Sorry that I'm not well on rust yet...

I'm not comfortable with initializing LinderaTokenizer during each tokenization, you should create a new method where you initialize the LinderaTokenizer and store it in the Lindera struct.

@curquiza
Copy link
Member

curquiza commented Aug 9, 2021

Hello @djKooks! Thanks for your changes! Can you remove the git conflicts you have on the Cargo.toml file, please? :)
@ManyTheFish will check your PR soon :)

Copy link
Member

@ManyTheFish ManyTheFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @djKooks, sorry for the time. I requested some changes to your implementation. 😄
Moreover, I think you will have to rebase your branch because of the merge conflicts.


// Japanese script specialized pipeline
// TODO: define dict path for japanese

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

@@ -132,9 +135,25 @@ impl<A> Default for AnalyzerConfig<'_, A> {
.set_tokenizer(LegacyMeilisearch));

// Chinese script specialized pipeline

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Comment on lines +146 to +156
let mut tokenizer = LinderaTokenizer::new(Mode::Normal, "");

pipeline_map.insert((Script::Katakana, Language::Other), Pipeline::default()
.set_tokenizer(Lindera { tokenizer }));

pipeline_map.insert((Script::Hiragana, Language::Other), Pipeline::default()
.set_tokenizer(Lindera { tokenizer }));

// TODO: define dict path for korean
pipeline_map.insert((Script::Hangul, Language::Other), Pipeline::default()
.set_tokenizer(Lindera { tokenizer }));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may impl Default on your Lindera wrapper which create directly the lindera tokenizer internally.

let orig = "関西国際空港限定トートバッグ";
let analyzed = analyzer.analyze(orig);
let analyzed: Vec<_> = analyzed.tokens().map(|token| token.word).collect();
println!("Analyzed : {:?}", analyzed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
println!("Analyzed : {:?}", analyzed);
assert_eq!(analyzed, vec!["関西国際空港", "限定", "トートバッグ"]);

@curquiza curquiza linked an issue Oct 6, 2021 that may be closed by this pull request
@jungbin-kwon
Copy link

I look forward to Korean language support.

@ManyTheFish
Copy link
Member

ManyTheFish commented Oct 19, 2021

Hey @jungbin-kwon, Thanks for your support!
In this PR we chose Lindera as the Korean tokenizer, I don't speak Korean, so I don't know if this tokenizer is the best to tokenize Korean, if you have any recommendations, we would be happy to know them. 😃

@jungbin-kwon
Copy link

jungbin-kwon commented Oct 20, 2021

@ManyTheFish Hello. Thanks for responding to my comment.
I'm also not sure if Lindera is a suitable Tokenizer for Korean. (Requires testing.)
However, I tried both v0.23.1 version and v0.20.0 version, and v0.20.0 version worked better in Korean searches.

v0.23.1 Issue
There is a problem that each Korean character is cut off.
소고기 => ["소", "고", "기"]

Test environment

  • AWS ECS Parget
  • Docker Hub Meilisearch v0.23.1, v0.20.0

PS.
As a result of checking other versions, a problem occurred from v0.22.0 version.

@yagince
Copy link

yagince commented Jan 12, 2022

I'd like to use MeiliSearch in Japanese.
Do you have any plans to merge this PR?

@ManyTheFish
Copy link
Member

Hey @yagince!
This PR has been started by an external contributor, I can't promise anything about this.
However, if you have some time to finish this PR or open a new one adding the Japanese support, I'll be glad to follow your work and give some help. ☺️

On my side, I will come back on the tokenizer in a few months, but not in January 😞.

Thanks for your message and have a good day! 😊

@yagince
Copy link

yagince commented Jan 13, 2022

@ManyTheFish
Thank you for your answer.
I understand the situation.

@kination
Copy link
Contributor Author

@yagince @jungbin-kwon thanks for your comment!
I've used lindera as a tokenizer, but there are some problem commented here #49 (comment) (please refer second comment)

@ManyTheFish are there any progress here?

@ManyTheFish
Copy link
Member

ManyTheFish commented Jan 17, 2022

@yagince @jungbin-kwon thanks for your comment! I've used lindera as a tokenizer, but there are some problem commented here #49 (comment) (please refer second comment)

@ManyTheFish are there any progress here?

Hey @djKooks, sorry if I haven't been clear, For the moment we detect the most used Script in the attribute and we tokenize with the corresponding tokenizer. It's an issue when we have several languages in the same field, but your PR could, at least, enhance tokenization for fields that contain mostly Japanese characters.
If lindera is never chosen to tokenize Japanese because of language detection, it should be a bug in Whatlang more than in your PR, and so, we should fix It in another PR 😄.

There are some small change requests to fix, and I think It will be good for me.

@yagince
Copy link

yagince commented Feb 3, 2022

Hello, @djKooks.

We are trying to use Meilisearch in Japanese, but we are having some problems with the accuracy of the search for Japanese data.
I know you are very busy, but is it possible for you to proceed with this PR?

bors bot added a commit that referenced this pull request Feb 28, 2022
70: Setup lindera tokenizer for ja support ( related with #49 ) r=ManyTheFish a=miiton

I implemented my idea for #49.

I apologize first. I was wondering if it would be better to Pull Request to `@djKooks'` repository or to PR here,
but the difference with the main branch was too big. I decided to send it here.
Is there any problem?

With this, I think that purpose of "using Lindera for Japanese processing" has been achieved.

- As a preliminary step, implement the following from #49
  - Merge main branch.
  - Fixed with `@ManyTheFish` 's comments.
  - lindera-rs bump to v0.8.1
- As anwer to "How to detect language, Chinese or Japanese?"
  - I implmented with `whatlang::detect_lang()`.
  - Before this commit, used only `whatlang::detect_script()`  to detect "English" or "Other".

| Script           | Language      | Tokenizer |
| ---------------- | ------------- | --------- |
| Script::Mandarin | Language::Cmn | Jieba     |
| Script::Mandarin | Language::Jpn | Lindera   |
| Script::Katakana | Language::Jpn | Lindera   |
| Script::Hiragana | Language::Jpn | Lindera   |

We can probably make a detect with only "Language", but I also left "Script" for backward compatibility.

### Remaining problems.

`関西国際空港限定トートバッグ` was detected as "Japanese", but `関西国際空港` was detected as "Chinese" even though it is "Japanese", because only Mandarin characters are used. This is a whatlang's issue. 
( note: `関西国际空港` is "Chinese")

As for Korean, I found some commits in main branch, so I won't touch on that. 

# Pull Request

## What does this PR do?
Fixes #49

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?


Co-authored-by: djKooks <inylove82@gmail.com>
Co-authored-by: miiton <468745+miiton@users.noreply.github.com>
@ManyTheFish
Copy link
Member

closing this PR in favor of #70

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tokenizer for Ja/Ko
5 participants