Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Lindera tokenizer #567

Merged
merged 8 commits into from
Nov 27, 2023
Merged

feat: Add Lindera tokenizer #567

merged 8 commits into from
Nov 27, 2023

Conversation

rebasedming
Copy link
Collaborator

@rebasedming rebasedming commented Nov 25, 2023

Ticket(s) Closed

What

Introduces the Lindera tokenizer, which adds advanced tokenization support for Korean, Japanese, and Chinese.

Usage:

CREATE TABLE IF NOT EXISTS t (
    id SERIAL PRIMARY KEY,
    author TEXT,
    title TEXT,
    message TEXT,
    content JSONB,
    unix_timestamp_milli BIGINT,
    like_count INT,
    dislike_count INT,
    comment_count INT
);

INSERT INTO t (author, title, message, content, unix_timestamp_milli, like_count, dislike_count, comment_count)
VALUES
    ('김민준', '첫 번째 기사', '이것은 첫 번째 기사의 내용입니다', '{"details": "여기에는 일부 JSON 내용이 있습니다"}', EXTRACT(EPOCH FROM now()) * 1000, 25, 1, 5),
    ('이하은', '두 번째 기사', '이것은 두 번째 기사의 내용입니다', '{"details": "여기에는 더 많은 JSON 내용이 있습니다"}', EXTRACT(EPOCH FROM now()) * 1000, 75, 2, 10),
    ('박지후', '세 번째 기사', '이것은 세 번째 기사의 정보입니다', '{"details": "여기에도 일부 JSON 내용이 있습니다"}', EXTRACT(EPOCH FROM now()) * 1000, 15, 0, 3);

CREATE INDEX idx_t
ON t
USING bm25 ((t.*))
WITH (
    text_fields='{
        author: {tokenizer: {type: "korean_lindera"}, record: "position"},
        title: {tokenizer: {type: "korean_lindera"}, record: "position"},
        message: {tokenizer: {type: "korean_lindera"}, record: "position"}
    }',
    json_fields='{
        content: {}
    }',
    numeric_fields='{
        unix_timestamp_milli: {},
        like_count: {},
        dislike_count: {},
        comment_count: {}
    }'
);

SELECT * FROM t WHERE t @@@ 'title:번';

Why

Our CJK tokenizer is a bit simplistic, splitting words by spaces. In languages like Japanese, word boundaries are not always defined by spaces. The Lindera tokenizer splits words into tokens using a language-specific dictionary.

How

Uses the lindera Rust crates.

Tests

Copy link

vercel bot commented Nov 25, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
paradedb ⬜️ Ignored (Inspect) Visit Preview Nov 26, 2023 5:29pm

Copy link

codecov bot commented Nov 25, 2023

Codecov Report

Merging #567 (75d2589) into dev (83fd5d5) will increase coverage by 8.17%.
The diff coverage is 84.87%.

Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #567      +/-   ##
==========================================
+ Coverage   60.14%   68.31%   +8.17%     
==========================================
  Files          42       28      -14     
  Lines        3475     2888     -587     
==========================================
- Hits         2090     1973     -117     
+ Misses       1385      915     -470     
Files Coverage Δ
pg_bm25/src/parade_index/fields.rs 55.17% <0.00%> (-1.17%) ⬇️
pg_bm25/src/tokenizers/lindera.rs 94.56% <94.56%> (ø)
pg_bm25/src/tokenizers/mod.rs 38.82% <0.00%> (-10.44%) ⬇️

... and 15 files with indirect coverage changes

@neilyio
Copy link
Contributor

neilyio commented Nov 27, 2023

Looks great!

@rebasedming rebasedming merged commit ed8fd71 into dev Nov 27, 2023
16 checks passed
@rebasedming rebasedming deleted the rebasedming/tokenizers branch November 27, 2023 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for Lindera tokenizer
2 participants