feat: Add Lindera tokenizer #567

rebasedming · 2023-11-25T00:58:58Z

Ticket(s) Closed

Closes Add support for Lindera tokenizer #565

What

Introduces the Lindera tokenizer, which adds advanced tokenization support for Korean, Japanese, and Chinese.

Usage:

CREATE TABLE IF NOT EXISTS t (
    id SERIAL PRIMARY KEY,
    author TEXT,
    title TEXT,
    message TEXT,
    content JSONB,
    unix_timestamp_milli BIGINT,
    like_count INT,
    dislike_count INT,
    comment_count INT
);

INSERT INTO t (author, title, message, content, unix_timestamp_milli, like_count, dislike_count, comment_count)
VALUES
    ('김민준', '첫 번째 기사', '이것은 첫 번째 기사의 내용입니다', '{"details": "여기에는 일부 JSON 내용이 있습니다"}', EXTRACT(EPOCH FROM now()) * 1000, 25, 1, 5),
    ('이하은', '두 번째 기사', '이것은 두 번째 기사의 내용입니다', '{"details": "여기에는 더 많은 JSON 내용이 있습니다"}', EXTRACT(EPOCH FROM now()) * 1000, 75, 2, 10),
    ('박지후', '세 번째 기사', '이것은 세 번째 기사의 정보입니다', '{"details": "여기에도 일부 JSON 내용이 있습니다"}', EXTRACT(EPOCH FROM now()) * 1000, 15, 0, 3);

CREATE INDEX idx_t
ON t
USING bm25 ((t.*))
WITH (
    text_fields='{
        author: {tokenizer: {type: "korean_lindera"}, record: "position"},
        title: {tokenizer: {type: "korean_lindera"}, record: "position"},
        message: {tokenizer: {type: "korean_lindera"}, record: "position"}
    }',
    json_fields='{
        content: {}
    }',
    numeric_fields='{
        unix_timestamp_milli: {},
        like_count: {},
        dislike_count: {},
        comment_count: {}
    }'
);

SELECT * FROM t WHERE t @@@ 'title:번';

Why

Our CJK tokenizer is a bit simplistic, splitting words by spaces. In languages like Japanese, word boundaries are not always defined by spaces. The Lindera tokenizer splits words into tokens using a language-specific dictionary.

How

Uses the lindera Rust crates.

Tests

vercel · 2023-11-25T00:59:03Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
paradedb	⬜️ Ignored (Inspect)	Visit Preview		Nov 26, 2023 5:29pm

codecov · 2023-11-25T01:11:34Z

Codecov Report

Merging #567 (75d2589) into dev (83fd5d5) will increase coverage by 8.17%.
The diff coverage is 84.87%.

Additional details and impacted files

@@            Coverage Diff             @@
##              dev     #567      +/-   ##
==========================================
+ Coverage   60.14%   68.31%   +8.17%     
==========================================
  Files          42       28      -14     
  Lines        3475     2888     -587     
==========================================
- Hits         2090     1973     -117     
+ Misses       1385      915     -470

Files	Coverage Δ
pg_bm25/src/parade_index/fields.rs	`55.17% <0.00%> (-1.17%)`	⬇️
pg_bm25/src/tokenizers/lindera.rs	`94.56% <94.56%> (ø)`
pg_bm25/src/tokenizers/mod.rs	`38.82% <0.00%> (-10.44%)`	⬇️

... and 15 files with indirect coverage changes

pg_bm25/src/tokenizers/lindera.rs

neilyio · 2023-11-27T15:12:50Z

Looks great!

rebasedming requested a review from neilyio as a code owner November 25, 2023 00:58

neilyio reviewed Nov 25, 2023

View reviewed changes

pg_bm25/src/tokenizers/lindera.rs Outdated Show resolved Hide resolved

rebasedming added 4 commits November 25, 2023 19:05

Add Lindera tokenizer

8e27649

Add lindera tests

0c98b1e

Replace multilang tokenerize with langauage specific tokenizers

19fb1fb

Rebased

50df303

rebasedming force-pushed the rebasedming/tokenizers branch from 4958261 to 50df303 Compare November 26, 2023 00:07

rebasedming added 4 commits November 26, 2023 11:29

Lindera tokenizer unit tests passing

4568bca

Add tests for chinese and japanese

00d8fba

Update docs to reflect lindera tokenizers

af93a50

Remove whichlang crate

75d2589

neilyio approved these changes Nov 27, 2023

View reviewed changes

rebasedming merged commit ed8fd71 into dev Nov 27, 2023
16 checks passed

rebasedming deleted the rebasedming/tokenizers branch November 27, 2023 15:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Lindera tokenizer #567

feat: Add Lindera tokenizer #567

rebasedming commented Nov 25, 2023 •

edited

vercel bot commented Nov 25, 2023 •

edited

codecov bot commented Nov 25, 2023 •

edited

neilyio commented Nov 27, 2023

feat: Add Lindera tokenizer #567

feat: Add Lindera tokenizer #567

Conversation

rebasedming commented Nov 25, 2023 • edited

Ticket(s) Closed

What

Why

How

Tests

vercel bot commented Nov 25, 2023 • edited

codecov bot commented Nov 25, 2023 • edited

Codecov Report

neilyio commented Nov 27, 2023

rebasedming commented Nov 25, 2023 •

edited

vercel bot commented Nov 25, 2023 •

edited

codecov bot commented Nov 25, 2023 •

edited