Support Unicode Text Segmentation in Tokenizer #507

davidmezzetti · 2023-07-18T19:27:23Z

Currently, the tokenizer pipeline primarily supports English text. While this logic should be retained for backwards compatibility, the pipeline should be updated to support tokenization using the Unicode Text Segmentation algorithm per the Unicode Standard Annex #29.

davidmezzetti added this to the v5.6.0 milestone Jul 18, 2023

davidmezzetti self-assigned this Jul 18, 2023

davidmezzetti closed this as completed in dcd8067 Jul 18, 2023

This was referenced Jul 20, 2023

Better BM25 #508

Closed

ScoringFactory does not support Chinese #506

Closed

Question about BM25 with respect to the scoring parameter and the ScoringFactory class #490

Closed

Hybrid Search #509

Closed

davidmezzetti mentioned this issue Jul 29, 2023

Add multilingual graph topic modeling #511

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Unicode Text Segmentation in Tokenizer #507

Support Unicode Text Segmentation in Tokenizer #507

davidmezzetti commented Jul 18, 2023

Support Unicode Text Segmentation in Tokenizer #507

Support Unicode Text Segmentation in Tokenizer #507

Comments

davidmezzetti commented Jul 18, 2023