Revise text tokenization to better support CJK languages #30

titusz · 2019-01-27T22:08:11Z

Text tokenization should be designed to be simple and generic while also supporting CJK languages.
It must yield appropriate results with similarity encoding independent of language and character set. Tokenization should not assume that text can be extracted without word boundary and separation issues.

titusz added this to the Version 1.1 milestone Jan 27, 2019

titusz added Type: Feature Priority: High Scope: Medium Affects: Spec Affects: Code labels Feb 24, 2019

titusz closed this as completed May 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise text tokenization to better support CJK languages #30

Revise text tokenization to better support CJK languages #30

titusz commented Jan 27, 2019

Revise text tokenization to better support CJK languages #30

Revise text tokenization to better support CJK languages #30

Comments

titusz commented Jan 27, 2019