Tokenizer

Library used by Meilisearch to tokenize queries and documents

Role

The tokenizer’s role is to take a sentence or phrase and split it into smaller units of language, called tokens. It finds and retrieves all the words in a string based on the language’s particularities.

Details

MeiliSearch’s tokenizer is modular. It goes field by field, determining the most likely language for the field and running a different pipeline for each language.

Pipelines include language-specific processes. For example, the Chinese pipeline converts all text into simplified Chinese before tokenization, allowing a single search query to give results in both traditional and simplified Chinese.

If you'd like to read more about the tokenizer design, check out the feature specification.

Supported languages

MeiliSearch is multilingual, featuring optimized support for:

Any language that uses whitespace to separate words
Chinese 🇨🇳 (through Jieba)

We aim to provide global language support, and your feedback helps us move closer to that goal. If you notice inconsistencies in your search results or the way your documents are processed, please open an issue on our GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
.github.disabled		.github.disabled
benches		benches
src		src
tests		tests
.gitignore		.gitignore
.rustfmt.toml		.rustfmt.toml
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
bors.toml		bors.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tokenizer

Role

Details

Supported languages

About

Releases

Packages

Languages

License

igaul/meilisearch-tokenizer

Folders and files

Latest commit

History

Repository files navigation

Tokenizer

Role

Details

Supported languages

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages