kham

Thai word segmentation engine written in Rust. Fast, no_std-compatible core library with bindings for Python, WebAssembly, C, a command-line interface, and database extensions for PostgreSQL and SQLite.

Website & live demo: kham.io

Features

newmm algorithm — DAG-based maximal matching constrained to Thai Character Cluster (TCC) boundaries
Compound-first DP scoring — minimises token count before maximising dictionary matches, then uses TNC frequency as tiebreaker; 94.9% sentence-level agreement with PyThaiNLP newmm (F1 0.975)
Zero-copy API — segment() returns &str slices into the original input; no heap allocation per token
no_std core — kham-core compiles for bare-metal targets (alloc only)
Built-in dictionary — 62,102-word CC0-licensed Thai word list embedded at compile time
Thai FTS pipeline — FtsTokenizer adds stopword filtering, POS tagging, NER, RTGS romanization, phonetic soundex, abbreviation expansion, and OOV n-gram fallback
Named entity recognition — gazetteer-based NER (~36,600 entries): provinces, countries, Wikipedia places/orgs, person and family names
Part-of-speech tagging — 13-category lookup table (~9,000 entries)
Phonetic encoding — lk82, udom83, MetaSound, and Thai–English cross-language Soundex
Number normalization — Thai digits ↔ ASCII, spelled-out number words ↔ integer, Thai Baht currency text
Abbreviation expansion — 118-entry built-in TSV (months, era markers, ranks, agencies)
Date parsing — 7 input formats, Buddhist Era and Gregorian, round-trips to ISO 8601 and Thai text
Sentence segmentation — Thai terminators, Paiyannoi, punctuation, with decimal/abbreviation-aware dot rules
Multi-target — Rust crate, Python wheel, WASM module, C shared library, CLI binary, PostgreSQL FTS parser, SQLite FTS5 tokenizer

Packages

Crate	Registry	Docs	Description
`kham-core`	crates.io	(this file)	Pure Rust engine, `no_std` compatible
`kham-cli`	crates.io	(this file)	`kham` binary
`kham-python`	PyPI	kham-python/README.md	Python bindings via PyO3 / maturin
`kham-wasm`	npm	kham-wasm/README.md	WebAssembly bindings via wasm-bindgen
`kham-capi`	crates.io	kham-capi/README.md	C FFI with cbindgen-generated header
`kham-pg`	PGXN (coming soon)	kham-pg/README.md	PostgreSQL text search parser for Thai
`kham-sqlite`	—	kham-sqlite/README.md	SQLite FTS5 tokenizer for Thai

Quick start

Rust

[dependencies]
kham-core = "0.5"

use kham_core::Tokenizer;

let tok = Tokenizer::new();
let tokens = tok.segment("กินข้าวกับปลา");
for t in &tokens {
    println!("{} ({:?})", t.text, t.kind);
}
// กินข้าว (Thai)
// กับ     (Thai)
// ปลา     (Thai)

Mixed script works out of the box:

let tokens = tok.segment("ธนาคาร100แห่ง");
assert_eq!(tokens[0].text, "ธนาคาร"); // Thai
assert_eq!(tokens[1].text, "100");     // Number
assert_eq!(tokens[2].text, "แห่ง");   // Thai

CLI

cargo install kham-cli

kham "กินข้าวกับปลา"               # กินข้าว|กับ|ปลา
kham --sep " / " "สวัสดีชาวโลก"    # สวัสดี / ชาว / โลก
kham --kind "ธนาคาร100แห่ง"        # ธนาคาร:Thai|100:Number|แห่ง:Thai
kham --spans "กินข้าวกับปลา"       # กินข้าว:0-7|กับ:7-10|ปลา:10-13

# FTS pipeline — kind, POS, NE, stopword, synonyms (one token per line)
kham --fts "ทักษิณเดินทางไปกรุงเทพ"
# ทักษิณ  kind=Person  pos=-     ne=Person  stop=false  syn=-
# เดิน    kind=Thai    pos=Verb  ne=-       stop=false  syn=-
# ทาง     kind=Thai    pos=Noun  ne=-       stop=true   syn=-
# ไป      kind=Thai    pos=Verb  ne=-       stop=true   syn=-
# กรุงเทพ kind=Place   pos=-     ne=Place   stop=false  syn=-

# FTS + phonetic encoding — syn= shows the lk82 code
kham --fts --soundex lk82 "กินข้าวกับปลา" | column -t
# กินข้าว  kind=Thai  pos=-     ne=-  stop=false  syn=1619
# กับ      kind=Thai  pos=Conj  ne=-  stop=true   syn=1400
# ปลา      kind=Thai  pos=Noun  ne=-  stop=false  syn=4800

echo "กินข้าว" | kham           # stdin
RUST_LOG=debug kham "กินข้าว"  # per-token trace + timing

Other targets

Target	Quick link
Python	kham-python/README.md
JavaScript / TypeScript (WASM)	kham-wasm/README.md
C	kham-capi/README.md
PostgreSQL FTS	kham-pg/README.md
SQLite FTS5	kham-sqlite/README.md

Token contract

pub struct Token<'a> {
    pub text: &'a str,            // zero-copy slice of the input string
    pub span: Range<usize>,       // byte offsets in the original string
    pub char_span: Range<usize>,  // Unicode scalar-value (char) offsets
    pub kind: TokenKind,          // Thai | Latin | Number | Punctuation | Emoji | Whitespace | Unknown | Named(NamedEntityKind)
}

span — byte offsets; slice with &input[token.span.clone()]
char_span — Unicode scalar-value offsets for Python/JavaScript indexing
Joining all token.text values (whitespace kept) reconstructs the original input exactly

Full-Text Search

FtsTokenizer wraps the segmenter with the full NLP pipeline:

use kham_core::fts::FtsTokenizer;

let fts = FtsTokenizer::new();

let tokens = fts.segment_for_fts("ทักษิณเดินทางไปกรุงเทพ");
for t in &tokens {
    println!("{} ne={:?} pos={:?} stop={}", t.text, t.ne, t.pos, t.is_stop);
}
// ทักษิณ  ne=Some(Person)  pos=None  stop=false
// เดิน    ne=None          pos=Verb  stop=false
// ทาง     ne=None          pos=None  stop=true
// ไป      ne=None          pos=Verb  stop=true
// กรุงเทพ ne=Some(Place)   pos=None  stop=false  ← merged from กรุง+เทพ

// Flat lexeme list for tsvector (stopwords removed)
let lexemes = fts.lexemes("กินข้าวกับปลา");
// → ["กินข้าว", "ปลา"]

Builder options:

use kham_core::fts::FtsTokenizer;
use kham_core::abbrev::AbbrevMap;
use kham_core::synonym::SynonymMap;
use kham_core::stopwords::StopwordSet;
use kham_core::romanizer::RomanizationMap;
use kham_core::soundex::SoundexAlgorithm;

let fts = FtsTokenizer::builder()
    .abbrevs(AbbrevMap::builtin())             // ก.ค. → กรกฎาคม before segmentation
    .synonyms(SynonymMap::from_tsv(include_str!("synonyms.tsv")))
    .stopwords(StopwordSet::from_text("ซื้อ\nขาย\n"))
    .romanization(RomanizationMap::builtin())  // adds RTGS to synonyms: กิน → "kin"
    .soundex(SoundexAlgorithm::Lk82)          // adds lk82 code to synonyms for Thai/Named tokens
    .ngram_size(3)                             // trigrams for Unknown tokens (0 = disable)
    .number_normalize(true)                    // Thai digits → ASCII synonym (default: true)
    .build();

FtsToken fields: text, position, kind, is_stop, synonyms, trigrams, pos, ne.

Number normalization

use kham_core::number::{
    thai_digits_to_ascii, parse_thai_word, u64_to_thai_word,
    parse_thai_baht, to_thai_baht_text,
};

thai_digits_to_ascii("๑๒๓")             // "123"
parse_thai_word("หนึ่งร้อยยี่สิบสาม")  // Some(123)
u64_to_thai_word(123)                   // "หนึ่งร้อยยี่สิบสาม"
parse_thai_baht("หนึ่งร้อยบาทห้าสิบสตางค์")
// Some(BahtAmount { baht: 100, satang: 50 })
to_thai_baht_text(100, 0)              // "หนึ่งร้อยบาทถ้วน"

In FtsTokenizer, number normalization runs automatically: TokenKind::Number tokens get their ASCII form added to synonyms. Opt out with .number_normalize(false).

Abbreviation expansion

use kham_core::abbrev::AbbrevMap;

let map = AbbrevMap::builtin();
assert_eq!(map.expand_text("วันที่5ก.ค.2567"), "วันที่5กรกฎาคม2567");
assert_eq!(map.expand_text("พ.ศ.2567"),        "พุทธศักราช2567");

let exps = map.lookup("ดร.").unwrap();
assert_eq!(exps, &["ดอกเตอร์"]);

Built-in TSV covers 12 month abbreviations, era markers, military/police ranks, government agencies, and Bangkok districts. Use with FtsTokenizerBuilder::abbrevs(AbbrevMap::builtin()).

Date parsing

use kham_core::date::{parse_thai_date, Era};

let d = parse_thai_date("5 กรกฎาคม 2567").unwrap();
assert_eq!(d.to_iso8601(), "2024-07-05"); // BE 2567 → CE 2024

let d = parse_thai_date("๕ ก.ค. ๒๕๖๗").unwrap();
assert_eq!(d.to_iso8601(), "2024-07-05");

let d = parse_thai_date("5/7/2567").unwrap();
assert_eq!(d.era, Era::Buddhist);

Supported formats: full month name, abbreviated month, era marker (พ.ศ. / ค.ศ.), วันที่ prefix, slash/dash-separated, Thai digits. Era inferred when omitted: year ≥ 2300 → Buddhist Era.

Sentence segmentation

use kham_core::sentence::split_sentences;

let text = "สวัสดีครับ! วันนี้อากาศดีมาก\nเราไปกินข้าวกันเถอะ";
let sents = split_sentences(text);
assert_eq!(sents[0].text, "สวัสดีครับ!");
assert_eq!(sents[1].text, "วันนี้อากาศดีมาก");
assert_eq!(sents[2].text, "เราไปกินข้าวกันเถอะ");

Character	Rule
`๚` `๛`	Always splits
`ฯ`	Splits unless part of `ฯลฯ`
`\n`	Always splits
`!` `?`	Always splits
`.`	Splits only when followed by whitespace or end-of-string

Named entity recognition

The built-in gazetteer (~36,600 entries) covers Thai provinces, 246 countries, 17,000+ Wikipedia places/orgs, and 9,000+ person and family names. Multi-token matching merges compound names split by the segmenter:

กรุงเทพ  → segmenter splits → กรุง + เทพ
         → NE tagger merges → กรุงเทพ  Named(Place)

See ADR-001 for the person-name import decision.

Phonetic encoding (Soundex)

use kham_core::soundex::{lk82, udom83, metasound, sounds_like, SoundexAlgorithm};
use kham_core::soundex::{thai_english_soundex, sounds_like_cross_lang};

assert_eq!(lk82("กาน"), lk82("ขาน")); // same consonant group → "1600"
assert!(sounds_like("กาน", "คาน", SoundexAlgorithm::Lk82));

// Thai–English cross-language (Suwanvisat & Prasitjutrakul 1998)
let en = thai_english_soundex("McDonald");
let th = thai_english_soundex("แมคโดนัลด์");
assert_eq!(&en[..3], &th[..3]); // shared phonetic prefix

FTS integration — emit the soundex code as a synonym:

let fts = FtsTokenizer::builder()
    .soundex(SoundexAlgorithm::Lk82)
    .build();

Building

cargo build                          # all crates
cargo test --release                 # all tests
cargo test -p kham-core --release    # core only
cargo bench -p kham-core             # throughput benchmarks
cargo run -p kham-bench-accuracy     # word-boundary P/R/F1
cargo run -p kham-bench-accuracy -- --threshold 0.95  # CI gate

Prerequisites:

Target	Tool	Install
All	Rust ≥ 1.85	`curl -sSf https://sh.rustup.rs \| sh`
WASM	`wasm-pack`	`cargo install wasm-pack`
Python	`maturin`	`pip install maturin`
C	`cbindgen`	`cargo install cbindgen`
PostgreSQL	Docker with BuildKit	docs.docker.com
SQLite (macOS)	Homebrew sqlite	`brew install sqlite`
SQLite (Linux)	SQLite dev headers	`apt install libsqlite3-dev`

CI

Job	What it checks
`fmt`	`cargo fmt --check`
`clippy`	`cargo clippy -D warnings`
`test`	Unit + integration + doc tests, stable and MSRV 1.85, Linux and macOS
`no_std`	`kham-core` compiles for `thumbv7em-none-eabihf`
`wasm`	`wasm-pack build --target web` succeeds
`python`	`maturin develop` on Python 3.8 and 3.12
`pg_regress`	31 SQL tests across 4 suites in Docker PostgreSQL 17

License

Licensed under either of:

at your option.

Name		Name	Last commit message	Last commit date
Latest commit History 230 Commits
.claude		.claude
.github/workflows		.github/workflows
doc		doc
kham-bench-accuracy		kham-bench-accuracy
kham-capi		kham-capi
kham-cli		kham-cli
kham-core		kham-core
kham-pg		kham-pg
kham-python		kham-python
kham-sqlite		kham-sqlite
kham-wasm		kham-wasm
kham-web		kham-web
scripts		scripts
.claudeignore		.claudeignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md

Document	Contents
doc/roadmap.md	Release history, pending action checklist, corpus import plan
doc/architecture.md	Crate graph, pipeline flowcharts, module responsibilities
doc/benchmarks.md	Throughput numbers, PostgreSQL and SQLite FTS5 benchmarks
doc/dict-format.md	`dict.bin` binary format, DARTS lifecycle, data sources
doc/adr-001-ne-person-name-import-strategy.md	Person name import strategy
doc/adr-002-syllables-corpus-import-decision.md	Why syllables_th.txt is excluded
doc/adr-003-orchid-pos-tag-mapping.md	ORCHID 44-tag → 13-category POS mapping

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kham

Features

Packages

Quick start

Rust

CLI

Other targets

Token contract

Full-Text Search

Number normalization

Abbreviation expansion

Date parsing

Sentence segmentation

Named entity recognition

Phonetic encoding (Soundex)

Building

CI

Further reading

License

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

kham

Features

Packages

Quick start

Rust

CLI

Other targets

Token contract

Full-Text Search

Number normalization

Abbreviation expansion

Date parsing

Sentence segmentation

Named entity recognition

Phonetic encoding (Soundex)

Building

CI

Further reading

License

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages