docs: document state-machine tokenizer architecture by imrim12 · Pull Request #13 · imrim12/vietscript

imrim12 · 2026-04-27T19:39:41Z

Summary

Follow-up tới PR perf đã merge (FSM tokenizer thay regex tokenizer). Trước giờ kiến trúc tokenizer chỉ tồn tại trong code + commit message — PR này đưa nó vào docs.

Thêm trang mới docs/architecture/tokenizer.md:
- Thiết kế FSM: char-code dispatch, keyword trie với per-entry boundary rules (WORD / IDENT / NONE mô phỏng đúng \b, lookahead VI, hoặc keyword không có boundary), bounded backtracking cho multi-word identifier vs keyword, operator longest-match trie.
- Bảng bench (regex vs FSM) lấy từ packages/parser/bench/comparison.json: 200×–3000× speedup, loại bỏ hành vi siêu tuyến tính.
- Hướng dẫn dùng new Parser({ tokenizer: 'regex' }) để rollback debug.
- Quy trình thêm keyword mới (cập nhật cả specs.ts và KEYWORDS array trong tokenizer-fsm.ts).
Wire trang mới vào sidebar Vitepress dưới section "Kiến trúc".
README: thêm dòng tokenizer state-machine vào bảng feature, link sang doc kiến trúc, cập nhật test count (249 → 402), thêm pnpm bench vào dev commands.
getting-started.md: nhắc pnpm bench / pnpm bench:baseline và link sang trang kiến trúc.
roadmap.md: thêm note "Cập nhật ngoài lịch trình" trỏ sang doc mới.
CHANGELOG.md: log perf migration với số bench đầy đủ.

Test plan

pnpm test — 402/402 pass
pnpm docs:build — verify không phát sinh dead link MỚI nào (4 dead link tồn tại trên main đã reproduce, không liên quan PR này)
Sidebar render đúng entry "Kiến trúc Tokenizer (state machine)"
Relative links trong trang mới (../roadmap.md, ../compatibility.md) trỏ đúng

Notes

Không sửa logic code, chỉ docs.
Pre-existing dead links (CONTRIBUTING, packages/plugins/{vite,webpack}, basics/index) cố ý không đụng — scope nhỏ, để PR khác xử lý.

https://claude.ai/code/session_01DwkDg2bBKFmfGmfP3Xtg7o

Generated by Claude Code

- New page docs/architecture/tokenizer.md covering FSM design (char-code dispatch, keyword trie with per-entry boundary rules, bounded backtracking for multi-word identifier vs keyword, operator longest-match), parity testing model, and bench harness - Wire it into the vitepress sidebar under a new "Kiến trúc" section - README: add tokenizer perf row to feature table, link to the new doc, update test count (249 → 402), add `pnpm bench` to dev commands - getting-started.md: mention `pnpm bench` / `pnpm bench:baseline` and link to the architecture page - roadmap.md: add an "off-roadmap update" note pointing to the new doc - CHANGELOG.md: log the perf migration with bench numbers https://claude.ai/code/session_01DwkDg2bBKFmfGmfP3Xtg7o

Now that the state-machine tokenizer is the only one, remove the dual implementation and present the codebase as if it were the first version. - Delete `tokenizer-fsm.ts` and inline its content into `tokenizer.ts`, renaming the class from `TokenizerFSM` to `Tokenizer`. - Remove `ITokenizer`, `TokenizerKind`, `ParserOptions`, `createTokenizer` factory from `parser.ts`. `Parser` constructor takes no options again and always uses `new Tokenizer(this)`. - Delete `packages/parser/src/constants/specs.ts` (regex spec table) and the now-unused `Spec` type from `@vietscript/shared`. - Drop `tokenizer-fsm.test.ts` (parity-vs-regex tests no longer apply); remaining tokenizer behavior is covered by tokenizer-edge, vietnamese-keywords, identifier-match-keyword, plus the snapshot drift / fixture parity smoke tests. - Simplify benches: tokenizer.bench.ts now benches a single Tokenizer across the 5 fixtures; parser.bench.ts drops the "regex baseline" name. - Drop stale comparison.json; baseline.json regenerated from the single tokenizer. Docs: - Rewrite `docs/architecture/tokenizer.md` to describe the current tokenizer as-is (no "switched from regex" framing, no comparison tables). - README: replace the perf-comparison row with a neutral description of the tokenizer. - roadmap.md / getting-started.md: drop comparison wording, link to the architecture page. - CHANGELOG: collapse the migration entry into a neutral "Tokenizer" section describing the current design. Tests: 357/357 pass. Lint + typecheck clean across 7 packages. https://claude.ai/code/session_01DwkDg2bBKFmfGmfP3Xtg7o

claude added 2 commits April 27, 2026 19:39

imrim12 merged commit f0b110f into main May 1, 2026
1 of 9 checks passed

imrim12 deleted the claude/docs-fsm-tokenizer branch May 1, 2026 07:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: document state-machine tokenizer architecture#13

docs: document state-machine tokenizer architecture#13
imrim12 merged 2 commits into
mainfrom
claude/docs-fsm-tokenizer

imrim12 commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

imrim12 commented Apr 27, 2026

Summary

Test plan

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants