Skip to content

docs: document state-machine tokenizer architecture#13

Merged
imrim12 merged 2 commits into
mainfrom
claude/docs-fsm-tokenizer
May 1, 2026
Merged

docs: document state-machine tokenizer architecture#13
imrim12 merged 2 commits into
mainfrom
claude/docs-fsm-tokenizer

Conversation

@imrim12
Copy link
Copy Markdown
Owner

@imrim12 imrim12 commented Apr 27, 2026

Summary

Follow-up tới PR perf đã merge (FSM tokenizer thay regex tokenizer). Trước giờ kiến trúc tokenizer chỉ tồn tại trong code + commit message — PR này đưa nó vào docs.

  • Thêm trang mới docs/architecture/tokenizer.md:
    • Thiết kế FSM: char-code dispatch, keyword trie với per-entry boundary rules (WORD / IDENT / NONE mô phỏng đúng \b, lookahead VI, hoặc keyword không có boundary), bounded backtracking cho multi-word identifier vs keyword, operator longest-match trie.
    • Bảng bench (regex vs FSM) lấy từ packages/parser/bench/comparison.json: 200×–3000× speedup, loại bỏ hành vi siêu tuyến tính.
    • Hướng dẫn dùng new Parser({ tokenizer: 'regex' }) để rollback debug.
    • Quy trình thêm keyword mới (cập nhật cả specs.tsKEYWORDS array trong tokenizer-fsm.ts).
  • Wire trang mới vào sidebar Vitepress dưới section "Kiến trúc".
  • README: thêm dòng tokenizer state-machine vào bảng feature, link sang doc kiến trúc, cập nhật test count (249 → 402), thêm pnpm bench vào dev commands.
  • getting-started.md: nhắc pnpm bench / pnpm bench:baseline và link sang trang kiến trúc.
  • roadmap.md: thêm note "Cập nhật ngoài lịch trình" trỏ sang doc mới.
  • CHANGELOG.md: log perf migration với số bench đầy đủ.

Test plan

  • pnpm test — 402/402 pass
  • pnpm docs:build — verify không phát sinh dead link MỚI nào (4 dead link tồn tại trên main đã reproduce, không liên quan PR này)
  • Sidebar render đúng entry "Kiến trúc Tokenizer (state machine)"
  • Relative links trong trang mới (../roadmap.md, ../compatibility.md) trỏ đúng

Notes

  • Không sửa logic code, chỉ docs.
  • Pre-existing dead links (CONTRIBUTING, packages/plugins/{vite,webpack}, basics/index) cố ý không đụng — scope nhỏ, để PR khác xử lý.

https://claude.ai/code/session_01DwkDg2bBKFmfGmfP3Xtg7o


Generated by Claude Code

claude added 2 commits April 27, 2026 19:39
- New page docs/architecture/tokenizer.md covering FSM design (char-code
  dispatch, keyword trie with per-entry boundary rules, bounded backtracking
  for multi-word identifier vs keyword, operator longest-match), parity
  testing model, and bench harness
- Wire it into the vitepress sidebar under a new "Kiến trúc" section
- README: add tokenizer perf row to feature table, link to the new doc,
  update test count (249 → 402), add `pnpm bench` to dev commands
- getting-started.md: mention `pnpm bench` / `pnpm bench:baseline` and link
  to the architecture page
- roadmap.md: add an "off-roadmap update" note pointing to the new doc
- CHANGELOG.md: log the perf migration with bench numbers

https://claude.ai/code/session_01DwkDg2bBKFmfGmfP3Xtg7o
Now that the state-machine tokenizer is the only one, remove the dual
implementation and present the codebase as if it were the first version.

- Delete `tokenizer-fsm.ts` and inline its content into `tokenizer.ts`,
  renaming the class from `TokenizerFSM` to `Tokenizer`.
- Remove `ITokenizer`, `TokenizerKind`, `ParserOptions`, `createTokenizer`
  factory from `parser.ts`. `Parser` constructor takes no options again
  and always uses `new Tokenizer(this)`.
- Delete `packages/parser/src/constants/specs.ts` (regex spec table) and
  the now-unused `Spec` type from `@vietscript/shared`.
- Drop `tokenizer-fsm.test.ts` (parity-vs-regex tests no longer apply);
  remaining tokenizer behavior is covered by tokenizer-edge,
  vietnamese-keywords, identifier-match-keyword, plus the snapshot
  drift / fixture parity smoke tests.
- Simplify benches: tokenizer.bench.ts now benches a single Tokenizer
  across the 5 fixtures; parser.bench.ts drops the "regex baseline" name.
- Drop stale comparison.json; baseline.json regenerated from the single
  tokenizer.

Docs:
- Rewrite `docs/architecture/tokenizer.md` to describe the current
  tokenizer as-is (no "switched from regex" framing, no comparison
  tables).
- README: replace the perf-comparison row with a neutral description of
  the tokenizer.
- roadmap.md / getting-started.md: drop comparison wording, link to the
  architecture page.
- CHANGELOG: collapse the migration entry into a neutral "Tokenizer"
  section describing the current design.

Tests: 357/357 pass. Lint + typecheck clean across 7 packages.

https://claude.ai/code/session_01DwkDg2bBKFmfGmfP3Xtg7o
@imrim12 imrim12 merged commit f0b110f into main May 1, 2026
1 of 9 checks passed
@imrim12 imrim12 deleted the claude/docs-fsm-tokenizer branch May 1, 2026 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants