refactor(eval): unify 3 eval-pipeline tokenizers via monolith bm25-memory.tokenize (PR-3 re-author) by hang-in · Pull Request #7 · jaytoone/CTX

hang-in · 2026-05-10T09:48:22Z

Summary

Re-author of Draft PR #3 directly into the upstream monolith path, per the boundary decision in #1 (comment 4414942453):

Decision: monolith path for now. The decomposition in tunaCtx is the right long-term architecture, but I don't want to ship the package boundary in the same cycle as the functional fixes — too much blast radius.

This PR keeps the _bm25/ package decomposition out and applies the eval/production tokenizer unification directly to the upstream paths.

Sites converted

Three eval sites stop defining their own tokenize() and load the canonical implementation from src/hooks/bm25-memory.py via importlib.util (the hyphenated filename rules out a normal import). Same loader pattern as the re-skinned PR-5 regression test.

Site	Old local definition
`benchmarks/eval/g1_docs_bm25_eval.py`	`r'\d+[-–]\d+\|\d+\.\d+\|\w+'`
`benchmarks/eval/g1_longterm_baseline_eval.py`	`r'\b\w+\b'` (bm25 baseline)
`benchmarks/eval/g2_docs_paraphrase_eval.py`	particle strip + decimal preservation (in-file `_KO_PARTICLES`)

After this PR, eval and production both go through the same bm25-memory.tokenize().

Sites intentionally NOT converted

Per the same #1 thread, two sites are kept divergent with annotated rationale:

Site	Why
`src/cli/telemetry.py`	identifier-frequency stats, not BM25 ranking
`src/retrieval/bm25_retriever.py`	code-search needs raw TF; canonical's `dict.fromkeys()` dedup would flatten the score distribution on duplicate identifiers

(These already have annotation comments in upstream — leaving as-is.)

Loader pattern

import importlib.util
from pathlib import Path

_MONOLITH = Path(__file__).resolve().parents[2] / "src" / "hooks" / "bm25-memory.py"
_spec = importlib.util.spec_from_file_location("bm25_memory", _MONOLITH)
_bm25_memory = importlib.util.module_from_spec(_spec)
_spec.loader.exec_module(_bm25_memory)
tokenize = _bm25_memory.tokenize  # canonical

Validation

All three files: ast.parse OK
Monolith loader smoke test: tokenize('BM25와 검색하다 0.595') → ['bm25', 'bm25와', '검색하다', '0.595']

Closes / supersedes

Closes (when merged): the Draft hang-in/tunaCtx#3 — same scope, monolith form

Related: #1

…mory.tokenize Per jaytoone#1 — re-author of PR jaytoone#3 (Draft hang-in/tunaCtx#3) directly into the upstream monolith path, since the boundary decision is to keep the package decomposition out of this functional cycle. Three eval sites stop defining their own `tokenize()` and load the canonical implementation from `src/hooks/bm25-memory.py` via `importlib.util` (the hyphenated filename rules out a normal `import`). Same loader pattern as the re-skinned PR-5 regression test. Sites converted: - benchmarks/eval/g1_docs_bm25_eval.py (was r'\d+[-–]\d+|\d+\.\d+|\w+') - benchmarks/eval/g1_longterm_baseline_eval.py (was r'\b\w+\b' — bm25 baseline) - benchmarks/eval/g2_docs_paraphrase_eval.py (was particle strip + decimal) Sites intentionally NOT converted (annotation rationale stays): - src/cli/telemetry.py — identifier-frequency stats, not BM25 ranking - src/retrieval/bm25_retriever.py — needs raw TF; canonical's dict.fromkeys() dedup would flatten code-search score distribution Once landed, eval and production go through the same tokenizer implementation, closing the eval/production divergence flagged in issue jaytoone#1. Validation: - All three files: AST OK - Monolith loader smoke test: tokenize('BM25와 검색하다 0.595') → ['bm25', 'bm25와', '검색하다', '0.595'] Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This was referenced May 10, 2026

[Tokenizer] Unify _bm25.tokenizer canonical entry across eval pipeline (Draft, P0 of #1) #3

Closed

Track: _bm25/ package boundary — decomposition vs monolith decision #8

Open

jaytoone merged commit 411b6a2 into jaytoone:master May 10, 2026

jaytoone mentioned this pull request May 10, 2026

Forked for production hardening — happy to discuss upstream #1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(eval): unify 3 eval-pipeline tokenizers via monolith bm25-memory.tokenize (PR-3 re-author)#7

refactor(eval): unify 3 eval-pipeline tokenizers via monolith bm25-memory.tokenize (PR-3 re-author)#7
jaytoone merged 1 commit intojaytoone:masterfrom
hang-in:feat/upstream-pr3-tokenizer-monolith

hang-in commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hang-in commented May 10, 2026

Summary

Sites converted

Sites intentionally NOT converted

Loader pattern

Validation

Closes / supersedes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants