Skip to content

refactor(eval): unify 3 eval-pipeline tokenizers via monolith bm25-memory.tokenize (PR-3 re-author)#7

Merged
jaytoone merged 1 commit intojaytoone:masterfrom
hang-in:feat/upstream-pr3-tokenizer-monolith
May 10, 2026
Merged

refactor(eval): unify 3 eval-pipeline tokenizers via monolith bm25-memory.tokenize (PR-3 re-author)#7
jaytoone merged 1 commit intojaytoone:masterfrom
hang-in:feat/upstream-pr3-tokenizer-monolith

Conversation

@hang-in
Copy link
Copy Markdown
Contributor

@hang-in hang-in commented May 10, 2026

Summary

Re-author of Draft PR #3 directly into the upstream monolith path, per the boundary decision in #1 (comment 4414942453):

Decision: monolith path for now. The decomposition in tunaCtx is the right long-term architecture, but I don't want to ship the package boundary in the same cycle as the functional fixes — too much blast radius.

This PR keeps the _bm25/ package decomposition out and applies the eval/production tokenizer unification directly to the upstream paths.

Sites converted

Three eval sites stop defining their own tokenize() and load the canonical implementation from src/hooks/bm25-memory.py via importlib.util (the hyphenated filename rules out a normal import). Same loader pattern as the re-skinned PR-5 regression test.

Site Old local definition
benchmarks/eval/g1_docs_bm25_eval.py r'\d+[-–]\d+|\d+\.\d+|\w+'
benchmarks/eval/g1_longterm_baseline_eval.py r'\b\w+\b' (bm25 baseline)
benchmarks/eval/g2_docs_paraphrase_eval.py particle strip + decimal preservation (in-file _KO_PARTICLES)

After this PR, eval and production both go through the same bm25-memory.tokenize().

Sites intentionally NOT converted

Per the same #1 thread, two sites are kept divergent with annotated rationale:

Site Why
src/cli/telemetry.py identifier-frequency stats, not BM25 ranking
src/retrieval/bm25_retriever.py code-search needs raw TF; canonical's dict.fromkeys() dedup would flatten the score distribution on duplicate identifiers

(These already have annotation comments in upstream — leaving as-is.)

Loader pattern

import importlib.util
from pathlib import Path

_MONOLITH = Path(__file__).resolve().parents[2] / "src" / "hooks" / "bm25-memory.py"
_spec = importlib.util.spec_from_file_location("bm25_memory", _MONOLITH)
_bm25_memory = importlib.util.module_from_spec(_spec)
_spec.loader.exec_module(_bm25_memory)
tokenize = _bm25_memory.tokenize  # canonical

Validation

  • All three files: ast.parse OK
  • Monolith loader smoke test: tokenize('BM25와 검색하다 0.595')['bm25', 'bm25와', '검색하다', '0.595']

Closes / supersedes

  • Closes (when merged): the Draft hang-in/tunaCtx#3 — same scope, monolith form

Related: #1

…mory.tokenize

Per jaytoone#1 — re-author of PR jaytoone#3 (Draft hang-in/tunaCtx#3) directly into
the upstream monolith path, since the boundary decision is to keep the
package decomposition out of this functional cycle.

Three eval sites stop defining their own `tokenize()` and load the
canonical implementation from `src/hooks/bm25-memory.py` via
`importlib.util` (the hyphenated filename rules out a normal `import`).
Same loader pattern as the re-skinned PR-5 regression test.

Sites converted:
  - benchmarks/eval/g1_docs_bm25_eval.py      (was r'\d+[-–]\d+|\d+\.\d+|\w+')
  - benchmarks/eval/g1_longterm_baseline_eval.py (was r'\b\w+\b' — bm25 baseline)
  - benchmarks/eval/g2_docs_paraphrase_eval.py (was particle strip + decimal)

Sites intentionally NOT converted (annotation rationale stays):
  - src/cli/telemetry.py        — identifier-frequency stats, not BM25 ranking
  - src/retrieval/bm25_retriever.py — needs raw TF; canonical's
    dict.fromkeys() dedup would flatten code-search score distribution

Once landed, eval and production go through the same tokenizer
implementation, closing the eval/production divergence flagged in
issue jaytoone#1.

Validation:
- All three files: AST OK
- Monolith loader smoke test: tokenize('BM25와 검색하다 0.595')
  → ['bm25', 'bm25와', '검색하다', '0.595']

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants