docs(readme): propagate Judge accuracy framing across all LongMemEval sections by jaylfc · Pull Request #32 · jaylfc/taosmd

jaylfc · 2026-04-19T18:12:28Z

Follow-up to #31 which clarified the intro headline. Three more README sections still labelled our 97.0% as Recall@5:

Section	Before	After
Benchmark Results table (line 158)	Single "Recall@5" column for all systems	Split into Score + Metric columns — per-row honest labelling
Fusion Strategy Comparison (line 181)	Column header "Recall@5"	"Judge accuracy" (all strategies use the Judge harness)
Key Features list (line 263)	"97.0% Recall@5"	"97.0% end-to-end Judge accuracy"

Also softened "MemPalace-equivalent" to "same algorithm as MemPalace" — that row describes the algorithm, not the metric, which differs between the two systems.

Remaining Recall@5 mentions in the README are intentional: competitors' published numbers, the paragraph explaining why direct comparison isn't apples-to-apples, and the code-block comment labelling benchmarks/longmemeval_recall.py (which is the Recall@5 reproduction script).

Summary by CodeRabbit

Documentation
- Updated benchmarking tables and narrative to report end-to-end Judge accuracy as the primary metric instead of Recall@5.
- Clarified measurement differences across systems and added references to benchmark scripts for reproducibility.
- Updated feature descriptions and fusion strategy comparison metrics accordingly.

… sections PR #31 clarified the headline (lines 7-9) that our 97.0% is end-to-end Judge accuracy, not Recall@5. But three downstream references still labelled it as Recall@5: 1. Benchmark Results table (line 158) — column header was "Recall@5" even though our 97% is Judge and the competitors' numbers are the different, looser Recall@5 metric. Split into "Score" + "Metric" columns so each row is honestly labelled; added a clarifying paragraph below the table pointing at both benchmark scripts. 2. Fusion Strategy Comparison (line 181) — column said "Recall@5" but all four strategies were measured with the same Judge harness. Renamed header to "Judge accuracy" and softened the "MemPalace- equivalent" label to "same algorithm as MemPalace" so it describes the retrieval approach, not the metric. 3. Key Features list (line 263) — "97.0% Recall@5" → "97.0% end-to-end Judge accuracy". Remaining Recall@5 references are all intentional: competitors' published numbers, the narrative paragraph explaining the metric difference, and the code-block comment for `longmemeval_recall.py` which is the Recall@5 reproduction script.

coderabbitai · 2026-04-19T18:12:43Z

Caution

Review failed

Pull request was closed or merged during review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 68770700-ed8d-45df-92ac-02dd3d962c14

📥 Commits

Reviewing files that changed from the base of the PR and between 116edab and a893cf1.

📒 Files selected for processing (1)

README.md

📝 Walkthrough

Walkthrough

Updated README benchmarking tables and descriptions to report end-to-end Judge accuracy as the primary metric for taOSmd instead of Recall@5, clarified metric differences with other systems, and added references to specific benchmark scripts for reproducibility.

Changes

Cohort / File(s)	Summary
README Benchmark Documentation `README.md`	Shifted primary metric reporting from Recall@5 to end-to-end Judge accuracy for taOSmd; updated fusion strategy comparison table headers and metric labels; clarified benchmark methodology differences between taOSmd and other systems; added references to benchmark harness scripts; updated Key Features bullet point accordingly.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 Metrics shift like morning dew,

Judge accuracy shines bright and true,

From Recall to end-to-end flow,

The benchmarks' honest truth we show,

Clarity hops through every row! 📊

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main change: propagating Judge accuracy framing across README sections, which matches the PR objectives of replacing misleading Recall@5 labels with accurate Judge accuracy descriptions.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/readme-judge-accuracy-consistency

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

kilo-code-bot · 2026-04-19T18:13:18Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Files Reviewed (1 files)

README.md - 0 issues

_{Reviewed by grok-code-fast-1:optimized:free · 195,378 tokens}

jaylfc merged commit 7b7d044 into master Apr 19, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(readme): propagate Judge accuracy framing across all LongMemEval sections#32

docs(readme): propagate Judge accuracy framing across all LongMemEval sections#32
jaylfc merged 1 commit intomasterfrom
fix/readme-judge-accuracy-consistency

jaylfc commented Apr 19, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 19, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

kilo-code-bot bot commented Apr 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jaylfc commented Apr 19, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

kilo-code-bot bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jaylfc commented Apr 19, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 19, 2026 •

edited

Loading

kilo-code-bot bot commented Apr 19, 2026 •

edited

Loading