Skip to content

docs(readme): propagate Judge accuracy framing across all LongMemEval sections#32

Merged
jaylfc merged 1 commit intomasterfrom
fix/readme-judge-accuracy-consistency
Apr 19, 2026
Merged

docs(readme): propagate Judge accuracy framing across all LongMemEval sections#32
jaylfc merged 1 commit intomasterfrom
fix/readme-judge-accuracy-consistency

Conversation

@jaylfc
Copy link
Copy Markdown
Owner

@jaylfc jaylfc commented Apr 19, 2026

Follow-up to #31 which clarified the intro headline. Three more README sections still labelled our 97.0% as Recall@5:

Section Before After
Benchmark Results table (line 158) Single "Recall@5" column for all systems Split into Score + Metric columns — per-row honest labelling
Fusion Strategy Comparison (line 181) Column header "Recall@5" "Judge accuracy" (all strategies use the Judge harness)
Key Features list (line 263) "97.0% Recall@5" "97.0% end-to-end Judge accuracy"

Also softened "MemPalace-equivalent" to "same algorithm as MemPalace" — that row describes the algorithm, not the metric, which differs between the two systems.

Remaining Recall@5 mentions in the README are intentional: competitors' published numbers, the paragraph explaining why direct comparison isn't apples-to-apples, and the code-block comment labelling benchmarks/longmemeval_recall.py (which is the Recall@5 reproduction script).

Summary by CodeRabbit

  • Documentation
    • Updated benchmarking tables and narrative to report end-to-end Judge accuracy as the primary metric instead of Recall@5.
    • Clarified measurement differences across systems and added references to benchmark scripts for reproducibility.
    • Updated feature descriptions and fusion strategy comparison metrics accordingly.

… sections

PR #31 clarified the headline (lines 7-9) that our 97.0% is end-to-end
Judge accuracy, not Recall@5. But three downstream references still
labelled it as Recall@5:

1. Benchmark Results table (line 158) — column header was "Recall@5"
   even though our 97% is Judge and the competitors' numbers are the
   different, looser Recall@5 metric. Split into "Score" + "Metric"
   columns so each row is honestly labelled; added a clarifying
   paragraph below the table pointing at both benchmark scripts.
2. Fusion Strategy Comparison (line 181) — column said "Recall@5"
   but all four strategies were measured with the same Judge harness.
   Renamed header to "Judge accuracy" and softened the "MemPalace-
   equivalent" label to "same algorithm as MemPalace" so it describes
   the retrieval approach, not the metric.
3. Key Features list (line 263) — "97.0% Recall@5" → "97.0% end-to-end
   Judge accuracy".

Remaining Recall@5 references are all intentional: competitors'
published numbers, the narrative paragraph explaining the metric
difference, and the code-block comment for `longmemeval_recall.py`
which is the Recall@5 reproduction script.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 19, 2026

Caution

Review failed

Pull request was closed or merged during review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 68770700-ed8d-45df-92ac-02dd3d962c14

📥 Commits

Reviewing files that changed from the base of the PR and between 116edab and a893cf1.

📒 Files selected for processing (1)
  • README.md

📝 Walkthrough

Walkthrough

Updated README benchmarking tables and descriptions to report end-to-end Judge accuracy as the primary metric for taOSmd instead of Recall@5, clarified metric differences with other systems, and added references to specific benchmark scripts for reproducibility.

Changes

Cohort / File(s) Summary
README Benchmark Documentation
README.md
Shifted primary metric reporting from Recall@5 to end-to-end Judge accuracy for taOSmd; updated fusion strategy comparison table headers and metric labels; clarified benchmark methodology differences between taOSmd and other systems; added references to benchmark harness scripts; updated Key Features bullet point accordingly.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 Metrics shift like morning dew,

Judge accuracy shines bright and true,

From Recall to end-to-end flow,

The benchmarks' honest truth we show,

Clarity hops through every row! 📊

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: propagating Judge accuracy framing across README sections, which matches the PR objectives of replacing misleading Recall@5 labels with accurate Judge accuracy descriptions.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/readme-judge-accuracy-consistency

Comment @coderabbitai help to get the list of available commands and usage tips.

@kilo-code-bot
Copy link
Copy Markdown

kilo-code-bot bot commented Apr 19, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Files Reviewed (1 files)
  • README.md - 0 issues

Reviewed by grok-code-fast-1:optimized:free · 195,378 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant