Conversation
… sections PR #31 clarified the headline (lines 7-9) that our 97.0% is end-to-end Judge accuracy, not Recall@5. But three downstream references still labelled it as Recall@5: 1. Benchmark Results table (line 158) — column header was "Recall@5" even though our 97% is Judge and the competitors' numbers are the different, looser Recall@5 metric. Split into "Score" + "Metric" columns so each row is honestly labelled; added a clarifying paragraph below the table pointing at both benchmark scripts. 2. Fusion Strategy Comparison (line 181) — column said "Recall@5" but all four strategies were measured with the same Judge harness. Renamed header to "Judge accuracy" and softened the "MemPalace- equivalent" label to "same algorithm as MemPalace" so it describes the retrieval approach, not the metric. 3. Key Features list (line 263) — "97.0% Recall@5" → "97.0% end-to-end Judge accuracy". Remaining Recall@5 references are all intentional: competitors' published numbers, the narrative paragraph explaining the metric difference, and the code-block comment for `longmemeval_recall.py` which is the Recall@5 reproduction script.
|
Caution Review failedPull request was closed or merged during review No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughUpdated README benchmarking tables and descriptions to report end-to-end Judge accuracy as the primary metric for taOSmd instead of Recall@5, clarified metric differences with other systems, and added references to specific benchmark scripts for reproducibility. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Files Reviewed (1 files)
Reviewed by grok-code-fast-1:optimized:free · 195,378 tokens |
Follow-up to #31 which clarified the intro headline. Three more README sections still labelled our 97.0% as Recall@5:
Also softened "MemPalace-equivalent" to "same algorithm as MemPalace" — that row describes the algorithm, not the metric, which differs between the two systems.
Remaining Recall@5 mentions in the README are intentional: competitors' published numbers, the paragraph explaining why direct comparison isn't apples-to-apples, and the code-block comment labelling
benchmarks/longmemeval_recall.py(which is the Recall@5 reproduction script).Summary by CodeRabbit