-
Notifications
You must be signed in to change notification settings - Fork 0
[Priority 1] Implement Context Evaluator with validation pipeline #59
Copy link
Copy link
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Problem
We have no systematic quality control for recall results. Hallucinations, incomplete answers, and low-confidence recalls go undetected. Human knowledge (corrections, domain expertise) can only be stored via manual rlm remember calls.
Proposal (from AIGNE paper analysis)
Add Context Evaluator after every recall session that:
- Validates output against source context
- Computes confidence score based on:
- Coverage: did we find entries for all query terms?
- Coherence: do subagent findings agree or contradict?
- Completeness: any obvious gaps?
- Triggers human review when confidence <0.7
- Stores human corrections as new memory entries tagged
human-verified - Logs validation outcomes to
performance.jsonl
Implementation
- Create
rlm/evaluator.pywithvalidate_recall_results() - Integrate into recall pipeline (after synthesis, before return)
- Add user prompt for verification when confidence low
- Store human annotations with provenance
- Log all validation metrics
Impact
- Quality control — catch/correct errors before they propagate
- Trust — users see confidence scores
- Self-improvement — validation failures inform learned patterns
- Human knowledge capture — corrections become first-class memories
Effort
2-3 days
Related
- Context Evaluator from 'Everything is Context' paper (arxiv 2512.05470)
- Self-improving strategies (learned_patterns.md)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request