Follow-up to feedback from @m13v on #198. Two concrete changes to the memify / compression pipeline.
Triple-gate before cosine dedup
Current compression pass dedups purely on cosine similarity over normalised chunks. This merges memories that are semantically close but factually distinct:
- "met with Jay on Monday about budgets"
- "met with Jay on Tuesday about hiring"
Cosine >0.9 on both, but they're different events. Fix:
- Extract
(subject, relation, object) triple from each candidate chunk.
- Bucket candidates by overlapping triples.
- Apply cosine similarity dedup within a bucket only — cosine becomes a secondary filter, not the primary key.
- Across-bucket merges are never allowed regardless of cosine score.
Needs an entity+predicate extractor in the compression pipeline. We already have the machinery in the KG extraction step; the work is wiring it as a pre-dedup gate instead of a parallel concern.
Claim-level memify granularity
Current pipeline over-chunks: one paragraph can produce 4+ overlapping facts that all carry identical metadata. The storage waste is manageable, but retrieval quality suffers because redundant atomic facts dilute the ranking.
Move to one-assertable-claim-per-node, and let the graph edges carry composition. Example:
- ❌ Today: 4 nodes for "Jay works at JAN LABS", "JAN LABS is a company", "Jay is employed", "Jay's employer is JAN LABS"
- ✅ Target: 1 node
{subject: Jay, predicate: works_at, object: JAN LABS} — composition lives on the edges
Acceptance
Follow-up to feedback from @m13v on #198. Two concrete changes to the memify / compression pipeline.
Triple-gate before cosine dedup
Current compression pass dedups purely on cosine similarity over normalised chunks. This merges memories that are semantically close but factually distinct:
Cosine >0.9 on both, but they're different events. Fix:
(subject, relation, object)triple from each candidate chunk.Needs an entity+predicate extractor in the compression pipeline. We already have the machinery in the KG extraction step; the work is wiring it as a pre-dedup gate instead of a parallel concern.
Claim-level memify granularity
Current pipeline over-chunks: one paragraph can produce 4+ overlapping facts that all carry identical metadata. The storage waste is manageable, but retrieval quality suffers because redundant atomic facts dilute the ranking.
Move to one-assertable-claim-per-node, and let the graph edges carry composition. Example:
{subject: Jay, predicate: works_at, object: JAN LABS}— composition lives on the edgesAcceptance