Summary
Add a cheap, deterministic content-fingerprint check on the add_memory / POST /api/v1/memories write path so that re-submitting identical content is a no-op before we pay for an LLM fact-extraction call.
Idea from OB1
OB1's recipes/content-fingerprint-dedup solves "if you import the same ChatGPT export twice, you get double the rows." Its algorithm:
- Normalize — lowercase, strip leading/trailing whitespace, collapse runs of whitespace to a single space.
- SHA-256 the normalized string → a deterministic 64-char hex fingerprint.
- Upsert —
INSERT ... ON CONFLICT (content_fingerprint) DO UPDATE (merge metadata on collision, insert otherwise), enforced by a unique index so no caller can bypass it.
Result: "Re-running any import produces 0 new rows for already-imported content."
Why it fits memserv
- mem0 already deduplicates semantically, but it does so by invoking the LLM (Claude Haiku) on every
add() — see PRD §19 ("Fact-extraction cost scales with add_memory call volume"). A pre-extraction hash check skips that cost entirely for byte-identical re-submits, which is exactly what happens during the bulk imports proposed in the import-toolkit issue and on webhook/n8n retries.
- It's a write-path optimization that doesn't touch the single-user / dual-protocol invariants.
Proposed approach
- Compute a normalized SHA-256 fingerprint of the raw
content in app/rest.py / app/mcp_server.py (or a shared helper in app/memory.py).
- Store the fingerprint in the memory's mem0
metadata (e.g. metadata["content_fp"]) on add.
- On a new
add, first check Qdrant for an existing payload with that fingerprint (a payload filter, no vector search needed) and short-circuit with the existing record if found.
- Make it opt-out via a flag/param for callers who deliberately want re-extraction.
Notes / scope
We can't reuse OB1's Postgres ON CONFLICT mechanic directly (we're on Qdrant), but the normalize→hash→check pattern ports cleanly as a payload-filter lookup. Keep it single-user; the fingerprint lives in metadata, not a new table.
Source: https://github.com/NateBJones-Projects/OB1
Summary
Add a cheap, deterministic content-fingerprint check on the
add_memory/POST /api/v1/memorieswrite path so that re-submitting identical content is a no-op before we pay for an LLM fact-extraction call.Idea from OB1
OB1's
recipes/content-fingerprint-dedupsolves "if you import the same ChatGPT export twice, you get double the rows." Its algorithm:INSERT ... ON CONFLICT (content_fingerprint) DO UPDATE(merge metadata on collision, insert otherwise), enforced by a unique index so no caller can bypass it.Result: "Re-running any import produces 0 new rows for already-imported content."
Why it fits memserv
add()— see PRD §19 ("Fact-extraction cost scales withadd_memorycall volume"). A pre-extraction hash check skips that cost entirely for byte-identical re-submits, which is exactly what happens during the bulk imports proposed in the import-toolkit issue and on webhook/n8n retries.Proposed approach
contentinapp/rest.py/app/mcp_server.py(or a shared helper inapp/memory.py).metadata(e.g.metadata["content_fp"]) onadd.add, first check Qdrant for an existing payload with that fingerprint (a payload filter, no vector search needed) and short-circuit with the existing record if found.Notes / scope
We can't reuse OB1's Postgres
ON CONFLICTmechanic directly (we're on Qdrant), but the normalize→hash→check pattern ports cleanly as a payload-filter lookup. Keep it single-user; the fingerprint lives in metadata, not a new table.Source: https://github.com/NateBJones-Projects/OB1