Skip to content

Content-fingerprint dedup on the write path (cheap hash dedup before mem0 extraction) #48

@imonroe

Description

@imonroe

Summary

Add a cheap, deterministic content-fingerprint check on the add_memory / POST /api/v1/memories write path so that re-submitting identical content is a no-op before we pay for an LLM fact-extraction call.

Idea from OB1

OB1's recipes/content-fingerprint-dedup solves "if you import the same ChatGPT export twice, you get double the rows." Its algorithm:

  1. Normalize — lowercase, strip leading/trailing whitespace, collapse runs of whitespace to a single space.
  2. SHA-256 the normalized string → a deterministic 64-char hex fingerprint.
  3. UpsertINSERT ... ON CONFLICT (content_fingerprint) DO UPDATE (merge metadata on collision, insert otherwise), enforced by a unique index so no caller can bypass it.

Result: "Re-running any import produces 0 new rows for already-imported content."

Why it fits memserv

  • mem0 already deduplicates semantically, but it does so by invoking the LLM (Claude Haiku) on every add() — see PRD §19 ("Fact-extraction cost scales with add_memory call volume"). A pre-extraction hash check skips that cost entirely for byte-identical re-submits, which is exactly what happens during the bulk imports proposed in the import-toolkit issue and on webhook/n8n retries.
  • It's a write-path optimization that doesn't touch the single-user / dual-protocol invariants.

Proposed approach

  • Compute a normalized SHA-256 fingerprint of the raw content in app/rest.py / app/mcp_server.py (or a shared helper in app/memory.py).
  • Store the fingerprint in the memory's mem0 metadata (e.g. metadata["content_fp"]) on add.
  • On a new add, first check Qdrant for an existing payload with that fingerprint (a payload filter, no vector search needed) and short-circuit with the existing record if found.
  • Make it opt-out via a flag/param for callers who deliberately want re-extraction.

Notes / scope

We can't reuse OB1's Postgres ON CONFLICT mechanic directly (we're on Qdrant), but the normalize→hash→check pattern ports cleanly as a payload-filter lookup. Keep it single-user; the fingerprint lives in metadata, not a new table.

Source: https://github.com/NateBJones-Projects/OB1

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions