Skip to content

embed: extend Preprocess with HTML / base64 / URL-tracking / whitespace strip#322

Open
hansn74 wants to merge 2 commits into
kenn-io:mainfrom
hansn74:feat/embed-sanitize
Open

embed: extend Preprocess with HTML / base64 / URL-tracking / whitespace strip#322
hansn74 wants to merge 2 commits into
kenn-io:mainfrom
hansn74:feat/embed-sanitize

Conversation

@hansn74
Copy link
Copy Markdown
Contributor

@hansn74 hansn74 commented May 13, 2026

Summary

Adds four opt-in (default-on) transforms to Preprocess() so noise-laden emails (inline base64 images, HTML residue, tracking-tagged URLs, HTML→text whitespace bloat) tokenize back down inside the embedder's context window:

  • strip_html — drop <style>…</style> / <script>…</script> blocks, generic <…> tags, decode HTML entities
  • strip_base64 — drop data:…;base64,… URIs and bare base64 runs ≥200 chars (excluding / so URL paths survive)
  • strip_url_tracking — drop utm_*, fbclid, gclid, etc. query params from http(s) URLs
  • collapse_whitespace — normalize CRLF→LF, trim per-line trailing whitespace, collapse runs of ≥3 newlines and ≥2 horizontal spaces

Motivation

Running build-embeddings against a 2.2M-message corpus on nomic-embed-text (8,192-token window) with max_input_chars = 6000, ~1.7% of messages still trip Ollama's HTTP 400: input length exceeds context length. The offenders are almost always polluted with one of the four patterns above — a 30 KB inline image, leaked <table style="…"> markup, or campaign-tagged URLs that repeat across newsletters. Each offender forces the worker to downshift to batch_size = 1 to drain (~25 s), capping real throughput at ~10 msg/s instead of ~42.

Stripping the pollution shrinks dense input to clean prose without semantic loss, eliminates the sawtooth, and improves vector quality by not averaging embeddings over CSS / base64 gibberish.

Usage

Defaults are on. Opt out per-toggle in config.toml:

[vector.preprocess]
strip_html = false           # default true
strip_base64 = false         # default true
strip_url_tracking = false   # default true
collapse_whitespace = false  # default true

Config / migration

  • Follows the existing *bool tristate pattern from strip_quotes / strip_signatures: nil ⇒ default true, explicit false preserved verbatim.
  • EmbeddingsConfig.Fingerprint() is unchanged (<model>:<dimension>), so existing partly-built indexes keep their generation. Users with mature indexes who want full benefit should --full-rebuild.

Pipeline order

Deliberate, documented inline. Notably, base64/data-URI stripping runs before HTML stripping so an oversized <img src="data:image/png;base64,…"> (longer than reHTMLTag's 500-char defensive ceiling) has its payload removed first, leaving a small enough tag for the subsequent HTML pass to sweep. Regression-tested.

Tests

  • Per-transform unit tests with both positive (strips the noise) and negative (preserves prose, URLs, hashes, short tokens) cases
  • Three explicit regression tests:
    • URL paths that look base64-ish must survive (/ excluded from blob class)
    • Oversized <img src="data:…"> tags must sweep cleanly (pipeline order)
    • CRLF input must collapse identically to LF (normalization at top)
  • Config-tier tests verify the four new toggles honour the same nil/true/false tristate as the existing pair
  • Full-pipeline end-to-end test combining every transform

Co-Authored-By: Claude Opus 4.7 (1M context)

Adds four opt-in (default-on) transforms to Preprocess() so noise-laden
emails (inline base64 images, HTML residue, tracking-tagged links,
HTML→text whitespace bloat) tokenize back down inside the embedder's
context window:

  strip_html           strip <style>/<script> blocks, generic <tags>,
                       decode HTML entities
  strip_base64         strip data:...;base64,... URIs and bare base64
                       runs >=200 chars (excluding '/' so URL paths
                       survive)
  strip_url_tracking   drop utm_*, fbclid, gclid, etc. query params
  collapse_whitespace  normalize CRLF -> LF, trim per-line trailing
                       whitespace, collapse runs of >=3 newlines, runs
                       of >=2 horizontal spaces

Motivation: while building embeddings for a 2.2M-message corpus on
nomic-embed-text (8192-token window), ~1.7% of messages tripped the
endpoint's context-length check even after the 6000-char rune cap. The
offenders were almost always polluted with one of the four patterns
above: a 30KB inline image, leaked <table style="..."> markup, or
campaign-tagged URLs repeating across newsletters. Stripping these
shrinks dense input to clean prose without semantic loss, eliminates
the downshift-to-batch-size=1 sawtooth that capped real throughput at
~10 msg/s, and improves vector quality by not averaging the embedding
over CSS gibberish.

Config follows the existing PreprocessConfig pattern: *bool in the TOML
tier (nil = "default true", explicit `false` preserved verbatim), plain
bool in the runtime tier, helpers like StripHTMLEnabled() bridge the
two. Both call sites (build-embeddings + the live worker spawned by
`serve`) are wired symmetrically.

Pipeline order matters and is deliberate:
  1. CRLF normalization (line-oriented regexes assume LF)
  2. base64 / data: URI strip (runs before HTML so an oversized
     <img src="data:..."> -- longer than reHTMLTag's 500-char ceiling
     -- has its payload removed first, leaving a small enough tag for
     the subsequent HTML pass to sweep)
  3. HTML strip + entity decode
  4. URL tracking-param strip
  5. existing quote/signature strip
  6. whitespace collapse
  7. TrimSpace + Subject prefix + rune-bounded truncation

Tests cover each transform in isolation, three regression cases
(URL-paths-look-base64, oversized-img-tag, CRLF normalization), and a
full-pipeline end-to-end. Config-tier tests verify the new toggles
honour the same nil/true/false tristate semantics as the existing pair.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 13, 2026

roborev: Combined Review (0a0d630)

Medium issue found: the embedding preprocessor may strip valid plain-text content when HTML stripping is enabled by default.

Medium

  • Location: internal/vector/embed/preprocess.go:47
  • Problem: The HTML stripper removes any <...> span, not just real HTML tags. Plain-text email content such as John <john@example.com>, placeholders, comparisons, or code snippets can be deleted from embeddings.
  • Fix: Use an HTML tokenizer or allowlist of real tag names, or only run tag stripping when the body appears to be HTML.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Replaces the original `<[^>]{0,500}>` with a stricter tag-name
pattern `</?[a-zA-Z][a-zA-Z0-9-]*(?:\s[^>]{0,400})?\s*/?>` so the
stripper no longer eats text that merely contains angle brackets:

  John <john@example.com>      kept verbatim (@ rejects tag-name)
  See <https://example.com>.   kept verbatim (: rejects tag-name)
  x < 3 and y > 4              kept verbatim (space-then-digit rejects)
  <Aug 6, 2026>                kept verbatim (space rejects)

Real HTML tags (<p>, <br/>, <a href="...">, </div>, <table style="...">)
continue to match. The {0,400} attribute-body cap is moved inside an
optional non-capturing group that only fires when a whitespace-then-
attributes section actually follows the tag name, so the stripper
treats `<p>` and `<a href="...">` symmetrically.

Caught by roborev on PR kenn-io#322.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 15, 2026

roborev: Combined Review (3823b23)

Embedding preprocessing can mix old and new vector semantics under the same active generation.

Medium

  • cmd/msgvault/cmd/embed_vector.go:82 - Generation identity still keys only on embedding model/dimension fingerprint, while preprocessing now changes the exact text sent to the embedder. Existing active generations can be reused with old vectors, then new rows get embedded with sanitized text under the same generation.
    • Fix: Include a preprocessing version/hash in the generation identity, or force/create a rebuild when effective preprocessing settings change.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@hansn74
Copy link
Copy Markdown
Contributor Author

hansn74 commented May 15, 2026

Acknowledging roborev's 3823b23 medium finding (preprocessing-vs-generation-fingerprint) — flagging that it's a real concern but deferring the fix to a maintainer call rather than handling it inline. Reasoning:

The finding is correct: EmbeddingsConfig.Fingerprint() is <model>:<dimension> only, so toggling a preprocess flag mid-life of an active generation produces mixed-preprocessing vectors in the same index. The places this matters most are messages that the sanitize layer materially rewrites (HTML newsletters, base64-leak bodies); for typical Latin prose the textual delta is tiny and cosine similarity tolerates it well, so retrieval quality is usually fine without a rebuild. The places it could surface are corner-case queries that depend on exactly the sanitized terms.

The proper fix is to add a stable hash of the effective preprocess settings to the fingerprint — any change to [vector.preprocess] then forces a new generation, with the existing pending→building→active lifecycle handling the transition cleanly. I held off because:

  1. This changes the fingerprint contract introduced in Add semantic and hybrid search via local vector embeddings #277. You have an opinion about which axes should and shouldn't trigger a rebuild; adding preprocess axes is a real architectural choice, not a localized fix to this PR. I'd rather you pick the granularity (raw config hash vs versioned major bump vs explicit [vector.preprocess].version) than impose mine.
  2. The current behavior is recoverable. Users with a mature index can --full-rebuild after enabling the new sanitize defaults. Documenting that as a release note (or surfacing a warning when the running config differs from the building-gen's recorded config) is a smaller change than changing the fingerprint identity wholesale.

Happy to push the fingerprint extension into this PR if you'd prefer strict consistency over the documented-rebuild path — let me know the shape you want (config-hash vs version-int vs other) and I'll add it. Otherwise this can ship and the fingerprint redesign can be a follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant