embed: extend Preprocess with HTML / base64 / URL-tracking / whitespace strip by hansn74 · Pull Request #322 · kenn-io/msgvault

hansn74 · 2026-05-13T19:07:57Z

Summary

Adds four opt-in (default-on) transforms to Preprocess() so noise-laden emails (inline base64 images, HTML residue, tracking-tagged URLs, HTML→text whitespace bloat) tokenize back down inside the embedder's context window:

strip_html — drop <style>…</style> / <script>…</script> blocks, generic <…> tags, decode HTML entities
strip_base64 — drop data:…;base64,… URIs and bare base64 runs ≥200 chars (excluding / so URL paths survive)
strip_url_tracking — drop utm_*, fbclid, gclid, etc. query params from http(s) URLs
collapse_whitespace — normalize CRLF→LF, trim per-line trailing whitespace, collapse runs of ≥3 newlines and ≥2 horizontal spaces

Motivation

Running build-embeddings against a 2.2M-message corpus on nomic-embed-text (8,192-token window) with max_input_chars = 6000, ~1.7% of messages still trip Ollama's HTTP 400: input length exceeds context length. The offenders are almost always polluted with one of the four patterns above — a 30 KB inline image, leaked <table style="…"> markup, or campaign-tagged URLs that repeat across newsletters. Each offender forces the worker to downshift to batch_size = 1 to drain (~25 s), capping real throughput at ~10 msg/s instead of ~42.

Stripping the pollution shrinks dense input to clean prose without semantic loss, eliminates the sawtooth, and improves vector quality by not averaging embeddings over CSS / base64 gibberish.

Usage

Defaults are on. Opt out per-toggle in config.toml:

[vector.preprocess]
strip_html = false           # default true
strip_base64 = false         # default true
strip_url_tracking = false   # default true
collapse_whitespace = false  # default true

Config / migration

Follows the existing *bool tristate pattern from strip_quotes / strip_signatures: nil ⇒ default true, explicit false preserved verbatim.
EmbeddingsConfig.Fingerprint() is unchanged (<model>:<dimension>), so existing partly-built indexes keep their generation. Users with mature indexes who want full benefit should --full-rebuild.

Pipeline order

Deliberate, documented inline. Notably, base64/data-URI stripping runs before HTML stripping so an oversized <img src="data:image/png;base64,…"> (longer than reHTMLTag's 500-char defensive ceiling) has its payload removed first, leaving a small enough tag for the subsequent HTML pass to sweep. Regression-tested.

Tests

Per-transform unit tests with both positive (strips the noise) and negative (preserves prose, URLs, hashes, short tokens) cases
Three explicit regression tests:
- URL paths that look base64-ish must survive (/ excluded from blob class)
- Oversized <img src="data:…"> tags must sweep cleanly (pipeline order)
- CRLF input must collapse identically to LF (normalization at top)
Config-tier tests verify the four new toggles honour the same nil/true/false tristate as the existing pair
Full-pipeline end-to-end test combining every transform

Co-Authored-By: Claude Opus 4.7 (1M context)

Adds four opt-in (default-on) transforms to Preprocess() so noise-laden emails (inline base64 images, HTML residue, tracking-tagged links, HTML→text whitespace bloat) tokenize back down inside the embedder's context window: strip_html strip <style>/<script> blocks, generic <tags>, decode HTML entities strip_base64 strip data:...;base64,... URIs and bare base64 runs >=200 chars (excluding '/' so URL paths survive) strip_url_tracking drop utm_*, fbclid, gclid, etc. query params collapse_whitespace normalize CRLF -> LF, trim per-line trailing whitespace, collapse runs of >=3 newlines, runs of >=2 horizontal spaces Motivation: while building embeddings for a 2.2M-message corpus on nomic-embed-text (8192-token window), ~1.7% of messages tripped the endpoint's context-length check even after the 6000-char rune cap. The offenders were almost always polluted with one of the four patterns above: a 30KB inline image, leaked <table style="..."> markup, or campaign-tagged URLs repeating across newsletters. Stripping these shrinks dense input to clean prose without semantic loss, eliminates the downshift-to-batch-size=1 sawtooth that capped real throughput at ~10 msg/s, and improves vector quality by not averaging the embedding over CSS gibberish. Config follows the existing PreprocessConfig pattern: *bool in the TOML tier (nil = "default true", explicit `false` preserved verbatim), plain bool in the runtime tier, helpers like StripHTMLEnabled() bridge the two. Both call sites (build-embeddings + the live worker spawned by `serve`) are wired symmetrically. Pipeline order matters and is deliberate: 1. CRLF normalization (line-oriented regexes assume LF) 2. base64 / data: URI strip (runs before HTML so an oversized <img src="data:..."> -- longer than reHTMLTag's 500-char ceiling -- has its payload removed first, leaving a small enough tag for the subsequent HTML pass to sweep) 3. HTML strip + entity decode 4. URL tracking-param strip 5. existing quote/signature strip 6. whitespace collapse 7. TrimSpace + Subject prefix + rune-bounded truncation Tests cover each transform in isolation, three regression cases (URL-paths-look-base64, oversized-img-tag, CRLF normalization), and a full-pipeline end-to-end. Config-tier tests verify the new toggles honour the same nil/true/false tristate semantics as the existing pair. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

roborev-ci · 2026-05-13T19:09:46Z

roborev: Combined Review (`0a0d630`)

Medium issue found: the embedding preprocessor may strip valid plain-text content when HTML stripping is enabled by default.

Medium

Location: internal/vector/embed/preprocess.go:47
Problem: The HTML stripper removes any <...> span, not just real HTML tags. Plain-text email content such as John <john@example.com>, placeholders, comparisons, or code snippets can be deleted from embeddings.
Fix: Use an HTML tokenizer or allowlist of real tag names, or only run tag stripping when the body appears to be HTML.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Replaces the original `<[^>]{0,500}>` with a stricter tag-name pattern `</?[a-zA-Z][a-zA-Z0-9-]*(?:\s[^>]{0,400})?\s*/?>` so the stripper no longer eats text that merely contains angle brackets: John <john@example.com> kept verbatim (@ rejects tag-name) See <https://example.com>. kept verbatim (: rejects tag-name) x < 3 and y > 4 kept verbatim (space-then-digit rejects) <Aug 6, 2026> kept verbatim (space rejects) Real HTML tags (<p>, <br/>, <a href="...">, </div>, <table style="...">) continue to match. The {0,400} attribute-body cap is moved inside an optional non-capturing group that only fires when a whitespace-then- attributes section actually follows the tag name, so the stripper treats `<p>` and `<a href="...">` symmetrically. Caught by roborev on PR kenn-io#322. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

roborev-ci · 2026-05-15T11:45:23Z

roborev: Combined Review (`3823b23`)

Embedding preprocessing can mix old and new vector semantics under the same active generation.

Medium

cmd/msgvault/cmd/embed_vector.go:82 - Generation identity still keys only on embedding model/dimension fingerprint, while preprocessing now changes the exact text sent to the embedder. Existing active generations can be reused with old vectors, then new rows get embedded with sanitized text under the same generation.
- Fix: Include a preprocessing version/hash in the generation identity, or force/create a rebuild when effective preprocessing settings change.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

hansn74 · 2026-05-15T14:39:06Z

Acknowledging roborev's 3823b23 medium finding (preprocessing-vs-generation-fingerprint) — flagging that it's a real concern but deferring the fix to a maintainer call rather than handling it inline. Reasoning:

The finding is correct: EmbeddingsConfig.Fingerprint() is <model>:<dimension> only, so toggling a preprocess flag mid-life of an active generation produces mixed-preprocessing vectors in the same index. The places this matters most are messages that the sanitize layer materially rewrites (HTML newsletters, base64-leak bodies); for typical Latin prose the textual delta is tiny and cosine similarity tolerates it well, so retrieval quality is usually fine without a rebuild. The places it could surface are corner-case queries that depend on exactly the sanitized terms.

The proper fix is to add a stable hash of the effective preprocess settings to the fingerprint — any change to [vector.preprocess] then forces a new generation, with the existing pending→building→active lifecycle handling the transition cleanly. I held off because:

This changes the fingerprint contract introduced in Add semantic and hybrid search via local vector embeddings #277. You have an opinion about which axes should and shouldn't trigger a rebuild; adding preprocess axes is a real architectural choice, not a localized fix to this PR. I'd rather you pick the granularity (raw config hash vs versioned major bump vs explicit [vector.preprocess].version) than impose mine.
The current behavior is recoverable. Users with a mature index can --full-rebuild after enabling the new sanitize defaults. Documenting that as a release note (or surfacing a warning when the running config differs from the building-gen's recorded config) is a smaller change than changing the fingerprint identity wholesale.

Happy to push the fingerprint extension into this PR if you'd prefer strict consistency over the documented-rebuild path — let me know the shape you want (config-hash vs version-int vs other) and I'll add it. Otherwise this can ship and the fingerprint redesign can be a follow-up.

hansn74 mentioned this pull request May 13, 2026

vector: chunk long messages instead of truncating #323

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

embed: extend Preprocess with HTML / base64 / URL-tracking / whitespace strip#322

embed: extend Preprocess with HTML / base64 / URL-tracking / whitespace strip#322
hansn74 wants to merge 2 commits into
kenn-io:mainfrom
hansn74:feat/embed-sanitize

hansn74 commented May 13, 2026

Uh oh!

roborev-ci Bot commented May 13, 2026

Uh oh!

roborev-ci Bot commented May 15, 2026

Uh oh!

hansn74 commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hansn74 commented May 13, 2026

Summary

Motivation

Usage

Config / migration

Pipeline order

Tests

Uh oh!

roborev-ci Bot commented May 13, 2026

roborev: Combined Review (0a0d630)

Medium

Uh oh!

roborev-ci Bot commented May 15, 2026

roborev: Combined Review (3823b23)

Medium

Uh oh!

hansn74 commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

roborev: Combined Review (`0a0d630`)

roborev: Combined Review (`3823b23`)