embed: extend Preprocess with HTML / base64 / URL-tracking / whitespace strip#322
embed: extend Preprocess with HTML / base64 / URL-tracking / whitespace strip#322hansn74 wants to merge 2 commits into
Conversation
Adds four opt-in (default-on) transforms to Preprocess() so noise-laden
emails (inline base64 images, HTML residue, tracking-tagged links,
HTML→text whitespace bloat) tokenize back down inside the embedder's
context window:
strip_html strip <style>/<script> blocks, generic <tags>,
decode HTML entities
strip_base64 strip data:...;base64,... URIs and bare base64
runs >=200 chars (excluding '/' so URL paths
survive)
strip_url_tracking drop utm_*, fbclid, gclid, etc. query params
collapse_whitespace normalize CRLF -> LF, trim per-line trailing
whitespace, collapse runs of >=3 newlines, runs
of >=2 horizontal spaces
Motivation: while building embeddings for a 2.2M-message corpus on
nomic-embed-text (8192-token window), ~1.7% of messages tripped the
endpoint's context-length check even after the 6000-char rune cap. The
offenders were almost always polluted with one of the four patterns
above: a 30KB inline image, leaked <table style="..."> markup, or
campaign-tagged URLs repeating across newsletters. Stripping these
shrinks dense input to clean prose without semantic loss, eliminates
the downshift-to-batch-size=1 sawtooth that capped real throughput at
~10 msg/s, and improves vector quality by not averaging the embedding
over CSS gibberish.
Config follows the existing PreprocessConfig pattern: *bool in the TOML
tier (nil = "default true", explicit `false` preserved verbatim), plain
bool in the runtime tier, helpers like StripHTMLEnabled() bridge the
two. Both call sites (build-embeddings + the live worker spawned by
`serve`) are wired symmetrically.
Pipeline order matters and is deliberate:
1. CRLF normalization (line-oriented regexes assume LF)
2. base64 / data: URI strip (runs before HTML so an oversized
<img src="data:..."> -- longer than reHTMLTag's 500-char ceiling
-- has its payload removed first, leaving a small enough tag for
the subsequent HTML pass to sweep)
3. HTML strip + entity decode
4. URL tracking-param strip
5. existing quote/signature strip
6. whitespace collapse
7. TrimSpace + Subject prefix + rune-bounded truncation
Tests cover each transform in isolation, three regression cases
(URL-paths-look-base64, oversized-img-tag, CRLF normalization), and a
full-pipeline end-to-end. Config-tier tests verify the new toggles
honour the same nil/true/false tristate semantics as the existing pair.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
Replaces the original `<[^>]{0,500}>` with a stricter tag-name
pattern `</?[a-zA-Z][a-zA-Z0-9-]*(?:\s[^>]{0,400})?\s*/?>` so the
stripper no longer eats text that merely contains angle brackets:
John <john@example.com> kept verbatim (@ rejects tag-name)
See <https://example.com>. kept verbatim (: rejects tag-name)
x < 3 and y > 4 kept verbatim (space-then-digit rejects)
<Aug 6, 2026> kept verbatim (space rejects)
Real HTML tags (<p>, <br/>, <a href="...">, </div>, <table style="...">)
continue to match. The {0,400} attribute-body cap is moved inside an
optional non-capturing group that only fires when a whitespace-then-
attributes section actually follows the tag name, so the stripper
treats `<p>` and `<a href="...">` symmetrically.
Caught by roborev on PR kenn-io#322.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
|
Acknowledging roborev's The finding is correct: The proper fix is to add a stable hash of the effective preprocess settings to the fingerprint — any change to
Happy to push the fingerprint extension into this PR if you'd prefer strict consistency over the documented-rebuild path — let me know the shape you want (config-hash vs version-int vs other) and I'll add it. Otherwise this can ship and the fingerprint redesign can be a follow-up. |
Summary
Adds four opt-in (default-on) transforms to
Preprocess()so noise-laden emails (inline base64 images, HTML residue, tracking-tagged URLs, HTML→text whitespace bloat) tokenize back down inside the embedder's context window:strip_html— drop<style>…</style>/<script>…</script>blocks, generic<…>tags, decode HTML entitiesstrip_base64— dropdata:…;base64,…URIs and bare base64 runs ≥200 chars (excluding/so URL paths survive)strip_url_tracking— droputm_*,fbclid,gclid, etc. query params from http(s) URLscollapse_whitespace— normalize CRLF→LF, trim per-line trailing whitespace, collapse runs of ≥3 newlines and ≥2 horizontal spacesMotivation
Running
build-embeddingsagainst a 2.2M-message corpus onnomic-embed-text(8,192-token window) withmax_input_chars = 6000, ~1.7% of messages still trip Ollama'sHTTP 400: input length exceeds context length. The offenders are almost always polluted with one of the four patterns above — a 30 KB inline image, leaked<table style="…">markup, or campaign-tagged URLs that repeat across newsletters. Each offender forces the worker to downshift tobatch_size = 1to drain (~25 s), capping real throughput at ~10 msg/s instead of ~42.Stripping the pollution shrinks dense input to clean prose without semantic loss, eliminates the sawtooth, and improves vector quality by not averaging embeddings over CSS / base64 gibberish.
Usage
Defaults are on. Opt out per-toggle in
config.toml:Config / migration
*booltristate pattern fromstrip_quotes/strip_signatures: nil ⇒ default true, explicitfalsepreserved verbatim.EmbeddingsConfig.Fingerprint()is unchanged (<model>:<dimension>), so existing partly-built indexes keep their generation. Users with mature indexes who want full benefit should--full-rebuild.Pipeline order
Deliberate, documented inline. Notably, base64/data-URI stripping runs before HTML stripping so an oversized
<img src="data:image/png;base64,…">(longer thanreHTMLTag's 500-char defensive ceiling) has its payload removed first, leaving a small enough tag for the subsequent HTML pass to sweep. Regression-tested.Tests
/excluded from blob class)<img src="data:…">tags must sweep cleanly (pipeline order)Co-Authored-By: Claude Opus 4.7 (1M context)