fix(strip_html): rewrite as linear single-pass scan to avoid ReDoS#896
Merged
Conversation
The previous strip_html regex /<script[\s\S]*?<\/script>|<style[\s\S]*?<\/style>|<[\s\S]*?>|<!--[\s\S]*?-->/g contains lazy alternatives that backtrack O(n^2) on inputs with many unclosed `<script` / `<style` openers. A 350KB payload of `'<script'.repeat(50000)` blocked the Node.js event loop for ~10s, and cost grew quadratically with input size. memoryLimit only charged str.length, which does not bound regex CPU. Replace the regex with an indexOf-based single-pass scan. For each `<` we: - if `<script` opener: find next `</script>` and skip the whole block; cache "no closer after pos k" so subsequent unclosed `<script` openers do not re-scan the tail. - same for `<style` / `</style>`. - otherwise treat as a generic `<...>` tag (matches the original behavior, where the `<[\s\S]*?>` alternative also caught comments). - if no closing `>` exists, emit the tail as literal text and stop. Total work is O(n). All existing strip_html test cases pass unchanged. Add regression tests covering the PoCs (`<script` / `<style` repeats, and `<script>foo` repeats with `>` but no `</script>`) plus a memoryLimit assertion. Co-authored-by: Cursor <cursoragent@cursor.com>
Coverage Report for CI Build 25680358067Coverage increased (+0.002%) to 99.542%Details
Uncovered ChangesNo uncovered changes found. Coverage RegressionsNo coverage regressions found. Coverage Stats💛 - Coveralls |
Same algorithm and complexity, fewer lines. Document why a regex-only solution can't be O(n) in V8 (no atomic groups / possessive quantifiers / memoization, so unrolled-loop patterns are still O(n^2) on unclosed openers — empirically confirmed: original 280KB ~4s, Friedl unrolled ~14s, atomic lookahead ~7s; tokenizer ~1ms). Co-authored-by: Cursor <cursoragent@cursor.com>
Drop the module-level STRIP_BLOCKS table; the rest of the file keeps each filter self-contained (only escapeMap/unescapeMap are top-level maps shared across filters). Two openers don't justify a table. Co-authored-by: Cursor <cursoragent@cursor.com>
In HTML5, <script>, <style>, and <!-- --> are all raw-text blocks: their content is opaque until the matching closer, so a `>` inside CSS, JS, or a comment must not be treated as a tag end. The previous code only had this special handling for <script> and <style>; comments containing `>` fell through to the generic `<...>` branch and were partially stripped (e.g. `<!-- a > b -->` left `b -->` in the output). Match Shopify Liquid's STRIP_HTML_BLOCKS set (script + style + comment), and consolidate the three near-identical branches into a small opener/closer table inside the function. Algorithm and complexity unchanged (O(n) via indexOf + cached closer positions). Add a regression test for `>` inside a comment. Co-authored-by: Cursor <cursoragent@cursor.com>
Once `indexOf(closer, X)` returns -1, all subsequent searches (with monotonically increasing start) also return -1. So tracking absence is enough; storing positions is unnecessary. Make `blocks` a Set and delete a kind once its closer is known absent — no parallel `dead` bookkeeping. Use Jest's per-test timeout for the ReDoS regressions instead of manual Date.now() bookkeeping. Co-authored-by: Cursor <cursoragent@cursor.com>
Adding ['<', '>'] as the lowest-priority entry of `blocks` lets the inner loop subsume the generic-tag fallback: the `end` sentinel and its `< 0` / `<= 0` follow-up checks disappear, the "no terminator" exit becomes a single `i === lt` test, and Set<[string, string]> collapses to Map<string, string>. Co-authored-by: Cursor <cursoragent@cursor.com>
github-actions Bot
pushed a commit
that referenced
this pull request
May 14, 2026
# [10.26.0](v10.25.7...v10.26.0) (2026-05-14) ### Bug Fixes * **date:** cap strftime widths and account padding in memoryLimit ([#895](#895)) ([3129d46](3129d46)) * enforce renderLimit for empty renderTemplates calls ([#894](#894)) ([5b9c346](5b9c346)) * propagate ownPropertyOnly into Context.spawn() for {% render %} ([#893](#893)) ([dbbf628](dbbf628)) * **security:** block Object.prototype filter/tag lookups (RCE) ([#897](#897)) ([457fae0](457fae0)) * strip html newline tags ([#892](#892)) ([26ea285](26ea285)) * **strip_html:** rewrite as linear single-pass scan to avoid ReDoS ([#896](#896)) ([3616a74](3616a74)) ### Features * add sha256 and hmac_sha256 filters for cryptographic operations ([#889](#889)) ([1c816d4](1c816d4))
|
🎉 This PR is included in version 10.26.0 🎉 The release is available on: Your semantic-release bot 📦🚀 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The
strip_htmlregex/<script[\s\S]*?<\/script>|<style[\s\S]*?<\/style>|<[\s\S]*?>|<!--[\s\S]*?-->/gbacktracks O(n²) on inputs with many unclosed openers (e.g.
'<script'.repeat(50000), ~350 KB), which blocks the Node.js event loop for ~10 s.memoryLimitonly chargesstr.lengthand does not bound regex CPU.Replace it with a linear single-pass
indexOfscan. For each<we try each block kind in priority order —<script>…</script>,<style>…</style>,<!--…-->, and<…>as the catch-all — and skip past the matched closer. A kind whose closer is absent from the rest of the input is removed from the map and never tried again, so repeated unclosed openers don't re-scan the tail and total work stays O(n).Why a regex-only fix can't be O(n) here
I tried regex variants first. They all stay O(n²) on this input class:
[\s\S]*?Both regex variants grow ≈ ×4 per input doubling. The Friedl unrolled-loop pattern (
<script\b[^<]*(?:<(?!\/script>)[^<]*)*<\/script>|…) eliminates exponential backtracking from a single match attempt, but the engine still retries the alternation at every<position; for inputs with N unmatched openers that is O(N) retries × O(N) work-per-retry. JS regex has no atomic groups, no possessive quantifiers, and no cross-attempt memoization, so there is no regex-only way to remember "no</script>exists past position k" between attempts. The new code stores that fact across iterations, which is what makes it linear.Secondary fix:
<!--…-->is a raw-text blockThis PR also aligns comment handling with Shopify Liquid's
STRIP_HTML_BLOCKS:<script>,<style>, and<!--…-->are all HTML5 raw-text blocks whose content is opaque until the matching closer. The previous code only special-cased<script>and<style>, so<!-- a > b -->got partially stripped tob -->(the generic<...>branch matched<!-- a >). Now the whole comment is removed, matching Shopify and the common expectation. A regression test covers this.Tests
test/integration/liquid/dos.spec.ts: PoCs for'<script'.repeat(50000),'<style'.repeat(50000), and'<script>foo'.repeat(50000). Each test is bound by Jest's per-test timeout (it(..., fn, 1000)); an O(n²) regression would blow it.test/integration/filters/html.spec.ts: regression for<!-- a > b -->after→after.memoryLimitassertion confirmsstrip_htmlchargesstr.length.Files
src/filters/html.ts— rewrittenstrip_html.test/integration/liquid/dos.spec.ts— ReDoS +memoryLimitregressions.test/integration/filters/html.spec.ts—>-in-comment regression.