Skip to content

fix(strip_html): rewrite as linear single-pass scan to avoid ReDoS#896

Merged
harttle merged 6 commits into
masterfrom
fix/strip-html-redos
May 11, 2026
Merged

fix(strip_html): rewrite as linear single-pass scan to avoid ReDoS#896
harttle merged 6 commits into
masterfrom
fix/strip-html-redos

Conversation

@harttle
Copy link
Copy Markdown
Owner

@harttle harttle commented May 10, 2026

Summary

The strip_html regex

/<script[\s\S]*?<\/script>|<style[\s\S]*?<\/style>|<[\s\S]*?>|<!--[\s\S]*?-->/g

backtracks O(n²) on inputs with many unclosed openers (e.g. '<script'.repeat(50000), ~350 KB), which blocks the Node.js event loop for ~10 s. memoryLimit only charges str.length and does not bound regex CPU.

Replace it with a linear single-pass indexOf scan. For each < we try each block kind in priority order — <script>…</script>, <style>…</style>, <!--…-->, and <…> as the catch-all — and skip past the matched closer. A kind whose closer is absent from the rest of the input is removed from the map and never tried again, so repeated unclosed openers don't re-scan the tail and total work stays O(n).

Why a regex-only fix can't be O(n) here

I tried regex variants first. They all stay O(n²) on this input class:

Input original [\s\S]*? Friedl unrolled-loop this PR
68 KB 255 ms 971 ms 0 ms
137 KB 904 ms (×3.5) 3937 ms (×4.0) 1 ms
273 KB 3437 ms (×3.8) 14433 ms (×3.7) 0 ms
547 KB 18796 ms (×5.5) (>5 s — skipped) 1 ms
1.1 MB (skipped) (skipped) 1 ms
2.1 MB 2 ms
4.4 MB 4 ms

Both regex variants grow ≈ ×4 per input doubling. The Friedl unrolled-loop pattern (<script\b[^<]*(?:<(?!\/script>)[^<]*)*<\/script>|…) eliminates exponential backtracking from a single match attempt, but the engine still retries the alternation at every < position; for inputs with N unmatched openers that is O(N) retries × O(N) work-per-retry. JS regex has no atomic groups, no possessive quantifiers, and no cross-attempt memoization, so there is no regex-only way to remember "no </script> exists past position k" between attempts. The new code stores that fact across iterations, which is what makes it linear.

Secondary fix: <!--…--> is a raw-text block

This PR also aligns comment handling with Shopify Liquid's STRIP_HTML_BLOCKS: <script>, <style>, and <!--…--> are all HTML5 raw-text blocks whose content is opaque until the matching closer. The previous code only special-cased <script> and <style>, so <!-- a > b --> got partially stripped to b --> (the generic <...> branch matched <!-- a >). Now the whole comment is removed, matching Shopify and the common expectation. A regression test covers this.

Tests

  • test/integration/liquid/dos.spec.ts: PoCs for '<script'.repeat(50000), '<style'.repeat(50000), and '<script>foo'.repeat(50000). Each test is bound by Jest's per-test timeout (it(..., fn, 1000)); an O(n²) regression would blow it.
  • test/integration/filters/html.spec.ts: regression for <!-- a > b -->afterafter.
  • A memoryLimit assertion confirms strip_html charges str.length.
  • All 1544 existing tests pass; lint clean; perf-diff ≈ ±0.4 % (noise).

Files

  • src/filters/html.ts — rewritten strip_html.
  • test/integration/liquid/dos.spec.ts — ReDoS + memoryLimit regressions.
  • test/integration/filters/html.spec.ts>-in-comment regression.

The previous strip_html regex
  /<script[\s\S]*?<\/script>|<style[\s\S]*?<\/style>|<[\s\S]*?>|<!--[\s\S]*?-->/g
contains lazy alternatives that backtrack O(n^2) on inputs with many
unclosed `<script` / `<style` openers. A 350KB payload of
`'<script'.repeat(50000)` blocked the Node.js event loop for ~10s, and
cost grew quadratically with input size. memoryLimit only charged
str.length, which does not bound regex CPU.

Replace the regex with an indexOf-based single-pass scan. For each `<`
we:
- if `<script` opener: find next `</script>` and skip the whole block;
  cache "no closer after pos k" so subsequent unclosed `<script`
  openers do not re-scan the tail.
- same for `<style` / `</style>`.
- otherwise treat as a generic `<...>` tag (matches the original
  behavior, where the `<[\s\S]*?>` alternative also caught comments).
- if no closing `>` exists, emit the tail as literal text and stop.

Total work is O(n). All existing strip_html test cases pass unchanged.

Add regression tests covering the PoCs (`<script` / `<style` repeats,
and `<script>foo` repeats with `>` but no `</script>`) plus a
memoryLimit assertion.

Co-authored-by: Cursor <cursoragent@cursor.com>
@coveralls
Copy link
Copy Markdown

coveralls commented May 10, 2026

Coverage Report for CI Build 25680358067

Coverage increased (+0.002%) to 99.542%

Details

  • Coverage increased (+0.002%) from the base build.
  • Patch coverage: 14 of 14 lines across 1 file are fully covered (100%).
  • No coverage regressions found.

Uncovered Changes

No uncovered changes found.

Coverage Regressions

No coverage regressions found.


Coverage Stats

Coverage Status
Relevant Lines: 3008
Covered Lines: 3001
Line Coverage: 99.77%
Relevant Branches: 1140
Covered Branches: 1128
Branch Coverage: 98.95%
Branches in Coverage %: Yes
Coverage Strength: 22040.6 hits per line

💛 - Coveralls

harttle and others added 5 commits May 10, 2026 15:17
Same algorithm and complexity, fewer lines. Document why a regex-only
solution can't be O(n) in V8 (no atomic groups / possessive quantifiers
/ memoization, so unrolled-loop patterns are still O(n^2) on unclosed
openers — empirically confirmed: original 280KB ~4s, Friedl unrolled
~14s, atomic lookahead ~7s; tokenizer ~1ms).

Co-authored-by: Cursor <cursoragent@cursor.com>
Drop the module-level STRIP_BLOCKS table; the rest of the file keeps
each filter self-contained (only escapeMap/unescapeMap are top-level
maps shared across filters). Two openers don't justify a table.

Co-authored-by: Cursor <cursoragent@cursor.com>
In HTML5, <script>, <style>, and <!-- --> are all raw-text blocks: their
content is opaque until the matching closer, so a `>` inside CSS, JS, or
a comment must not be treated as a tag end. The previous code only had
this special handling for <script> and <style>; comments containing `>`
fell through to the generic `<...>` branch and were partially stripped
(e.g. `<!-- a > b -->` left `b -->` in the output).

Match Shopify Liquid's STRIP_HTML_BLOCKS set (script + style + comment),
and consolidate the three near-identical branches into a small
opener/closer table inside the function.

Algorithm and complexity unchanged (O(n) via indexOf + cached closer
positions). Add a regression test for `>` inside a comment.

Co-authored-by: Cursor <cursoragent@cursor.com>
Once `indexOf(closer, X)` returns -1, all subsequent searches (with
monotonically increasing start) also return -1. So tracking absence is
enough; storing positions is unnecessary. Make `blocks` a Set and
delete a kind once its closer is known absent — no parallel `dead`
bookkeeping. Use Jest's per-test timeout for the ReDoS regressions
instead of manual Date.now() bookkeeping.

Co-authored-by: Cursor <cursoragent@cursor.com>
Adding ['<', '>'] as the lowest-priority entry of `blocks` lets the
inner loop subsume the generic-tag fallback: the `end` sentinel and
its `< 0` / `<= 0` follow-up checks disappear, the "no terminator"
exit becomes a single `i === lt` test, and Set<[string, string]>
collapses to Map<string, string>.

Co-authored-by: Cursor <cursoragent@cursor.com>
@harttle harttle merged commit 3616a74 into master May 11, 2026
14 checks passed
github-actions Bot pushed a commit that referenced this pull request May 14, 2026
# [10.26.0](v10.25.7...v10.26.0) (2026-05-14)

### Bug Fixes

* **date:** cap strftime widths and account padding in memoryLimit ([#895](#895)) ([3129d46](3129d46))
* enforce renderLimit for empty renderTemplates calls ([#894](#894)) ([5b9c346](5b9c346))
* propagate ownPropertyOnly into Context.spawn() for {% render %} ([#893](#893)) ([dbbf628](dbbf628))
* **security:** block Object.prototype filter/tag lookups (RCE) ([#897](#897)) ([457fae0](457fae0))
* strip html newline tags ([#892](#892)) ([26ea285](26ea285))
* **strip_html:** rewrite as linear single-pass scan to avoid ReDoS ([#896](#896)) ([3616a74](3616a74))

### Features

* add sha256 and hmac_sha256 filters for cryptographic operations ([#889](#889)) ([1c816d4](1c816d4))
@github-actions
Copy link
Copy Markdown

🎉 This PR is included in version 10.26.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants