[DOS vulnerability fix] Limit how many DOM nodes get iterated upon #172

valeriansaliou · 2019-04-07T20:24:14Z

We're using node-html-to-text at scale in production for Crisp (https://crisp.chat/); thanks for the work!

Our inbound email system has been put offline a few days ago due to node-html-to-text being vulnerable to attacks where the parsed HTML would be either super-deep, or have a LOT of DOM elements, or both.

We noticed via a NodeJS --prof check that the walk() method was being called a huge number of times. This can be considered as a DOS vulnerability, and was fixed in our fork with hard-coded limits (that are high).

After deploying the patch in this PR in production, we've not seen any more load / downtime issues.

Here's how we fixed it:

Limit the number of base DOM elements that can be iterated upon (MAX_BASE set to 10);
Limit the width of a DOM children iteration (MAX_DOM set to 200);
Limit the recursive depth of the walk() method (MAX_ELEMENTS set to 2000);

I'm looking forward to get this PR merged into the NPMJS package 😉

mlegenhausen · 2019-04-09T18:41:57Z

Why do you not add a preprocessing step that reduces the amount of HTML DOM elements given to html-to-text?

valeriansaliou · 2019-04-10T08:05:10Z

Hi! This would make it a double-pre-processing, as a full DOM parsing would be required in order to count / limit DOM elements. It's already been done in node-html-to-text, so for performance reasons we've better avoid double-pre-processing, and apply security limits in the library right away.

I'm eager to make this optional and user-configurable if needed.

tomholub · 2019-06-28T17:18:48Z

This should definitely be done in node-html-to-text itself, since it already does the parsing and has access to info about nodes. As for ourselves, we'll be using @valeriansaliou fork until this or something similar gets merged.

KillyMXI · 2020-05-07T19:49:33Z

I'm now looking at this PR for merging, but find it rather crude.

My concerns with these changes:

all values definitely need to be configurable to make possible for existing tests to pass;
not all of the names MAX_BASE, MAX_DOM and MAX_ELEMENTS speak for their purpose;
what should happen with exceeding data? How can I know if I got full or trimmed output? How can I force full output? Should that be default?

What I see as a possible solution:

var DEFAULT_OPTIONS = {
    ...,
    limits: {
        maxBaseElements: undefined, // undefined | number
        maxChildElements: undefined, // everything opt-in?
        maxTotalElements: undefined, // advice for good values can be put into docs
        ellipsis: '...' // string | undefined (if set - insert this as a block whenever content is skipped)
    },
};

I also wonder whether there is a meaningful way to move the logic to an extension point and avoid adding a lot of rigid options to html-to-text itself. But that might just overcomplicate things instead.

@valeriansaliou, @baptistejamin are you still onboard with this?
I would love to hear your thoughts.
If you still want to complete this PR - how soon you will be able to do that?

In addition to not breaking existing tests, this PR also require some tests on its own.

Note: while this is important to prevent crashes, it is still a crude solution. It would be better to know the common causes and see whether they can be handled properly, before resorting to this.

KillyMXI · 2020-06-01T15:17:30Z

Oops. It wasn't my plan to merge this in the current state, but oh well. It will be easier for me to work from here.

Current plan is to rework the settings as I've described above, and also make some other improvements to better handle some cases when deep nesting can happen when it shouldn't.

This is my rework of PR #172

KillyMXI · 2020-06-03T14:36:45Z

And the rework is done.

MAX_BASE option was pointless with the current implementation of base element search.

MAX_ELEMENTS option the way it was proposed here didn't lead to any well-behaving implementation. Replaced with maxDepth. Depth limit has cleaner implementation and well-defined behavior for ellipsis.

maxDepth and maxChildNodes in combination put an upper bound on how many nodes will be processed in total in html-to-text, so it's a fair trade.

BUT! Neither the original proposal nor my rework limit the extra work completely. Because the whole input is still processed by htmlparser2. There is still notable performance benefit on very long inputs though. (According to my estimation, 7 times larger file can be processed in about the same time if we throw away most of it.)

Known issues: I want the options to be grouped, but that also raises the issue with proper options merge. Before I will address that, ellipsis option has to be defined explicitly.

valeriansaliou · 2020-06-07T09:17:22Z

Hello @KillyMXI ,

Sorry for the late answer; we're still onboard w/ this and actively using our fork, we will definitely move back to using NPM's html-to-text whenever this gets bumped w/ this patch.

For DOS safety's sake, I'd enforce default values that are way-too-high to affect anyone using the library (instead of using undefined values), and make it clear in the documentation that this can be disabled by setting the options to undefined, though this would be dangerous as this has proved to lead to DOS vulnerabilities.

KillyMXI · 2020-06-07T10:13:27Z

I can argue this doesn't solve the vulnerability, just makes it a bit harder (see my previous comment). I don't want to force an incomplete solution and create a false sense of security. And I suppose different hardware can choke at vastly different input sizes, so I can't make assumptions on where "way-too-high" is.

Solid solutions would be:

limit the input string length before feeding it into html-to-text - you can do this on your side;
make sure htmlparser2 stops at certain point. I didn't explore this route yet.

KillyMXI · 2020-10-31T19:02:54Z

Version 6 with all the changes is now released.

I decided to include the input string length limit and set it to some big value.

baptistejamin and others added 3 commits April 6, 2019 08:24

Limit MAX depth to prevents DOS attacks

de1bcd5

LIMIT DOM

48d5f1e

Update html-to-text.js

fec1bd7

valeriansaliou changed the title ~~Limit how many DOM nodes get iterated upon~~ [DOS vulnerability fix] Limit how many DOM nodes get iterated upon Apr 7, 2019

valeriansaliou changed the title ~~[DOS vulnerability fix] Limit how many DOM nodes get iterated upon~~ [DOS vulnerability fix] Limit how many DOM nodes can get iterated upon Apr 7, 2019

valeriansaliou changed the title ~~[DOS vulnerability fix] Limit how many DOM nodes can get iterated upon~~ [DOS vulnerability fix] Limit how many DOM nodes get iterated upon Apr 7, 2019

KillyMXI mentioned this pull request Apr 8, 2020

Version 6 roadmap (tracking issue) #200

Closed

40 tasks

KillyMXI mentioned this pull request May 19, 2020

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory #202

Open

KillyMXI merged commit 93034eb into html-to-text:master Jun 1, 2020

KillyMXI added a commit that referenced this pull request Jun 3, 2020

LImits for depth and breadth of DOM tree processing

192a092

This is my rework of PR #172

vladimiry mentioned this pull request Sep 25, 2020

Unhandled Promise Rejection Failed index emails vladimiry/ElectronMail#327

Closed

sseide mentioned this pull request Apr 22, 2021

Update html-to-text to fix uncontrolled memory consumption andris9/nodemailer-html-to-text#13

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOS vulnerability fix] Limit how many DOM nodes get iterated upon #172

[DOS vulnerability fix] Limit how many DOM nodes get iterated upon #172

valeriansaliou commented Apr 7, 2019

mlegenhausen commented Apr 9, 2019

valeriansaliou commented Apr 10, 2019

tomholub commented Jun 28, 2019

KillyMXI commented May 7, 2020 •

edited

Loading

KillyMXI commented Jun 1, 2020

KillyMXI commented Jun 3, 2020 •

edited

Loading

valeriansaliou commented Jun 7, 2020

KillyMXI commented Jun 7, 2020

KillyMXI commented Oct 31, 2020

[DOS vulnerability fix] Limit how many DOM nodes get iterated upon #172

[DOS vulnerability fix] Limit how many DOM nodes get iterated upon #172

Conversation

valeriansaliou commented Apr 7, 2019

mlegenhausen commented Apr 9, 2019

valeriansaliou commented Apr 10, 2019

tomholub commented Jun 28, 2019

KillyMXI commented May 7, 2020 • edited Loading

KillyMXI commented Jun 1, 2020

KillyMXI commented Jun 3, 2020 • edited Loading

valeriansaliou commented Jun 7, 2020

KillyMXI commented Jun 7, 2020

KillyMXI commented Oct 31, 2020

KillyMXI commented May 7, 2020 •

edited

Loading

KillyMXI commented Jun 3, 2020 •

edited

Loading