Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOS vulnerability fix] Limit how many DOM nodes get iterated upon #172

Merged
merged 3 commits into from
Jun 1, 2020

Conversation

valeriansaliou
Copy link
Contributor

We're using node-html-to-text at scale in production for Crisp (https://crisp.chat/); thanks for the work!

Our inbound email system has been put offline a few days ago due to node-html-to-text being vulnerable to attacks where the parsed HTML would be either super-deep, or have a LOT of DOM elements, or both.

We noticed via a NodeJS --prof check that the walk() method was being called a huge number of times. This can be considered as a DOS vulnerability, and was fixed in our fork with hard-coded limits (that are high).

After deploying the patch in this PR in production, we've not seen any more load / downtime issues.

Here's how we fixed it:

  1. Limit the number of base DOM elements that can be iterated upon (MAX_BASE set to 10);
  2. Limit the width of a DOM children iteration (MAX_DOM set to 200);
  3. Limit the recursive depth of the walk() method (MAX_ELEMENTS set to 2000);

I'm looking forward to get this PR merged into the NPMJS package 😉

@valeriansaliou valeriansaliou changed the title Limit how many DOM nodes get iterated upon [DOS vulnerability fix] Limit how many DOM nodes get iterated upon Apr 7, 2019
@valeriansaliou valeriansaliou changed the title [DOS vulnerability fix] Limit how many DOM nodes get iterated upon [DOS vulnerability fix] Limit how many DOM nodes can get iterated upon Apr 7, 2019
@valeriansaliou valeriansaliou changed the title [DOS vulnerability fix] Limit how many DOM nodes can get iterated upon [DOS vulnerability fix] Limit how many DOM nodes get iterated upon Apr 7, 2019
@mlegenhausen
Copy link
Member

Why do you not add a preprocessing step that reduces the amount of HTML DOM elements given to html-to-text?

@valeriansaliou
Copy link
Contributor Author

Hi! This would make it a double-pre-processing, as a full DOM parsing would be required in order to count / limit DOM elements. It's already been done in node-html-to-text, so for performance reasons we've better avoid double-pre-processing, and apply security limits in the library right away.

I'm eager to make this optional and user-configurable if needed.

@tomholub
Copy link

This should definitely be done in node-html-to-text itself, since it already does the parsing and has access to info about nodes. As for ourselves, we'll be using @valeriansaliou fork until this or something similar gets merged.

@KillyMXI KillyMXI mentioned this pull request Apr 8, 2020
40 tasks
@KillyMXI
Copy link
Member

KillyMXI commented May 7, 2020

I'm now looking at this PR for merging, but find it rather crude.

My concerns with these changes:

  • all values definitely need to be configurable to make possible for existing tests to pass;
  • not all of the names MAX_BASE, MAX_DOM and MAX_ELEMENTS speak for their purpose;
  • what should happen with exceeding data? How can I know if I got full or trimmed output? How can I force full output? Should that be default?

What I see as a possible solution:

var DEFAULT_OPTIONS = {
    ...,
    limits: {
        maxBaseElements: undefined, // undefined | number
        maxChildElements: undefined, // everything opt-in?
        maxTotalElements: undefined, // advice for good values can be put into docs
        ellipsis: '...' // string | undefined (if set - insert this as a block whenever content is skipped)
    },
};

I also wonder whether there is a meaningful way to move the logic to an extension point and avoid adding a lot of rigid options to html-to-text itself. But that might just overcomplicate things instead.

@valeriansaliou, @baptistejamin are you still onboard with this?
I would love to hear your thoughts.
If you still want to complete this PR - how soon you will be able to do that?

In addition to not breaking existing tests, this PR also require some tests on its own.

Note: while this is important to prevent crashes, it is still a crude solution. It would be better to know the common causes and see whether they can be handled properly, before resorting to this.

@KillyMXI
Copy link
Member

KillyMXI commented Jun 1, 2020

Oops. It wasn't my plan to merge this in the current state, but oh well. It will be easier for me to work from here.

Current plan is to rework the settings as I've described above, and also make some other improvements to better handle some cases when deep nesting can happen when it shouldn't.

KillyMXI added a commit that referenced this pull request Jun 3, 2020
@KillyMXI
Copy link
Member

KillyMXI commented Jun 3, 2020

And the rework is done.

MAX_BASE option was pointless with the current implementation of base element search.

MAX_ELEMENTS option the way it was proposed here didn't lead to any well-behaving implementation. Replaced with maxDepth. Depth limit has cleaner implementation and well-defined behavior for ellipsis.

maxDepth and maxChildNodes in combination put an upper bound on how many nodes will be processed in total in html-to-text, so it's a fair trade.

BUT! Neither the original proposal nor my rework limit the extra work completely. Because the whole input is still processed by htmlparser2. There is still notable performance benefit on very long inputs though. (According to my estimation, 7 times larger file can be processed in about the same time if we throw away most of it.)

Known issues: I want the options to be grouped, but that also raises the issue with proper options merge. Before I will address that, ellipsis option has to be defined explicitly.

@valeriansaliou
Copy link
Contributor Author

Hello @KillyMXI ,

Sorry for the late answer; we're still onboard w/ this and actively using our fork, we will definitely move back to using NPM's html-to-text whenever this gets bumped w/ this patch.

For DOS safety's sake, I'd enforce default values that are way-too-high to affect anyone using the library (instead of using undefined values), and make it clear in the documentation that this can be disabled by setting the options to undefined, though this would be dangerous as this has proved to lead to DOS vulnerabilities.

@KillyMXI
Copy link
Member

KillyMXI commented Jun 7, 2020

I can argue this doesn't solve the vulnerability, just makes it a bit harder (see my previous comment). I don't want to force an incomplete solution and create a false sense of security. And I suppose different hardware can choke at vastly different input sizes, so I can't make assumptions on where "way-too-high" is.

Solid solutions would be:

  • limit the input string length before feeding it into html-to-text - you can do this on your side;
  • make sure htmlparser2 stops at certain point. I didn't explore this route yet.

@KillyMXI
Copy link
Member

Version 6 with all the changes is now released.

I decided to include the input string length limit and set it to some big value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants