Clarify the "should only be run on trusted input" statement in the readme? #53

rgrove · 2013-09-18T22:59:55Z

Gumbo's readme contains the following scary warning under "Non-Goals":

Security. Gumbo was initially designed for a product that worked with trusted input files only. We're working to harden this and make sure that it behaves as expected even on malicious input, but for now, Gumbo should only be run on trusted input or within a sandbox.

I was wondering if you could clarify this. Is the implication that Gumbo may be vulnerable to buffer overflows or similar attacks? The readme also says Gumbo was tested on billions of pages from Google's index, which seems to imply that it at least handled that untrusted input well.

In other words, how paranoid should I be about this? What steps would be involved in hardening Gumbo, and how might contributors help?

nostrademons · 2013-09-18T23:22:36Z

The story behind that is that someone from the Google Security team ran a fuzzer against it and came up with at least one buffer overflow bug. The bug occurs in an obscure corner case of the HTML5 spec, where several little-used features are combined, and so far there appears to be only one. It was not triggered on the testing runs against Google's index. But I haven't had time to fix it yet, and until there's time for a full security review, I figure the responsible thing to do is put up a warning.

There've also been comments from other experienced C programmers on the net about possible integer overflows or unchecked malloc calls. @Maxime2's been doing a great job fixing a number of these, but until I've gone through the code with an eye toward them, there may be more lurking. #39 describes a plan for handling & reporting memory allocation errors.

So, as for how paranoid you should be: I think that if you're just writing tools to clean up your own organization's HTML, you're fine. If you're writing tools that run as a separate process with limited privileges (eg. a Hadoop streaming job that needs to parse HTML, or an HTML playground that runs in a chroot jail or is otherwise isolated), you should probably also be okay. I would be wary about using Gumbo to parse untrusted information that then runs in a high-privilege process, eg. you probably wouldn't want to use it to parse HTML comments on your webserver and then reformat them, since someone with knowledge of specific character sequences that trigger a buffer overflow could exploit that to run arbitrary code.

rgrove · 2013-09-18T23:24:51Z

Exactly what I was looking for. Thanks for clarifying!

rgrove · 2014-05-16T19:25:30Z

@nostrademons I see that a number of fuzzer-related fixes have landed recently. Are you still of the opinion that paranoia is justified when parsing untrusted HTML?

nostrademons · 2014-05-17T02:43:52Z

Gumbo underwent and passed a security review within Google which identified and fixed 2 bugs and a few memory leaks. I'm told this is relatively solid for a C-based parser, and the bugs it found were very obscure (one involved a misnested inside a misnested formatting tag followed by trailing whitespace; the other didn't have an observable effect but made Valgrind/ASAN unhappy by reading uninitialized memory and throwing the result away). Various other GitHub users have also contributed patches to fix some potential integer overflows in array indexing. I would still like to add better OOM handling; right now, it doesn't check the return value of malloc, which is potentially an opportunity for crashes, but there is a convenient hook where this could be added (in the allocator function). I'd also like to add a fuzzer into the unit tests so that it stays clean even after changes, as I'd like to make some changes to support and some of the later HTML5 spec stuff.

So I would say that this library is probably robust enough for most usage, even within servers. Standard disclaimers about the dangers of writing a parser in an unsafe language apply, and I wouldn't use it for "high value" targets like financial websites or identity providers (but then, I would assume that these firms require a thorough internal security audit of all outside code that goes into their product....we did for Google Search). It's probably more robust than most other open-source code that purports to parse HTML, however, and has been tested on more flavors of pathologically bad HTML. (The 27,000 node deep XML document that necessitated the recursion depth limit in the unit tests actually parsed fine under Gumbo, while it crashed Webkit.)

rgrove · 2014-05-17T02:45:56Z

@nostrademons This is great news! Thanks for the update. 😄

rgrove closed this as completed Sep 18, 2013

rgrove mentioned this issue Sep 19, 2013

Error with greater and lesser than tags rgrove/sanitize#83

Closed

rgrove mentioned this issue Jul 16, 2014

Gumbo security question rgrove/sanitize#110

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify the "should only be run on trusted input" statement in the readme? #53

Clarify the "should only be run on trusted input" statement in the readme? #53

rgrove commented Sep 18, 2013

nostrademons commented Sep 18, 2013

rgrove commented Sep 18, 2013

rgrove commented May 16, 2014

nostrademons commented May 17, 2014

rgrove commented May 17, 2014

Clarify the "should only be run on trusted input" statement in the readme? #53

Clarify the "should only be run on trusted input" statement in the readme? #53

Comments

rgrove commented Sep 18, 2013

nostrademons commented Sep 18, 2013

rgrove commented Sep 18, 2013

rgrove commented May 16, 2014

nostrademons commented May 17, 2014

rgrove commented May 17, 2014