Skip to content
This repository has been archived by the owner on Feb 15, 2023. It is now read-only.

Clarify the "should only be run on trusted input" statement in the readme? #53

Closed
rgrove opened this issue Sep 18, 2013 · 5 comments
Closed

Comments

@rgrove
Copy link
Contributor

rgrove commented Sep 18, 2013

Gumbo's readme contains the following scary warning under "Non-Goals":

Security. Gumbo was initially designed for a product that worked with trusted input files only. We're working to harden this and make sure that it behaves as expected even on malicious input, but for now, Gumbo should only be run on trusted input or within a sandbox.

I was wondering if you could clarify this. Is the implication that Gumbo may be vulnerable to buffer overflows or similar attacks? The readme also says Gumbo was tested on billions of pages from Google's index, which seems to imply that it at least handled that untrusted input well.

In other words, how paranoid should I be about this? What steps would be involved in hardening Gumbo, and how might contributors help?

@nostrademons
Copy link
Contributor

The story behind that is that someone from the Google Security team ran a fuzzer against it and came up with at least one buffer overflow bug. The bug occurs in an obscure corner case of the HTML5 spec, where several little-used features are combined, and so far there appears to be only one. It was not triggered on the testing runs against Google's index. But I haven't had time to fix it yet, and until there's time for a full security review, I figure the responsible thing to do is put up a warning.

There've also been comments from other experienced C programmers on the net about possible integer overflows or unchecked malloc calls. @Maxime2's been doing a great job fixing a number of these, but until I've gone through the code with an eye toward them, there may be more lurking. #39 describes a plan for handling & reporting memory allocation errors.

So, as for how paranoid you should be: I think that if you're just writing tools to clean up your own organization's HTML, you're fine. If you're writing tools that run as a separate process with limited privileges (eg. a Hadoop streaming job that needs to parse HTML, or an HTML playground that runs in a chroot jail or is otherwise isolated), you should probably also be okay. I would be wary about using Gumbo to parse untrusted information that then runs in a high-privilege process, eg. you probably wouldn't want to use it to parse HTML comments on your webserver and then reformat them, since someone with knowledge of specific character sequences that trigger a buffer overflow could exploit that to run arbitrary code.

@rgrove
Copy link
Contributor Author

rgrove commented Sep 18, 2013

Exactly what I was looking for. Thanks for clarifying!

@rgrove
Copy link
Contributor Author

rgrove commented May 16, 2014

@nostrademons I see that a number of fuzzer-related fixes have landed recently. Are you still of the opinion that paranoia is justified when parsing untrusted HTML?

@nostrademons
Copy link
Contributor

Gumbo underwent and passed a security review within Google which identified and fixed 2 bugs and a few memory leaks. I'm told this is relatively solid for a C-based parser, and the bugs it found were very obscure (one involved a misnested inside a misnested formatting tag followed by trailing whitespace; the other didn't have an observable effect but made Valgrind/ASAN unhappy by reading uninitialized memory and throwing the result away). Various other GitHub users have also contributed patches to fix some potential integer overflows in array indexing. I would still like to add better OOM handling; right now, it doesn't check the return value of malloc, which is potentially an opportunity for crashes, but there is a convenient hook where this could be added (in the allocator function). I'd also like to add a fuzzer into the unit tests so that it stays clean even after changes, as I'd like to make some changes to support and some of the later HTML5 spec stuff.

So I would say that this library is probably robust enough for most usage, even within servers. Standard disclaimers about the dangers of writing a parser in an unsafe language apply, and I wouldn't use it for "high value" targets like financial websites or identity providers (but then, I would assume that these firms require a thorough internal security audit of all outside code that goes into their product....we did for Google Search). It's probably more robust than most other open-source code that purports to parse HTML, however, and has been tested on more flavors of pathologically bad HTML. (The 27,000 node deep XML document that necessitated the recursion depth limit in the unit tests actually parsed fine under Gumbo, while it crashed Webkit.)

@rgrove
Copy link
Contributor Author

rgrove commented May 17, 2014

@nostrademons This is great news! Thanks for the update. 😄

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@rgrove @nostrademons and others