Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testing: apparent memory leak in fuzzer #44517

Closed
dsnet opened this issue Feb 22, 2021 · 16 comments
Closed

testing: apparent memory leak in fuzzer #44517

dsnet opened this issue Feb 22, 2021 · 16 comments
Labels
fuzz Issues related to native fuzzing support NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Milestone

Comments

@dsnet
Copy link
Member

dsnet commented Feb 22, 2021

If I leave a fuzzer running for sufficiently long, then it crashes with an OOM. Snippet from dmesg:

[1974087.246830] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-319973.slice/session-c1.scope,task=json.test,pid=3659815,uid=319973
[1974087.264733] Out of memory: Killed process 3659815 (json.test) total-vm:18973836kB, anon-rss:13185376kB, file-rss:0kB, shmem-rss:0kB, UID:319973 pgtables:33988kB oom_score_adj:0
[1974087.971181] oom_reaper: reaped process 3659815 (json.test), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

I don't believe this is because the code being tested is OOMing, but rather the fuzzer itself is retaining too much memory.

Here's a graph of the RSS memory usage over time:
graph
The machine has 32GiB of RAM.

There are large jumps in memory usage at various intervals. I don't have much understanding of how the fuzzer works, but maybe this could be the mutator discovering that some input expands coverage and adding that to some internal data structure?

I should further note that when the fuzzer crashes, it produces a testdata/FuzzXXX/YYY file as the "reproducer". Running the test with that "reproducer" does not fail the fuzz test. If possible, the fuzzer should be able to distinguish between OOMs due to itself and versus OOMs due to the code being tested. The former should not result in any "repro" corpus files being added, while the latter should.

I'm using 5aacd47.
(I can provide the code I'm fuzzing, but contact me privately)

\cc @katiehockman @jayconrod

@dsnet dsnet added the fuzz Issues related to native fuzzing support label Feb 22, 2021
@cagedmantis cagedmantis added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 23, 2021
@cagedmantis cagedmantis added this to the Backlog milestone Feb 23, 2021
@katiehockman
Copy link
Contributor

katiehockman commented Mar 4, 2021

Thanks @dsnet. We'll definitely look into this.

@katiehockman katiehockman added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Mar 4, 2021
@katiehockman katiehockman self-assigned this Mar 4, 2021
@rolandshoemaker
Copy link
Member

rolandshoemaker commented May 20, 2021

I suspect this will have been fixed by https://golang.org/cl/316030. @dsnet could you re-run your target with the current head of dev.fuzz and see if it still OOMs?

@dsnet
Copy link
Member Author

dsnet commented May 23, 2021

It seems that there's maybe still a possible memory leak?

chart-oom

The horizontal axis is in seconds, while the vertical axis is memory utilization in percentages (on a system with 8GiB). This is the memory usage for the worker, while the parent was closer to 0.1% memory usage at all times. It seems that https://golang.org/cl/316030 had a notable impact.

Of note, the worker process seemed to allocate more and more memory until it hit 75% memory utilization around 2 hours in. Afterwards, it held at that level. It never triggered an OOM from the kernel, but did cause the system to be sufficiently unusable that I couldn't SSH into the machine to kill the fuzzer. It's unclear to me whether the process hovering around 75% utilization was intended behavior.

I'm using 60f16d7.

@rolandshoemaker
Copy link
Member

rolandshoemaker commented Jun 3, 2021

@dsnet I'll take another look, I think we expect that the fuzzer will typically use a bunch of memory, especially if it is finding a lot of interesting inputs, but the chart does look strange.

Could you share the target you are using? It's hard to debug without really knowing what is actually happening. You can email me at bracewell@ if you'd like it to remain private.

@klauspost
Copy link
Contributor

klauspost commented Jun 7, 2021

For a test set with a rather large base set I see huge memory allocs as well. I've had to reduce the base set to only include input <100K to not have memory usage explode.

go version devel go1.17-5542c10fbf Fri Jun 4 03:07:33 2021 +0000 windows/amd64 FWIW.

@rolandshoemaker
Copy link
Member

rolandshoemaker commented Jun 8, 2021

I am currently unable to reproduce the behavior described in your second comment on the head of dev.fuzz (b2ff3e8) or at 60f16d7 @dsnet. Running your target on my system results in workers each using a stable ~200MB of memory (tested over ~4 hours).

One thing that may be happening is that the coordinator process currently stores a copy of the corpus in memory, so if the input corpus is extremely large, or if the workers find a lot of new inputs, the memory footprint will be quite large. We can address this by just always reading the inputs from disk, rather than keeping a copy in memory.

Another factor may be that we currently don't influence the mutator towards smaller inputs, so over time inputs will trend larger, although this is capped at ~100MB so as long as there isn't a leak in the worker we shouldn't see unbounded memory usage growth over time.

@dsnet
Copy link
Member Author

dsnet commented Aug 11, 2021

Anecdotal evidence: it seems that there might be a memory leak related to the use of sub-tests.

https://github.com/go-json-experiment/json/blob/120675eb6f8ce84b755a48c2ee8dded9c29818a6/fuzz_test.go#L122-L141

In the snippet above, I eventually encounter an OOM (within 30 minutes) if t.Run is used. However, if I remove the use of sub-tests, the memory stays fairly low.

@dsnet
Copy link
Member Author

dsnet commented Aug 11, 2021

Okay, it seems that this is the problem. Trivial reproduction:

func FuzzOOM(f *testing.F) {
	f.Fuzz(func(t *testing.T, b []byte) {
		for i := 0; i < 1000; i++ {
			t.Run("", func(t *testing.T) {})
		}
	})
}

This will allocate and hold onto gigabytes of memory within minutes.

@gopherbot
Copy link

gopherbot commented Aug 11, 2021

Change https://golang.org/cl/341336 mentions this issue: [dev.fuzz] testing: fix memory leak with sub-tests

@katiehockman katiehockman modified the milestones: Backlog, Go1.18 Sep 14, 2021
@katiehockman katiehockman removed their assignment Sep 14, 2021
@rsc rsc changed the title [dev.fuzz] apparent memory leak in fuzzer testing: apparent memory leak in fuzzer Sep 21, 2021
@gopherbot
Copy link

gopherbot commented Sep 22, 2021

Change https://golang.org/cl/351452 mentions this issue: testing: reduce memory used by subtest names

@bcmills
Copy link
Member

bcmills commented Oct 4, 2021

Another case to consider: the test can spawn arbitrarily many subtests with names based on the fuzz inputs, so the set of user-provided subtest names will grow over time even if it only stores O(1) bits per user-provided test name.

Consider this related test:

func FuzzInputAsSubtest(f *testing.F) {
	f.Fuzz(func(t *testing.T, b []byte) {
		t.Run(string(b), func(t *testing.T) {
		})
	})
}

As far as I can tell, in order to make this scale we must ensure that the fuzz worker periodically drops the subNames map in between test runs, since we only really care that the subtest names be unique within a given Fuzz input invocation — not that they be globally unique across the full fuzzing run.

@bcmills
Copy link
Member

bcmills commented Oct 4, 2021

FWIW I'm unable to reproduce unbounded memory growth with the example in #44517 (comment) from the current head, and I don't observe a leak with #44517 (comment) either.

I'm not sure how to resolve that discrepancy. @dsnet, can you still reproduce the problem?

@dsnet
Copy link
Member Author

dsnet commented Oct 4, 2021

@dsnet, can you still reproduce the problem?

As of go version devel go1.18-3bbc82371e, this is still an issue. I run it with:

go.tip test -fuzz=OOM -v -parallel=1

Since each individual fuzz worker has its own namespace, running with -parallel=1 seems to manifest the problem fastest. On my Ryzen 5900x, it was allocating about ~1GiB/minute.

@bcmills
Copy link
Member

bcmills commented Oct 4, 2021

Hmm, I wonder if the -v and -parallel flags in that invocation are load-bearing. 🤔

@gopherbot
Copy link

gopherbot commented Oct 6, 2021

Change https://golang.org/cl/354430 mentions this issue: testing: don't create unique subtest names while fuzzing

@gopherbot
Copy link

gopherbot commented Oct 8, 2021

Change https://golang.org/cl/354749 mentions this issue: testing: reduce memory used by subtest names

gopherbot pushed a commit that referenced this issue Nov 1, 2021
This is heavily based on CL 341336 by Joe Tsai and CL 351452 by
Jay Conrod.

T.Run and T.Name use a map[string]int64 to hold the next suffix to use
when duplicate names are passed to T.Run. This map necessarily retains
one entry per unique name. However, it's a waste of memory to retain
one entry per duplicate name: when we encounter the Nth duplicate, we
know that names 00 through N-1 have been used just by looking at N.

We do still need to store (and check for collisions againsts) explicit
names provided by the caller. For example, if the user passes in "a",
then "a#01", then "a" again, we cannot deduplicate the second "a" to
"a#01" — we need to instead skip ahead to "a#02". We can do so by
checking the count of "a", then generating a proposed deduplicated
name, then double-checking that proposed name against only the
explicit names so far.

This somewhat reduces memory usage for tests that spawn large numbers
of duplicate subtests, but doesn't solve the problem of memory growth
for fuzzing — we still have to track all of the explicit,
user-provided subtest names, and in a long-running fuzz test that set
alone may be unbounded.

This fixes memory growth for the example described in
https://golang.org/issue/44517#issuecomment-897104060,
but not the one in
https://golang.org/issue/44517#issuecomment-933825661.

For #44517

Change-Id: Ia159ecfcf44561ba67508d3af6377c27856df31d
Reviewed-on: https://go-review.googlesource.com/c/go/+/354749
Trust: Bryan C. Mills <bcmills@google.com>
Run-TryBot: Bryan C. Mills <bcmills@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Emmanuel Odeke <emmanuel@orijtech.com>
Reviewed-by: Jay Conrod <jayconrod@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fuzz Issues related to native fuzzing support NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Projects
None yet
Development

No branches or pull requests

8 participants