Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.fuzz] apparent memory leak in fuzzer #44517

Open
dsnet opened this issue Feb 22, 2021 · 6 comments
Open

[dev.fuzz] apparent memory leak in fuzzer #44517

dsnet opened this issue Feb 22, 2021 · 6 comments
Assignees
Milestone

Comments

@dsnet
Copy link
Member

@dsnet dsnet commented Feb 22, 2021

If I leave a fuzzer running for sufficiently long, then it crashes with an OOM. Snippet from dmesg:

[1974087.246830] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-319973.slice/session-c1.scope,task=json.test,pid=3659815,uid=319973
[1974087.264733] Out of memory: Killed process 3659815 (json.test) total-vm:18973836kB, anon-rss:13185376kB, file-rss:0kB, shmem-rss:0kB, UID:319973 pgtables:33988kB oom_score_adj:0
[1974087.971181] oom_reaper: reaped process 3659815 (json.test), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

I don't believe this is because the code being tested is OOMing, but rather the fuzzer itself is retaining too much memory.

Here's a graph of the RSS memory usage over time:
graph
The machine has 32GiB of RAM.

There are large jumps in memory usage at various intervals. I don't have much understanding of how the fuzzer works, but maybe this could be the mutator discovering that some input expands coverage and adding that to some internal data structure?

I should further note that when the fuzzer crashes, it produces a testdata/FuzzXXX/YYY file as the "reproducer". Running the test with that "reproducer" does not fail the fuzz test. If possible, the fuzzer should be able to distinguish between OOMs due to itself and versus OOMs due to the code being tested. The former should not result in any "repro" corpus files being added, while the latter should.

I'm using 5aacd47.
(I can provide the code I'm fuzzing, but contact me privately)

\cc @katiehockman @jayconrod

@dsnet dsnet added the fuzz label Feb 22, 2021
@cagedmantis cagedmantis added this to the Backlog milestone Feb 23, 2021
@katiehockman
Copy link
Member

@katiehockman katiehockman commented Mar 4, 2021

Thanks @dsnet. We'll definitely look into this.

@rolandshoemaker
Copy link
Member

@rolandshoemaker rolandshoemaker commented May 20, 2021

I suspect this will have been fixed by https://golang.org/cl/316030. @dsnet could you re-run your target with the current head of dev.fuzz and see if it still OOMs?

@dsnet
Copy link
Member Author

@dsnet dsnet commented May 23, 2021

It seems that there's maybe still a possible memory leak?

chart-oom

The horizontal axis is in seconds, while the vertical axis is memory utilization in percentages (on a system with 8GiB). This is the memory usage for the worker, while the parent was closer to 0.1% memory usage at all times. It seems that https://golang.org/cl/316030 had a notable impact.

Of note, the worker process seemed to allocate more and more memory until it hit 75% memory utilization around 2 hours in. Afterwards, it held at that level. It never triggered an OOM from the kernel, but did cause the system to be sufficiently unusable that I couldn't SSH into the machine to kill the fuzzer. It's unclear to me whether the process hovering around 75% utilization was intended behavior.

I'm using 60f16d7.

@rolandshoemaker
Copy link
Member

@rolandshoemaker rolandshoemaker commented Jun 3, 2021

@dsnet I'll take another look, I think we expect that the fuzzer will typically use a bunch of memory, especially if it is finding a lot of interesting inputs, but the chart does look strange.

Could you share the target you are using? It's hard to debug without really knowing what is actually happening. You can email me at bracewell@ if you'd like it to remain private.

@klauspost
Copy link
Contributor

@klauspost klauspost commented Jun 7, 2021

For a test set with a rather large base set I see huge memory allocs as well. I've had to reduce the base set to only include input <100K to not have memory usage explode.

go version devel go1.17-5542c10fbf Fri Jun 4 03:07:33 2021 +0000 windows/amd64 FWIW.

@rolandshoemaker
Copy link
Member

@rolandshoemaker rolandshoemaker commented Jun 8, 2021

I am currently unable to reproduce the behavior described in your second comment on the head of dev.fuzz (b2ff3e8) or at 60f16d7 @dsnet. Running your target on my system results in workers each using a stable ~200MB of memory (tested over ~4 hours).

One thing that may be happening is that the coordinator process currently stores a copy of the corpus in memory, so if the input corpus is extremely large, or if the workers find a lot of new inputs, the memory footprint will be quite large. We can address this by just always reading the inputs from disk, rather than keeping a copy in memory.

Another factor may be that we currently don't influence the mutator towards smaller inputs, so over time inputs will trend larger, although this is capped at ~100MB so as long as there isn't a leak in the worker we shouldn't see unbounded memory usage growth over time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants