profiler: Fix race condition in the profile's buffer #641
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes: #635, #631
Context
After landing system-wide profiling (#627) we started noticing way more profiles that failed with various error messages (#635), such as:
gzip: invalid header
flate: corrupt input before offset 2170
unexpected EOF
Which indicated some sort of corruption going on. This correlates well to the addition of system-wide, which not only produces way more profiles overall but also increases the pace at which they are generated.
After playing around with the code in the write client I could not spot anything that did not seem correct:
First, the series are appended:
parca-agent/pkg/agent/write_client.go
Lines 146 to 149 in 0c75dc9
and then, they are processed and they are off in the wire, if everything goes well
parca-agent/pkg/agent/write_client.go
Lines 86 to 99 in 0c75dc9
The underlying buffer
At that point, I started taking a look at the call site,
cpu.go
(formerly calledprofile.go
) where we obtain a bytes bufferparca-agent/pkg/profiler/profiler.go
Line 626 in 6c74e9f
which, according to its documentation:
So, we are getting a slice that we store for sending later. Before it's sent, it's quite possible that the buffer pool is reset, before we actually send the data, resulting in sending corrupt profiles.
The reason why we barely saw this before was because of the slower pace of profile creation, which tightened the window for this race condition
Test plan
Ran the agent for 30 minutes and there were no errors sending profiles: