Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

profiler: Fix race condition in the profile's buffer #641

Merged
merged 1 commit into from
Aug 1, 2022

Conversation

javierhonduco
Copy link
Contributor

@javierhonduco javierhonduco commented Aug 1, 2022

Fixes: #635, #631

Context

After landing system-wide profiling (#627) we started noticing way more profiles that failed with various error messages (#635), such as:

  • gzip: invalid header
  • flate: corrupt input before offset 2170
  • unexpected EOF

Which indicated some sort of corruption going on. This correlates well to the addition of system-wide, which not only produces way more profiles overall but also increases the pace at which they are generated.

After playing around with the code in the write client I could not spot anything that did not seem correct:

First, the series are appended:

b.series = append(b.series, &profilestorepb.RawProfileSeries{
Labels: profileSeries.Labels,
Samples: profileSeries.Samples,
})

and then, they are processed and they are off in the wire, if everything goes well

_, err := b.writeClient.WriteRaw(ctx, &profilestorepb.WriteRawRequest{
Series: batch,
Normalized: true, // Since everything generated by the agent is normalized, we can assume this to be true.
})
// Only log error if retrying, otherwise it will be logged outside the retry
if err != nil && expbackOff.NextBackOff().Nanoseconds() > 0 {
level.Debug(b.logger).Log(
"msg", "batch write client failed to send profiles",
"retry", expbackOff.NextBackOff(),
"count", len(batch),
"err", err,
)
}
return err

The underlying buffer

At that point, I started taking a look at the call site, cpu.go (formerly called profile.go) where we obtain a bytes buffer

RawProfile: buf.Bytes(),

which, according to its documentation:

Bytes returns a slice of length b.Len() holding the unread portion of the buffer. The slice is valid for use only until the next buffer modification (that is, only until the next call to a method like Read, Write, Reset, or Truncate). The slice aliases the buffer content at least until the next buffer modification, so immediate changes to the slice will affect the result of future reads.

So, we are getting a slice that we store for sending later. Before it's sent, it's quite possible that the buffer pool is reset, before we actually send the data, resulting in sending corrupt profiles.

The reason why we barely saw this before was because of the slower pace of profile creation, which tightened the window for this race condition

Test plan

Ran the agent for 30 minutes and there were no errors sending profiles:

level=debug ts=2022-08-01T13:54:28.112511969Z caller=write_client.go:108 msg="batch write client sent profiles" count=33
level=debug ts=2022-08-01T13:54:38.151905771Z caller=write_client.go:108 msg="batch write client sent profiles" count=30
level=debug ts=2022-08-01T13:54:48.064483943Z caller=write_client.go:108 msg="batch write client sent profiles" count=25
level=debug ts=2022-08-01T13:54:58.138064657Z caller=write_client.go:108 msg="batch write client sent profiles" count=26
level=debug ts=2022-08-01T13:55:08.0212775Z caller=write_client.go:108 msg="batch write client sent profiles" count=19
level=debug ts=2022-08-01T13:55:18.053826657Z caller=write_client.go:108 msg="batch write client sent profiles" count=21
level=debug ts=2022-08-01T13:55:28.212127284Z caller=write_client.go:108 msg="batch write client sent profiles" count=34
level=debug ts=2022-08-01T13:55:38.125946646Z caller=write_client.go:108 msg="batch write client sent profiles" count=33
level=debug ts=2022-08-01T13:55:48.056939936Z caller=write_client.go:108 msg="batch write client sent profiles" count=32
level=debug ts=2022-08-01T13:55:58.149203393Z caller=write_client.go:108 msg="batch write client sent profiles" count=33
level=debug ts=2022-08-01T13:56:08.096657073Z caller=write_client.go:108 msg="batch write client sent profiles" count=31
level=debug ts=2022-08-01T13:56:18.116233418Z caller=write_client.go:108 msg="batch write client sent profiles" count=37
level=debug ts=2022-08-01T13:56:28.343900223Z caller=write_client.go:108 msg="batch write client sent profiles" count=36
level=debug ts=2022-08-01T13:56:38.145866682Z caller=write_client.go:108 msg="batch write client sent profiles" count=35
level=debug ts=2022-08-01T13:56:48.134876821Z caller=write_client.go:108 msg="batch write client sent profiles" count=29
level=debug ts=2022-08-01T13:56:58.24054575Z caller=write_client.go:108 msg="batch write client sent profiles" count=33
level=debug ts=2022-08-01T13:57:08.274369062Z caller=write_client.go:108 msg="batch write client sent profiles" count=45
level=debug ts=2022-08-01T13:57:18.121521521Z caller=write_client.go:108 msg="batch write client sent profiles" count=22
level=debug ts=2022-08-01T13:57:28.168074083Z caller=write_client.go:108 msg="batch write client sent profiles" count=28
level=debug ts=2022-08-01T13:57:38.181045685Z caller=write_client.go:108 msg="batch write client sent profiles" count=31
level=debug ts=2022-08-01T13:57:48.144038651Z caller=write_client.go:108 msg="batch write client sent profiles" count=28
level=debug ts=2022-08-01T13:57:58.075858101Z caller=write_client.go:108 msg="batch write client sent profiles" count=19
level=debug ts=2022-08-01T13:58:08.107327655Z caller=write_client.go:108 msg="batch write client sent profiles" count=26
level=debug ts=2022-08-01T13:58:18.056224463Z caller=write_client.go:108 msg="batch write client sent profiles" count=20
level=debug ts=2022-08-01T13:58:28.631909061Z caller=write_client.go:108 msg="batch write client sent profiles" count=28
level=debug ts=2022-08-01T13:58:38.0890856Z caller=write_client.go:108 msg="batch write client sent profiles" count=27
level=debug ts=2022-08-01T13:58:48.300975251Z caller=write_client.go:108 msg="batch write client sent profiles" count=42

image

@javierhonduco
Copy link
Contributor Author

Rebase

Copy link
Member

@kakkoyun kakkoyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I'm happy to accept it since it is an immediate fix.

However, as we talked offline, we need to separate certain parts of the profiler as stand-alone modules, and the writer can be one, and it could handle its buffer pool. I suspect now that we have increased the frequency of profile creation, this will drastically increase the allocations, which is not something we want.

@kakkoyun kakkoyun merged commit c3308e2 into parca-dev:main Aug 1, 2022
@javierhonduco javierhonduco deleted the fix-profile-buffer-race branch August 1, 2022 14:57
@javierhonduco
Copy link
Contributor Author

Totally agree, once we have dealt with the correctness issues we can focus on the efficiency work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

profile write client: Agent sending corrupted compressed profiles
2 participants