New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Direct allocation attribution in profiling? #1751
Comments
I don't think this is correct. The goal of the current logic is that it is equivalent to:
This explains the current logic: given i.i.d. coinflip outcomes, the geometric distribution gives the number of flips until the next heads, allowing us to avoid having to do O(N) coinflips on an N-byte allocation. (The earliest usage I could find a quick cite for this trick comes from the cooperative bug isolation stuff (http://pages.cs.wisc.edu/~liblit/dissertation/dissertation.pdf), but I feel sure it's been around for longer). Once we've picked an object, we need to know how many times the coinflip function would have come up heads within it; this explains why the poisson is right for the unskewing. (This is one source of inaccuracy in the current logic IIRC; we should really only count the "unaccounted for" bytes when doing the Poisson computation, but I think we don't). The current logic has the nice property[1] that the bytes attributed to a given stack trace are correct in expectation. It's not clear to me that there's an easy way of achieving that property using some other distribution for picking sampled allocations. [1] Modulo the double-counting inaccuracy |
Just wanted to note that the idea of using a geometric distribution for quick random sampling definitely predates Ben (Liblit)'s thesis: it dates back to at least Vitter '87 (http://www.ittc.ku.edu/~jsv/Papers/Vit87.RandomSampling.pdf). Blog post here, describing its reinvention by the blog author :). -- http://erikerlandson.github.io/blog/2014/09/11/faster-random-samples-with-gap-sampling/ |
Thank you @davidtgoldblatt and @emeryberger for the explanations and the pointers, and thank you @davidtgoldblatt for the offline discussion. I want to summarize here: First, I previously thought that the geometric distribution for generating the sample intervals was chosen because:
Now I understand that the first point was correct but the second point was wrong: the geometric distribution was chosen following a very careful line of mathematical reasoning. Also, my previous guess regarding the Poisson distribution was wrong: I guessed that it might be an (alternative) distribution being used for generating the random interval, but it is actually used to characterize the distribution of the estimated total bytes (per stack trace). One thing I would like to do next, is to go over some mathematics more carefully. Intuitively they make sense to me, but I have not yet been totally convinced of the following - The basis of the mathematical reasoning is the Bernoulli sampling for each allocation byte, which served as the first step to transform the prohibitively expensive exact byte counting into a much cheaper statistical estimation. I imagine that there may be alternative ways achieving the same goal. I hope to better understand:
Coming back to what I was originally describing in this issue, in particular, the four limitations. They all still exist, but I now see them differently:
I don't yet have a fundamentally better alternative than the geometric distribution. So, we'd need to acknowledge the roots of the skewness and bear with it, and we'd also need to face the challenge of coming up with some good tests for it.
One key understanding I've got is, I've become clearer in seeing that the ultimate goal of profiling is to answer the following allocation attribution question: "What is the total bytes allocated by each stack trace?" Previously, I thought that the sampling guarantee and sampling ratio estimation were two separate ends in themselves. Now, I see them as two parts trying to address the same allocation attribution question:
And more importantly they both implicitly assumed that we have to answer the attribution question based on the current sampling strategy. Rather than immediately looking into these two parts individually, I think a better question to ask first is: "Is there a more direct way of answering the allocation attribution question, without first routing to our current sampling strategy?" There can be two places to start from:
Of course, there's also the more challenging route of jumping completely out of our current sampling mechanism, and trying to come up with some brand new sampling mechanism to answer the allocation attribution question. I hope to get some answers to these questions, and will update on this thread. |
Here's the progress I've got so far - On a high level, I'm still sticking to the mathematical basis of our current sampling strategy, i.e. the Bernoulli sampling for each allocation byte, and all the rest of the writing here aims at answering the following question that I raised earlier:
I'll first go over the background in a bit more detail, and then elaborate about my findings. BackgroundWe start from trying to answer the allocation attribution question in a direct and exact manner: void attribute_exact(size_t sz) {
attribute(get_bt(), sz);
} Where Directly executing void attribute_Bernoulli(size_t sz) {
for (size_t i = 0; i < sz; ++i) {
if (Bernoulli(p)) {
attribute(get_bt(), 1/p);
}
}
} Where This is basically what @davidtgoldblatt wrote earlier. Directly executing The main benefit of the Bernoulli transformation is not in itself: for independent and repeated Bernoulli samples, the interval we need to wait till the next success follows a Geometric distribution with the same parameter static size_t wait = Geometric(p);
void attribute_Geometric(size_t sz) {
while (sz >= wait) {
attribute(get_bt(), 1/p);
sz -= wait;
wait = Geometric(p);
}
wait -= sz;
assert(wait > 0);
} Where
To tackle this challenge, we need to "compress" the
More specifically, we want to have a static size_t wait = Geometric(p);
void attribute_magic(size_t sz) {
if (sz >= wait) {
attribute(get_bt(), magic(&sz, &wait));
}
wait -= sz;
assert(wait > 0);
} Which has In the current profiling logic, we implicitly have the size_t magic_current(size_t *sz, size_t *wait) {
size_t total_attr = *sz;
*sz = 0;
*wait = Geometric(p);
return total_attr;
} This computation is apparently wrong, for all three quantities. It can be an approximation if all allocations are very large, but in general it is very inaccurate, and thus leads to the tricky sampling ratio estimation challenge I described earlier. My claimAfter working on it for a while, I claim that the correct version of size_t magic_proposed(size_t *sz, size_t *wait) {
size_t total_attr = (1 + Binomial(*sz - *wait, p)) * 1/p;
*sz = min(Geometric(p) - 1, *sz - *wait);
*wait = *sz + Geometric(p);
return total_attr;
} And, if desired, My proof goes as follows. I'll use To prove my claim, I find it easier to go back to the per-byte Bernoulli sampling story: as soon as the
The number of additional
For any
Next stepPlease help me go over the math and check if there's anything I did wrong. If everything looks correct, I'll explore how we can implement |
I'm about to get on a plane, but this looks mostly correct to me (I'll try to take a closer look next week). One thing I'll note is that for large n and small p, the Poisson approximates the binomial closely, and is more computationally tractable, which I believe is the motivation for using it. |
Which you in fact noted in the original reply, sorry for the comment noise! |
No worries. Yes, Poisson should be a very close approximation to Binomial in our case. I figured that there was a mistake in my math. So hold on, and I'll make some corrections early next week. |
CorrectionThe The goal of the
Then our goal is to come up with a In my earlier
However, I generated them separately, which means I implicitly assumed that they were statistically independent, but they're actually not. Some obvious observations:
The correct way should be:
Here I'm implicitly assuming that The corresponding version of the size_t magic_revised(size_t *sz, size_t *wait) {
size_t n_attr = 1;
size_t remaining_sz = Geometric(p) - 1;
if (remaining_sz < *sz - *wait) {
n_attr += 1 + Binomial(*sz - *wait - remaining_sz - 1, p);
} else {
remaining_sz = *sz - *wait;
}
*sz = remaining_sz;
*wait = remaining_sz + Geometric(p);
return n_attr * 1/p;
} |
Taking deattribution into accountI realized that I forgot the other side of the coin: when a memory is released, we need to "deattribute", so that our answer to the allocation attribution question is always up-to-date. The deattribution counterparts for void deattribute_exact(size_t sz) {
deattribute(get_bt(), sz);
} and void deattribute_Bernoulli(size_t sz) {
for (size_t i = 0; i < sz; ++i) {
if (Bernoulli(p)) {
deattribute(get_bt(), 1/p);
}
}
} To have To do so, we can have another variable
static size_t wait = Geometric(p);
static size_t advance = 0;
void attribute_Geometric_symmetric(size_t sz) {
while (sz >= wait) {
attribute(get_bt(), 1/p);
sz -= wait;
wait = Geometric(p);
advance = 0;
}
wait -= sz;
advance += sz;
assert(wait > 0);
}
void deattribute_Geometric_symmetric(size_t sz) {
while (sz > advance) {
deattribute(get_bt(), 1/p);
sz -= advance;
advance = Geometric(p);
wait = 0;
}
advance -= sz;
wait += sz;
assert(wait > 0);
} Note the asymmetries in the The "compressed" version is as follows: static size_t wait = Geometric(p);
static size_t advance = 0;
void attribute_compressed_symmetric(size_t sz) {
if (sz >= wait) {
size_t n_attr = 1;
size_t remaining_sz = Geometric(p) - 1;
if (remaining_sz < sz - wait) {
n_attr += 1 + Binomial(sz - wait - remaining_sz - 1, p);
} else {
remaining_sz = sz - wait;
}
attribute(get_bt(), n_attr * 1/p);
sz = remaining_sz;
wait = remaining_sz + Geometric(p);
advance = 0;
}
wait -= sz;
advance += sz;
assert(wait > 0);
}
void deattribute_compressed_symmetric(size_t sz) {
if (sz > advance) {
size_t n_attr = 1;
size_t remaining_sz = Geometric(p);
if (remaining_sz < sz - advance) {
n_attr += 1 + Binomial(sz - advance - remaining_sz - 1, p);
} else {
remaining_sz = sz - advance;
}
deattribute(get_bt(), n_attr * 1/p);
sz = remaining_sz;
advance = remaining_sz + Geometric(p) - 1;
wait = 0;
}
advance -= sz;
wait += sz;
assert(wait > 0);
} I've "inlined" the internal
Please help me verify if my math is correct. Digression: sample the current advancementI was tempted to get rid of the new My inspiration originated from the "memoryless" characteristic of Geometric variables: even if I don't record If I choose to sample rather than record the current void deattribute_Geometric_sample(size_t sz) {
for (size_t advance = Geometric(p) - 1; sz > advance; advance = Geometric(p)) {
deattribute(get_bt(), 1/p);
sz -= advance;
wait = 0;
}
wait += sz;
assert(wait > 0);
} It was great to see that The "compressed" version becomes: void deattribute_compressed_sample(size_t sz) {
size_t remaining_sz = Geometric(p);
if (remaining_sz <= sz) {
size_t n_attr = 1 + Binomial(sz - remaining_sz, p);
deattribute(get_bt(), n_attr * 1/p);
sz = remaining_sz;
wait = 0;
}
wait += sz;
assert(wait > 0);
} It was not trivial, and hopefully I got it correct. Interestingly, the logic becomes much simpler, even without any explicit need for the However, the logic simplicity is perhaps the only advantage of this approach: the common path of the deattribution logic has an additional In fact, I can even go more extreme, and replace the void attribute_Geometric_sample(size_t sz) {
for (size_t wait = Geometric(p); sz >= wait; wait = Geometric(p)) {
attribute(get_bt(), 1/p);
sz -= wait;
}
} and its "compressed" version is degenerated to: void attribute_compressed_sample(size_t sz) {
size_t n_attr = Binomial(sz, p);
if (n_attr > 0) {
attribute(get_bt(), n_attr * 1/p);
}
} Though seemingly clean, this logic means that each allocation must draw a Binomial (or a Poisson if we want to approximate), which is typically far more costly than the increment/decrement operation it saved. |
Thanks to @davidtgoldblatt for the offline discussion. We have an upgraded version that is computationally more efficient. Let me first summarize everything in the pseudo-code form and then explain in more detail: void attribute_upgraded(alloc_t *alloc) {
static size_t wait = Geometric(p);
size_t sz = get_alloc_sz(alloc);
if (sz < wait) {
set_alloc_sampled(alloc, false);
wait -= sz;
} else {
bt_t *bt = get_bt();
size_t bytes_to_sample = sz - wait;
set_alloc_sampled(alloc, true);
set_alloc_bt(alloc, bt);
set_alloc_bytes_to_sample(alloc, bytes_to_sample);
increase_bt_bytes_sampled(bt, 1);
increase_bt_bytes_to_sample(bt, bytes_to_sample);
wait = Geometric(p);
}
assert(wait > 0);
}
void deattribute_upgraded(alloc_t *alloc) {
if (get_alloc_sampled(alloc)) {
bt_t *bt = get_alloc_bt(alloc);
size_t bytes_to_sample = get_alloc_bytes_to_sample(alloc);
decrease_bt_bytes_sampled(bt, 1);
decrease_bt_bytes_to_sample(bt, bytes_to_sample);
}
}
size_t get_bt_attribution(bt_t *bt) {
size_t bytes_sampled = get_bt_bytes_sampled(bt);
size_t bytes_to_sample = get_bt_bytes_to_sample(bt);
return (bytes_sampled + Binomial(bytes_to_sample, p)) * 1/p;
} On a high level, this new version starts from an important methodology change: the deattribution path no longer samples bytes; instead, it simply acknowledges the samples drawn at allocation time. After all, even if deattribution samples again, the outcome is just another set of Bernoulli samples, which on expectation is not different from the set of Bernoulli samples originally drawn at allocation time. Moreover, by reducing the number of samples we draw, we're effectively reducing the variance of our final estimation. There are a couple of computational savings:
One pretty clever optimization technique being used here is delayed sampling: whenever an allocation triggers a sample, the remaining There is one more minor correction in this new version: previously I made a mistake in the deallocation path: on the deallocation path, I should have not deattributed for the deallocation back trace, but I should rather have done it for the allocation back trace. |
InferenceI think it's beneficial to generalize On the memory allocator side, a point estimate obtained using I think the correct way is to treat the attribution problem as an inference problem: Given Assuming uniform prior, which should be realistic in most cases, I went through the math and was able to show that the total bytes follows a distribution of
where The maximum a posteriori point estimate, which is equivalent to the maximum likelihood estimate in this case, would be (To be more exact, the point estimate should be The mean of the distribution can also be useful, which is:
It is In any case, to provide the maximal inference flexibility, we should give applications the full distribution, a.k.a. all three of |
Here's our current approach for determining the next profiling sample interval:
jemalloc/src/prof.c
Lines 438 to 476 in ea351a7
According to the comment, we draw a random number from a geometric distribution with
p = 1/2^lg_prof_sample
. If we denote the random number asI
, then the expectation isE(I) = 1/p = 2^lg_prof_sample
, so on average, we sample every2^lg_prof_sample
bytes.The randomization is great for the purpose of avoiding any undesired synchronization between sampling and application's allocation pattern. However, I find that this particular randomization mechanism has a couple of limitations.
Limitations
One key design principle of this byte-based sampling approach, in contrast to a frequency-based sampling approach, is: we want to catch all the large allocations. In other words, we want to make a guarantee in the form of "if an allocation is
K
bytes or larger, then we always sample it". However, we cannot make such a guarantee no matter how largeK
is, even though we often thought that we can use2^lg_prof_sample
forK
.The probability mass function of a geometric distribution is strictly decreasing. For example, let
M = E(I) = 2^lg_prof_sample
, then the probabilities of havingI
falling in[0, 0.5*M]
,[0.5*M, M]
,[M, 1.5*M]
,[1.5*M, 2*M]
and[2*M, ∞)
are 39%, 24%, 14%, 9% and 14%, respectively. I find these numbers disturbing, especially the part that there's a 40% chance that the sampling interval is below half of2^lg_prof_sample
; applications would likely expect that the distribution be relatively more centered on2^lg_prof_sample
.We've seen needs for estimating the overall bytes allocated per stack trace from the sampled bytes, and I find it very hard. An example is our own leak checking facility:
jemalloc/src/prof_data.c
Lines 999 to 1028 in ea351a7
The computation relies on the claim that the probability of an allocation of size
X
being sampled is1 - exp(-X/M)
(whereM = 2^lg_prof_sample
). I don't know how exactly we derived this. A likely source can be:https://github.com/google/pprof/blob/be90f332299a1d03fbb09bdbebdf28118ae300fb/profile/legacy_profile.go#L650-L674
However, the comment there seems to suggest that a Poisson distribution is used (perhaps for drawing the sampling interval?), whereas we use a geometric distribution.
No matter what distribution the interval
I
is generated from, the probability of an allocation of sizeX
being sampled equalsE(X/max(I,X))
: whenI < X
, the probability is1
; otherwise it'sX/I
.After working on it for a long while and consulting math experts, the conclusion is that
E(X/max(I,X))
is hard be expressed in closed form ifI
follows a geometric distribution. The best lower and upper bounds forE(X/max(I,X))
I could get are respectivelyand
and the lower bound is strictly larger than
1 - exp(-X/M)
. In fact1 - exp(-X/M)
only computes the probability ofI < X
, which would be correct if the allocation is always sampled ifI < X
(which is true) and the allocation is never sampled otherwise (which is apparently false).The computational steps for the randomization we have are not trivial, to the extent that we ideally should write some unit tests. However I don't think it's easy to write one, which is probably the reason why we didn't write any.
Proposal
One way to resolve the limitations above is to use a simple uniform distribution for generating the sampling interval. For example, if a uniform distribution on the range
[0, 2*M]
is used (whereM = 2^lg_prof_sample
), the expectationE(I)
isM
, as before, and we get some nice properties:2*M
bytes or larger, then we always sample it.E(X/max(I,X))
cannot be exactly expressed in closed form, but it has much nicer lower and upper bounds respectively as(X/2M) * max(1, log(2M) - log(X))
and(X/2M) * (1 + log(2M) - log(X))
ifX < 2*M
; otherwiseE(X/max(I,X))
is always1
.Another approach, alternative to the uniform distribution, is the Poisson distribution. It has the advantage of being more centered on it's mean, but it doesn't have a sampling guarantee. The sampling ratio estimation is also tricky, and so is writing an efficient generator / tests. Therefore I think a simple uniform distribution is better.
The text was updated successfully, but these errors were encountered: