Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Direct allocation attribution in profiling? #1751

Open
yinan1048576 opened this issue Jan 29, 2020 · 11 comments
Open

Direct allocation attribution in profiling? #1751

yinan1048576 opened this issue Jan 29, 2020 · 11 comments

Comments

@yinan1048576
Copy link
Contributor

yinan1048576 commented Jan 29, 2020

Here's our current approach for determining the next profiling sample interval:

jemalloc/src/prof.c

Lines 438 to 476 in ea351a7

void
prof_sample_threshold_update(tsd_t *tsd) {
#ifdef JEMALLOC_PROF
if (!config_prof) {
return;
}
if (lg_prof_sample == 0) {
thread_prof_sample_event_update(tsd,
THREAD_EVENT_MIN_START_WAIT);
return;
}
/*
* Compute sample interval as a geometrically distributed random
* variable with mean (2^lg_prof_sample).
*
* __ __
* | log(u) | 1
* bytes_until_sample = | -------- |, where p = ---------------
* | log(1-p) | lg_prof_sample
* 2
*
* For more information on the math, see:
*
* Non-Uniform Random Variate Generation
* Luc Devroye
* Springer-Verlag, New York, 1986
* pp 500
* (http://luc.devroye.org/rnbookindex.html)
*/
uint64_t r = prng_lg_range_u64(tsd_prng_statep_get(tsd), 53);
double u = (double)r * (1.0/9007199254740992.0L);
uint64_t bytes_until_sample = (uint64_t)(log(u) /
log(1.0 - (1.0 / (double)((uint64_t)1U << lg_prof_sample))))
+ (uint64_t)1U;
thread_prof_sample_event_update(tsd, bytes_until_sample);
#endif
}

According to the comment, we draw a random number from a geometric distribution with p = 1/2^lg_prof_sample. If we denote the random number as I, then the expectation is E(I) = 1/p = 2^lg_prof_sample, so on average, we sample every 2^lg_prof_sample bytes.

The randomization is great for the purpose of avoiding any undesired synchronization between sampling and application's allocation pattern. However, I find that this particular randomization mechanism has a couple of limitations.

Limitations

  • Sampling guarantee

One key design principle of this byte-based sampling approach, in contrast to a frequency-based sampling approach, is: we want to catch all the large allocations. In other words, we want to make a guarantee in the form of "if an allocation is K bytes or larger, then we always sample it". However, we cannot make such a guarantee no matter how large K is, even though we often thought that we can use 2^lg_prof_sample for K.

  • Distribution skewness

The probability mass function of a geometric distribution is strictly decreasing. For example, let M = E(I) = 2^lg_prof_sample, then the probabilities of having I falling in [0, 0.5*M], [0.5*M, M], [M, 1.5*M], [1.5*M, 2*M] and [2*M, ∞) are 39%, 24%, 14%, 9% and 14%, respectively. I find these numbers disturbing, especially the part that there's a 40% chance that the sampling interval is below half of 2^lg_prof_sample; applications would likely expect that the distribution be relatively more centered on 2^lg_prof_sample.

  • Sampling ratio estimation

We've seen needs for estimating the overall bytes allocated per stack trace from the sampled bytes, and I find it very hard. An example is our own leak checking facility:

jemalloc/src/prof_data.c

Lines 999 to 1028 in ea351a7

static void
prof_leakcheck(const prof_cnt_t *cnt_all, size_t leak_ngctx,
const char *filename) {
#ifdef JEMALLOC_PROF
/*
* Scaling is equivalent AdjustSamples() in jeprof, but the result may
* differ slightly from what jeprof reports, because here we scale the
* summary values, whereas jeprof scales each context individually and
* reports the sums of the scaled values.
*/
if (cnt_all->curbytes != 0) {
double sample_period = (double)((uint64_t)1 << lg_prof_sample);
double ratio = (((double)cnt_all->curbytes) /
(double)cnt_all->curobjs) / sample_period;
double scale_factor = 1.0 / (1.0 - exp(-ratio));
uint64_t curbytes = (uint64_t)round(((double)cnt_all->curbytes)
* scale_factor);
uint64_t curobjs = (uint64_t)round(((double)cnt_all->curobjs) *
scale_factor);
malloc_printf("<jemalloc>: Leak approximation summary: ~%"FMTu64
" byte%s, ~%"FMTu64" object%s, >= %zu context%s\n",
curbytes, (curbytes != 1) ? "s" : "", curobjs, (curobjs !=
1) ? "s" : "", leak_ngctx, (leak_ngctx != 1) ? "s" : "");
malloc_printf(
"<jemalloc>: Run jeprof on \"%s\" for leak detail\n",
filename);
}
#endif
}

The computation relies on the claim that the probability of an allocation of size X being sampled is 1 - exp(-X/M) (where M = 2^lg_prof_sample). I don't know how exactly we derived this. A likely source can be:

https://github.com/google/pprof/blob/be90f332299a1d03fbb09bdbebdf28118ae300fb/profile/legacy_profile.go#L650-L674

However, the comment there seems to suggest that a Poisson distribution is used (perhaps for drawing the sampling interval?), whereas we use a geometric distribution.

No matter what distribution the interval I is generated from, the probability of an allocation of size X being sampled equals E(X/max(I,X)): when I < X, the probability is 1; otherwise it's X/I.

After working on it for a long while and consulting math experts, the conclusion is that E(X/max(I,X)) is hard be expressed in closed form if I follows a geometric distribution. The best lower and upper bounds for E(X/max(I,X)) I could get are respectively

1 - (1/2)exp(-X/M) - (1/4)exp(-2X/M) - (1/8)exp(-4X/M) - ...

and

1 - (1/2)exp(-2X/M) - (1/4)exp(-4X/M) - (1/8)exp(-8X/M) - ...

and the lower bound is strictly larger than 1 - exp(-X/M). In fact 1 - exp(-X/M) only computes the probability of I < X, which would be correct if the allocation is always sampled if I < X (which is true) and the allocation is never sampled otherwise (which is apparently false).

  • Testing

The computational steps for the randomization we have are not trivial, to the extent that we ideally should write some unit tests. However I don't think it's easy to write one, which is probably the reason why we didn't write any.

Proposal

One way to resolve the limitations above is to use a simple uniform distribution for generating the sampling interval. For example, if a uniform distribution on the range [0, 2*M] is used (where M = 2^lg_prof_sample), the expectation E(I) is M, as before, and we get some nice properties:

  • Sampling guarantee: if an allocation is 2*M bytes or larger, then we always sample it.
  • Distribution skewness: the distribution is uniform rather than heavily skewed to the small values.
  • Sampling ratio estimation: E(X/max(I,X)) cannot be exactly expressed in closed form, but it has much nicer lower and upper bounds respectively as (X/2M) * max(1, log(2M) - log(X)) and (X/2M) * (1 + log(2M) - log(X)) if X < 2*M; otherwise E(X/max(I,X)) is always 1.
  • Testing: a uniform random number generator is provided as a library. No need for additional tests.
  • "One more thing": the computation for generating the random number no longer needs floating point support. In other words, if the application doesn't request leak checking, jemalloc can be compiled with floating point support completely disabled.

Another approach, alternative to the uniform distribution, is the Poisson distribution. It has the advantage of being more centered on it's mean, but it doesn't have a sampling guarantee. The sampling ratio estimation is also tricky, and so is writing an efficient generator / tests. Therefore I think a simple uniform distribution is better.

@davidtgoldblatt
Copy link
Member

One key design principle of this byte-based sampling approach, in contrast to a frequency-based sampling approach, is: we want to catch all the large allocations.

I don't think this is correct. The goal of the current logic is that it is equivalent to:

void maybe_sample_stack_trace(size_t cur_allocation_size) {
    for (size_t i = 0; i < cur_allocation_size; ++i) {
      if (coinflip(1.0 / (2 ** lg_prof_sample)) {
        attribute_cost_to_stack(get_stacktrace(), 2**lg_prof_sample);
      }
    }
}

This explains the current logic: given i.i.d. coinflip outcomes, the geometric distribution gives the number of flips until the next heads, allowing us to avoid having to do O(N) coinflips on an N-byte allocation. (The earliest usage I could find a quick cite for this trick comes from the cooperative bug isolation stuff (http://pages.cs.wisc.edu/~liblit/dissertation/dissertation.pdf), but I feel sure it's been around for longer).

Once we've picked an object, we need to know how many times the coinflip function would have come up heads within it; this explains why the poisson is right for the unskewing. (This is one source of inaccuracy in the current logic IIRC; we should really only count the "unaccounted for" bytes when doing the Poisson computation, but I think we don't).

The current logic has the nice property[1] that the bytes attributed to a given stack trace are correct in expectation. It's not clear to me that there's an easy way of achieving that property using some other distribution for picking sampled allocations.

[1] Modulo the double-counting inaccuracy

@emeryberger
Copy link

Just wanted to note that the idea of using a geometric distribution for quick random sampling definitely predates Ben (Liblit)'s thesis: it dates back to at least Vitter '87 (http://www.ittc.ku.edu/~jsv/Papers/Vit87.RandomSampling.pdf). Blog post here, describing its reinvention by the blog author :). -- http://erikerlandson.github.io/blog/2014/09/11/faster-random-samples-with-gap-sampling/

@yinan1048576
Copy link
Contributor Author

Thank you @davidtgoldblatt and @emeryberger for the explanations and the pointers, and thank you @davidtgoldblatt for the offline discussion. I want to summarize here:

First, I previously thought that the geometric distribution for generating the sample intervals was chosen because:

  • We want to randomize the profiling interval so as to avoid any undesired synchronization between sampling and application's allocation pattern.
  • We happen to have chosen geometric distribution among many types of distributions on positive integers, without any particularly strong arguments.

Now I understand that the first point was correct but the second point was wrong: the geometric distribution was chosen following a very careful line of mathematical reasoning.

Also, my previous guess regarding the Poisson distribution was wrong: I guessed that it might be an (alternative) distribution being used for generating the random interval, but it is actually used to characterize the distribution of the estimated total bytes (per stack trace).

One thing I would like to do next, is to go over some mathematics more carefully. Intuitively they make sense to me, but I have not yet been totally convinced of the following -

The basis of the mathematical reasoning is the Bernoulli sampling for each allocation byte, which served as the first step to transform the prohibitively expensive exact byte counting into a much cheaper statistical estimation. I imagine that there may be alternative ways achieving the same goal. I hope to better understand:

  • What alternatives can there be (if any)?
  • What kind of mathematical advantage can the Bernoulli way give us, compared to alternatives (if any)?

Coming back to what I was originally describing in this issue, in particular, the four limitations. They all still exist, but I now see them differently:

  • Distribution skewness and testing

I don't yet have a fundamentally better alternative than the geometric distribution. So, we'd need to acknowledge the roots of the skewness and bear with it, and we'd also need to face the challenge of coming up with some good tests for it.

  • Sampling guarantee and sampling ratio estimation

One key understanding I've got is, I've become clearer in seeing that the ultimate goal of profiling is to answer the following allocation attribution question:

"What is the total bytes allocated by each stack trace?"

Previously, I thought that the sampling guarantee and sampling ratio estimation were two separate ends in themselves. Now, I see them as two parts trying to address the same allocation attribution question:

  • Sampling guarantee: the reason why we want to make a guarantee in the form of "if an allocation is K bytes or larger, then we always sample it", is that we want our sampled bytes per stack trace to approximate the total bytes (since the large allocations'd better be more faithfully attributed).
  • Sampling ratio estimation: given that the sampled bytes are not correct attributions, we want to compute the correct attribution from the sampled bytes in some way.

And more importantly they both implicitly assumed that we have to answer the attribution question based on the current sampling strategy.

Rather than immediately looking into these two parts individually, I think a better question to ask first is:

"Is there a more direct way of answering the allocation attribution question, without first routing to our current sampling strategy?"

There can be two places to start from:

  • Suppose if we keep the current mechanism of determining sampling interval, i.e. the byte-based geometric distribution, can we record something else, rather than the plain allocation size, so that we can directly get an estimate of the total bytes in the end? In such a case, there'd be no need for any sampling ratio estimation, and we don't need to make any sampling guarantee on large allocations.
  • Suppose if we keep reporting the plain allocation size for each sample, what other mechanism can we use to determine the sampling interval, so that the allocation attribution question can be answered more easily in the end? One apparent approach is a frequency based sampling mechanism: sampling ratio is fixed, and we can trivially obtain a correct allocation attribution answer on expectation, but the error bar would be higher for larger allocations (which appear less frequently), so the estimation results may even be worse than a biased estimate with less variance, like what we have now.

Of course, there's also the more challenging route of jumping completely out of our current sampling mechanism, and trying to come up with some brand new sampling mechanism to answer the allocation attribution question.

I hope to get some answers to these questions, and will update on this thread.

@yinan1048576 yinan1048576 changed the title Change randomization mechanism for profiling sample interval Direct allocation attribution in profiling? Jan 30, 2020
@yinan1048576
Copy link
Contributor Author

yinan1048576 commented Feb 7, 2020

Here's the progress I've got so far -

On a high level, I'm still sticking to the mathematical basis of our current sampling strategy, i.e. the Bernoulli sampling for each allocation byte, and all the rest of the writing here aims at answering the following question that I raised earlier:

Suppose if we keep the current mechanism of determining sampling interval, i.e. the byte-based geometric distribution, can we record something else, rather than the plain allocation size, so that we can directly get an estimate of the total bytes in the end?

I'll first go over the background in a bit more detail, and then elaborate about my findings.

Background

We start from trying to answer the allocation attribution question in a direct and exact manner:

void attribute_exact(size_t sz) {
    attribute(get_bt(), sz);
}

Where get_bt() means getting the back trace.

Directly executing attribute_exact() would mean invoking get_bt() and attribute() for every allocation, which is too expensive. In order to improve, we rewrite attribute_exact() following a byte-wise Bernoulli transformation:

void attribute_Bernoulli(size_t sz) {
    for (size_t i = 0; i < sz; ++i) {
        if (Bernoulli(p)) {
            attribute(get_bt(), 1/p);
        }
    }
}

Where p = 1 / prof_interval.

This is basically what @davidtgoldblatt wrote earlier. attribute_Bernoulli() is equivalent to attribute_exact() on expectation: the expectation of the total attribution remains sz, because the expected attribution is p * 1/p = 1 byte for each of the sz bytes.

Directly executing attribute_Bernoulli() would only make O(sz * p) invocations of get_bt() and attribute(), but with the additional cost of running through a O(sz) loop per allocation, assuming Bernoulli(p) is O(1).

The main benefit of the Bernoulli transformation is not in itself: for independent and repeated Bernoulli samples, the interval we need to wait till the next success follows a Geometric distribution with the same parameter p. In other words, rather than tediously obtaining every single Bernoulli sample, we just need to obtain Geometric samples and use them as the waiting time:

static size_t wait = Geometric(p);
void attribute_Geometric(size_t sz) {
    while (sz >= wait) {
        attribute(get_bt(), 1/p);
        sz -= wait;
        wait = Geometric(p);
    }
    wait -= sz;
    assert(wait > 0);
}

Where Geometric(p) is the version of the Geometric distribution on positive integers (since Wikipedia has two versions, the other one on non-negative integers), and wait > 0 is my pre- and post- condition of the function.

attribute_Geometric() is exactly equivalent to attribute_Bernoulli(). The invocation frequency of get_bt() and attribute() remains O(sz * p), but the additional per allocation cost is also reduced to O(sz * p), assuming Geometric(p) is O(1). This is great, only except that sz * p is still unbounded, which can be a problem for huge allocations.

To tackle this challenge, we need to "compress" the while() loop, meaning that we need to find a way to obtain an equivalent sample for:

  • Total attribution
  • Remaining sz
  • New wait

More specifically, we want to have a O(1) cost size_t magic(size_t *sz, size_t *wait) function that, given the current sz and wait as input parameters, can return a sample for the total attribution, and at the same time put a sample for the remaining sz and a sample for the new wait back to the respective input parameters. With such a magic() function, we can have:

static size_t wait = Geometric(p);
void attribute_magic(size_t sz) {
    if (sz >= wait) {
        attribute(get_bt(), magic(&sz, &wait));
    }
    wait -= sz;
    assert(wait > 0);
}

Which has O(1) invocations of get_bt() and attribute(), plus O(1) additional cost.

In the current profiling logic, we implicitly have the magic() function defined as:

size_t magic_current(size_t *sz, size_t *wait) {
    size_t total_attr = *sz;
    *sz = 0;
    *wait = Geometric(p);
    return total_attr;
}

This computation is apparently wrong, for all three quantities. It can be an approximation if all allocations are very large, but in general it is very inaccurate, and thus leads to the tricky sampling ratio estimation challenge I described earlier.

My claim

After working on it for a while, I claim that the correct version of magic() is as follows:

size_t magic_proposed(size_t *sz, size_t *wait) {
    size_t total_attr = (1 + Binomial(*sz - *wait, p)) * 1/p;
    *sz = min(Geometric(p) - 1, *sz - *wait);
    *wait = *sz + Geometric(p);
    return total_attr;
}

And, if desired, Binomial(*sz - *wait, p) can be approximated by Poisson((*sz - *wait) * p). Further, if we only want a single allocation attribution number per bt in the end, we can take the expectation and replace the first line with size_t total_attr = *sz - *wait + 1/p.

My proof goes as follows. I'll use sz and wait rather than *sz and *wait for convenience.

To prove my claim, I find it easier to go back to the per-byte Bernoulli sampling story: as soon as the while() loop is entered, there will be an immediate 1/p attribution, but after that, the loop is doing nothing but Bernoulli sampling on the sz - wait bytes:

  • Total attribution

The number of additional 1/p attributions (after the initial attribution) equals the number of Bernoulli successes from the sz - wait bytes, which follows the Binomial distribution.

  • Remaining sz

For any s in the range of [0, sz - wait), after the while() loop finishes, the remaining sz would be equal to s if and only if: (a) the s+1-th last byte generated a Bernoulli success, and (b) all the subsequent bytes (if any) generated a Bernoulli failure. Thus, prob(remaining_sz = s) = p * (1-p)^s, which equals the probability mass function of Geometric(p) - 1. The rest of the probability mass of remaining_sz is all on sz - wait. Therefore I got min(Geometric(p) - 1, sz - wait).

  • New wait

wait always comes from Geometric(p), but the particular wait value after the while() loop finishes has to satisfy the condition of remaining_sz < wait. Due to the nice "memoryless" property of Geometric distribution, the difference between the new wait and remaining_sz is also a Geometric(p).

Next step

Please help me go over the math and check if there's anything I did wrong. If everything looks correct, I'll explore how we can implement attribute_magic() in jemalloc.

@davidtgoldblatt
Copy link
Member

I'm about to get on a plane, but this looks mostly correct to me (I'll try to take a closer look next week). One thing I'll note is that for large n and small p, the Poisson approximates the binomial closely, and is more computationally tractable, which I believe is the motivation for using it.

@davidtgoldblatt
Copy link
Member

Which you in fact noted in the original reply, sorry for the comment noise!

@yinan1048576
Copy link
Contributor Author

No worries. Yes, Poisson should be a very close approximation to Binomial in our case.

I figured that there was a mistake in my math. So hold on, and I'll make some corrections early next week.

@yinan1048576
Copy link
Contributor Author

yinan1048576 commented Feb 10, 2020

Correction

The magic_proposed() function I wrote earlier was not right. I'll first try to explain why it's not right, and then propose my revised version.

The goal of the magic() function is to "compress" the while() loop in attribute_Geometric(). If we adopt the following definitions for the random variables:

  • n_attr: number of times attribute(get_bt(), 1/p) is invoked in the while() loop;
  • remaining_sz: the remaining sz when the while() loop ends; and
  • new_wait: the new wait we get after the while() loop ends.

Then our goal is to come up with a O(1) algorithm to draw a sample of all three random variables from their joint distribution.

In my earlier magic_proposed() function, I was correct in:

  • the marginal distribution of n_attr, being 1 + Binomial(sz - wait, p); and
  • the marginal distribution of remaining_sz, being min(Geometric(p) - 1, sz - wait).

However, I generated them separately, which means I implicitly assumed that they were statistically independent, but they're actually not. Some obvious observations:

  • When remaining_sz equals sz - wait, then n_attr is degenerated to a constant 1.
  • When n_attr equals 2, then remaining_sz is uniformly distributed in the range of [0, sz - wait - 1].

The correct way should be:

  • Generate remaining_sz from its marginal distribution, min(Geometric(p) - 1, sz - wait).
  • Generate n_attr from its conditional distribution given remaining_sz, which, if remaining_sz < sz - wait, should be 2 + Binomial(sz - wait - remaining_sz - 1, p).
  • Generate new_wait from its conditional distribution given remaining_sz, remaining_sz + Geometric(p) (the same as what I had before).

Here I'm implicitly assuming that n_attr and new_wait are conditionally independent given remaining_sz, which I believe should be true.

The corresponding version of the magic() function is:

size_t magic_revised(size_t *sz, size_t *wait) {
    size_t n_attr = 1;
    size_t remaining_sz = Geometric(p) - 1;
    if (remaining_sz < *sz - *wait) {
        n_attr += 1 + Binomial(*sz - *wait - remaining_sz - 1, p);
    } else {
        remaining_sz = *sz - *wait;
    }
    *sz = remaining_sz;
    *wait = remaining_sz + Geometric(p);
    return n_attr * 1/p;
}

@yinan1048576
Copy link
Contributor Author

yinan1048576 commented Feb 14, 2020

Taking deattribution into account

I realized that I forgot the other side of the coin: when a memory is released, we need to "deattribute", so that our answer to the allocation attribution question is always up-to-date.

The deattribution counterparts for attribute_exact() and attribute_Bernoulli() are trivial, i.e.

void deattribute_exact(size_t sz) {
    deattribute(get_bt(), sz);
}

and

void deattribute_Bernoulli(size_t sz) {
    for (size_t i = 0; i < sz; ++i) {
        if (Bernoulli(p)) {
            deattribute(get_bt(), 1/p);
        }
    }
}

To have deattribute_Geometric(), we need to know the current amount of advancement from the last attribute(get_bt(), 1/p) call, so that we know when to call deattribute(get_bt(), 1/p).

To do so, we can have another variable advance to record the current amount of advancement: it would be "symmetric" to wait, only except:

  • advance can be 0 (actually it starts at 0), and
  • while advance is being rolled back, whenever it exactly hits 0, there should be no immediate deattribute() call.
static size_t wait = Geometric(p);
static size_t advance = 0;

void attribute_Geometric_symmetric(size_t sz) {
    while (sz >= wait) {
        attribute(get_bt(), 1/p);
        sz -= wait;
        wait = Geometric(p);
        advance = 0;
    }
    wait -= sz;
    advance += sz;
    assert(wait > 0);
}

void deattribute_Geometric_symmetric(size_t sz) {
    while (sz > advance) {
        deattribute(get_bt(), 1/p);
        sz -= advance;
        advance = Geometric(p);
        wait = 0;
    }
    advance -= sz;
    wait += sz;
    assert(wait > 0);
}

Note the asymmetries in the while() condition and in the final assertion.

The "compressed" version is as follows:

static size_t wait = Geometric(p);
static size_t advance = 0;

void attribute_compressed_symmetric(size_t sz) {
    if (sz >= wait) {
        size_t n_attr = 1;
        size_t remaining_sz = Geometric(p) - 1;
        if (remaining_sz < sz - wait) {
            n_attr += 1 + Binomial(sz - wait - remaining_sz - 1, p);
        } else {
            remaining_sz = sz - wait;
        }
        attribute(get_bt(), n_attr * 1/p);
        sz = remaining_sz;
        wait = remaining_sz + Geometric(p);
        advance = 0;
    }
    wait -= sz;
    advance += sz;
    assert(wait > 0);
}

void deattribute_compressed_symmetric(size_t sz) {
    if (sz > advance) {
        size_t n_attr = 1;
        size_t remaining_sz = Geometric(p);
        if (remaining_sz < sz - advance) {
            n_attr += 1 + Binomial(sz - advance - remaining_sz - 1, p);
        } else {
            remaining_sz = sz - advance;
        }
        deattribute(get_bt(), n_attr * 1/p);
        sz = remaining_sz;
        advance = remaining_sz + Geometric(p) - 1;
        wait = 0;
    }
    advance -= sz;
    wait += sz;
    assert(wait > 0);
}

I've "inlined" the internal magic() function for simplicity. Note the asymmetries in:

  • the if condition;
  • the sampling for remaining_sz; and
  • the sampling for the new wait/advance.

Please help me verify if my math is correct.

Digression: sample the current advancement

I was tempted to get rid of the new advance variable, together with the additional increments/decrements on the common path it introduced. It turned out to be a bad idea, but let me still try to summarize my attempt here, as a record / lesson.

My inspiration originated from the "memoryless" characteristic of Geometric variables: even if I don't record advance, I can obtain an unbiased sample of it, which is simply a Geometric(p) - 1. On the other hand, in the case of rolling back a huge sz, there was no way to record all the past intervals, so I'd have to sample them anyway.

If I choose to sample rather than record the current advance, then there's no need for the static advance variable, the attribution logic is reduced back (to my original attribute_Geometric()), and the deattribution logic becomes:

void deattribute_Geometric_sample(size_t sz) {
    for (size_t advance = Geometric(p) - 1; sz > advance; advance = Geometric(p)) {
        deattribute(get_bt(), 1/p);
        sz -= advance;
        wait = 0;
    }
    wait += sz;
    assert(wait > 0);
}

It was great to see that advance becomes a function local variable, or more precisely, a loop local variable.

The "compressed" version becomes:

void deattribute_compressed_sample(size_t sz) {
    size_t remaining_sz = Geometric(p);
    if (remaining_sz <= sz) {
        size_t n_attr = 1 + Binomial(sz - remaining_sz, p);
        deattribute(get_bt(), n_attr * 1/p);
        sz = remaining_sz;
        wait = 0;
    }
    wait += sz;
    assert(wait > 0);
}

It was not trivial, and hopefully I got it correct. Interestingly, the logic becomes much simpler, even without any explicit need for the advance variable, because it was completely "encapsulated" within the Binomial sampling.

However, the logic simplicity is perhaps the only advantage of this approach: the common path of the deattribution logic has an additional Geometric(p), which, even though most likely a O(1) operation, is typically much more costly than an increment/decrement operation. So it's not worthwhile.

In fact, I can even go more extreme, and replace the wait variable by sampling, like:

void attribute_Geometric_sample(size_t sz) {
    for (size_t wait = Geometric(p); sz >= wait; wait = Geometric(p)) {
        attribute(get_bt(), 1/p);
        sz -= wait;
    }
}

and its "compressed" version is degenerated to:

void attribute_compressed_sample(size_t sz) {
    size_t n_attr = Binomial(sz, p);
    if (n_attr > 0) {
        attribute(get_bt(), n_attr * 1/p);
    }
}

Though seemingly clean, this logic means that each allocation must draw a Binomial (or a Poisson if we want to approximate), which is typically far more costly than the increment/decrement operation it saved.

@yinan1048576
Copy link
Contributor Author

yinan1048576 commented Feb 26, 2020

Thanks to @davidtgoldblatt for the offline discussion. We have an upgraded version that is computationally more efficient. Let me first summarize everything in the pseudo-code form and then explain in more detail:

void attribute_upgraded(alloc_t *alloc) {
    static size_t wait = Geometric(p);

    size_t sz = get_alloc_sz(alloc);
    if (sz < wait) {
        set_alloc_sampled(alloc, false);
        wait -= sz;
    } else {
        bt_t *bt = get_bt();
        size_t bytes_to_sample = sz - wait;
        set_alloc_sampled(alloc, true);
        set_alloc_bt(alloc, bt);
        set_alloc_bytes_to_sample(alloc, bytes_to_sample);
        increase_bt_bytes_sampled(bt, 1);
        increase_bt_bytes_to_sample(bt, bytes_to_sample);
        wait = Geometric(p);
    }
    assert(wait > 0);
}

void deattribute_upgraded(alloc_t *alloc) {
    if (get_alloc_sampled(alloc)) {
        bt_t *bt = get_alloc_bt(alloc);
        size_t bytes_to_sample = get_alloc_bytes_to_sample(alloc);
        decrease_bt_bytes_sampled(bt, 1);
        decrease_bt_bytes_to_sample(bt, bytes_to_sample);
    }
}

size_t get_bt_attribution(bt_t *bt) {
    size_t bytes_sampled = get_bt_bytes_sampled(bt);
    size_t bytes_to_sample = get_bt_bytes_to_sample(bt);
    return (bytes_sampled + Binomial(bytes_to_sample, p)) * 1/p;
}

On a high level, this new version starts from an important methodology change: the deattribution path no longer samples bytes; instead, it simply acknowledges the samples drawn at allocation time. After all, even if deattribution samples again, the outcome is just another set of Bernoulli samples, which on expectation is not different from the set of Bernoulli samples originally drawn at allocation time. Moreover, by reducing the number of samples we draw, we're effectively reducing the variance of our final estimation.

There are a couple of computational savings:

  • The advance variable is not needed anymore, since its main usage was to track the deallocation sampling.
  • remaining_sz can be removed, as well as the sampling logic for it. remaining_sz was necessary on the allocation path because it was used to compute the next value for advance. (I realized that when I was only considering the allocation path, I could have got rid of remaining_sz via some trivial tricks, and the logic would have been significantly simplified since n_attr and new_wait could have been independently generated from their marginal distributions.)
  • We don't need to touch wait at all on the deallocation path. wait is used only for the purpose of determining the samples for allocations. To make this point really clear, I relocated wait so that it stays inside of attribute_upgraded().

One pretty clever optimization technique being used here is delayed sampling: whenever an allocation triggers a sample, the remaining sz - wait bytes are added to a to_be_sampled count for the corresponding back trace, rather than being used directly for drawing a Binomial sample. Further, when a sampled allocation is being freed, the same amount is subtracted. The outcome is that we only need to draw a single Binomial sample when we want to get the attribution number (i.e. calling the get_bt_attribution() function), rather than drawing a Binomial sample per sampled allocation.

There is one more minor correction in this new version: previously I made a mistake in the deallocation path: on the deallocation path, I should have not deattributed for the deallocation back trace, but I should rather have done it for the allocation back trace.

@yinan1048576
Copy link
Contributor Author

Inference

I think it's beneficial to generalize get_bt_attribution(): what it currently does is a point estimate, but it doesn't give application owners the whole picture. For example, it would be helpful in many cases to also know the variance, or, say, a 95% confidence interval.

On the memory allocator side, a point estimate obtained using p = 50% is much more trustworthy than a same point estimate value obtained using p = 5%, and we should be able to quantify that, especially since using p = 50% incurs a performance cost that we want to get credit for.

I think the correct way is to treat the attribution problem as an inference problem:

Given bytes_sampled and bytes_to_sample for a particular stack trace, what is the probabilistic distribution of the total bytes allocated at this stack trace?

Assuming uniform prior, which should be realistic in most cases, I went through the math and was able to show that the total bytes follows a distribution of

bytes_to_sample + NegBinomial(bytes_sampled + 1, p) - 1

where NegBinomial(k, p) characterizes the distribution of the number of i.i.d. Binomial(p) samples we need to draw until we've collected k heads. It corresponds to the "n trials, given k successes" formulation on Wikipedia. For example, NegBinomial(1, p) is equivalent to Geometric(p), and NegBinomial(k, p) is equivalent to the sum of k independent Geometric(p).

The maximum a posteriori point estimate, which is equivalent to the maximum likelihood estimate in this case, would be bytes_to_sample + bytes_sampled * 1/p, which is identical to the expectation of (bytes_sampled + Binomial(bytes_to_sample, p)) * 1/p in my get_bt_attribution().

(To be more exact, the point estimate should be bytes_to_sample + floor(bytes_sampled * 1/p), and when bytes_sampled * 1/p is an integer, the answer should be two: bytes_to_sample + bytes_sampled * 1/p as well as bytes_to_sample + bytes_sampled * 1/p - 1.)

The mean of the distribution can also be useful, which is:

bytes_to_sample + bytes_sampled * 1/p + (1/p - 1)

It is 1/p - 1 larger than the point estimate above, which makes sense: for example, if bytes_sampled is 0, then, even though 0 byte has the highest likelihood for generating 0 samples, the number of bytes generating 0 samples can be anything non-negative, and the mean is 1/p - 1. I personally prefer the mean over the point estimate.

In any case, to provide the maximal inference flexibility, we should give applications the full distribution, a.k.a. all three of bytes_to_sample, bytes_sampled, and p, so that applications can meet whatever inference goals they have. For convenience, we can also provide some easy numbers, e.g. the point estimate, the mean, and e.g. the variance (which is (bytes_sampled + 1) * (1-p)/p^2), since most applications might just want something simple.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants