Add probability sampler details #331

c24t · 2019-10-22T02:11:21Z

This PR adds details on the ProbabilitySampler sampling algorithm, following the implementation in open-telemetry/opentelemetry-python#225.

One big change in this PR is that it recommends making the sampling decision based on the leftmost bytes of the trace ID, which should be random according to https://www.w3.org/TR/trace-context/#trace-id. All OT and OC clients use the rightmost bytes now, see e.g. the java implementation.

It also doesn't specify the number of bytes the sampler should consider, even though leaving this up to the implementation would mean getting different sampling decisions for the same trace ID in different clients. After chatting with @bogdandrutu, it sounds like the spec should prescribe this number. This is difficult since the java client currently uses 63 bits so it can store an internal upper bound as a Long, which seems like an insane thing to require other languages to do.

This PR only affects the description of the sampling algorithm, but there's more work to do elsewhere in the doc to fully describe the ProbabilitySampler.

c24t · 2019-10-22T02:32:03Z

specification/sdk-tracing.md

+- `rate` is the sampling rate, in `[0.0, 1.0]`
+- `precision`: the number of bits of the trace ID to consider, in `[1, 128]`
+- `traceID`: is the integer trace ID of the span to be created, assuming
+  big-endian byte order


Weirdly, the W3C spec says "vendors MUST keep the random part of trace-id ID on the left side" without taking a stance on trace ID endianness, so the random part may be high- or low-order bytes. We may just want to declare trace ID to be a big-endian unsigned int somewhere and never worry about this again.

The W3C spec is not really "weird", it just follows common algorithms for UUIDs. https://en.wikipedia.org/wiki/Universally_unique_identifier. We should ensure that our specified algorithm works with those. E.g. consider:

In [10]: uid = uuid.uuid1(); print(uid, hex(int(uid))) 8e92cf3e-f4bd-11e9-8027-c8f750bf624f 0x8e92cf3ef4bd11e98027c8f750bf624f In [11]: uid = uuid.uuid1(); print(uid, hex(int(uid))) 8ee59394-f4bd-11e9-8353-c8f750bf624f 0x8ee59394f4bd11e98353c8f750bf624f In [12]: uid = uuid.uuid1(); print(uid, hex(int(uid))) 8f15eae2-f4bd-11e9-bf05-c8f750bf624f 0x8f15eae2f4bd11e9bf05c8f750bf624f In [13]: uid = uuid.uuid1(); print(uid, hex(int(uid))) 8f389cc8-f4bd-11e9-9c8f-c8f750bf624f 0x8f389cc8f4bd11e99c8fc8f750bf624f

The problem of endianness is only introduced in this PR because it attempts to interpret byte sequence as numbers. W3C spec does not make such assumptions, so the question of endianness does not arrise there.

@Oberon00 this actually doesn't work with v1 UUIDs since the leftmost portion is set from the timestamp, and this assumes that it's random.

Maybe that assumption is too much. The W3C spec also doesn't even recommend using random values: https://www.w3.org/TR/trace-context/#trace-id.

Because tracing systems may make sampling decisions based on the value of trace-id, for increased interoperability vendors MUST keep the random part of trace-id ID on the left side.

Not the entire value, just the part we use to make the sampling decision.

The sentence you quoted only applies if the tracing system operate with IDs that are shorter than 128/64 byte. A tracing system may use the whole 128/64 bytes to fill it with deterministic data that is unique (e.g. a globally unique process identifier + a sequential ID for the created trace) but not random

Whoops, not exactly, it applies to

Many unique identification generation algorithms create IDs where one part of the value is constant (often time- or host-based), and the other part is a randomly generated value.

So it merely says: "If you have a random part in your ID, keep it left."
The spec also fails to state if that random value should be randomly chosen per-process or per-tracer, or per-something-else.

specification/sdk-tracing.md

yurishkuro

Lacks motivation why this algorithm must be required.

yurishkuro · 2019-10-22T13:22:21Z

specification/sdk-tracing.md

@@ -122,8 +122,67 @@ These are the default samplers implemented in the OpenTelemetry SDK:

 #### Probability Sampler algorithm

-TODO: Add details about how the probability sampler is implemented as a function
-of the `TraceID`.
+The `ProbabilitySampler` makes a determistic sampling decision for each span


Basing the algorithm on trace id should not be a requirement. It's an optimization that is possible in some languages, but could be pretty expensive in others, considering that trace id is just a blob of bytes that needs to be interpreted as a number.

I think this is not a performance optimization (at least not primarily). Using the trace ID as the only input (and the same algorithm) ensures that all spans of the trace have the same sampling decision, which is required to avoid broken traces.

I would assume that the root span of a trace might use a probability sampler, and all the descendants would use the sampled bit from context, so there's only one decision, and I agree with @yurishkuro.

I'd also expect the sampler to respect the sampled bit by default, but we say here that it should be possible to configure the sampler to apply to all spans, or any span with a remote parent.

L119 actually says the default behavior should be to make a new sampling decision for spans with remote parents, but I don't know the motivation for this, and think there's a good argument for restricting this to root spans by default.

One benefit of this sampling algorithm is that lower-probability samplers always sample a subset of traces of higher-probability samplers. Because of this, each service could set its own sampling rate and we'd still get complete traces for a percentage of requests equal to the lowest sampling rate. For example, if we have three services with sampling rates .3, .2, and .1 respectively, we'd get complete traces for .1 of requests. With independent random samplers we'd get traces for .006 of requests.

Even if we didn't want to make it possible to change the sampling rate along the call chain, there are some clear benefits to making the sampling decision deterministic. E.g. checking that telemetry is working.

These are good arguments. 👍

an integer comparison to check if sampling makes sense

As I understand it, @yurishkuro's complaint is that we have to convert the byte array into an int to do the comparison. FWIW an implementation could compute the bound as a byte array instead and do a left-to-right bytewise comparison (see this gist for an example in python), but it's not clear that this would be faster than converting to an int, especially if the language boxes the bytes to do the comparison.

Perhaps ProbabilitySampler should be renamed DeterministicProbabilitySampler to indicate that it's based on the TraceID?

DeterministicProbabilitySampler is a mouthful. I'd prefer PercentageSampler or RateSampler, but I think ProbabilitySampler is a fine description, and has the benefit of having the same name in OpenCensus.

I would not be willing to draw statistical interpretations from a body of spans using deterministic sampling from clients in written with different PRNGs.

I can think of two obvious risks for people doing statistical inference on sampled traces: one is that trace IDs aren't even nominally randomly distributed. For example, if the trace ID includes a timestamp fragment, the sampled request rate might look cyclical or bursty even if the actual request rate is constant. The other is that the service itself handles requests differently depending on the trace ID.

But these are bugs in the service or instrumentation, not problems of insufficient randomness.

AFAICT we don't actually need strong statistical randomness here. We just need the "random" part of trace ID to be roughly uniformly distributed over all possible values, and not to be correlated with the stats we're monitoring (latency, error rate, etc.).

The fact that PRNGs are more P than R doesn't seem unusually bad in this context. And in any case a RandomProbabilitySampler would suffer from the same problems.

How about if sampler does not depend on the trace-id at all? There is no need to consistently select the same trace for sampling as the sampling decision is mostly made when root span is created.
Sampler can simply increment an atomic counter by 1 for every root span is created. If sampling rate is 0.01 then trace is sampled if the count is multiple of 100. Counter can be initialized to a random value once, so across different instance/process they are not synchronized. This sounds more like RateSampler though.

I addressed this above in #331 (comment). We want to be able to make new sampling decisions for remote parent spans, which is why it's helpful to sample consistently.

Now that I read your description, RateSampler is clearly a better name for that than the behavior I'm describing.

RateSampler would imply time dependency, like rate limiting, this is more of a RatioSampler.

specification/sdk-tracing.md

yurishkuro · 2019-10-22T13:26:59Z

specification/sdk-tracing.md

+- `rate` is the sampling rate, in `[0.0, 1.0]`
+- `precision`: the number of bits of the trace ID to consider, in `[1, 128]`
+- `traceID`: is the integer trace ID of the span to be created, assuming
+  big-endian byte order


The problem of endianness is only introduced in this PR because it attempts to interpret byte sequence as numbers. W3C spec does not make such assumptions, so the question of endianness does not arrise there.

SergeyKanzhelev · 2019-10-23T00:29:46Z

specification/sdk-tracing.md

+
+- `rate` is the sampling rate, in `[0.0, 1.0]`
+- `precision`: the number of bits of the trace ID to consider, in `[1, 128]`
+- `traceID`: is the integer trace ID of the span to be created, assuming


Would it be possible to describe an algorithm while operate on traceID as bytes?

Because we shouldn't describe the trace ID as an integer here when it appears in the API as a byte array, or because we want to avoid the conversion from bytes to int?

If it's just that the description is wrong then this is easy to reword. If we're trying to avoid the conversion then we could convert the bound to a byte array (once, at sampler creation time) and do a bytewise comparison with the leftmost bytes of the trace ID to make the sampling decision. It's not clear that this would be much of an improvement.

I described keeping the trace ID a byte array in 3b3d321, but this makes the spec a bit more verbose and difficult to read. If this isn't what you were looking for I'm happy to revert it and find another way to say this.

yurishkuro · 2019-10-25T15:18:20Z

One big change in this PR is that it recommends making the sampling decision based on the leftmost bytes of the trace ID, which should be random according to https://www.w3.org/TR/trace-context/#trace-id.

That specific part of W3C spec is being seriously challenged right now : w3c/trace-context#344

There is no reason why sampling algorithm needs to depend on which side of the byte array is more random, it can simply XOR the left and right 8 bytes into a single uint64.

c24t · 2019-10-26T00:13:19Z

We can wait on w3c/trace-context#344 to resolve this.

There is no reason why sampling algorithm needs to depend on which side of the byte array is more random, it can simply XOR the left and right 8 bytes into a single uint64.

That sounds like a good approach. The only obvious drawback is that it makes the sampling algorithm less intuitive, since as it's written now you can tell at a glance whether a given trace ID should be sampled if you know the bound.

It would also mean that implementations can't choose to make a decision based on fewer bytes, but if we're looking at the leftmost bytes anyway we're only going to get a consistent sampling decision if every sampler is using the same length trace IDs.

tedsuo · 2019-12-18T16:37:05Z

@c24t FYI the trace-context group has resolved the details on their end: w3c/trace-context#344

jmacd · 2020-01-22T16:28:10Z

AI(c24t) Update this PR with the decision from W3C or try to move it forward.

Oberon00 · 2020-05-19T13:02:59Z

Note that there are currently several PRs open for this sampler:

Add probability sampler details #331 by @c24t to specify an algorithm (this PR).
Add the requirement for the probability sampler #570 by @jkwatson to add a probability span attribute
Rename ParentOrElse to ParentBased and generalize to support all cases #610 by @bogdandrutu to split out parent handling.
Rename ProbabilitySampler to TraceIdRatioBasedSampler and add requirements #611 by @bogdandrutu to specify abstract requirements for the algorithm
Proposal for sampling.priority oteps#107 by @Kanwaldeep to add a priority (probability?) also to the tracestate.

github-actions · 2020-08-12T03:18:34Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions · 2020-08-19T03:19:14Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

…exporter usage in SimpleProcessor. (open-telemetry#358)

Flesh out ProbabilitySampler details

c112068

c24t requested review from AloisReitbauer, bogdandrutu, carlosalberto, iredelmeier, reyang, SergeyKanzhelev, songy23, tedsuo, tigrannajaryan and yurishkuro as code owners October 22, 2019 02:11

c24t added area:sdk Related to the SDK area:sampling Related to trace sampling labels Oct 22, 2019

c24t commented Oct 22, 2019

View reviewed changes

c24t mentioned this pull request Oct 22, 2019

Add sampler API, use in SDK tracer open-telemetry/opentelemetry-python#225

Merged

Oberon00 reviewed Oct 22, 2019

View reviewed changes

specification/sdk-tracing.md Outdated Show resolved Hide resolved

yurishkuro requested changes Oct 22, 2019

View reviewed changes

c24t added 3 commits October 22, 2019 12:32

s/^/pow/

46eaf40

Don't use 0x0 as trace ID example

9223a28

Misc style and typo fixes

aa53399

SergeyKanzhelev reviewed Oct 23, 2019

View reviewed changes

Make trace ID byte array explicit

3b3d321

c24t mentioned this pull request Oct 24, 2019

Clean up ProbabilitySampler for 64 bit trace IDs open-telemetry/opentelemetry-python#238

Merged

SergeyKanzhelev mentioned this pull request Nov 12, 2019

Return weight value alongside a sampling decision. #180

Closed

thisthat mentioned this pull request Dec 16, 2019

TraceId.getLowerLong returns idHi open-telemetry/opentelemetry-java#690

Closed

lizthegrey self-requested a review April 28, 2020 18:15

bogdandrutu mentioned this pull request May 18, 2020

Rename ProbabilitySampler to TraceIdRatioBasedSampler and add requirements #611

Merged

Oberon00 mentioned this pull request May 19, 2020

Proposal for sampling.priority open-telemetry/oteps#107

Closed

This was referenced May 20, 2020

Add sampling API open-telemetry/opentelemetry-cpp#72

Closed

Added sampler to Activity - v1 open-telemetry/opentelemetry-dotnet#683

Merged

carlosalberto added the spec:trace Related to the specification/trace directory label Jun 19, 2020

carlosalberto added the release:required-for-ga Must be resolved before GA release, or nice to have before GA label Jul 2, 2020

github-actions bot added the Stale label Aug 12, 2020

github-actions bot closed this Aug 19, 2020

TuckTuckFloof pushed a commit to TuckTuckFloof/opentelemetry-specification that referenced this pull request Oct 15, 2020

Simple fix for open-telemetry#331: Add atomic_flag protection around …

0f3a5c8

…exporter usage in SimpleProcessor. (open-telemetry#358)

This was referenced Feb 9, 2021

[For 1.0] Mark TraceIdRatioBased experimental. #1412

Closed

Review approach & specify algorithm for TraceIdRatioBasedSampler (ProbabilitySampler) #1413

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add probability sampler details #331

Add probability sampler details #331

c24t commented Oct 22, 2019 •

edited

Loading

c24t Oct 22, 2019

Oberon00 Oct 22, 2019 •

edited

Loading

yurishkuro Oct 22, 2019 •

edited

Loading

c24t Oct 22, 2019

Oberon00 Oct 23, 2019 •

edited

Loading

c24t Oct 23, 2019

Oberon00 Oct 25, 2019 •

edited

Loading

Oberon00 Oct 25, 2019

yurishkuro left a comment

yurishkuro Oct 22, 2019

Oberon00 Oct 22, 2019

jmacd Oct 22, 2019

c24t Oct 22, 2019

jmacd Oct 22, 2019

c24t Oct 24, 2019

c24t Oct 24, 2019

rghetia Oct 25, 2019

c24t Oct 26, 2019

yurishkuro May 19, 2020

yurishkuro Oct 22, 2019 •

edited

Loading

SergeyKanzhelev Oct 23, 2019

c24t Oct 23, 2019

c24t Oct 23, 2019

yurishkuro commented Oct 25, 2019

c24t commented Oct 26, 2019

tedsuo commented Dec 18, 2019

jmacd commented Jan 22, 2020

Oberon00 commented May 19, 2020

github-actions bot commented Aug 12, 2020

github-actions bot commented Aug 19, 2020

Add probability sampler details #331

Add probability sampler details #331

Conversation

c24t commented Oct 22, 2019 • edited Loading

Choose a reason for hiding this comment

Oberon00 Oct 22, 2019 • edited Loading

Choose a reason for hiding this comment

yurishkuro Oct 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Oberon00 Oct 23, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Oberon00 Oct 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yurishkuro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yurishkuro Oct 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yurishkuro commented Oct 25, 2019

c24t commented Oct 26, 2019

tedsuo commented Dec 18, 2019

jmacd commented Jan 22, 2020

Oberon00 commented May 19, 2020

github-actions bot commented Aug 12, 2020

github-actions bot commented Aug 19, 2020

c24t commented Oct 22, 2019 •

edited

Loading

Oberon00 Oct 22, 2019 •

edited

Loading

yurishkuro Oct 22, 2019 •

edited

Loading

Oberon00 Oct 23, 2019 •

edited

Loading

Oberon00 Oct 25, 2019 •

edited

Loading

yurishkuro Oct 22, 2019 •

edited

Loading