Add OpenTelemetry performance benchmark spec #748

ThomsonTan · 2020-07-28T14:54:37Z

Documented the performance benchmark requirement for language libraries. The most important metrics like throughput/CPU/memory are included.

tigrannajaryan · 2020-07-28T15:22:09Z

specification/performance-benchmark.md

+
+### Number of outstanding events in local cache
+
+If the remote end-point is unavailable for OpenTelemetry exporter, the library needs cache 1M events before dropping them.


Does this need to be a hard-coded number? I would expect this to be a configuration option with some reasonable default.

This does not seem like a performance benchmark, but a requirement of the sdk?

@tigrannajaryan I'll remove the hard-coded numbers which could be hard to apply to languages and sustain over time.

@dyladan this spec is supposed to recommend SDKs to implement some common performance benchmark (like the basic metrics listed here) which the users could easily get (by running the benchmark program locally).

tigrannajaryan · 2020-07-28T15:24:47Z

specification/performance-benchmark.md

+
+### CPU time
+
+Under extreme workload, the library should not take more than 5% of total CPU time for event tracing, or it should not take more than 1 full logical core in multi-core system.


To guarantee this may require the library to monitor its own CPU usage and begin throttling, sampling or otherwise limits its own resource consumption.
Otherwise it is pretty easy to to write an application that does nothing but make OpenTelemetry calls and the library will consume close to 100% of CPU.

tigrannajaryan · 2020-07-28T15:26:36Z

specification/performance-benchmark.md

+
+### Memory consumption
+
+Under extreme workload, the peak working set increase caused by tracing should not exceed 100MB.


Similarly to the number of items I would expect this to be configurable if it is a limit that is honored by the library. However, it is not clear if we want to put a limitation on the total memory usage. Unlike the number of events the total memory usage by tracing is much more difficult to calculate accurately.

It is also not clear what the interaction between the limits on the number of events and on the memory usage is. Should the library begin dropping when either of these limits is hit? If that is the intent then it would be useful to call it out explicitly. This likely belongs to a functional specification of the libraries, not necessarily to the performance spec that this PR describes.

I think it will be better to have a separate document for functional requirements for libraries in the specification and refer to it as necessary from the performance specification document.

Agree that providing numbers does not seem like a cross-language issue - different languages have way too different performance characteristics. How about providing guidelines on what metrics should be looked at without targets? Also we can add tips / tricks about high performance tracing - for example, and this is my opinion, but if others agree, observability is an area that deserves possibly gross microoptimizations since users don't want to pay cost for observability, they want their money to go into serving users. This file seems like a good avenue for providing observability best practices.

iNikem · 2020-07-30T12:25:02Z

As @tigrannajaryan mentioned, do we want to measure overhead during production runtime and somehow react to it? Or these are guidelines for performance testing that happens development time?

specification/performance-benchmark.md

cijothomas · 2020-07-30T14:03:48Z

specification/performance-benchmark.md

+
+### Number of events could be produced per second
+
+For application uses with C/C++/Java library, it can produce 10K events without dropping any event.


Why special mention of c/cpp/java?

cijothomas · 2020-07-30T14:04:42Z

specification/performance-benchmark.md

+
+### Number of events could be produced per second
+
+For application uses with C/C++/Java library, it can produce 10K events without dropping any event.


This is a bit vague.. Events could be dropped by the Exporter based on the specific exporter needs/design. Is this about all the exporters? or only for the otlp exporter?

Also, please clarify what makes one event. Is it a span? is it a span with 10 attributes? or span with 50 attributes? or span with 0 attributes? Depending on the implementation, perf can vary, and unless explicitly documented here. everyone end up with their own choice of event.

cijothomas · 2020-07-30T14:07:04Z

specification/performance-benchmark.md

+
+### Number of outstanding events in local cache
+
+If the remote end-point is unavailable for OpenTelemetry exporter, the library needs cache 1M events before dropping them.


There is no mention of local cache in any specs so far. Its upto individual exporters to deal with local storage, if any. Or is this specifically about the OTLP exporter?

cijothomas · 2020-07-30T14:08:40Z

specification/performance-benchmark.md

+
+## Benchmark Report
+
+The implementation language libraries need add the typical result of above benchmarks in the release page. 


If this can describe the type of the application along with the spans expected from it - then every language can implement the same type of application. Otherwise there be no consistency.

something like this: https://github.com/TechEmpower/FrameworkBenchmarks/wiki/Project-Information-Framework-Tests-Overview#plaintext

cijothomas

Left comments. The specs are bit vague now.

dyladan · 2020-08-04T18:50:28Z

It can be difficult to measure exactly how much CPU and memory overhead is caused by tracing, and it can be wildly dependent on how the tracing is used. As mentioned by @tigrannajaryan, an application which generates 1-2 spans per request will have a much different tracing overhead than an application which generates tens or hundreds of spans per request. In order to put hard limits on CPU and memory overhead, too much is required to be known about the system being traced. In my opinion, it would be more useful to define requirements for a sample application, then using that sample application define throughput requirements as a percentage increase.

For example, you may require a sample app which generates 10 spans for each request it receives, each of which has 10 attributes with randomly generated data. You could then require that a load test on this particular sample application be able to handle 98% of the throughput of the same application without tracing enabled. This would provide a clear path to SIGs on how to implement and interpret a benchmark.

FWIW, I think it will be difficult or impossible to come up with a set of restrictions which makes sense to apply to all language SIGs.

yurishkuro · 2020-08-04T19:01:18Z

At one of Zipkin workshops (around 2016) Spoons from Lightstep gave a presentation on the methodology of measuring tracing overhead under varying conditions of the application's own behavior. Perhaps @tedsuo can ask him to share that presentation.

yurishkuro · 2020-08-04T19:05:11Z

Also, Lightstep published a tool for benchmarking https://lightstep.com/blog/understanding-tracer-performance-with-lightstep-benchmarks/

tigrannajaryan · 2020-08-12T01:52:25Z

@ThomsonTan to make progress on this I think it is important to clarify what the intent of this PR is.

Is it to define performance requirements for OpenTelemetry SDK implementations or to define how to perform performance measurements in a consistent manner across different OpenTelemetry SDKs?

If the goal is to define performance requirements I think we can discuss 2 distincts parts:

Behavioral performance requirements, such as what happens when the application continuously generates spans as fast as it can, but the SDK is unable to send them to the configured destination. Is it OK to crash, OOM in this case or certain failsafe behavior is prescribed, such as queuing of spans up to certain number of spans, and then dropping the newly generated one's (or the old ones). There is a number of things that a well-behaved production-ready library needs to do and arguably we could make the expected behavior part of the specification.
Requirement to consume resources within specified limits. In theory the limits can be part of the spec, however in practice I find it a bit problematic, because the resource consumption highly depends on the language and on the machine/environment where the benchmarking is performed. I think it is undesirable to make this part of the specification.

If the goal is to define the guidelines for performing SDK benchmarking then we need to define the list of scenarios about which we care the most. To me the important scenarios would be several that are within the normal operational modes (e.g. generating 1000 spans per second, sending via OTLP to a server which is fast enough not to block the SDK) plus some other scenarios which are abnormal (e.g. generating at 10x higher rate then the SDK is capable of exporting).

github-actions · 2020-08-26T03:19:54Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

ThomsonTan · 2020-09-01T15:00:04Z

@tigrannajaryan as discussed in previous meeting, it is hard to define a consistent performance requirement across languages and hardware, the goal here is asking SDKs to provide a consistent benchmark program to the users to get a sense of what performance he/she could expect from a specific SDK. So we'd like to define performance measurement in consistent manner across different OpenTelemetry SDKs, much like the Benchmarks we have in OpenTelemetry .NET SDK.

This spec is meant to define the list of scenarios we and the users care most, I'll incorporate the scenario you listed to the spec.

tigrannajaryan · 2020-09-01T17:18:25Z

@ThomsonTan sounds good, please ping me when this is ready for another review round.

ThomsonTan · 2020-09-09T15:53:18Z

@ThomsonTan sounds good, please ping me when this is ready for another review round.

@tigrannajaryan I updated the doc based the feedback here, could you please help review again?

tigrannajaryan

Thanks @ThomsonTan

I left a few comments and my assumption is that we are aiming to have comparable performance measurement results across languages. If that is not a goal then my comments may not be applicable.

tigrannajaryan · 2020-09-10T03:16:51Z

specification/performance-benchmark.md

+
+### Create Span
+
+Number of spans which could be created in 1 second per logical core and over all


I think it is sufficient to measure this per one logical core. If we do want to have multicore measurements we have to specify how many cores to use so that results are comparable.

I think it is also necessary to specify some information about the composition of the Span, e.g. how many attributes, of what type and length. This can affect performance significantly.

I think that when we go from 1 core to N cores, then we don't actually want to compare raw performance anymore. We want to compare scalability. If we have 1 core performance established, then it is our unit of work and moving to N cores we want to see the curve.

Yes, measure N cores will shows the scalability based on 1 core, also there are big.LITTLE architecture so performance over all cores is not a *N over a single core.

tigrannajaryan · 2020-09-10T03:19:08Z

specification/performance-benchmark.md

+
+### CPU Usage
+
+With given number of events throughput specified by user, e.g. 1000 spans per


I suggest we don't use the word "event" here since it is an actual and different concept at OpenTelemetry (Span Events) which can be confusing.

I wonder if 1000 spans per second are too small number for fast languages. For example Collector can do receive/batch/export of 10k spans per sec using less than 10% of one CPU and this is in Go. I expect C++ or Rust to be even more efficient so the CPU usage may be too small for reliable measurement.
On the other hand I am worried if we specify too large a number slow languages may not be able to sustain it.
If this turns out to be the case we may need to measure CPU usage at more than one rate to get useful results.

We should probably also specify the duration of the test over which to measure the CPU usage.

We need to specify which processors and exporters are used. If exporter actually sends data then we need reasonable guarantees that the receiving end does not influence the performance of SDK (if for example it does not receive quicjly enough). Ideally all tests should use the same high-performance mock backend that receives and drops the data.

Increased the measure throughput to 10k which makes more sense on modern hardware. Add minimum measure during as 1 min. Also added SimpleProcessor and OTLP exporter to the test pipeline so different implementations could be compared apple to apple.

Just to notice, in Java there is a HUGE performance difference between pipelines with SimpleProcessor and BatchSpanProcessor. If we want to measure production-like setup, I would argue against using SimpleProcessor

tigrannajaryan · 2020-09-10T03:28:09Z

specification/performance-benchmark.md

+### Memory Usage
+
+Measure dynamic memory comsumption, e.g. heap, for above scenario as memory cost
+at given event throughput.


Similarly to CPU usage memory usage depends on processors, exporters used, Span composition, the backend (slow backend will cause memory buildup, etc). This needs to be in the specs otherwise SDKs will implement these differently and we will not have comparable results.

tigrannajaryan · 2020-09-10T03:30:04Z

specification/performance-benchmark.md

+
+### OTLP Exporting Latency
+
+Measure and report the latency of sending events to a local OTLP collector via


We need to define latency between which moments of time we measure. Is this from the moment when the Span is started and until it is received by the Collector? There are multiple measurement points possible which will have different latencies. Also, if we involve the Collector here we need to tell what config to use for the Collector since that can affect receiving performance and thus affect the latency.

The latency part was intended to measure how much the application could be blocked by the logging/trace SDK. It may not be a good metric and I removed it for now. Let me know for any suggestions.

jkwatson

I'm still not sure exactly how we're going to use this, but it seems like an ok starting point.

tigrannajaryan · 2020-10-15T01:53:51Z

I believe this is an additive change so I am changing the label to "allowed-for-ga" rather than "required". (We can merge it whenever it is ready but I want to make sure it is not seen as a blocker for trace spec freeze).

github-actions · 2020-10-22T03:26:59Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

iNikem · 2020-10-22T08:10:07Z

@ThomsonTan Can you rebase this PR please?

@open-telemetry/technical-committee Any objections to merging this PR?

ThomsonTan · 2020-10-22T22:01:39Z

@ThomsonTan Can you rebase this PR please?

@open-telemetry/technical-committee Any objections to merging this PR?

@iNikem rebased the PR to the latest, let me know if there is anything I need follow up to get it merged.

github-actions · 2020-11-01T03:16:45Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

iNikem · 2020-11-01T12:47:46Z

@open-telemetry/technical-committee Can we merge this PR?

github-actions · 2020-11-09T03:19:51Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

iNikem · 2020-11-09T13:40:29Z

@open-telemetry/technical-committee Can we merge this PR?

tigrannajaryan · 2020-11-10T15:16:52Z

Merging, there do not seem to be any other objections and has been approved for a while.

Add OpenTelemetry perf bench draft

4dc4c41

ThomsonTan requested a review from a team as a code owner July 28, 2020 14:54

tigrannajaryan reviewed Jul 28, 2020

View reviewed changes

reyang added release:required-for-ga Must be resolved before GA release, or nice to have before GA spec:miscellaneous For issues that don't match any other spec label labels Jul 29, 2020

cijothomas reviewed Jul 30, 2020

View reviewed changes

specification/performance-benchmark.md Outdated Show resolved Hide resolved

cijothomas reviewed Jul 30, 2020

View reviewed changes

cijothomas previously requested changes Jul 30, 2020

View reviewed changes

ThomsonTan mentioned this pull request Aug 11, 2020

Add benchmarks to proto open-telemetry/opentelemetry-proto#138

Closed

tigrannajaryan self-assigned this Aug 12, 2020

bogdandrutu assigned yurishkuro Aug 12, 2020

github-actions bot added the Stale label Aug 26, 2020

github-actions bot removed the Stale label Sep 2, 2020

Reformat benchmark spec based on feedback on GitHub

08bbf04

ThomsonTan requested a review from a team as a code owner September 8, 2020 08:41

tigrannajaryan reviewed Sep 10, 2020

View reviewed changes

ffe4 mentioned this pull request Sep 10, 2020

Add benchmark tests for common operations open-telemetry/opentelemetry-python#335

Closed

Address feedback and add specific requirement for each span

bc7b3a3

ThomsonTan added 2 commits October 5, 2020 20:13

Add warm-up phase recommendation for measurement configuration

a9fd74f

Add a section of the goal of this performance benchmark

cb2db52

tigrannajaryan requested review from anuraaga, dyladan, cijothomas and jkwatson October 14, 2020 22:47

jkwatson approved these changes Oct 15, 2020

View reviewed changes

tigrannajaryan added release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs and removed release:required-for-ga Must be resolved before GA release, or nice to have before GA labels Oct 15, 2020

github-actions bot added the Stale label Oct 22, 2020

Merge remote-tracking branch 'ot/master' into user/tomtan/PerfBench

eefc9d4

github-actions bot removed the Stale label Oct 23, 2020

cijothomas approved these changes Oct 23, 2020

View reviewed changes

iNikem approved these changes Oct 24, 2020

View reviewed changes

github-actions bot added the Stale label Nov 1, 2020

github-actions bot removed the Stale label Nov 2, 2020

andrewhsu mentioned this pull request Nov 2, 2020

Performance Testing for GA #1180

Open

7 tasks

github-actions bot added the Stale label Nov 9, 2020

github-actions bot removed the Stale label Nov 10, 2020

tigrannajaryan merged commit a48fb7f into open-telemetry:master Nov 10, 2020

jtescher mentioned this pull request Nov 10, 2020

Implement the standard performance benchmark open-telemetry/opentelemetry-rust#356

Closed


		### Number of outstanding events in local cache

		If the remote end-point is unavailable for OpenTelemetry exporter, the library needs cache 1M events before dropping them.


		### CPU time

		Under extreme workload, the library should not take more than 5% of total CPU time for event tracing, or it should not take more than 1 full logical core in multi-core system.


		### Memory consumption

		Under extreme workload, the peak working set increase caused by tracing should not exceed 100MB.


		### Number of events could be produced per second

		For application uses with C/C++/Java library, it can produce 10K events without dropping any event.


		## Benchmark Report

		The implementation language libraries need add the typical result of above benchmarks in the release page.


		### Create Span

		Number of spans which could be created in 1 second per logical core and over all


		### CPU Usage

		With given number of events throughput specified by user, e.g. 1000 spans per


		### OTLP Exporting Latency

		Measure and report the latency of sending events to a local OTLP collector via

Add OpenTelemetry performance benchmark spec #748

Add OpenTelemetry performance benchmark spec #748

Conversation

ThomsonTan commented Jul 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iNikem commented Jul 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cijothomas Jul 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cijothomas left a comment

Choose a reason for hiding this comment

dyladan commented Aug 4, 2020

yurishkuro commented Aug 4, 2020

yurishkuro commented Aug 4, 2020

tigrannajaryan commented Aug 12, 2020

github-actions bot commented Aug 26, 2020

ThomsonTan commented Sep 1, 2020

tigrannajaryan commented Sep 1, 2020

ThomsonTan commented Sep 9, 2020

tigrannajaryan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkwatson left a comment

Choose a reason for hiding this comment

tigrannajaryan commented Oct 15, 2020

github-actions bot commented Oct 22, 2020

iNikem commented Oct 22, 2020

ThomsonTan commented Oct 22, 2020

github-actions bot commented Nov 1, 2020

iNikem commented Nov 1, 2020

github-actions bot commented Nov 9, 2020

iNikem commented Nov 9, 2020

tigrannajaryan commented Nov 10, 2020

cijothomas Jul 30, 2020 •

edited

Loading