Switch distributor->ingester communication to more efficient PushBytes method #430

mdisibio · 2020-12-23T21:59:23Z

What this PR does:
This PR switches distributor->ingester communication of trace data to a more efficient PushBytes method. This api is different in that it contains a slice of byte slices with pre-marshaled Batches ([][]byte). The distributor is marshalling the trace data to a byte slice only once (instead of per ingester), and all data is delivered to the ingester in 1 gRPC call (instead of per trace). This has large improvements in both cpu and memory when ReplicationFactor >= 2, but also non-trivial improvement for replication factor = 1.

Background
The main driver for tempo TCO is compute as the actual object storage is very cost effective. Compute is roughly 50/50 for the distributor and ingester layers. As the distributor is mainly a proxy that replicates traffic to the ingesters according to the replication factor this was higher than expected and seemed to be a good area for improvement. Pprof benchmarking the distributor showed that most cpu and memory processing was related to proto marshalling and compression. From reviewing distributor.Push, the current implementation had several deficiencies. For example, when Replication Factor = 2, a gRPC call is made to each ingester for each belonging trace. This incurs:

2x marshaling per trace
2x compression per trace
2x gRPC calls per trace

The new API signature reduces it to the following in theory:

1x marshaling per trace
2x compression per trace
1x gRPC call per ingester

In practice there are larger than expected savings, as less memory churn means less garbage collection.

Performance Analysis
Performance before and after were measured locally and in a dev cluster. A useful docker-compose setup that configures the necessary replication factor is ~~located here: https://github.com/mdisibio/tempo-load-test/tree/master~~ added to the /integration/microservices/ folder. This setup includes cadvisor, grafana, and a dashboard.

The main metric of interest is the compute efficiency, which is spans / s / core across all distributor and ingester pods. This metric is computed with promQL of (simplifying) rate(tempo_receiver_accepted_spans) / rate(container_cpu_usage_seconds_total)

The current e2e and benchmarking tools were not straightforward to measure this, hence the creation of the linked compose setup.

Local testing improvements:

Spans/s/core went from 2700 -> 10-15K, 5x+
In dev k8s cluster, went from 2200 -> 5K, 2.5x+

Screenshots:

**Before: **

After

** K8s cluster**

Next Steps

Think more about why the savings are so significant. Does gRPC really behave that much differently when operating on a complex object graph instead of simple [][]byte ?
There is possibly a more elegant solution in using the experimental gRPC PreparedMsg api. This would allow for both marshalling and compression to be done once per trace. Discussion here: Sending large complex message limits throughput grpc/grpc-go#1879 However this requires a dependency update that we are blocked on.
Think about how to incorporate measurement of the spans/s/core metric in CI pipeline and this main repo.
Think about how to better test the microservices mode, separate distributors and ingesters, and repl factor > 1.

Which issue(s) this PR fixes:
n/a

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

…es gRPC method

joe-elliott · 2021-01-04T16:31:05Z

This looks good to me. Excellent performance improvements. Is there a reason not to PR the microservices docker-compose file to this repo?

Since we are 0.x I think we should aggressively remove the old proto/GRPC calls and leave these in place. Perhaps cut a release with this change and then one with the old endpoint removed?

In the changelog can you add a brief description of how to migrate to this new setup? Roll ingesters first completely and then roll distributors?

mdisibio · 2021-01-04T17:52:43Z

This looks good to me. Excellent performance improvements. Is there a reason not to PR the microservices docker-compose file to this repo?

The profiling setup felt didn't seem to fit well within the existing docker-compose examples. What about adding it to a new /integration/profiling/ folder?

Since we are 0.x I think we should aggressively remove the old proto/GRPC calls and leave these in place. Perhaps cut a release with this change and then one with the old endpoint removed?
In the changelog can you add a brief description of how to migrate to this new setup? Roll ingesters first completely and then roll distributors?

Agree, two phase release sounds ideal, and will add that info.

Switch distributor->ingester communication to more efficienct PushByt…

1944047

…es gRPC method

mdisibio requested review from annanay25 and joe-elliott as code owners December 23, 2020 21:59

mdisibio added 2 commits January 4, 2021 09:00

Merge branch 'master' into push-bytes

7ecedd1

Use local context, update comments

6fd22a4

Update changelog, add microservices mode profiling tools

0bfac1d

mdisibio changed the title ~~WIP: Switch distributor->ingester communication to more efficient PushBytes method~~ Switch distributor->ingester communication to more efficient PushBytes method Jan 4, 2021

joe-elliott approved these changes Jan 4, 2021

View reviewed changes

joe-elliott merged commit 9e0e05a into grafana:master Jan 4, 2021

mdisibio mentioned this pull request Jan 5, 2021

Fix ingester push latency metric on operational dashboard #438

Merged

3 tasks

joe-elliott mentioned this pull request Jan 12, 2021

Added bytes ingested metric per tenant #453

Merged

3 tasks

mdisibio deleted the push-bytes branch February 3, 2021 18:08

kvrhdn mentioned this pull request Dec 13, 2021

Remove Push from tempopb.Pusher #1173

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch distributor->ingester communication to more efficient PushBytes method #430

Switch distributor->ingester communication to more efficient PushBytes method #430

mdisibio commented Dec 23, 2020 •

edited

joe-elliott commented Jan 4, 2021

mdisibio commented Jan 4, 2021 •

edited

Switch distributor->ingester communication to more efficient PushBytes method #430

Switch distributor->ingester communication to more efficient PushBytes method #430

Conversation

mdisibio commented Dec 23, 2020 • edited

joe-elliott commented Jan 4, 2021

mdisibio commented Jan 4, 2021 • edited

mdisibio commented Dec 23, 2020 •

edited

mdisibio commented Jan 4, 2021 •

edited