Toward Magma Distributed Tracing #10492

hcgatewood · 2021-11-22T20:44:10Z

Toward Magma Distributed Tracing

tl;dr

Distributed tracing unlocks developers with enhanced debugging tools and empowers operators with radically improved visibility into their deployments. Imagine real-time, dev-or-prod, passively-generated request tracing — across Orc8r and gateways.

If you zoom in on the picture below, you’ll see a cross-service, end-to-end understanding of a request’s path through an example application. This includes services visited, latencies, annotations, and more — both for application-level code, as well as for DB operations.

Intro

This document aims to provide a guiding outline for how to progressively outfit the Magma project with distributed tracing functionalities.

In particular, we propose

A particular set of technologies
An architecture to handle reporting gateway traces

Context

This section provides context on the distributed tracing landscape, for the chosen set of technologies. Note that we’re proposing to use both Jaeger and OpenTelemetry together as complementary technologies.

For an intro to distributed tracing, see the short tracing intro or comprehensive Jaeger intro.

Why distributed tracing

The Magma project has grown large enough that debugging, especially root-causing performance bottlenecks, has become unwieldy. Distributed tracing provides a mechanism for an online, real-time understanding of how requests pass through a series of services. This will unlock developers with enhanced debugging tools and empower operators with radically improved visibility into their deployments.

Jaeger

CNCF graduated project. End-to-end solution for outfitting a project with distributed tracing. Compatible with OpenTracing (and mostly+imminently compatible with OpenTelemetry). Supports multiple storage backends, defaulting to Elasticsearch. Also supports tunable sampling rates and patterns, to limit network and storage pressure.

Includes the following components

UI (demo video)
Collector, to aggregate and sample spans then store to DB
Agent, to queue local spans and push to collector

Each component can be deployed as an individual container, or the full solution can be managed by the Jaeger K8s operator.

Additional reading

OpenTelemetry

CNCF sandbox project (OpenTracing, its predecessor, was a CNCF incubating project). Open specification of how to represent, propagate, and store spans. Also includes alpha specifications for metrics and logs formats — may be of interest to us in the future, especially logs, but for now we can focus on tracing. OpenTelemetry is the new, backwards-compatible incarnation of OpenTracing.

Also includes language-specific (e.g. Go) and framework-specific (e.g. gRPC) libraries to generate, propagate, and report spans. Supported languages include

Go (beta)
Python (beta)
C++ (pre-alpha — can also consider the OpenTracing C++ libraryif necessary)

Also supports reporting spans for application code requests into storage backends (e.g. reporting on Postgres lock contention), as shims in the caller’s language

Additional reading

Architecture

This section describes the proposed architecture for outfitting the Magma project with distributed tracing.

Desiderata include

Trivial penalty to application performance
Minimal hackery to propagate gateway spans to Orc8r
Separate sampling profiles between gateway and Orc8r spans
Low impact to existing Helm charts

With these desiderata in mind, we present the following architecture

Description

All jaeger- components are deployments of existing Jaeger components -- no custom code
Orc8r’s Jaeger functionality is managed by the Jaeger K8s operator, which injects an agent sidecar into each Orc8r service
Separate collectors for gateways vs. Orc8r, to support variable sampling rates — e.g. sampling gateways with much lower sampling rate
On gateway, place a single, containerized agent service. Point all AGW services to the agent, and point the agent to the gateway-specific collector in the Orc8r. The agent-collector interface is unary gRPC, so this should work seamlessly
Use Elasticsearch as the storage backend, as we already use it for multiple purposes and generally expect it to be present for production deployments

Affordances

✅ Pro: trivial penalty to application performance — similar to how Prometheus handles metrics
✅ Pro: no hackery necessary to propagate gateway spans
✅ Pro: separate gateway vs. Orc8r sampling rates
✅ Pro: minimal impact to Helm charts via jaeger-operator
✅ Pro: with minimal additional work, we could point the gateway’s agent to a gateway-local collector, affording headless gateways fully-local tracing functionality — including viewing the traces via the Jaeger UI
⚠️ Con: gateway traces will be dropped during extended disconnections from Orc8r. This is likely not a strong enough negative to justify the manual hackery and/or expansive infrastructure (e.g. gateway-local Kafka) that would be required to overcome the negative
⚠️ Con: by default, jaeger-operator expects an ingress controller to exist. Unclear whether this is a strict requirement. If no, we can probably work around it with our existing Nginx load balancer. If yes, we can take it as an opportunity to upgrade our platform and incorporate an ingress controller.
🚨 Risk: lack of libraries for instrumenting C code. Unclear if there are realistic alternatives. Also may be possible to use some C vs. C++ hackery to circumvent this issue.
🚨 Risk: large refactor to propagate contexts through application code**.** Almost nowhere in the Magma project are contexts properly propagated — including Orc8r. Connecting these contexts through will take a gradual, persistent effort over the next few halves.
- Note: context propagation is an established cloud-native pattern, unlocking functionalities like propagated context cancellation, auth and identity preservation across calls, service-to-service access management (i.e. not just at the edge), etc. So migrating to this pattern is something the project will want to accomplish eventually anyway, so we can use distributed tracing as a forcing function to begin this migration.

Appendix

Option: use fluentd to aggregate spans

It’s possible to use a FluentBit exporter for OpenTelemetry. This would allow our data pipeline, on the AGW side, to remain unchanged. This is something we will want to look into after the POC tasks, specifically to answer the questions

Does this preclude dynamic changes in sampling rates/patterns
Is there any way to extract the spans from fluentd before placing them into ES (i.e. can we enforce that they sit in their own ES indexes/etc, to keep log vs. span cleanup independent)

The text was updated successfully, but these errors were encountered:

hcgatewood · 2021-11-22T20:44:45Z

cc @mstre123 @andreilee for visibility

electronjoe · 2021-11-29T20:15:26Z

Nit request, in the Architecture subsection diagram, can we have specialty diagram annotations of some sort (color, shape, etc) for what is being produced novel in-house and what is built-in to Jaeger?

hcgatewood · 2021-11-29T20:42:04Z

@electronjoe all the jaeger- components already exist, so we won't need to custom-build anything, just deploy it. I'll add a note to make that clear

hcgatewood · 2021-12-13T20:37:08Z

Closing as accepted

hcgatewood added the type: proposal Proposals and design documents label Nov 22, 2021

hcgatewood added the status: on hold label Nov 29, 2021

hcgatewood removed the status: on hold label Dec 10, 2021

hcgatewood changed the title ~~Magma Distributed Tracing~~ Toward Magma Distributed Tracing Dec 10, 2021

hcgatewood added the status: accepted label Dec 13, 2021

hcgatewood closed this as completed Dec 13, 2021

Neudrino mentioned this issue Mar 3, 2022

Enable jaeger tracing across gateway services #3471

Closed

alexzurbonsen mentioned this issue Apr 7, 2022

chore: Upgrade to Go 1.18 #12151

Merged

8 tasks

Neudrino mentioned this issue Jul 22, 2022

Events-Based call tracing design exploration #9134

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Toward Magma Distributed Tracing #10492

Toward Magma Distributed Tracing #10492

hcgatewood commented Nov 22, 2021 •

edited

Loading

hcgatewood commented Nov 22, 2021

electronjoe commented Nov 29, 2021

hcgatewood commented Nov 29, 2021

hcgatewood commented Dec 13, 2021

Toward Magma Distributed Tracing #10492

Toward Magma Distributed Tracing #10492

Comments

hcgatewood commented Nov 22, 2021 • edited Loading

Toward Magma Distributed Tracing

tl;dr

Intro

Context

Why distributed tracing

Jaeger

OpenTelemetry

Architecture

Appendix

Option: use fluentd to aggregate spans

hcgatewood commented Nov 22, 2021

electronjoe commented Nov 29, 2021

hcgatewood commented Nov 29, 2021

hcgatewood commented Dec 13, 2021

hcgatewood commented Nov 22, 2021 •

edited

Loading