Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Toward Magma Distributed Tracing #10492

Closed
hcgatewood opened this issue Nov 22, 2021 · 4 comments
Closed

Toward Magma Distributed Tracing #10492

hcgatewood opened this issue Nov 22, 2021 · 4 comments
Labels
status: accepted type: proposal Proposals and design documents

Comments

@hcgatewood
Copy link
Contributor

hcgatewood commented Nov 22, 2021

Toward Magma Distributed Tracing

tl;dr

Distributed tracing unlocks developers with enhanced debugging tools and empowers operators with radically improved visibility into their deployments. Imagine real-time, dev-or-prod, passively-generated request tracing — across Orc8r and gateways.

If you zoom in on the picture below, you’ll see a cross-service, end-to-end understanding of a request’s path through an example application. This includes services visited, latencies, annotations, and more — both for application-level code, as well as for DB operations.

trace-detail-ss (1)

Intro

This document aims to provide a guiding outline for how to progressively outfit the Magma project with distributed tracing functionalities.

In particular, we propose

  • A particular set of technologies
  • An architecture to handle reporting gateway traces

Context

This section provides context on the distributed tracing landscape, for the chosen set of technologies. Note that we’re proposing to use both Jaeger and OpenTelemetry together as complementary technologies.

For an intro to distributed tracing, see the short tracing intro or comprehensive Jaeger intro.

Why distributed tracing

The Magma project has grown large enough that debugging, especially root-causing performance bottlenecks, has become unwieldy. Distributed tracing provides a mechanism for an online, real-time understanding of how requests pass through a series of services. This will unlock developers with enhanced debugging tools and empower operators with radically improved visibility into their deployments.

Jaeger

CNCF graduated project. End-to-end solution for outfitting a project with distributed tracing. Compatible with OpenTracing (and mostly+imminently compatible with OpenTelemetry). Supports multiple storage backends, defaulting to Elasticsearch. Also supports tunable sampling rates and patterns, to limit network and storage pressure.

Includes the following components

  • UI (demo video)
  • Collector, to aggregate and sample spans then store to DB
  • Agent, to queue local spans and push to collector

Each component can be deployed as an individual container, or the full solution can be managed by the Jaeger K8s operator.

architecture-v1 (1)

Additional reading

OpenTelemetry

CNCF sandbox project (OpenTracing, its predecessor, was a CNCF incubating project). Open specification of how to represent, propagate, and store spans. Also includes alpha specifications for metrics and logs formats — may be of interest to us in the future, especially logs, but for now we can focus on tracing. OpenTelemetry is the new, backwards-compatible incarnation of OpenTracing.

spans-traces

Also includes language-specific (e.g. Go) and framework-specific (e.g. gRPC) libraries to generate, propagate, and report spans. Supported languages include

Also supports reporting spans for application code requests into storage backends (e.g. reporting on Postgres lock contention), as shims in the caller’s language

Additional reading

Architecture

This section describes the proposed architecture for outfitting the Magma project with distributed tracing.

Desiderata include

  • Trivial penalty to application performance
  • Minimal hackery to propagate gateway spans to Orc8r
  • Separate sampling profiles between gateway and Orc8r spans
  • Low impact to existing Helm charts

With these desiderata in mind, we present the following architecture

jaeger

Description

  • All jaeger- components are deployments of existing Jaeger components -- no custom code
  • Orc8r’s Jaeger functionality is managed by the Jaeger K8s operator, which injects an agent sidecar into each Orc8r service
  • Separate collectors for gateways vs. Orc8r, to support variable sampling rates — e.g. sampling gateways with much lower sampling rate
  • On gateway, place a single, containerized agent service. Point all AGW services to the agent, and point the agent to the gateway-specific collector in the Orc8r. The agent-collector interface is unary gRPC, so this should work seamlessly
  • Use Elasticsearch as the storage backend, as we already use it for multiple purposes and generally expect it to be present for production deployments

Affordances

  • ✅ Pro: trivial penalty to application performance — similar to how Prometheus handles metrics
  • ✅ Pro: no hackery necessary to propagate gateway spans
  • ✅ Pro: separate gateway vs. Orc8r sampling rates
  • ✅ Pro: minimal impact to Helm charts via jaeger-operator
  • ✅ Pro: with minimal additional work, we could point the gateway’s agent to a gateway-local collector, affording headless gateways fully-local tracing functionality — including viewing the traces via the Jaeger UI
  • ⚠️ Con: gateway traces will be dropped during extended disconnections from Orc8r. This is likely not a strong enough negative to justify the manual hackery and/or expansive infrastructure (e.g. gateway-local Kafka) that would be required to overcome the negative
  • ⚠️ Con: by default, jaeger-operator expects an ingress controller to exist. Unclear whether this is a strict requirement. If no, we can probably work around it with our existing Nginx load balancer. If yes, we can take it as an opportunity to upgrade our platform and incorporate an ingress controller.
  • 🚨 Risk: lack of libraries for instrumenting C code. Unclear if there are realistic alternatives. Also may be possible to use some C vs. C++ hackery to circumvent this issue.
  • 🚨 Risk: large refactor to propagate contexts through application code**.** Almost nowhere in the Magma project are contexts properly propagated — including Orc8r. Connecting these contexts through will take a gradual, persistent effort over the next few halves.
    • Note: context propagation is an established cloud-native pattern, unlocking functionalities like propagated context cancellation, auth and identity preservation across calls, service-to-service access management (i.e. not just at the edge), etc. So migrating to this pattern is something the project will want to accomplish eventually anyway, so we can use distributed tracing as a forcing function to begin this migration.

Appendix

Option: use fluentd to aggregate spans

It’s possible to use a FluentBit exporter for OpenTelemetry. This would allow our data pipeline, on the AGW side, to remain unchanged. This is something we will want to look into after the POC tasks, specifically to answer the questions

  • Does this preclude dynamic changes in sampling rates/patterns
  • Is there any way to extract the spans from fluentd before placing them into ES (i.e. can we enforce that they sit in their own ES indexes/etc, to keep log vs. span cleanup independent)
@hcgatewood hcgatewood added the type: proposal Proposals and design documents label Nov 22, 2021
@hcgatewood
Copy link
Contributor Author

cc @mstre123 @andreilee for visibility

@electronjoe
Copy link
Member

Nit request, in the Architecture subsection diagram, can we have specialty diagram annotations of some sort (color, shape, etc) for what is being produced novel in-house and what is built-in to Jaeger?

@hcgatewood
Copy link
Contributor Author

@electronjoe all the jaeger- components already exist, so we won't need to custom-build anything, just deploy it. I'll add a note to make that clear

@hcgatewood hcgatewood changed the title Magma Distributed Tracing Toward Magma Distributed Tracing Dec 10, 2021
@hcgatewood
Copy link
Contributor Author

Closing as accepted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: accepted type: proposal Proposals and design documents
Projects
None yet
Development

No branches or pull requests

2 participants