feat(localenv): add trace collection (with Tempo) in local playground #2816

mkurapov · 2024-07-18T14:15:52Z

Changes proposed in this pull request

Adds the ability to collect traces in Rafiki
Adds trace auto-instrumentation for HTTP requests, our GraphQL Admin API and Postgres.
Traces are pushed to the open telemetry collector, which proceeds to push traces to Grafana Tempo. These then are able to be visualized in Grafana:

I added a small panel in the example Grafana dashboard that shows some of the longer traces than can be explored.

Context

Checklist

# Conflicts: # localenv/telemetry/README.md # localenv/telemetry/docker-compose.yml # localenv/telemetry/grafana/provisioning/dashboards/default.yaml # localenv/telemetry/grafana/provisioning/dashboards/example.json # localenv/telemetry/grafana/provisioning/datasources/datasources.yaml # localenv/telemetry/otel-collector-config.yaml # packages/backend/src/telemetry/service.ts

netlify · 2024-07-18T14:16:08Z

✅ Deploy Preview for brilliant-pasca-3e80ec canceled.

Name	Link
🔨 Latest commit	`babca58`
🔍 Latest deploy log	https://app.netlify.com/sites/brilliant-pasca-3e80ec/deploys/669eb2638ffecf000851bc34

mkurapov · 2024-07-18T14:23:02Z

packages/backend/src/telemetry/index.ts

+  instrumentations: [
+    new UndiciInstrumentation(),
+    new HttpInstrumentation(),
+    new PgInstrumentation(),
+    new GraphQLInstrumentation({
+      mergeItems: true,
+      ignoreTrivialResolveSpans: true,
+      ignoreResolveSpans: true
+    })


UndiciInstrumentation: Auto-instruments node fetch module. Useful for us to track requests made with the open payments client, since it uses fetch under the hood.

HttpInstrumentation: Auto-instruments http and https modules. Useful to track client requests made with axios and allows to track received requests via http servers. (e.g. you would be able to see the how happy-life-bank handles each request).

PgInstrumentation: auto-instruments all of our postgres queries.

GraphQLInstrumentation: auto-instruments our GraphQL APIs, particularly resolvers. The settings are to reduce the amount of noise we get. Otherwise, we see spans for every single field that we resolve. We can adjust if necessary.

We are always instrumenting these, regardless of enableTelemetryTraces. Is that OK? I mean the important thing is that we're not sending them anywhere, so maybe it's fine. But I wonder if this could have a performance impact or cause any unnecessary concern.

Although there shouldn't be much of a difference at all, it would probably be safer to put this behind the flag.

I moved the code behind the flags, tested with them enabled & disabled, and everything seems to work. It also works if we make telemetry non optional and try to increment a counter while everything is disabled - it just does a noop.

mkurapov · 2024-07-18T14:24:24Z

packages/backend/src/telemetry/index.ts

+if (Config.enableTelemetryTraces) {
+  for (const url of Config.openTelemetryTraceCollectorUrls) {
+    const traceExporter = new OTLPTraceExporter({
+      url
+    })
+
+    tracerProvider.addSpanProcessor(new BatchSpanProcessor(traceExporter))
+  }
+}


seperate config variables, to differentiate between metrics and traces. (IMO we want metrics on by default, but not traces).

mkurapov · 2024-07-18T14:29:02Z

packages/backend/Dockerfile.prod

@@ -59,4 +59,4 @@ COPY --from=builder /home/rafiki/packages/backend/dist ./packages/backend/dist
 COPY --from=builder /home/rafiki/packages/token-introspection/dist ./packages/token-introspection/dist
 COPY --from=builder /home/rafiki/packages/backend/knexfile.js ./packages/backend/knexfile.js

-CMD ["node", "/home/rafiki/packages/backend/dist/index.js"]
+CMD ["node", "-r", "/home/rafiki/packages/backend/dist/telemetry/index.js", "/home/rafiki/packages/backend/dist/index.js"]


Because auto-instrumentation essentially needs to wrap over multiple modules, everything needs to be loaded as the very first thing before app startup:

https://opentelemetry.io/docs/languages/js/getting-started/nodejs/#setup

otherwise, auto-instrumentation will not work.

We could just turn this CMD call pnpm start or something, but maybe its ok like this?

just curious, what happens if we dont include the --require option? some error? silently fail or maybe coincidentally work sometime?

dont have a very strong opinion on the pnpm start command vs this but leaning towards this. since we may only want to start it this way in this context as opposed to Dockerfile.dev, from host machine, etc. Could always have different start commands in that case I guess though.

It just "fails" silently, and doesn't generate traces. For example, if the pg module gets loaded before this postgres auto instrumentation, it just won't pick up those traces.

If you uncomnent the debugger in the telemetry file, you'd be able to see those warnings.

BlairCurrey

Looks like there are merge conflicts with the telemetry service refactor but generally looks good to me. Spun it up and viewed the traces locally. Just had the one open-ended comment here: #2816 (comment)

Ping me when ready for re-review

# Conflicts: # localenv/telemetry/grafana/provisioning/dashboards/example.json # packages/backend/src/telemetry/service.ts

njlie

LGTM, just some dangling comments to be cleaned up. Was able to get the trace logs in the Grafana dashboard on my machine as well.

njlie · 2024-07-22T22:33:56Z

packages/backend/src/telemetry/index.ts

+import { UndiciInstrumentation } from '@opentelemetry/instrumentation-undici'
+
+// debug logger:
+// diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.DEBUG)


nit: cleanup

I was thinking about this one, but its actually quite useful to keep for seeing whether the telemetry service is working at all, particularly the connection between the backend service and the collector itself. Otherwise need to dig into docs to find this.

might be a rare exception to keeping comments in

mkurapov added 13 commits July 10, 2024 13:50

feat: add telemetry stack to localenv under command

d883612

chore: remove tempo

4f0dad8

chore(localenv): update prometheus scrape interval

84643af

chore: explicitly set otel collector endpoint

093bc8e

chore: change prometheus scrape interval to 15s

58c5b85

chore: update dashboard queries

51c1c26

chore: add readme

0bc78f7

chore: set auto-refresh within grafana dashboard

bd88c3a

chore: add psql command

0a91ca4

feat(backend): add trace auto-instrumentation

4912ec1

feat(localenv): add tempo to telemetry stack

87f33aa

feat(localenv): add example panel of traces in the grafana dashboard

886535c

github-actions bot added pkg: backend Changes in the backend package. type: source Changes business logic labels Jul 18, 2024

mkurapov commented Jul 18, 2024

View reviewed changes

chore: format for example dashboard

d3535c2

mkurapov requested review from koekiebox, BlairCurrey, njlie and JoblersTune July 18, 2024 16:14

BlairCurrey previously approved these changes Jul 22, 2024

View reviewed changes

Merge branch 'main' into 2808/mk/tracing

a763789

# Conflicts: # localenv/telemetry/grafana/provisioning/dashboards/example.json # packages/backend/src/telemetry/service.ts

mkurapov dismissed BlairCurrey’s stale review via 07096a8 July 22, 2024 16:30

chore: rearrange dashboard

00baa73

mkurapov force-pushed the 2808/mk/tracing branch from 07096a8 to 00baa73 Compare July 22, 2024 19:22

chore(backend): add instrumentation only if enabled

babca58

mkurapov requested a review from BlairCurrey July 22, 2024 21:41

njlie approved these changes Jul 22, 2024

View reviewed changes

mkurapov merged commit ce66ab8 into main Jul 23, 2024
36 of 42 checks passed

mkurapov deleted the 2808/mk/tracing branch July 23, 2024 10:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(localenv): add trace collection (with Tempo) in local playground #2816

feat(localenv): add trace collection (with Tempo) in local playground #2816

mkurapov commented Jul 18, 2024 •

edited

Loading

netlify bot commented Jul 18, 2024 •

edited

Loading

mkurapov Jul 18, 2024 •

edited

Loading

BlairCurrey Jul 22, 2024

mkurapov Jul 22, 2024

mkurapov Jul 18, 2024

mkurapov Jul 18, 2024 •

edited

Loading

mkurapov Jul 18, 2024

BlairCurrey Jul 18, 2024 •

edited

Loading

mkurapov Jul 18, 2024

BlairCurrey left a comment

njlie left a comment

njlie Jul 22, 2024

mkurapov Jul 23, 2024

feat(localenv): add trace collection (with Tempo) in local playground #2816

feat(localenv): add trace collection (with Tempo) in local playground #2816

Conversation

mkurapov commented Jul 18, 2024 • edited Loading

Changes proposed in this pull request

Context

Checklist

netlify bot commented Jul 18, 2024 • edited Loading

✅ Deploy Preview for brilliant-pasca-3e80ec canceled.

mkurapov Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

BlairCurrey Jul 22, 2024

Choose a reason for hiding this comment

mkurapov Jul 22, 2024

Choose a reason for hiding this comment

mkurapov Jul 18, 2024

Choose a reason for hiding this comment

mkurapov Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

mkurapov Jul 18, 2024

Choose a reason for hiding this comment

BlairCurrey Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

mkurapov Jul 18, 2024

Choose a reason for hiding this comment

BlairCurrey left a comment

Choose a reason for hiding this comment

njlie left a comment

Choose a reason for hiding this comment

njlie Jul 22, 2024

Choose a reason for hiding this comment

mkurapov Jul 23, 2024

Choose a reason for hiding this comment

mkurapov commented Jul 18, 2024 •

edited

Loading

netlify bot commented Jul 18, 2024 •

edited

Loading

mkurapov Jul 18, 2024 •

edited

Loading

mkurapov Jul 18, 2024 •

edited

Loading

BlairCurrey Jul 18, 2024 •

edited

Loading