Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(localenv): add trace collection (with Tempo) in local playground #2816

Merged
merged 17 commits into from
Jul 23, 2024

Conversation

mkurapov
Copy link
Contributor

@mkurapov mkurapov commented Jul 18, 2024

Changes proposed in this pull request

  • Adds the ability to collect traces in Rafiki
  • Adds trace auto-instrumentation for HTTP requests, our GraphQL Admin API and Postgres.
  • Traces are pushed to the open telemetry collector, which proceeds to push traces to Grafana Tempo. These then are able to be visualized in Grafana:
Screenshot 2024-07-18 at 15 55 10

I added a small panel in the example Grafana dashboard that shows some of the longer traces than can be explored.

Context

Fixes #2808

Checklist

  • Related issues linked using fixes #number
  • Tests added/updated
  • Documentation added
  • Make sure that all checks pass
  • Bruno collection updated

@github-actions github-actions bot added pkg: backend Changes in the backend package. type: source Changes business logic labels Jul 18, 2024
Copy link

netlify bot commented Jul 18, 2024

Deploy Preview for brilliant-pasca-3e80ec canceled.

Name Link
🔨 Latest commit babca58
🔍 Latest deploy log https://app.netlify.com/sites/brilliant-pasca-3e80ec/deploys/669eb2638ffecf000851bc34

Comment on lines 72 to 80
instrumentations: [
new UndiciInstrumentation(),
new HttpInstrumentation(),
new PgInstrumentation(),
new GraphQLInstrumentation({
mergeItems: true,
ignoreTrivialResolveSpans: true,
ignoreResolveSpans: true
})
Copy link
Contributor Author

@mkurapov mkurapov Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UndiciInstrumentation: Auto-instruments node fetch module. Useful for us to track requests made with the open payments client, since it uses fetch under the hood.

HttpInstrumentation: Auto-instruments http and https modules. Useful to track client requests made with axios and allows to track received requests via http servers. (e.g. you would be able to see the how happy-life-bank handles each request).

PgInstrumentation: auto-instruments all of our postgres queries.

GraphQLInstrumentation: auto-instruments our GraphQL APIs, particularly resolvers. The settings are to reduce the amount of noise we get. Otherwise, we see spans for every single field that we resolve. We can adjust if necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are always instrumenting these, regardless of enableTelemetryTraces. Is that OK? I mean the important thing is that we're not sending them anywhere, so maybe it's fine. But I wonder if this could have a performance impact or cause any unnecessary concern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although there shouldn't be much of a difference at all, it would probably be safer to put this behind the flag.

I moved the code behind the flags, tested with them enabled & disabled, and everything seems to work. It also works if we make telemetry non optional and try to increment a counter while everything is disabled - it just does a noop.

Comment on lines 57 to 65
if (Config.enableTelemetryTraces) {
for (const url of Config.openTelemetryTraceCollectorUrls) {
const traceExporter = new OTLPTraceExporter({
url
})

tracerProvider.addSpanProcessor(new BatchSpanProcessor(traceExporter))
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seperate config variables, to differentiate between metrics and traces. (IMO we want metrics on by default, but not traces).

@@ -59,4 +59,4 @@ COPY --from=builder /home/rafiki/packages/backend/dist ./packages/backend/dist
COPY --from=builder /home/rafiki/packages/token-introspection/dist ./packages/token-introspection/dist
COPY --from=builder /home/rafiki/packages/backend/knexfile.js ./packages/backend/knexfile.js

CMD ["node", "/home/rafiki/packages/backend/dist/index.js"]
CMD ["node", "-r", "/home/rafiki/packages/backend/dist/telemetry/index.js", "/home/rafiki/packages/backend/dist/index.js"]
Copy link
Contributor Author

@mkurapov mkurapov Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because auto-instrumentation essentially needs to wrap over multiple modules, everything needs to be loaded as the very first thing before app startup:

https://opentelemetry.io/docs/languages/js/getting-started/nodejs/#setup

otherwise, auto-instrumentation will not work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just turn this CMD call pnpm start or something, but maybe its ok like this?

Copy link
Contributor

@BlairCurrey BlairCurrey Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, what happens if we dont include the --require option? some error? silently fail or maybe coincidentally work sometime?

dont have a very strong opinion on the pnpm start command vs this but leaning towards this. since we may only want to start it this way in this context as opposed to Dockerfile.dev, from host machine, etc. Could always have different start commands in that case I guess though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just "fails" silently, and doesn't generate traces. For example, if the pg module gets loaded before this postgres auto instrumentation, it just won't pick up those traces.

If you uncomnent the debugger in the telemetry file, you'd be able to see those warnings.

BlairCurrey
BlairCurrey previously approved these changes Jul 22, 2024
Copy link
Contributor

@BlairCurrey BlairCurrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are merge conflicts with the telemetry service refactor but generally looks good to me. Spun it up and viewed the traces locally. Just had the one open-ended comment here: #2816 (comment)

Ping me when ready for re-review

# Conflicts:
#	localenv/telemetry/grafana/provisioning/dashboards/example.json
#	packages/backend/src/telemetry/service.ts
Copy link
Contributor

@njlie njlie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just some dangling comments to be cleaned up. Was able to get the trace logs in the Grafana dashboard on my machine as well.

import { UndiciInstrumentation } from '@opentelemetry/instrumentation-undici'

// debug logger:
// diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.DEBUG)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cleanup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this one, but its actually quite useful to keep for seeing whether the telemetry service is working at all, particularly the connection between the backend service and the collector itself. Otherwise need to dig into docs to find this.

might be a rare exception to keeping comments in

@mkurapov mkurapov merged commit ce66ab8 into main Jul 23, 2024
36 of 42 checks passed
@mkurapov mkurapov deleted the 2808/mk/tracing branch July 23, 2024 10:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: backend Changes in the backend package. type: source Changes business logic
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Configure Rafiki to be able to track traces
3 participants