Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(server): OpenTelemetry integration #7356

Merged
merged 29 commits into from
Mar 12, 2024
Merged

feat(server): OpenTelemetry integration #7356

merged 29 commits into from
Mar 12, 2024

Conversation

mertalev
Copy link
Contributor

@mertalev mertalev commented Feb 23, 2024

Description

This PR greatly simplifies profiling performance issues by using OpenTelemetry. The current state of the PR is the addition of a /metrics endpoint for Prometheus that includes duration, count and sum metrics for repository methods and HTTP requests as well as host metrics. Additionally, instrumentation for Nest, Postgres and Redis is available if an OTLP exporter is configured through env variables.

To do:

  • Convert these spans to metrics so Prometheus can scrape them
  • Expose specific configuration settings as env variables
  • Disable all of this by default

recognition-trace

Copy link

cloudflare-pages bot commented Feb 23, 2024

Deploying with  Cloudflare Pages  Cloudflare Pages

Latest commit: 90166eb
Status: ✅  Deploy successful!
Preview URL: https://43e1673d.immich.pages.dev
Branch Preview URL: https://feat-server-otlp.immich.pages.dev

View logs

Copy link
Contributor

@jrasm91 jrasm91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have this be something that is turned on or off from the administration settings? Can it be controlled at runtime?

@bo0tzz
Copy link
Member

bo0tzz commented Feb 23, 2024

To me this is something that is technical enough (also requiring deployment of other tools in the environment and such) that enabling it through an env var seems more logical.

Copy link
Member

@danieldietzler danieldietzler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome stuff! Really looking forward to this

server/src/infra/infra.utils.ts Outdated Show resolved Hide resolved
server/src/infra/repositories/database.repository.ts Outdated Show resolved Hide resolved
server/src/infra/repositories/filesystem.provider.ts Outdated Show resolved Hide resolved
server/src/infra/tracing.ts Outdated Show resolved Hide resolved
@jrasm91
Copy link
Contributor

jrasm91 commented Feb 23, 2024

To me this is something that is technical enough (also requiring deployment of other tools in the environment and such) that enabling it through an env var seems more logical.

What other tools are required? You basically just need to point a pre-existing prom instance at the /metrics endpoint.

@mertalev
Copy link
Contributor Author

mertalev commented Feb 24, 2024

It looks like the functionality I'm looking for to turn spans to metrics comes with the Span Metrics Connector. But that would mean pushing to an OpenTelemetry Collector and integrating that with Prometheus. Hmm.

@mertalev
Copy link
Contributor Author

Alright, after some wrestling I finally managed to export the execution time metrics to Prometheus. It uses histograms for metrics rather than spans/traces.

@mertalev mertalev marked this pull request as ready for review February 24, 2024 21:00
@mertalev
Copy link
Contributor Author

Also, anything returning a Promise has now been made async because this is the only way for the decorator to know it should await the call.

@mertalev
Copy link
Contributor Author

Would it make sense to have this be something that is turned on or off from the administration settings? Can it be controlled at runtime?

The SDK can be turned on or off, but it also has a bunch of env variables built into it already so any configuration we add is kinda redundant.

Copy link
Member

@danieldietzler danieldietzler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff!

server/src/infra/infra.utils.ts Outdated Show resolved Hide resolved
server/src/infra/infra.utils.ts Outdated Show resolved Hide resolved
server/src/domain/repositories/server-info.repository.ts Outdated Show resolved Hide resolved
server/src/infra/infra.utils.ts Outdated Show resolved Hide resolved
server/src/infra/instrumentation.ts Outdated Show resolved Hide resolved
server/src/infra/instrumentation.ts Outdated Show resolved Hide resolved
server/src/infra/repositories/access.repository.ts Outdated Show resolved Hide resolved
server/src/infra/repositories/activity.repository.ts Outdated Show resolved Hide resolved
@alextran1502
Copy link
Contributor

Mert, please feel free to merge this after resolving the conflicts

@jrasm91
Copy link
Contributor

jrasm91 commented Mar 11, 2024

Would it make sense to have this be something that is turned on or off from the administration settings? Can it be controlled at runtime?

The SDK can be turned on or off, but it also has a bunch of env variables built into it already so any configuration we add is kinda redundant.

I just like being able to turn it on without having to restart the instance. I'm thinking of a troubleshooting situation where I turn on the sdk from the Admin Settings, then go to the problematic view, generate the monitoring, then go back and turn it off. It just seems like it could be more convenient than having to set and ENV and restart it, but that's just my opinion at least.

@mertalev
Copy link
Contributor Author

mertalev commented Mar 12, 2024

Would it make sense to have this be something that is turned on or off from the administration settings? Can it be controlled at runtime?

The SDK can be turned on or off, but it also has a bunch of env variables built into it already so any configuration we add is kinda redundant.

I just like being able to turn it on without having to restart the instance. I'm thinking of a troubleshooting situation where I turn on the sdk from the Admin Settings, then go to the problematic view, generate the monitoring, then go back and turn it off. It just seems like it could be more convenient than having to set and ENV and restart it, but that's just my opinion at least.

That makes sense - we can possibly integrate this into the config repo in a later PR. But I think there are some questions to answer around how the SDK actually handles reconfiguration and restarts, whether it's okay to always wrap repo methods with the instrumentation decorator so it can be turned on or off dynamically, etc.

@mertalev mertalev enabled auto-merge (squash) March 12, 2024 05:09
@mertalev mertalev merged commit a097e90 into main Mar 12, 2024
24 checks passed
@mertalev mertalev deleted the feat/server-otlp branch March 12, 2024 05:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants