Skip to content

Aggregated metrics for payjoin-service (with OTel Collector sidecar)#1323

Closed
spacebear21 wants to merge 2 commits intopayjoin:masterfrom
spacebear21:open-telemetry-grafana
Closed

Aggregated metrics for payjoin-service (with OTel Collector sidecar)#1323
spacebear21 wants to merge 2 commits intopayjoin:masterfrom
spacebear21:open-telemetry-grafana

Conversation

@spacebear21
Copy link
Collaborator

This PR introduces a mechanism for collecting aggregated metrics from distributed payjoin-service operators. This is achieved by introducing an (optional) OpenTelemetry Collector sidecar that scrapes the local Prometheus /metrics endpoint, collects structured logs from the service's stdout, and receives traces. It pushes all three signal types to the Grafana Cloud instance I setup for the payjoin org using per-operator credentials. Claude drew this nice explanatory diagram:

┌─────────────────────────────────┐
│  Operator Server A              │
│  ┌───────────┐  ┌─────────────┐ │
│  │ payjoin-  │  │ OTel        │ │     OTLP/gRPC or OTLP/HTTP
│  │ service   ├──► Collector   ├─┼──────────────────────┐
│  │ (/metrics)│  │ (sidecar)   │ │                      │
│  └───────────┘  └─────────────┘ │                      │
└─────────────────────────────────┘                      │
                                                         ▼
┌─────────────────────────────────┐         ┌────────────────────────┐
│  Operator Server B              │         │  Grafana Cloud         │
│  ┌───────────┐  ┌─────────────┐ │         │                        │
│  │ payjoin-  │  │ OTel        │ │  OTLP   │  Mimir  (metrics)      │
│  │ service   ├──► Collector   ├─┼────────►│  Loki   (logs)         │
│  │           │  │             │ │         │  Tempo  (traces)       │
│  └───────────┘  └─────────────┘ │         │                        │
└─────────────────────────────────┘         │  Grafana (dashboards)  │
                                            └────────────────────────┘

This is an opt-in design. Each operator who opts-in needs to request an auth token from us and configure the collector accordingly.

Some open questions for reviewers:

  • Is the --telemetry feature overkill? nix2container needs to build the docker image with all features enabled anyway, so in practice payjoin-service features aren't really configurable for docker users. The same goes for the --acme feature.
  • The sidecar architecture introduces a pseudo-dependency on docker-compose to run the payjoin-service, since I don't expect many operators to go through the trouble of configuring and running this stack on bare metal. Is this OK?
  • The current approach requires operators to set the "OPERATOR_DOMAIN" env variable, that we could use to get per-operator stats (# active operators, up time by operator, etc.). User error would result in unreliable results, so I wonder if this could be set dynamically somehow? Maybe using acme.domains if it's set? IP address? Random UID?
Screenshot 2026-02-10 at 08 21 47

AI disclosure: I used Opus 4.6 to design the system and write much of the code and config files, manually reviewed everything and edited as needed.

Pull Request Checklist

Please confirm the following before requesting review:

This enables structured log output and configures exporters for
OpenTelemetry.
@coveralls
Copy link
Collaborator

coveralls commented Feb 10, 2026

Pull Request Test Coverage Report for Build 21928328465

Details

  • 0 of 25 (0.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.1%) to 83.128%

Changes Missing Coverage Covered Lines Changed/Added Lines %
payjoin-service/src/main.rs 0 25 0.0%
Totals Coverage Status
Change from base Build 21922394027: -0.1%
Covered Lines: 10238
Relevant Lines: 12316

💛 - Coveralls

@DanGould
Copy link
Contributor

  • I think we want to be very specific with shared metrics rather than enable logs in general. I can see that getting away from us and over-collecting easily.
  • I sent Ava a message about a docker compose dependency. This is the kind of thing you gotta ask the users right away. Docker is fine for BB & Cake evidently, that's how they're set up right now afaiu. BOBSpace may be on bare meta if they're not on compose already (but I think they're on compose already).

The current approach requires operators to set the "OPERATOR_DOMAIN" env variable, that we could use to get per-operator stats (# active operators, up time by operator, etc.). User error would result in unreliable results, so I wonder if this could be set dynamically somehow? Maybe using acme.domains if it's set? IP address? Random UID?

IP seems fine, but panicking the program unless it's set comes to mind, and then reporting as "You're running as x.y.z

@DanGould
Copy link
Contributor

2:47 PM

Yesterday @DanGould

Is a dependency on docker compose ok for you or do you nix everything

#1323

9:53 PM

Today @achow101

I nix everything

Docker can fuck right off

The OpenTelemetry Collector sidecar scrapes Prometheus metrics and
receives traces and logs from the `tracing` crate. Everything is then
tagged with operator metadata and exported to a Grafana OTLP endpoint.
@spacebear21 spacebear21 force-pushed the open-telemetry-grafana branch from fabf8dd to e1fd193 Compare February 12, 2026 00:15
@spacebear21 spacebear21 changed the title Aggregated metrics for payjoin-service Aggregated metrics for payjoin-service (with OTel Collector sidecar) Feb 12, 2026
@spacebear21
Copy link
Collaborator Author

Superseded by #1327

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants