Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High cardinality http_server_* metrics from otelcol.receiver.zipkin #4764

Closed
glindstedt opened this issue Aug 9, 2023 · 1 comment · Fixed by #4769
Closed

High cardinality http_server_* metrics from otelcol.receiver.zipkin #4764

glindstedt opened this issue Aug 9, 2023 · 1 comment · Fixed by #4769
Labels
enhancement New feature or request frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.

Comments

@glindstedt
Copy link
Contributor

What's wrong?

We've recently migrated to agent flow mode and in that process migrated to a setup with a single dedicated agent for ingesting traces. We ingest traces both via OTLP grpc and zipkin (and sometimes jaeger). We noticed that the scrape job to scrape metrics from the dedicated traces agent started becoming heavy, and found that it was due to some very high cardinality metrics exported by the zipkin receiver.

Example:

http_server_duration_bucket{component_id="otelcol.receiver.zipkin.default",http_client_ip="10.129.128.11",http_flavor="1.1",http_method="POST",http_scheme="http",http_status_code="202",net_host_name="zipkin.monitoring.svc.cluster.local",net_sock_peer_addr="127.0.0.6",net_sock_peer_port="33389",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_version="0.42.0",le="0"} 0

Notice the http_client_ip and net_sock_peer_port labels, which quickly explode. This seems to be due to this upstream issue: open-telemetry/opentelemetry-go-contrib#3765

Even though we've configured our scrape job to drop all http_server_* metrics, the act of just parsing the /metrics endpoint gradually becomes unmanageable as it grows asymptotically. Just now I tested with curl and found it to be 232Mb on our traces agent.

I'm opening this issue with the hope that a workaround can be implemented in the grafana-agent until this issue has been fixed upstream.

Steps to reproduce

Run the agent in flow mode with an otelcol.receiver.zipkin component and start ingesting zipkin traces, and see the http_server_* metrics exposed by the agent explode in cardinality.

System information

Linux x86; GKE 1.24

Software version

Grafana Agent v0.35.0

Configuration

otelcol.receiver.zipkin "default" {
	output {
		metrics = [otelcol.processor.memory_limiter.default.input]
		logs = [otelcol.processor.memory_limiter.default.input]
		traces = [otelcol.processor.memory_limiter.default.input]
	}
}

Logs

No response

@glindstedt glindstedt added the bug Something isn't working label Aug 9, 2023
@glindstedt
Copy link
Contributor Author

It seems like the otlp receiver metrics also have the problematic labels, however for some reason they don't seem to explode as much, the net_sock_peer_port value seems to be more stable

rpc_server_duration_milliseconds_bucket{component_id="otelcol.receiver.otlp.default",net_sock_peer_addr="127.0.0.6",net_sock_peer_port="38107",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc",otel_scope_version="0.42.0",rpc_grpc_status_code="0",rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",le="0"} 962

@rfratto rfratto added type/signals enhancement New feature or request and removed bug Something isn't working labels Aug 15, 2023
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

2 participants