Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric Collector crashbackoffloop Due to Inability to Reach Target Allocator #73

Closed
rivToadd opened this issue Feb 27, 2024 · 1 comment

Comments

@rivToadd
Copy link
Contributor

rivToadd commented Feb 27, 2024

Description

The metric collector enters a continuous crash backoff loop because it fails to connect to the target allocator. The issue seems to be caused by the target allocator service being unhealthy, as no endpoints are active under its service, even though there are no logs indicating why the service is unhealthy.

Steps to Reproduce (maybe?)

Start with the kube-otel-stack deployed with Helm chart version 3.9.
Upgrade the Helm chart to version 4.1 for the kube-otel-stack via GitOps.
Observe the metric collector's behavior post-upgrade.
Expected Behavior
The metric collector should successfully connect to the target allocator service without entering a crash loop, or at least provide logs indicating the reason for connection failures.

Actual Behavior

The metric collector continually crashes with the following errors:

Error: cannot start pipelines: Get "http://si-kube-otel-stack-metrics-targetallocator:80/scrape_configs": dial tcp 172.20adf0: connect: connection refused
2024/02/27 08:05:56 collector server run finished with error: cannot start pipelines: Get "http://si-kube-otel-stack-metrics-targetallocator:80/scrape_configs": dial tcp 172.2dff0: connect: connection refused

Upon inspecting the target allocator service, it was found to have no active endpoints, indicating that the service was unhealthy. However, there were no logs or indicators explaining why the service was in this state.

Temporary Fix

The issue was temporarily resolved by wiping the entire kube-otel-stack and rebuilding it. However, this is not a viable long-term solution as it involves downtime and potential data loss.

Environment

Kubernetes version: 1.28
Helm chart version: Upgraded from 3.9 to 4.1
GitOps tool: argocd

Additional Context

https://cloud-native.slack.com/archives/C033BJ8BASU/p1709021610781889

@jaronoff97
Copy link
Collaborator

I transferred this to the operator, we think we know what's causing this now. open-telemetry/opentelemetry-operator#2873

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants