Metric Collector crashbackoffloop Due to Inability to Reach Target Allocator #2873

jaronoff97 · 2024-04-17T14:39:11Z

Component(s)

collector, target allocator

Describe the issue you're reporting

Description

The metric collector enters a continuous crash backoff loop because it fails to connect to the target allocator. The issue seems to be caused by the target allocator service being unhealthy, as no endpoints are active under its service, even though there are no logs indicating why the service is unhealthy.

Steps to Reproduce (maybe?)

Start with the kube-otel-stack deployed with Helm chart version 3.9.
Upgrade the Helm chart to version 4.1 for the kube-otel-stack via GitOps.
Observe the metric collector's behavior post-upgrade.
Expected Behavior
The metric collector should successfully connect to the target allocator service without entering a crash loop, or at least provide logs indicating the reason for connection failures.

Actual Behavior

The metric collector continually crashes with the following errors:

Error: cannot start pipelines: Get "http://si-kube-otel-stack-metrics-targetallocator:80/scrape_configs": dial tcp 172.20adf0: connect: connection refused
2024/02/27 08:05:56 collector server run finished with error: cannot start pipelines: Get "http://si-kube-otel-stack-metrics-targetallocator:80/scrape_configs": dial tcp 172.2dff0: connect: connection refused

Upon inspecting the target allocator service, it was found to have no active endpoints, indicating that the service was unhealthy. However, there were no logs or indicators explaining why the service was in this state.

Temporary Fix

The issue was temporarily resolved by wiping the entire kube-otel-stack and rebuilding it. However, this is not a viable long-term solution as it involves downtime and potential data loss.

Environment

Kubernetes version: 1.28
Helm chart version: Upgraded from 3.9 to 4.1
GitOps tool: argocd

Additional Context

https://cloud-native.slack.com/archives/C033BJ8BASU/p1709021610781889

The text was updated successfully, but these errors were encountered:

jaronoff97 · 2024-04-17T14:40:22Z

@swiatekm-sumo found that this is probably due to this line which is merging the selector (@pavolloffay FWIW the tempo-operator does the same thing)

jaronoff97 added the needs triage label Apr 17, 2024

jaronoff97 mentioned this issue Apr 17, 2024

Metric Collector crashbackoffloop Due to Inability to Reach Target Allocator lightstep/otel-collector-charts#73

Closed

jaronoff97 added bug Something isn't working area:collector Issues for deploying collector area:target-allocator Issues for target-allocator and removed needs triage labels Apr 17, 2024

jaronoff97 mentioned this issue Apr 17, 2024

Fixes internal bug for modified selector #2874

Merged

pavolloffay closed this as completed in #2874 Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metric Collector crashbackoffloop Due to Inability to Reach Target Allocator #2873

Metric Collector crashbackoffloop Due to Inability to Reach Target Allocator #2873

jaronoff97 commented Apr 17, 2024

jaronoff97 commented Apr 17, 2024

Metric Collector crashbackoffloop Due to Inability to Reach Target Allocator #2873

Metric Collector crashbackoffloop Due to Inability to Reach Target Allocator #2873

Comments

jaronoff97 commented Apr 17, 2024

Component(s)

Describe the issue you're reporting

Description

Steps to Reproduce (maybe?)

Actual Behavior

Temporary Fix

Environment

Additional Context

jaronoff97 commented Apr 17, 2024