-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Target Allocator throwing "Node name is missing from the spec. Restarting watch routine" with high CPU/Memory #2916
Comments
thank for moving this over :D per my comment, i'm wondering if there's some type of leak in the node strategy causing this high usage. This is certainly MUCH higher than I would anticipate. Are you able to take a profile to share? If not, I can attempt to repro, but my backlog is pretty huge rn. |
my bet is that restarting the watch routine is causing a ton of memory churn... |
I am happy to get a profile - if you can tell me how to do that? |
@jaronoff97, opentelemetry-operator/cmd/otel-allocator/collector/collector.go Lines 134 to 137 in 48bc207
|
I don't think we should be restarting the watch routine as that feels unnecessary, but also we can't allocate a target for a pod that we know cannot be scraped yet. You can follow the steps under the "Collecting and Analyzing Profiles |
For others info - I have sent @jaronoff97 CPU and Memory profiles on Slack. |
I don't think we ever close this watcher, actually. So if it keeps getting restarted for some reason, we get a memory/goroutine leak. #2528 should fix this by using standard informers. |
merged ^^ i'm hoping that helps out! Please let me know if it doesn't |
@jaronoff97 Awesome! Do you guys push a 'main' or 'latest' build to Docker? I can try that out and let you know if it helps.. |
yep, we do push a main :) |
Ok - Good news / Bad news... Good News: I am not seeing the problem anymore with the new release... it seems to work. I'm doing some cluster rolls to see if I can trigger the behavior.. otherwise I'll just have to test by moving to larger clusters./ |
Ok.. more good news.. I was able to go back and reproduce the problem with the old code just by rolling the cluster (which puts pods into a Pending state). Going and rolling to the newest codebase immediately fixes it. |
Component(s)
target allocator
What happened?
(moved from open-telemetry/opentelemetry-collector-contrib#32747, where I opened this in the wrong place)
What happened?
Description
We are looking into using OTEL to replace our current Prometheus "scraping" based system. The desire is to run OTEL Collectors in a DaemonSet across the cluster, and use a TargetAllocator in
per-node
mode to pick up all the existing ServiceMonitor/PodMonitor objects and pass out the configs and endpoints.We had this running on a test cluster with ~8 nodes and it worked fine. We saw the TargetAllocator use ~128Mi of memory and virtually zero CPU, and the configurations it passed out seemed correct. However, as soon as we spun this up on a "small" but "real" cluster (~15 nodes, a few workloads) - we see the
targetallocator
pods go into a painful loop and use a ton of CPU and memory:When we look at the logs, the pods are in a loop spewing thousands of lines over and over again like this:
All of our clusters are generally configured the same - different workloads, but the same kinds of controllers, kubernetes versions, node OS's, etc.
What can I look for to better troubleshoot what might be wrong here?
Steps to Reproduce
Expected Result
We obviously don't expect the TargetAllocator pods to have this loop or be using those kinds of resources on a small cluster.
Kubernetes Version
1.28
Operator version
0.98.0
Collector version
0.98.0
Environment information
Environment
OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")
Log output
Additional context
OpenTelemetry Collector configuration
The text was updated successfully, but these errors were encountered: