Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Target Allocator throwing "Node name is missing from the spec. Restarting watch routine" with high CPU/Memory #2916

Closed
diranged opened this issue Apr 30, 2024 · 12 comments · Fixed by #2528
Labels
area:target-allocator Issues for target-allocator bug Something isn't working needs triage

Comments

@diranged
Copy link

Component(s)

target allocator

What happened?

(moved from open-telemetry/opentelemetry-collector-contrib#32747, where I opened this in the wrong place)

What happened?

Description

We are looking into using OTEL to replace our current Prometheus "scraping" based system. The desire is to run OTEL Collectors in a DaemonSet across the cluster, and use a TargetAllocator in per-node mode to pick up all the existing ServiceMonitor/PodMonitor objects and pass out the configs and endpoints.

We had this running on a test cluster with ~8 nodes and it worked fine. We saw the TargetAllocator use ~128Mi of memory and virtually zero CPU, and the configurations it passed out seemed correct. However, as soon as we spun this up on a "small" but "real" cluster (~15 nodes, a few workloads) - we see the targetallocator pods go into a painful loop and use a ton of CPU and memory:

image
image
image

When we look at the logs, the pods are in a loop spewing thousands of lines over and over again like this:

{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}

All of our clusters are generally configured the same - different workloads, but the same kinds of controllers, kubernetes versions, node OS's, etc.

What can I look for to better troubleshoot what might be wrong here?

Steps to Reproduce

Expected Result

We obviously don't expect the TargetAllocator pods to have this loop or be using those kinds of resources on a small cluster.

Kubernetes Version

1.28

Operator version

0.98.0

Collector version

0.98.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

Log output

{"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Starting the Target Allocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Starting server..."}
{"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Waiting for caches to sync for namespace"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Caches are synced for namespace"}
{"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Waiting for caches to sync for servicemonitors"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Caches are synced for servicemonitors"}
{"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Waiting for caches to sync for podmonitors"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Caches are synced for podmonitors"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
...

Additional context

OpenTelemetry Collector configuration

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector-agent
spec:
  args:
    feature-gates: +processor.resourcedetection.hostCPUSteppingAsString
  config: |-
    exporters:
      debug:
        sampling_initial: 15
        sampling_thereafter: 60
      debug/verbose:
        sampling_initial: 15
        sampling_thereafter: 60
        verbosity: detailed
      otlp/metrics:
        endpoint: 'otel-collector-metrics-collector:4317'
        tls:
          ca_file: /tls/ca.crt
          cert_file: /tls/tls.crt
          key_file: /tls/tls.key
    extensions:
      health_check:
        endpoint: 0.0.0.0:13133
      pprof:
        endpoint: :1777
    processors:
      k8sattributes:...
    receivers:
      hostmetrics:
        collection_interval: 10s
        root_path: /hostfs
        scrapers:
          cpu:
            metrics:
              system.cpu.frequency:
                enabled: true
              system.cpu.logical.count:
                enabled: true
              system.cpu.physical.count:
                enabled: true
              system.cpu.utilization:
                enabled: true
          disk: {}
          filesystem:
            exclude_fs_types:
              fs_types:
              - autofs
              - binfmt_misc
              - bpf
              - cgroup2
              - configfs
              - debugfs
              - devpts
              - fusectl
              - hugetlbfs
              - iso9660
              - mqueue
              - nsfs
              - proc
              - procfs
              - pstore
              - rpc_pipefs
              - securityfs
              - selinuxfs
              - squashfs
              - sysfs
              - tracefs
              match_type: strict
            exclude_mount_points:
              match_type: regexp
              mount_points:
              - /dev/*
              - /proc/*
              - /sys/*
              - /run/k3s/containerd/*
              - /var/lib/docker/*
              - /var/lib/kubelet/*
              - /snap/*
            metrics:
              system.filesystem.utilization:
                enabled: true
          load: {}
          memory:
            metrics:
              system.memory.utilization:
                enabled: true
          network:
            metrics:
              system.network.conntrack.count:
                enabled: true
              system.network.conntrack.max:
                enabled: true
          paging:
            metrics:
              system.paging.utilization:
                enabled: true
          processes: {}
      kubeletstats:
        auth_type: serviceAccount
        collection_interval: 15s
        endpoint: https://${env:KUBE_NODE_NAME}:10250
        insecure_skip_verify: true
      prometheus:
        config:
          scrape_configs:
          - job_name: otel-collector
        report_extra_scrape_metrics: true
    service:
      extensions:
      - health_check
      - pprof
      pipelines:...
      telemetry:
        logs:
          level: 'info'
  deploymentUpdateStrategy: {}
  env:
  - name: KUBE_NODE_NAME
    valueFrom:
      fieldRef:
        apiVersion: v1
        fieldPath: spec.nodeName
  - name: OTEL_RESOURCE_ATTRIBUTES
    value: node.name=$(KUBE_NODE_NAME)
  image: otel/opentelemetry-collector-contrib:0.98.0
  ingress:
    route: {}
  livenessProbe:
    failureThreshold: 3
    initialDelaySeconds: 60
    periodSeconds: 30
  managementState: managed
  mode: daemonset
  nodeSelector:
    kubernetes.io/os: linux
  observability:
    metrics:
      enableMetrics: true
  podDisruptionBudget:
    maxUnavailable: 1
  podSecurityContext:
    runAsGroup: 0
    runAsUser: 0
  priorityClassName: otel-collector
  replicas: 1
  resources:
    limits:
      memory: 1Gi
    requests:
      cpu: 100m
      memory: 192Mi
  securityContext:
    capabilities:
      add:
      - SYS_PTRACE
  shareProcessNamespace: true
  targetAllocator:
    allocationStrategy: per-node
    enabled: true
    filterStrategy: relabel-config
    image: otel/target-allocator:0.98.0
    observability:
      metrics: {}
    prometheusCR:
      enabled: true
      scrapeInterval: 1m0s
    replicas: 2
    resources:
      limits:
        memory: 4Gi
      requests:
        cpu: 100m
        memory: 2Gi
  tolerations:
  - effect: NoSchedule
    operator: Exists
  - effect: NoExecute
    operator: Exists
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 10%
    type: RollingUpdate
  upgradeStrategy: automatic
  volumeMounts:
  - mountPath: /hostfs
    mountPropagation: HostToContainer
    name: hostfs
    readOnly: true
  - mountPath: /etc/passwd
    name: etc-passwd
    readOnly: true
  - mountPath: /hostfs/var/cache
    name: host-var-cache
  - mountPath: /hostfs/run
    name: host-run
  - mountPath: /tls/ca.crt
    name: tls-ca
    readOnly: true
    subPath: ca.crt
  - mountPath: /tls/tls.key
    name: tls
    readOnly: true
    subPath: tls.key
  - mountPath: /tls/tls.crt
    name: tls
    readOnly: true
    subPath: tls.crt
  volumes:
  - hostPath:
      path: /
    name: hostfs
  - hostPath:
      path: /etc/passwd
    name: etc-passwd
  - hostPath:
      path: /run
      type: DirectoryOrCreate
    name: host-run
  - hostPath:
      path: /var/cache
      type: DirectoryOrCreate
    name: host-var-cache
  - name: tls-ca
    secret:
      defaultMode: 420
      items:
      - key: ca.crt
        path: ca.crt
      secretName: otel-collector-issuer
  - name: tls
    secret:
      defaultMode: 420
      items:
      - key: tls.crt
        path: tls.crt
      - key: tls.key
        path: tls.key
      secretName: otel-collector-agent
@jaronoff97
Copy link
Contributor

thank for moving this over :D per my comment, i'm wondering if there's some type of leak in the node strategy causing this high usage. This is certainly MUCH higher than I would anticipate. Are you able to take a profile to share? If not, I can attempt to repro, but my backlog is pretty huge rn.

@jaronoff97
Copy link
Contributor

my bet is that restarting the watch routine is causing a ton of memory churn...

@diranged
Copy link
Author

I am happy to get a profile - if you can tell me how to do that?

@diranged
Copy link
Author

@jaronoff97,
Just looking at the code that throws the error, I don't understand what it's point is... every cluster is going to have Pods where .spec.nodeName isn't filled in ... pods that are Pending for example. It feels like the operator is giving up and "trying again" when it sees that. Am I reading this wrong?

if pod.Spec.NodeName == "" {
k.log.Info("Node name is missing from the spec. Restarting watch routine")
return ""
}

@jaronoff97
Copy link
Contributor

I don't think we should be restarting the watch routine as that feels unnecessary, but also we can't allocate a target for a pod that we know cannot be scraped yet.

You can follow the steps under the "Collecting and Analyzing Profiles
" header in this doc where the port is whatever port your TA pod has exposed.

@diranged
Copy link
Author

For others info - I have sent @jaronoff97 CPU and Memory profiles on Slack.

@swiatekm
Copy link
Contributor

swiatekm commented May 1, 2024

my bet is that restarting the watch routine is causing a ton of memory churn...

I don't think we ever close this watcher, actually. So if it keeps getting restarted for some reason, we get a memory/goroutine leak.

#2528 should fix this by using standard informers.

@jaronoff97
Copy link
Contributor

merged ^^ i'm hoping that helps out! Please let me know if it doesn't

@diranged
Copy link
Author

diranged commented May 1, 2024

@jaronoff97 Awesome! Do you guys push a 'main' or 'latest' build to Docker? I can try that out and let you know if it helps..

@jaronoff97
Copy link
Contributor

yep, we do push a main :)

@diranged
Copy link
Author

diranged commented May 1, 2024

Ok - Good news / Bad news...

Good News: I am not seeing the problem anymore with the new release... it seems to work.
Bad News: Whatever state our cluster or pods were in that were triggering the bug seems to be resolved now - I ran the old release (0.98.0) first and it also did not exhibit the bug.

I'm doing some cluster rolls to see if I can trigger the behavior.. otherwise I'll just have to test by moving to larger clusters./

@diranged
Copy link
Author

diranged commented May 1, 2024

Ok.. more good news.. I was able to go back and reproduce the problem with the old code just by rolling the cluster (which puts pods into a Pending state). Going and rolling to the newest codebase immediately fixes it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:target-allocator Issues for target-allocator bug Something isn't working needs triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants