Target Allocator throwing "Node name is missing from the spec. Restarting watch routine" with high CPU/Memory #2916

diranged · 2024-04-30T15:41:26Z

Component(s)

target allocator

What happened?

(moved from open-telemetry/opentelemetry-collector-contrib#32747, where I opened this in the wrong place)

What happened?

Description

We are looking into using OTEL to replace our current Prometheus "scraping" based system. The desire is to run OTEL Collectors in a DaemonSet across the cluster, and use a TargetAllocator in per-node mode to pick up all the existing ServiceMonitor/PodMonitor objects and pass out the configs and endpoints.

We had this running on a test cluster with ~8 nodes and it worked fine. We saw the TargetAllocator use ~128Mi of memory and virtually zero CPU, and the configurations it passed out seemed correct. However, as soon as we spun this up on a "small" but "real" cluster (~15 nodes, a few workloads) - we see the targetallocator pods go into a painful loop and use a ton of CPU and memory:

When we look at the logs, the pods are in a loop spewing thousands of lines over and over again like this:

{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:27:28Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}

All of our clusters are generally configured the same - different workloads, but the same kinds of controllers, kubernetes versions, node OS's, etc.

What can I look for to better troubleshoot what might be wrong here?

Steps to Reproduce

Expected Result

We obviously don't expect the TargetAllocator pods to have this loop or be using those kinds of resources on a small cluster.

Kubernetes Version

1.28

Operator version

0.98.0

Collector version

0.98.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

Log output

{"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Starting the Target Allocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Starting server..."}
{"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Waiting for caches to sync for namespace"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Caches are synced for namespace"}
{"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Waiting for caches to sync for servicemonitors"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Caches are synced for servicemonitors"}
{"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Waiting for caches to sync for podmonitors"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","msg":"Caches are synced for podmonitors"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":"2024-04-29T22:50:03Z","logger":"allocator","msg":"Node name is missing from the spec. Restarting watch routine","component":"opentelemetry-targetallocator"}
...

Additional context

OpenTelemetry Collector configuration

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector-agent
spec:
  args:
    feature-gates: +processor.resourcedetection.hostCPUSteppingAsString
  config: |-
    exporters:
      debug:
        sampling_initial: 15
        sampling_thereafter: 60
      debug/verbose:
        sampling_initial: 15
        sampling_thereafter: 60
        verbosity: detailed
      otlp/metrics:
        endpoint: 'otel-collector-metrics-collector:4317'
        tls:
          ca_file: /tls/ca.crt
          cert_file: /tls/tls.crt
          key_file: /tls/tls.key
    extensions:
      health_check:
        endpoint: 0.0.0.0:13133
      pprof:
        endpoint: :1777
    processors:
      k8sattributes:...
    receivers:
      hostmetrics:
        collection_interval: 10s
        root_path: /hostfs
        scrapers:
          cpu:
            metrics:
              system.cpu.frequency:
                enabled: true
              system.cpu.logical.count:
                enabled: true
              system.cpu.physical.count:
                enabled: true
              system.cpu.utilization:
                enabled: true
          disk: {}
          filesystem:
            exclude_fs_types:
              fs_types:
              - autofs
              - binfmt_misc
              - bpf
              - cgroup2
              - configfs
              - debugfs
              - devpts
              - fusectl
              - hugetlbfs
              - iso9660
              - mqueue
              - nsfs
              - proc
              - procfs
              - pstore
              - rpc_pipefs
              - securityfs
              - selinuxfs
              - squashfs
              - sysfs
              - tracefs
              match_type: strict
            exclude_mount_points:
              match_type: regexp
              mount_points:
              - /dev/*
              - /proc/*
              - /sys/*
              - /run/k3s/containerd/*
              - /var/lib/docker/*
              - /var/lib/kubelet/*
              - /snap/*
            metrics:
              system.filesystem.utilization:
                enabled: true
          load: {}
          memory:
            metrics:
              system.memory.utilization:
                enabled: true
          network:
            metrics:
              system.network.conntrack.count:
                enabled: true
              system.network.conntrack.max:
                enabled: true
          paging:
            metrics:
              system.paging.utilization:
                enabled: true
          processes: {}
      kubeletstats:
        auth_type: serviceAccount
        collection_interval: 15s
        endpoint: https://${env:KUBE_NODE_NAME}:10250
        insecure_skip_verify: true
      prometheus:
        config:
          scrape_configs:
          - job_name: otel-collector
        report_extra_scrape_metrics: true
    service:
      extensions:
      - health_check
      - pprof
      pipelines:...
      telemetry:
        logs:
          level: 'info'
  deploymentUpdateStrategy: {}
  env:
  - name: KUBE_NODE_NAME
    valueFrom:
      fieldRef:
        apiVersion: v1
        fieldPath: spec.nodeName
  - name: OTEL_RESOURCE_ATTRIBUTES
    value: node.name=$(KUBE_NODE_NAME)
  image: otel/opentelemetry-collector-contrib:0.98.0
  ingress:
    route: {}
  livenessProbe:
    failureThreshold: 3
    initialDelaySeconds: 60
    periodSeconds: 30
  managementState: managed
  mode: daemonset
  nodeSelector:
    kubernetes.io/os: linux
  observability:
    metrics:
      enableMetrics: true
  podDisruptionBudget:
    maxUnavailable: 1
  podSecurityContext:
    runAsGroup: 0
    runAsUser: 0
  priorityClassName: otel-collector
  replicas: 1
  resources:
    limits:
      memory: 1Gi
    requests:
      cpu: 100m
      memory: 192Mi
  securityContext:
    capabilities:
      add:
      - SYS_PTRACE
  shareProcessNamespace: true
  targetAllocator:
    allocationStrategy: per-node
    enabled: true
    filterStrategy: relabel-config
    image: otel/target-allocator:0.98.0
    observability:
      metrics: {}
    prometheusCR:
      enabled: true
      scrapeInterval: 1m0s
    replicas: 2
    resources:
      limits:
        memory: 4Gi
      requests:
        cpu: 100m
        memory: 2Gi
  tolerations:
  - effect: NoSchedule
    operator: Exists
  - effect: NoExecute
    operator: Exists
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 10%
    type: RollingUpdate
  upgradeStrategy: automatic
  volumeMounts:
  - mountPath: /hostfs
    mountPropagation: HostToContainer
    name: hostfs
    readOnly: true
  - mountPath: /etc/passwd
    name: etc-passwd
    readOnly: true
  - mountPath: /hostfs/var/cache
    name: host-var-cache
  - mountPath: /hostfs/run
    name: host-run
  - mountPath: /tls/ca.crt
    name: tls-ca
    readOnly: true
    subPath: ca.crt
  - mountPath: /tls/tls.key
    name: tls
    readOnly: true
    subPath: tls.key
  - mountPath: /tls/tls.crt
    name: tls
    readOnly: true
    subPath: tls.crt
  volumes:
  - hostPath:
      path: /
    name: hostfs
  - hostPath:
      path: /etc/passwd
    name: etc-passwd
  - hostPath:
      path: /run
      type: DirectoryOrCreate
    name: host-run
  - hostPath:
      path: /var/cache
      type: DirectoryOrCreate
    name: host-var-cache
  - name: tls-ca
    secret:
      defaultMode: 420
      items:
      - key: ca.crt
        path: ca.crt
      secretName: otel-collector-issuer
  - name: tls
    secret:
      defaultMode: 420
      items:
      - key: tls.crt
        path: tls.crt
      - key: tls.key
        path: tls.key
      secretName: otel-collector-agent

The text was updated successfully, but these errors were encountered:

jaronoff97 · 2024-04-30T15:45:24Z

thank for moving this over :D per my comment, i'm wondering if there's some type of leak in the node strategy causing this high usage. This is certainly MUCH higher than I would anticipate. Are you able to take a profile to share? If not, I can attempt to repro, but my backlog is pretty huge rn.

jaronoff97 · 2024-04-30T15:45:53Z

my bet is that restarting the watch routine is causing a ton of memory churn...

diranged · 2024-04-30T16:04:59Z

I am happy to get a profile - if you can tell me how to do that?

diranged · 2024-04-30T16:58:01Z

@jaronoff97,
Just looking at the code that throws the error, I don't understand what it's point is... every cluster is going to have Pods where .spec.nodeName isn't filled in ... pods that are Pending for example. It feels like the operator is giving up and "trying again" when it sees that. Am I reading this wrong?

opentelemetry-operator/cmd/otel-allocator/collector/collector.go

Lines 134 to 137 in 48bc207

    
           if pod.Spec.NodeName == "" { 
        
           	k.log.Info("Node name is missing from the spec. Restarting watch routine") 
        
           	return "" 
        
           }

jaronoff97 · 2024-04-30T17:15:22Z

I don't think we should be restarting the watch routine as that feels unnecessary, but also we can't allocate a target for a pod that we know cannot be scraped yet.

You can follow the steps under the "Collecting and Analyzing Profiles
" header in this doc where the port is whatever port your TA pod has exposed.

diranged · 2024-04-30T18:30:24Z

For others info - I have sent @jaronoff97 CPU and Memory profiles on Slack.

swiatekm · 2024-05-01T12:26:28Z

my bet is that restarting the watch routine is causing a ton of memory churn...

I don't think we ever close this watcher, actually. So if it keeps getting restarted for some reason, we get a memory/goroutine leak.

#2528 should fix this by using standard informers.

jaronoff97 · 2024-05-01T14:00:24Z

merged ^^ i'm hoping that helps out! Please let me know if it doesn't

diranged · 2024-05-01T14:10:52Z

@jaronoff97 Awesome! Do you guys push a 'main' or 'latest' build to Docker? I can try that out and let you know if it helps..

jaronoff97 · 2024-05-01T14:17:22Z

yep, we do push a main :)

diranged · 2024-05-01T14:31:45Z

Ok - Good news / Bad news...

Good News: I am not seeing the problem anymore with the new release... it seems to work.
Bad News: Whatever state our cluster or pods were in that were triggering the bug seems to be resolved now - I ran the old release (0.98.0) first and it also did not exhibit the bug.

I'm doing some cluster rolls to see if I can trigger the behavior.. otherwise I'll just have to test by moving to larger clusters./

diranged · 2024-05-01T14:36:12Z

Ok.. more good news.. I was able to go back and reproduce the problem with the old code just by rolling the cluster (which puts pods into a Pending state). Going and rolling to the newest codebase immediately fixes it.

diranged added bug Something isn't working needs triage labels Apr 30, 2024

diranged mentioned this issue Apr 30, 2024

Target Allocator throwing "Node name is missing from the spec. Restarting watch routine" with high CPU/Memory open-telemetry/opentelemetry-collector-contrib#32747

Closed

swiatekm mentioned this issue May 1, 2024

[chore] Use informer to track collector Pods in target allocator #2528

Merged

jaronoff97 closed this as completed in #2528 May 1, 2024

jaronoff97 added the area:target-allocator Issues for target-allocator label May 7, 2024

jaronoff97 mentioned this issue May 7, 2024

[chore] Refactor allocation strategies #2928

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Target Allocator throwing "Node name is missing from the spec. Restarting watch routine" with high CPU/Memory #2916

Target Allocator throwing "Node name is missing from the spec. Restarting watch routine" with high CPU/Memory #2916

diranged commented Apr 30, 2024

jaronoff97 commented Apr 30, 2024

jaronoff97 commented Apr 30, 2024

diranged commented Apr 30, 2024

diranged commented Apr 30, 2024

jaronoff97 commented Apr 30, 2024

diranged commented Apr 30, 2024

swiatekm commented May 1, 2024

jaronoff97 commented May 1, 2024

diranged commented May 1, 2024

jaronoff97 commented May 1, 2024

diranged commented May 1, 2024

diranged commented May 1, 2024

Target Allocator throwing "Node name is missing from the spec. Restarting watch routine" with high CPU/Memory #2916

Target Allocator throwing "Node name is missing from the spec. Restarting watch routine" with high CPU/Memory #2916

Comments

diranged commented Apr 30, 2024

Component(s)

What happened?

What happened?

Description

Steps to Reproduce

Expected Result

Kubernetes Version

Operator version

Collector version

Environment information

Environment

Log output

Additional context

OpenTelemetry Collector configuration

jaronoff97 commented Apr 30, 2024

jaronoff97 commented Apr 30, 2024

diranged commented Apr 30, 2024

diranged commented Apr 30, 2024

jaronoff97 commented Apr 30, 2024

diranged commented Apr 30, 2024

swiatekm commented May 1, 2024

jaronoff97 commented May 1, 2024

diranged commented May 1, 2024

jaronoff97 commented May 1, 2024

diranged commented May 1, 2024

diranged commented May 1, 2024