Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/k8sobjectsreceiver] watcher not restarting when kubernetes hangs up. #18078

Closed
stokerjon opened this issue Jan 27, 2023 · 6 comments
Closed
Assignees
Labels
bug Something isn't working priority:p2 Medium receiver/k8sobjects

Comments

@stokerjon
Copy link

Component(s)

receiver/k8sobjects

What happened?

Description

Kubernetes periodically hangs up watch connections, the receiver does not then restart them.

Steps to Reproduce

Create a k8sobjectsreceiver watching for kubernetes events.

Expected Result

k8sobjectsreceiver will continue watching for events and when kubernetes hangs up restart the watch

Actual Result

Kubernetes hangs up k8sobjectsreceiver watch. but does not restart

Collector version

0.68.0

Environment information

Environment

OS: Amazon Linux

OpenTelemetry Collector configuration

exporters:
  splunk_hec/platform_logs:
    disable_compression: true
    endpoint: OMITTED
    index: OMITTED
    max_connections: 200
    profiling_data_enabled: false
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_elapsed_time: 300s
      max_interval: 30s
    sending_queue:
      enabled: true
      num_consumers: 10
      queue_size: 5000
    source: kubernetes
    splunk_app_name: splunk-otel-collector
    splunk_app_version: 0.68.0
    timeout: 10s
    tls:
      insecure_skip_verify: false
    token: OMITTED
extensions:
      health_check: null
      memory_ballast:
        size_mib: OMITTED
processors:
  batch: null
  memory_limiter:
    check_interval: 2s
    limit_mib: OMITTED
  resource:
    attributes:
    - action: insert
      key: metric_source
      value: kubernetes
    - action: upsert
      key: k8s.cluster.name
      value: OMITTED
  resource/add_environment:
    attributes:
    - action: insert
      key: deployment.environment
      value: OMITTED
  resourcedetection:
    detectors:
    - env
    - eks
    - ec2
    - system
    override: true
    timeout: 10s
  transform/add_sourcetype:
    log_statements:
    - context: log
      statements:
      - set(resource.attributes["com.splunk.sourcetype"], Concat(["kube:object:",
        attributes["k8s.resource.name"]], ""))
receivers:
  k8sobjects:
    auth_type: serviceAccount
    objects:
    - group: events.k8s.io
      mode: watch
      name: events
service:
  extensions:
  - health_check
  - memory_ballast
  pipelines:
    logs/objects:
      exporters:
      - splunk_hec/platform_logs
      processors:
      - memory_limiter
      - batch
      - resourcedetection
      - resource
      - transform/add_sourcetype
      - resource/add_environment
      receivers:
      - k8sobjects

Log output

2023-01-27T10:12:45.167Z        warn    k8sobjectsreceiver@v0.68.0/receiver.go:162      Watch channel closed unexpectedly       {"kind": "receiver", "name": "k8sobjects", "pipeline": "logs", "resource": "events.k8s.io/v1, Resource=events"}

Additional context

No response

@stokerjon stokerjon added bug Something isn't working needs triage New item requiring triage labels Jan 27, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

  • receiver/k8sobjects: @dmitryax @harshit-splunk

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@atoulme atoulme added priority:p2 Medium and removed needs triage New item requiring triage labels Jan 27, 2023
@nicolastakashi
Copy link
Contributor

Hey, @atoulme and @stokerjon I faced the same issue, and I'm available to help with that fix.

I'm not sure how to force reproduce that issue, do you know how we can simulate this behavior?

@atoulme
Copy link
Contributor

atoulme commented Feb 19, 2023

No idea. Maybe create a mock k8s object api and restart it?

@stokerjon
Copy link
Author

Setting a short timeoutSeconds in ListOptions for the watch should have the same affect.
The hang up isn't random its happening every 30 min, so its probably a default timeout in the api server that hasn't been accounted for.

@nicolastakashi
Copy link
Contributor

nicolastakashi commented Feb 19, 2023

I managed it!
Just adjusting the unit test before opening a PR, if you want you can assign this issue to me @stokerjon

@dmitryax
Copy link
Member

Thanks @nicolastakashi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:p2 Medium receiver/k8sobjects
Projects
None yet
Development

No branches or pull requests

4 participants