Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intel_powerstat: telegraf hangs after some time running on a stressed preempt-rt system #14088

Closed
alysondeives opened this issue Oct 10, 2023 · 2 comments · Fixed by #14363
Closed
Labels
bug unexpected problem or unintended behavior

Comments

@alysondeives
Copy link

Relevant telegraf.conf

[agent]
  collection_jitter = "0s"
  debug = true
  flush_interval = "10s"
  flush_jitter = "0s"
  hostname = "$HOSTNAME"
  interval = "10s"
  logfile = ""
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  omit_hostname = false
  precision = ""
  quiet = false
  round_interval = true

[[outputs.prometheus_client]]
  listen = ":9273"

[[inputs.intel_powerstat]]
  cpu_metrics = [
    "cpu_frequency",
    "cpu_busy_frequency",
    "cpu_temperature",
    "cpu_c0_state_residency",
    "cpu_c1_state_residency",
    "cpu_c6_state_residency",
    "cpu_busy_cycles"
  ]
  package_metrics = [
    "current_power_consumption",
    "current_dram_power_consumption",
    "thermal_design_power",
    "cpu_base_frequency",
    "uncore_frequency"
  ]

[[inputs.linux_cpu]]
  metrics = [
    "cpufreq"
  ]

[[inputs.internal]]
  collect_memstats = true

[[inputs.cpu]]
  collect_cpu_time = false
  percpu = true
  report_active = false
  totalcpu = true

[[outputs.health]]
  service_address = "http://:8080"

Logs from Telegraf

Telegraf logs doesn't show much information (it was running without --debug) but it is possible to see that it stops printing prometheus write info.  

A kernel crash dump after telegraf hungs indicates that the goroutines stales waiting for a kernel response to MSR read requests:


PID: 55638 TASK: ff466bf3a7ffde00 CPU: 0 COMMAND: "telegraf"
#0 [ff85787a26767c70] __schedule at ffffffffa30ae0c6
#1 [ff85787a26767d00] schedule at ffffffffa30ae7f7
#2 [ff85787a26767d18] schedule_timeout at ffffffffa30b15a4
#3 [ff85787a26767d70] wait_for_completion at ffffffffa30afbc4
#4 [ff85787a26767db8] rdmsr_safe_on_cpu at ffffffffa2afbda8
#5 [ff85787a26767e78] msr_read at ffffffffa2640e55
#6 [ff85787a26767ec8] vfs_read at ffffffffa2903208
#7 [ff85787a26767f00] __x64_sys_pread64 at ffffffffa2904ea1
#8 [ff85787a26767f40] do_syscall_64 at ffffffffa30a6b60
#9 [ff85787a26767f50] entry_SYSCALL_64_after_hwframe at ffffffffa3200099


### System info

Telegraf 1.28.1, Debian 11, linux-yocto 5.10 preempt-rt kernel 

### Docker

_No response_

### Steps to reproduce

1. Launch a telegraf pod on a kubernetes cluster
2. Isolate cpu cores and launch stress pods into them (stress-ng for instance)
3. Wait a long period of time and notice that telegraf pod stops collecting metrics
4. Verify that pod is irresponsible and cannot even be deleted by kubectl


### Expected behavior

Telegraf should collect metrics without interruption.

### Actual behavior

Telegraf stops collecting metrics and pod becomes irresponsible (kubectl cannot manage it)

### Additional info

Encapsulating the MSR read into a goroutine and adding a timeout to it solved the issue.
I will add a pull request with this fix.
@alysondeives alysondeives added the bug unexpected problem or unintended behavior label Oct 10, 2023
alysondeives added a commit to alysondeives/telegraf that referenced this issue Oct 10, 2023
Signed-off-by: Alyson Deives Pereira <alyson.deivespereira@windriver.com>
@powersj
Copy link
Contributor

powersj commented Oct 12, 2023

Thanks again for another issue + PR!

next steps: waiting on PR updates, finish reviews

@alysondeives
Copy link
Author

Hi @powersj, I will work on the PR

alysondeives added a commit to alysondeives/telegraf that referenced this issue Nov 10, 2023
Signed-off-by: Alyson Deives Pereira <alyson.deivespereira@windriver.com>
alysondeives added a commit to alysondeives/telegraf that referenced this issue Nov 22, 2023
Signed-off-by: Alyson Deives Pereira <alyson.deivespereira@windriver.com>
alysondeives added a commit to alysondeives/telegraf that referenced this issue Nov 24, 2023
Signed-off-by: Alyson Deives Pereira <alyson.deivespereira@windriver.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
2 participants