Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect time slip and adjust uptime in the system input #7018

Closed
Gui13 opened this issue Feb 12, 2020 · 17 comments · Fixed by #7641
Closed

Detect time slip and adjust uptime in the system input #7018

Gui13 opened this issue Feb 12, 2020 · 17 comments · Fixed by #7641
Labels
bug unexpected problem or unintended behavior upstream bug or issues that rely on dependency fixes
Milestone

Comments

@Gui13
Copy link

Gui13 commented Feb 12, 2020

Relevant telegraf.conf:

# Read metrics about system load & uptime
[[inputs.system]]
  ## Uncomment to remove deprecated metrics.
  # fielddrop = ["uptime_format"]

System info:

Raspberry Pi 3A+ without RTC

Steps to reproduce:

  1. Use telegraf and collect metrics on an RPI3 that has no RTC
  2. Unplug the rpi3 and let it sit for a day
  3. Plug back the RPI3, clock time is restored from last fake_hwclock so it thinks it is yesterday
  4. Telgraf starts and cannot send system data yet because no connectivity. It correctly sees uptime as a few seconds
  5. Wifi connection is established and ntp corrects the clock to today
  6. Telegraf thinks that the uptime went from 30s to 1 day and 30s
  7. Telegraf reports 1day 30s uptime

Expected behavior:

Telegraf uses the correct uptime and reports the actual number of seconds since boot, independently from the time drift/adjustments (maybe use /proc/uptime?)

Actual behavior:

See Steps

Additional info:

Restarting telegraf correctly picks up the uptime.

@Gui13
Copy link
Author

Gui13 commented Feb 12, 2020

image

Here is the graf I have from a Grafana dashboard

@danielnelson
Copy link
Contributor

This is interesting, we do use /proc/uptime on Linux for this information so I'm not sure why restarting would trigger the change.

@danielnelson
Copy link
Contributor

I haven't dug in too deep, but it looks like /proc/uptime uses ktime_get_bootime which returns a monotonic clock.

/**
 * ktime_get_boottime - Returns monotonic time since boot in ktime_t format
 *
 * This is similar to CLOCK_MONTONIC/ktime_get, but also includes the
 * time spent in suspend.
 */

@Gui13
Copy link
Author

Gui13 commented Feb 12, 2020

This is a bit strange because the /proc/uptime of the node still shows the correct number of seconds:
Here you can see the reported telegraf being 1.022 days whereas reading /proc/uptime yields 5650s , which is a bit above 1.5 hours
image

@Gui13
Copy link
Author

Gui13 commented Feb 12, 2020

I think your gopsutil has a problem: it computes the uptime with a diff from boot time to now():

https://github.com/shirou/gopsutil/blob/a9d510e7e50a22d5da1075a3d26e718e61d56f27/host/host_linux.go#L122

So yeah, it can't work if there's a big time slippage in between. The boot time stays at what it was when the go process was started.

This explains the problem and why it is fixed by rebooting the process.

@Gui13
Copy link
Author

Gui13 commented Feb 12, 2020

Left a bug over there. I'll compensate by adding a manual telegraf exec metric with an actual read of /proc/uptime...

@danielnelson
Copy link
Contributor

I see, /proc/uptime is only used on certain virtualizations otherwise it uses an absolute btime from /proc/stat.

@danielnelson danielnelson added bug unexpected problem or unintended behavior upstream bug or issues that rely on dependency fixes labels Feb 12, 2020
@Gui13
Copy link
Author

Gui13 commented Feb 12, 2020

Would it make sense to add an option in the system plugin to get uptime through /proc/uptime?

I don't know the policy regarding non-cross-platform options, but this could be trivial to implement.

@danielnelson
Copy link
Contributor

Let's try to handle it in gopsutil first, they are very responsive and I'm sure we will hear back soon.

@danielnelson danielnelson changed the title Detect time slip and adjust uptime Detect time slip and adjust uptime in the system input Feb 13, 2020
@Gui13
Copy link
Author

Gui13 commented Apr 15, 2020

I didn't hear from them, and looking at the code I don't see any easy solution. I'll propose a PR which removes caching the value of boot_time but I suspect this could be rather expensive.

In the meantime I have worked around the problem with this:

# Read uptime from /proc/uptime instead of bogus gopsutil (see https://github.com/shirou/gopsutil/issues/837)
[[inputs.file]]
  files = ['/proc/uptime']
  data_format = "grok"

  grok_patterns = ["%{BASE10NUM:uptime_sec:float} %{BASE10NUM:idletime_sec:float}"] 

@Gui13
Copy link
Author

Gui13 commented Apr 15, 2020

PR is here: https://github.com/shirou/gopsutil/pull/857/files

You'll have to see if they merge it and have a new version to update.

@Gui13
Copy link
Author

Gui13 commented Apr 15, 2020

In the event of gopsutil not acting on this, would you be open to a PR which adds the option (or switches) to get uptime from /proc/uptime on the linux version?

Right now your system doesn't return the actual uptime. I suspect that this is way more prevalent than you think, because time adjustments from NTP are many and recurrent. On a host with a large uptime, I suspect this could amount to minutes (although I also suspect that in the scheme of things, minutes discrepancies across 2 years of uptime is not too bad for a monitoring system)

@philjohn
Copy link

Also seeing the same issue on an embedded device (Netgear ORBI) which lacks a RTC, as here a restart of the telegraf collector fixes the issue

@Snorzze
Copy link

Snorzze commented May 28, 2020

Same here for 4 different Raspberry Pi which were unplugged for a few days. Restarting the telegraf service helped.

@Gui13
Copy link
Author

Gui13 commented May 28, 2020

FYI my PR was merged in gopsutil, maybe someone more competent than me can update the telegraf go deps to reach the right commit in gopsutils to fix this issue here?

@danielnelson
Copy link
Contributor

If you can remind me when gopsutil v2.20.5 is released, I can update Telegraf to use it. I expect it will be in the next few days.

@Gui13
Copy link
Author

Gui13 commented Jun 4, 2020

Looks like you were right on time @danielnelson : https://github.com/shirou/gopsutil/releases/tag/v2.20.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior upstream bug or issues that rely on dependency fixes
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants