[inputs.ping] Native method gives higher results comparing to exec #9729

manio · 2021-09-07T07:28:16Z

Native method gives wrong results comparing to exec mode.

Relevant telegraf.conf:

[[inputs.ping]]
  urls = [
    "host1",
    "host2",
    "host3",
    "host4",
    "host5",
  ]

System info:

Telegraf 1.19.3 (git: HEAD a799489)
Debian Linux x86_64

Steps to reproduce:

Collect ping data using preferred native method
Collect ping data using the exec method
Compare the results

Expected behavior:

The ping reply time should be at least comparable

Actual behavior:

Now it look like this

You probably can see on this graph at which time I changed from native to exec :)

Additional info:

No idea why the ping is collecting the data wrongly in native mode - just guessing:

spawning new ping thread overhead?
thread's priority?

The text was updated successfully, but these errors were encountered:

alkuzad · 2021-09-07T12:11:17Z

1ms means you are pinging something really close. Maybe go-ping does have weird behavior in this case. I was testing few sites using this gist, which compiles options used by the input plugin, and it's comparable to what system ping shows

manio · 2021-09-08T07:19:55Z

Yes, I am testing it on LAN host to have as much as reliable testing environment as I can. You can check it on your local gateway and try to compare it.

I built your gist, adjusted the host to mine and I can clearly see that the values are higher and are drifting much more then a regular ping tool. Here are the stats from the ping tool:

285 packets transmitted, 285 received, 0% packet loss, time 286287ms
rtt min/avg/max/mdev = 0.507/0.665/0.980/0.062 ms

while the go-ping average seems to be about 100ms higher (calling it multiple times with sending single request as in your code) and it's not uncommon to have values above 1ms which doesn't happen at all with a ping tool.

I was trying to raise the nice priority of the telegraf tool and all its thread - no change noticed.
I also tested this on the collectd and ping module - results are same as using a ping utility.

alkuzad · 2021-09-08T10:33:36Z

@manio I looked into sources of go-ping and Linux curl, and it seems to be the difference in core mechanism:

Linux's ping uses sockets and SO_TIMESTAMP to get timestamps from Kernel
Go-ping uses time.Now() to save and calculate time differences

It's accurate enough for long periods of time, but it could be not so accurate for small ones. Also, the generation of time for sent packets is not done ideally before a packet is sent. Maybe this introduces further delay.

Hipska · 2022-03-07T10:54:58Z

I just came to the same/similar problem of this issue. On my system I'm having almost 1700 ping configs (mostly with 1 IP in urls) and the native method results in response times of a few hundreds of ms, while the exec method is only a few ms:

2022-03-07T09:55:44Z I! Loaded inputs: ping (1692x)
> ping,host=xxx,url=x.x.x.x average_response_ms=536.686787,maximum_response_ms=720.719722,minimum_response_ms=115.140587,packets_received=5i,packets_transmitted=5i,percent_packet_loss=0,result_code=0i,standard_deviation_ms=226.911318,ttl=61i 1646646949000000000
...

When only running this one ping config:

> ping,host=xxx,url=x.x.x.x average_response_ms=7.628867,maximum_response_ms=8.917635,minimum_response_ms=6.887803,packets_received=5i,packets_transmitted=5i,percent_packet_loss=0,result_code=0i,standard_deviation_ms=0.821779,ttl=61i 1646647152000000000

I was surprised very much when looking at my results and seeing this section in the documentation:

This plugin has two main methods of operation: exec and native. The recommended method is native, which has greater system compatibility and performance. However, for backwards compatibility the exec method is the default.

alkuzad · 2022-03-07T12:33:39Z

@Hipska docs is right, native version is written in Go, which executes faster and does not need to do any subprocesses spawn. It also provides easy interface without having to parse a lot of different versions of ping outputs.

Linux ping relies on Kernel, and gets time values directly from ICMP packets. Go version uses Time measurements, which won't be as accurate but it does not rely on things only available on one system. The only improvement to that would be using specific per-system code to get timestamps of packets.

The only thing I can suggest is to do docs update to show this "problem".

manio · 2022-03-07T12:45:04Z

FWIW after a longer testing period i finally just switched to native (having a "wrong" results) because using a exec methods gives glitches - I mean it probably parsing the results wrong (especially when there was a ping timeout) and I have eg. negative ping values which totally messed up my plots...

Hipska · 2022-03-07T14:27:41Z

@alkuzad Yes I know it means the overhead of launching external commands, but that should not result in less reliable data? In my example setup it did 3 seconds longer to process 1k7 urls, but the data was more correct. So I'm not convinced.

Also, time.Now() should have nanosecond precision, so that also doesn't explain these differences. There must be something else wrong in the library or telegraf's implementation..

alkuzad · 2022-03-07T17:13:02Z

@Hipska time.Now() precision is irrelevant if goroutines are waiting for free resources to run on. 3s longer means that golang just waited more for all of these processes to close but it does not impact measurement as system ping does not use time at all.

My simple pinger gist (from above) gives me very similar results locally for every ping to native binary, but I call 1 adress with 1 packet (default). It's the same result even for master branch. However, you are doing 1700urls and each of it does 5 packets. This is different situation as goroutines will not start/complete at Check with different number of urls each time and compare the stddev between this and non-native. You can use this simplified go program to remove impact of telegraf itself and test it there, it could be quite easy to port the channels/waiting group to do similar thing.

Hipska · 2022-03-08T09:46:47Z

I'm sorry, I'm not completely following. Are you saying my problem is different as the one from @manio and thus should be a separate issue?

On the other hand, I see you say it is a goroutine resources issue. Is there any way to tweak this so telegraf has more resources to handle it?

alkuzad · 2022-03-08T10:09:54Z

@Hipska yes these look like two different core issues. I don't think you can raise goroutines limit as they are set around max cpu cores and this makes sense. I would focus on finding out limits first and experiment a bit to see what seem to be an issue.

mandud · 2024-05-14T14:21:46Z

Hi Everyone,

Sorry bumping this thread, getting same issue when switching from exec into native (I'm using telegraf on windows)
ping result looks good (getting specific decimal ms), but the result always higher than before, this confusing another team that will monitoring the networks

Thanks

Hipska · 2024-05-14T14:29:57Z

To be complete, I solved my issue with adding a jitter to the ping plugins. I have interval at 60s and jitter is 45s, so it will spread all ping requests over 45s instead of trying to launch them all at the start of the minute.

mandud · 2024-05-14T14:40:43Z

Hi @Hipska

So, after you reconfigure both parameter (jitter & interval) the result getting better?
there's no mismatch measurement anymore?

If you don't mind, can share your latest configuration for ping plugin?
I also still looking for best configuration ping monitoring for hundred devices.

Thank You

Regards,
Mandud

manio added the bug unexpected problem or unintended behavior label Sep 7, 2021

telegraf-tiger bot added the area/ping label Sep 7, 2021

SuperQ mentioned this issue Jun 23, 2022

Investigate using SO_TIMESTAMP for packet timing prometheus-community/pro-bing#12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inputs.ping] Native method gives higher results comparing to exec #9729

[inputs.ping] Native method gives higher results comparing to exec #9729

manio commented Sep 7, 2021

alkuzad commented Sep 7, 2021

manio commented Sep 8, 2021

alkuzad commented Sep 8, 2021 •

edited

Loading

Hipska commented Mar 7, 2022 •

edited

Loading

alkuzad commented Mar 7, 2022

manio commented Mar 7, 2022

Hipska commented Mar 7, 2022

alkuzad commented Mar 7, 2022

Hipska commented Mar 8, 2022

alkuzad commented Mar 8, 2022

mandud commented May 14, 2024 •

edited

Loading

Hipska commented May 14, 2024

mandud commented May 14, 2024

[inputs.ping] Native method gives higher results comparing to exec #9729

[inputs.ping] Native method gives higher results comparing to exec #9729

Comments

manio commented Sep 7, 2021

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

alkuzad commented Sep 7, 2021

manio commented Sep 8, 2021

alkuzad commented Sep 8, 2021 • edited Loading

Hipska commented Mar 7, 2022 • edited Loading

alkuzad commented Mar 7, 2022

manio commented Mar 7, 2022

Hipska commented Mar 7, 2022

alkuzad commented Mar 7, 2022

Hipska commented Mar 8, 2022

alkuzad commented Mar 8, 2022

mandud commented May 14, 2024 • edited Loading

Hipska commented May 14, 2024

mandud commented May 14, 2024

alkuzad commented Sep 8, 2021 •

edited

Loading

Hipska commented Mar 7, 2022 •

edited

Loading

mandud commented May 14, 2024 •

edited

Loading