Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[inputs.ping] Native method gives higher results comparing to exec #9729

Open
manio opened this issue Sep 7, 2021 · 13 comments
Open

[inputs.ping] Native method gives higher results comparing to exec #9729

manio opened this issue Sep 7, 2021 · 13 comments
Labels
area/ping bug unexpected problem or unintended behavior

Comments

@manio
Copy link
Contributor

manio commented Sep 7, 2021

Native method gives wrong results comparing to exec mode.

Relevant telegraf.conf:

[[inputs.ping]]
  urls = [
    "host1",
    "host2",
    "host3",
    "host4",
    "host5",
  ]

System info:

Telegraf 1.19.3 (git: HEAD a799489)
Debian Linux x86_64

Steps to reproduce:

  1. Collect ping data using preferred native method
  2. Collect ping data using the exec method
  3. Compare the results

Expected behavior:

The ping reply time should be at least comparable

Actual behavior:

Now it look like this
image

You probably can see on this graph at which time I changed from native to exec :)

Additional info:

No idea why the ping is collecting the data wrongly in native mode - just guessing:

  1. spawning new ping thread overhead?
  2. thread's priority?
@manio manio added the bug unexpected problem or unintended behavior label Sep 7, 2021
@alkuzad
Copy link

alkuzad commented Sep 7, 2021

1ms means you are pinging something really close. Maybe go-ping does have weird behavior in this case. I was testing few sites using this gist, which compiles options used by the input plugin, and it's comparable to what system ping shows

@manio
Copy link
Contributor Author

manio commented Sep 8, 2021

Yes, I am testing it on LAN host to have as much as reliable testing environment as I can. You can check it on your local gateway and try to compare it.

I built your gist, adjusted the host to mine and I can clearly see that the values are higher and are drifting much more then a regular ping tool. Here are the stats from the ping tool:

285 packets transmitted, 285 received, 0% packet loss, time 286287ms
rtt min/avg/max/mdev = 0.507/0.665/0.980/0.062 ms

while the go-ping average seems to be about 100ms higher (calling it multiple times with sending single request as in your code) and it's not uncommon to have values above 1ms which doesn't happen at all with a ping tool.

I was trying to raise the nice priority of the telegraf tool and all its thread - no change noticed.
I also tested this on the collectd and ping module - results are same as using a ping utility.

@alkuzad
Copy link

alkuzad commented Sep 8, 2021

@manio I looked into sources of go-ping and Linux curl, and it seems to be the difference in core mechanism:

  • Linux's ping uses sockets and SO_TIMESTAMP to get timestamps from Kernel
  • Go-ping uses time.Now() to save and calculate time differences

It's accurate enough for long periods of time, but it could be not so accurate for small ones. Also, the generation of time for sent packets is not done ideally before a packet is sent. Maybe this introduces further delay.

@Hipska
Copy link
Contributor

Hipska commented Mar 7, 2022

I just came to the same/similar problem of this issue. On my system I'm having almost 1700 ping configs (mostly with 1 IP in urls) and the native method results in response times of a few hundreds of ms, while the exec method is only a few ms:

2022-03-07T09:55:44Z I! Loaded inputs: ping (1692x)
> ping,host=xxx,url=x.x.x.x average_response_ms=536.686787,maximum_response_ms=720.719722,minimum_response_ms=115.140587,packets_received=5i,packets_transmitted=5i,percent_packet_loss=0,result_code=0i,standard_deviation_ms=226.911318,ttl=61i 1646646949000000000
...

image

When only running this one ping config:

> ping,host=xxx,url=x.x.x.x average_response_ms=7.628867,maximum_response_ms=8.917635,minimum_response_ms=6.887803,packets_received=5i,packets_transmitted=5i,percent_packet_loss=0,result_code=0i,standard_deviation_ms=0.821779,ttl=61i 1646647152000000000

I was surprised very much when looking at my results and seeing this section in the documentation:

This plugin has two main methods of operation: exec and native. The recommended method is native, which has greater system compatibility and performance. However, for backwards compatibility the exec method is the default.

@alkuzad
Copy link

alkuzad commented Mar 7, 2022

@Hipska docs is right, native version is written in Go, which executes faster and does not need to do any subprocesses spawn. It also provides easy interface without having to parse a lot of different versions of ping outputs.

Linux ping relies on Kernel, and gets time values directly from ICMP packets. Go version uses Time measurements, which won't be as accurate but it does not rely on things only available on one system. The only improvement to that would be using specific per-system code to get timestamps of packets.

The only thing I can suggest is to do docs update to show this "problem".

@manio
Copy link
Contributor Author

manio commented Mar 7, 2022

FWIW after a longer testing period i finally just switched to native (having a "wrong" results) because using a exec methods gives glitches - I mean it probably parsing the results wrong (especially when there was a ping timeout) and I have eg. negative ping values which totally messed up my plots...

@Hipska
Copy link
Contributor

Hipska commented Mar 7, 2022

@alkuzad Yes I know it means the overhead of launching external commands, but that should not result in less reliable data? In my example setup it did 3 seconds longer to process 1k7 urls, but the data was more correct. So I'm not convinced.

Also, time.Now() should have nanosecond precision, so that also doesn't explain these differences. There must be something else wrong in the library or telegraf's implementation..

@alkuzad
Copy link

alkuzad commented Mar 7, 2022

@Hipska time.Now() precision is irrelevant if goroutines are waiting for free resources to run on. 3s longer means that golang just waited more for all of these processes to close but it does not impact measurement as system ping does not use time at all.

My simple pinger gist (from above) gives me very similar results locally for every ping to native binary, but I call 1 adress with 1 packet (default). It's the same result even for master branch. However, you are doing 1700urls and each of it does 5 packets. This is different situation as goroutines will not start/complete at Check with different number of urls each time and compare the stddev between this and non-native. You can use this simplified go program to remove impact of telegraf itself and test it there, it could be quite easy to port the channels/waiting group to do similar thing.

@Hipska
Copy link
Contributor

Hipska commented Mar 8, 2022

I'm sorry, I'm not completely following. Are you saying my problem is different as the one from @manio and thus should be a separate issue?

On the other hand, I see you say it is a goroutine resources issue. Is there any way to tweak this so telegraf has more resources to handle it?

@alkuzad
Copy link

alkuzad commented Mar 8, 2022

@Hipska yes these look like two different core issues. I don't think you can raise goroutines limit as they are set around max cpu cores and this makes sense. I would focus on finding out limits first and experiment a bit to see what seem to be an issue.

@mandud
Copy link

mandud commented May 14, 2024

Hi Everyone,

Sorry bumping this thread, getting same issue when switching from exec into native (I'm using telegraf on windows)
ping result looks good (getting specific decimal ms), but the result always higher than before, this confusing another team that will monitoring the networks

image

Thanks

@Hipska
Copy link
Contributor

Hipska commented May 14, 2024

To be complete, I solved my issue with adding a jitter to the ping plugins. I have interval at 60s and jitter is 45s, so it will spread all ping requests over 45s instead of trying to launch them all at the start of the minute.

@mandud
Copy link

mandud commented May 14, 2024

Hi @Hipska

So, after you reconfigure both parameter (jitter & interval) the result getting better?
there's no mismatch measurement anymore?

If you don't mind, can share your latest configuration for ping plugin?
I also still looking for best configuration ping monitoring for hundred devices.

Thank You

Regards,
Mandud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ping bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants