-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[inputs.ping] Native method gives higher results comparing to exec #9729
Comments
1ms means you are pinging something really close. Maybe go-ping does have weird behavior in this case. I was testing few sites using this gist, which compiles options used by the input plugin, and it's comparable to what system ping shows |
Yes, I am testing it on LAN host to have as much as reliable testing environment as I can. You can check it on your local gateway and try to compare it. I built your gist, adjusted the host to mine and I can clearly see that the values are higher and are drifting much more then a regular ping tool. Here are the stats from the ping tool:
while the go-ping average seems to be about 100ms higher (calling it multiple times with sending single request as in your code) and it's not uncommon to have values above 1ms which doesn't happen at all with a ping tool. I was trying to raise the nice priority of the telegraf tool and all its thread - no change noticed. |
@manio I looked into sources of go-ping and Linux curl, and it seems to be the difference in core mechanism:
It's accurate enough for long periods of time, but it could be not so accurate for small ones. Also, the generation of time for sent packets is not done ideally before a packet is sent. Maybe this introduces further delay. |
I just came to the same/similar problem of this issue. On my system I'm having almost 1700 ping configs (mostly with 1 IP in
When only running this one ping config:
I was surprised very much when looking at my results and seeing this section in the documentation:
|
@Hipska docs is right, native version is written in Go, which executes faster and does not need to do any subprocesses spawn. It also provides easy interface without having to parse a lot of different versions of ping outputs. Linux ping relies on Kernel, and gets time values directly from ICMP packets. Go version uses Time measurements, which won't be as accurate but it does not rely on things only available on one system. The only improvement to that would be using specific per-system code to get timestamps of packets. The only thing I can suggest is to do docs update to show this "problem". |
FWIW after a longer testing period i finally just switched to native (having a "wrong" results) because using a exec methods gives glitches - I mean it probably parsing the results wrong (especially when there was a ping timeout) and I have eg. negative ping values which totally messed up my plots... |
@alkuzad Yes I know it means the overhead of launching external commands, but that should not result in less reliable data? In my example setup it did 3 seconds longer to process 1k7 urls, but the data was more correct. So I'm not convinced. Also, |
@Hipska time.Now() precision is irrelevant if goroutines are waiting for free resources to run on. 3s longer means that golang just waited more for all of these processes to close but it does not impact measurement as system ping does not use time at all. My simple pinger gist (from above) gives me very similar results locally for every ping to native binary, but I call 1 adress with 1 packet (default). It's the same result even for master branch. However, you are doing 1700urls and each of it does 5 packets. This is different situation as goroutines will not start/complete at Check with different number of urls each time and compare the stddev between this and non-native. You can use this simplified go program to remove impact of telegraf itself and test it there, it could be quite easy to port the channels/waiting group to do similar thing. |
I'm sorry, I'm not completely following. Are you saying my problem is different as the one from @manio and thus should be a separate issue? On the other hand, I see you say it is a goroutine resources issue. Is there any way to tweak this so telegraf has more resources to handle it? |
@Hipska yes these look like two different core issues. I don't think you can raise goroutines limit as they are set around max cpu cores and this makes sense. I would focus on finding out limits first and experiment a bit to see what seem to be an issue. |
To be complete, I solved my issue with adding a jitter to the ping plugins. I have interval at 60s and jitter is 45s, so it will spread all ping requests over 45s instead of trying to launch them all at the start of the minute. |
Hi @Hipska So, after you reconfigure both parameter (jitter & interval) the result getting better? If you don't mind, can share your latest configuration for ping plugin? Thank You Regards, |
Native method gives wrong results comparing to exec mode.
Relevant telegraf.conf:
System info:
Telegraf 1.19.3 (git: HEAD a799489)
Debian Linux x86_64
Steps to reproduce:
Expected behavior:
The ping reply time should be at least comparable
Actual behavior:
Now it look like this
![image](https://user-images.githubusercontent.com/583157/132302253-d98fa765-8112-4241-8035-1454b116b229.png)
You probably can see on this graph at which time I changed from native to exec :)
Additional info:
No idea why the ping is collecting the data wrongly in native mode - just guessing:
The text was updated successfully, but these errors were encountered: