-
Notifications
You must be signed in to change notification settings - Fork 472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'Process' check is not reliable in 0.2.6+ as it was in 0.2.5 #267
Comments
The changes between 0.2.5 and 0.2.6 can be seen e.g. via v0.2.5...v0.2.6 - I don't see anything obvious, also the changes within github.com/mitchellh/go-ps are only related to darwin. @taurus-forever - you're using the official release binaries here, right and always on the same runtime (kernel/OS)? Looking into the release binaries for 0.2.5 it was built using go1.7rc3 and for 0.2.6 it was go1.8, both on travis. So we could try building it against different Go versions. |
true, all from https://github.com/aelsabbahy/goss/releases
all the tests were done on the same machine. withing 1 hour of tracing the issue.
Yes, it worth to try. |
goss was reverted in a previous commit to goss 0.3.1 while it looks like every goss version sice 0.2.6 are affected: goss-org/goss#267 So, reverting revert here and updating to the latest 0.3.3 version (to test and report upstream the latest goss version). Change-Id: Ife6aec7883fd6d83e3645f728805feaab12c4a06
A few things:
Would be interesting to add a command test like this and seeing if it has the same failures
|
Nope, goss restarts nothing. Monitoring must not touch services in our 'design' ;-)
I tried, and then it works well, so it is visible on some amount of tests only, in our case 140+ and execution time ~0.2s. This is why I suspect some internal queue overload/cleanup.
It will be very useful if you can provide 0.2.6 (or 0.3.3) built using go 1.7
How to do this?
|
I've attached a copy of goss 0.3.3 that has debugging information, can you run this version and send me a dump of the output of a failed run. If you feel better compiling it yourself, the code for this binary can be found on the debug branch of this repo. |
Here, you can find the trace: https://gist.github.com/taurus-forever/c09730b1d39ed73e4426a8c6acea743f Trace contains last 3 successful checks and 1 failed one. |
Guess the next steps are to create a binary with:
And see if either of those resolve the issue. I wish I could reproduce locally, I would iterate on it faster. |
I will appreciate if you create such test binaries for me. Tnx! |
Can you please try the following on your side (goss-0.3.3-linux-amd64 is default from download page):
with /tmp/test.yaml file: https://gist.github.com/taurus-forever/b7845d37c02434733bc2b5b7ef537017 For me it is reproducible in ~10 seconds on Debian Jessie. NOTE: Also what I noticed it is name of the "process" is important here, so add as much as you can test locally (I am not sure which process you have locally). Have fun! |
I'll set up a docker image and test with it.. I have a theory.. but not 100% sure why it ever worked if my theory is correct. |
I was able to reproduce this locally. I'll debug and fix it within the week. |
Try this one, let me know if it works for you. I usually see a failure in 5 minutes or so, but this one ran for an hour without issue. The code changes are on the debug branch, the actual change was to the go-ps library, if this fixes it, I'll submit a PR to that project. Thanks for your patience on this, was a tricky one. |
The fix has been confirmed. Nice catch: goss-org/go-ps@4433868 P.S. I didn't aware go-ps is from Hashimoto. I am using Vagrant a lot, the world is small. |
Released in And yeah, Hashimoto is awesome!.. he has tons of great work out there. |
Hi,
We are using latest goss 0.3.2 in production (it was latest in a moment of investigation, while 0.3.3 has the same issues as I can see).
Issue: fake alarms on internal monitoring system caused by 'Process' check which is expected to be running while goss reported it is not.
Environment: normal production server with 242 checks in total and average check execution time ~0.2s
Error:
Steps to reproduce:
In case of 0.2.5 - the test can run 10+ minutes without any issues.
In case of 0.2.6 - the error above happens from a second till a couple of minutes.
I have created dedicated test for 'Process' only, which works fast 0.008s and issue is not reproducible there. So, it is probably related to some internal buffers, garbage collectors and/or cleanup orders somewhere.
So, I suspect "Upgrade to go 1.8" from 0.2.6 release notes.
The issue is reproducible on: 0.2.6, 0.3.1, 0.3.3
The issue is NOT reproducible on: 0.2.4, 0.2.5
Looking for an ideas how to debug it further to trace it down. Thank you in advance!
The text was updated successfully, but these errors were encountered: