Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'Process' check is not reliable in 0.2.6+ as it was in 0.2.5 #267

Closed
taurus-forever opened this issue Jul 14, 2017 · 14 comments
Closed

'Process' check is not reliable in 0.2.6+ as it was in 0.2.5 #267

taurus-forever opened this issue Jul 14, 2017 · 14 comments

Comments

@taurus-forever
Copy link

Hi,

We are using latest goss 0.3.2 in production (it was latest in a moment of investigation, while 0.3.3 has the same issues as I can see).

Issue: fake alarms on internal monitoring system caused by 'Process' check which is expected to be running while goss reported it is not.
Environment: normal production server with 242 checks in total and average check execution time ~0.2s

Error:

Failures/Skipped:

Process: <random service>: running:
Expected
    <bool>: false
to equal
    <bool>: true

Steps to reproduce:

state=true ; while ${state} ; do if ! /tmp/goss-0.2.6-linux-amd64 -g /tmp/goss.yaml v ; then state=false ; fi ; done

In case of 0.2.5 - the test can run 10+ minutes without any issues.
In case of 0.2.6 - the error above happens from a second till a couple of minutes.

I have created dedicated test for 'Process' only, which works fast 0.008s and issue is not reproducible there. So, it is probably related to some internal buffers, garbage collectors and/or cleanup orders somewhere.

So, I suspect "Upgrade to go 1.8" from 0.2.6 release notes.

The issue is reproducible on: 0.2.6, 0.3.1, 0.3.3
The issue is NOT reproducible on: 0.2.4, 0.2.5

Looking for an ideas how to debug it further to trace it down. Thank you in advance!

@mika
Copy link
Contributor

mika commented Jul 14, 2017

The changes between 0.2.5 and 0.2.6 can be seen e.g. via v0.2.5...v0.2.6 - I don't see anything obvious, also the changes within github.com/mitchellh/go-ps are only related to darwin.

@taurus-forever - you're using the official release binaries here, right and always on the same runtime (kernel/OS)? Looking into the release binaries for 0.2.5 it was built using go1.7rc3 and for 0.2.6 it was go1.8, both on travis. So we could try building it against different Go versions.

@taurus-forever
Copy link
Author

you're using the official release binaries here

true, all from https://github.com/aelsabbahy/goss/releases

always on the same runtime (kernel/OS)

all the tests were done on the same machine. withing 1 hour of tracing the issue.

So we could try building it against different Go versions.

Yes, it worth to try.

sipwise-jenkins pushed a commit to sipwise/system-tests that referenced this issue Jul 14, 2017
goss was reverted in a previous commit to goss 0.3.1 while
it looks like every goss version sice 0.2.6 are affected:
goss-org/goss#267

So, reverting revert here and updating to the latest 0.3.3 version
(to test and report upstream the latest goss version).

Change-Id: Ife6aec7883fd6d83e3645f728805feaab12c4a06
@aelsabbahy
Copy link
Member

aelsabbahy commented Jul 14, 2017

A few things:

  1. I assume none of the other tests would be terminating or starting the process?
  • Maybe try running 1/2 the tests and seeing if you can reproduce it or isolate it down to the conflicting tests.
  1. I would be curious about the 1.7 vs 1.8 compilation and seeing if there's a difference there.

  2. We can take it one level further and do a debug dump on the process tree map as goss sees it and try to determine if something is wonky there.

Would be interesting to add a command test like this and seeing if it has the same failures

command:
  ps -eo comm | tee /tmp/what_do_i_see:
    exit-status: 0
    stdout:
      - PROCESS_NAME
    timeout: 10000

@taurus-forever
Copy link
Author

I assume none of the other tests would be terminating or starting the process?

Nope, goss restarts nothing. Monitoring must not touch services in our 'design' ;-)
Also, the same tests on goss 0.2.5 works without any issues for 10 minutes+
(are being checked 5 times per second).

Maybe try running 1/2 the tests and seeing if you can reproduce it or isolate it down to the conflicting tests.

I tried, and then it works well, so it is visible on some amount of tests only, in our case 140+ and execution time ~0.2s. This is why I suspect some internal queue overload/cleanup.

I would be curious about the 1.7 vs 1.8 compilation and seeing if there's a difference there.

It will be very useful if you can provide 0.2.6 (or 0.3.3) built using go 1.7
So, I will test it locally. Is it possible?

We can take it one level further and do a debug dump on the process tree map as goss sees it and try to determine if something is wonky there.

How to do this?

Would be interesting to add a command test like this and seeing if it has the same failures...

  1. Done. No luck: https://gist.github.com/taurus-forever/f7f09c0e12412d029a718d5cd11509bf
  2. I believe it is expected due to 'go' nature, as every check is running independently. So the test above guaranties us nothing.

@aelsabbahy
Copy link
Member

I've attached a copy of goss 0.3.3 that has debugging information, can you run this version and send me a dump of the output of a failed run.

If you feel better compiling it yourself, the code for this binary can be found on the debug branch of this repo.

goss.zip

@taurus-forever
Copy link
Author

taurus-forever commented Jul 17, 2017

Here, you can find the trace: https://gist.github.com/taurus-forever/c09730b1d39ed73e4426a8c6acea743f
es suspected goss doesn't see the processes.

Trace contains last 3 successful checks and 1 failed one.

@aelsabbahy
Copy link
Member

Guess the next steps are to create a binary with:

  1. Older go-ps
  2. Go 1.7

And see if either of those resolve the issue. I wish I could reproduce locally, I would iterate on it faster.

@taurus-forever
Copy link
Author

I will appreciate if you create such test binaries for me. Tnx!

@taurus-forever
Copy link
Author

taurus-forever commented Jul 19, 2017

I wish I could reproduce locally, I would iterate on it faster.

Can you please try the following on your side (goss-0.3.3-linux-amd64 is default from download page):

state=true ; while ${state} ; do if ! /tmp/goss-0.3.3-linux-amd64 -g /tmp/test.yaml v ; then state=false ; fi ; done

with /tmp/test.yaml file: https://gist.github.com/taurus-forever/b7845d37c02434733bc2b5b7ef537017

For me it is reproducible in ~10 seconds on Debian Jessie.

NOTE: Also what I noticed it is name of the "process" is important here, so add as much as you can test locally (I am not sure which process you have locally).

Have fun!

@aelsabbahy
Copy link
Member

I'll set up a docker image and test with it.. I have a theory.. but not 100% sure why it ever worked if my theory is correct.

@aelsabbahy
Copy link
Member

I was able to reproduce this locally. I'll debug and fix it within the week.

@aelsabbahy
Copy link
Member

aelsabbahy commented Jul 21, 2017

Try this one, let me know if it works for you. I usually see a failure in 5 minutes or so, but this one ran for an hour without issue.

The code changes are on the debug branch, the actual change was to the go-ps library, if this fixes it, I'll submit a PR to that project.

goss.zip

Thanks for your patience on this, was a tricky one.

@taurus-forever
Copy link
Author

The fix has been confirmed. Nice catch: goss-org/go-ps@4433868
Feel free to send it upstream. Thank you for the fast debug and fix here!

P.S. I didn't aware go-ps is from Hashimoto. I am using Vagrant a lot, the world is small.

@aelsabbahy
Copy link
Member

Released in v0.3.4

And yeah, Hashimoto is awesome!.. he has tons of great work out there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants