'Process' check is not reliable in 0.2.6+ as it was in 0.2.5 #267

taurus-forever · 2017-07-14T09:28:13Z

Hi,

We are using latest goss 0.3.2 in production (it was latest in a moment of investigation, while 0.3.3 has the same issues as I can see).

Issue: fake alarms on internal monitoring system caused by 'Process' check which is expected to be running while goss reported it is not.
Environment: normal production server with 242 checks in total and average check execution time ~0.2s

Error:

Failures/Skipped:

Process: <random service>: running:
Expected
    <bool>: false
to equal
    <bool>: true

Steps to reproduce:

state=true ; while ${state} ; do if ! /tmp/goss-0.2.6-linux-amd64 -g /tmp/goss.yaml v ; then state=false ; fi ; done

In case of 0.2.5 - the test can run 10+ minutes without any issues.
In case of 0.2.6 - the error above happens from a second till a couple of minutes.

I have created dedicated test for 'Process' only, which works fast 0.008s and issue is not reproducible there. So, it is probably related to some internal buffers, garbage collectors and/or cleanup orders somewhere.

So, I suspect "Upgrade to go 1.8" from 0.2.6 release notes.

The issue is reproducible on: 0.2.6, 0.3.1, 0.3.3
The issue is NOT reproducible on: 0.2.4, 0.2.5

Looking for an ideas how to debug it further to trace it down. Thank you in advance!

The text was updated successfully, but these errors were encountered:

mika · 2017-07-14T10:10:02Z

The changes between 0.2.5 and 0.2.6 can be seen e.g. via v0.2.5...v0.2.6 - I don't see anything obvious, also the changes within github.com/mitchellh/go-ps are only related to darwin.

@taurus-forever - you're using the official release binaries here, right and always on the same runtime (kernel/OS)? Looking into the release binaries for 0.2.5 it was built using go1.7rc3 and for 0.2.6 it was go1.8, both on travis. So we could try building it against different Go versions.

taurus-forever · 2017-07-14T10:14:59Z

you're using the official release binaries here

true, all from https://github.com/aelsabbahy/goss/releases

always on the same runtime (kernel/OS)

all the tests were done on the same machine. withing 1 hour of tracing the issue.

So we could try building it against different Go versions.

Yes, it worth to try.

goss was reverted in a previous commit to goss 0.3.1 while it looks like every goss version sice 0.2.6 are affected: goss-org/goss#267 So, reverting revert here and updating to the latest 0.3.3 version (to test and report upstream the latest goss version). Change-Id: Ife6aec7883fd6d83e3645f728805feaab12c4a06

aelsabbahy · 2017-07-14T19:05:00Z

A few things:

I assume none of the other tests would be terminating or starting the process?

Maybe try running 1/2 the tests and seeing if you can reproduce it or isolate it down to the conflicting tests.

I would be curious about the 1.7 vs 1.8 compilation and seeing if there's a difference there.
We can take it one level further and do a debug dump on the process tree map as goss sees it and try to determine if something is wonky there.

Would be interesting to add a command test like this and seeing if it has the same failures

command:
  ps -eo comm | tee /tmp/what_do_i_see:
    exit-status: 0
    stdout:
      - PROCESS_NAME
    timeout: 10000

taurus-forever · 2017-07-17T09:37:30Z

I assume none of the other tests would be terminating or starting the process?

Nope, goss restarts nothing. Monitoring must not touch services in our 'design' ;-)
Also, the same tests on goss 0.2.5 works without any issues for 10 minutes+
(are being checked 5 times per second).

Maybe try running 1/2 the tests and seeing if you can reproduce it or isolate it down to the conflicting tests.

I tried, and then it works well, so it is visible on some amount of tests only, in our case 140+ and execution time ~0.2s. This is why I suspect some internal queue overload/cleanup.

I would be curious about the 1.7 vs 1.8 compilation and seeing if there's a difference there.

It will be very useful if you can provide 0.2.6 (or 0.3.3) built using go 1.7
So, I will test it locally. Is it possible?

We can take it one level further and do a debug dump on the process tree map as goss sees it and try to determine if something is wonky there.

How to do this?

Would be interesting to add a command test like this and seeing if it has the same failures...

Done. No luck: https://gist.github.com/taurus-forever/f7f09c0e12412d029a718d5cd11509bf
I believe it is expected due to 'go' nature, as every check is running independently. So the test above guaranties us nothing.

aelsabbahy · 2017-07-17T15:52:45Z

I've attached a copy of goss 0.3.3 that has debugging information, can you run this version and send me a dump of the output of a failed run.

If you feel better compiling it yourself, the code for this binary can be found on the debug branch of this repo.

goss.zip

taurus-forever · 2017-07-17T19:28:24Z

Here, you can find the trace: https://gist.github.com/taurus-forever/c09730b1d39ed73e4426a8c6acea743f
es suspected goss doesn't see the processes.

Trace contains last 3 successful checks and 1 failed one.

aelsabbahy · 2017-07-19T01:52:30Z

Guess the next steps are to create a binary with:

Older go-ps
Go 1.7

And see if either of those resolve the issue. I wish I could reproduce locally, I would iterate on it faster.

taurus-forever · 2017-07-19T12:09:05Z

I will appreciate if you create such test binaries for me. Tnx!

taurus-forever · 2017-07-19T14:52:47Z

I wish I could reproduce locally, I would iterate on it faster.

Can you please try the following on your side (goss-0.3.3-linux-amd64 is default from download page):

state=true ; while ${state} ; do if ! /tmp/goss-0.3.3-linux-amd64 -g /tmp/test.yaml v ; then state=false ; fi ; done

with /tmp/test.yaml file: https://gist.github.com/taurus-forever/b7845d37c02434733bc2b5b7ef537017

For me it is reproducible in ~10 seconds on Debian Jessie.

NOTE: Also what I noticed it is name of the "process" is important here, so add as much as you can test locally (I am not sure which process you have locally).

Have fun!

aelsabbahy · 2017-07-19T18:34:44Z

I'll set up a docker image and test with it.. I have a theory.. but not 100% sure why it ever worked if my theory is correct.

aelsabbahy · 2017-07-20T17:43:42Z

I was able to reproduce this locally. I'll debug and fix it within the week.

aelsabbahy · 2017-07-21T00:17:20Z

Try this one, let me know if it works for you. I usually see a failure in 5 minutes or so, but this one ran for an hour without issue.

The code changes are on the debug branch, the actual change was to the go-ps library, if this fixes it, I'll submit a PR to that project.

goss.zip

Thanks for your patience on this, was a tricky one.

taurus-forever · 2017-07-21T09:09:47Z

The fix has been confirmed. Nice catch: goss-org/go-ps@4433868
Feel free to send it upstream. Thank you for the fast debug and fix here!

P.S. I didn't aware go-ps is from Hashimoto. I am using Vagrant a lot, the world is small.

aelsabbahy · 2017-07-21T19:33:13Z

Released in v0.3.4

And yeah, Hashimoto is awesome!.. he has tons of great work out there.

This was referenced Jul 21, 2017

Fix race condition with unix process listing mitchellh/go-ps#23

Open

Fix go-ps race condition #268

Merged

aelsabbahy closed this as completed in #268 Jul 21, 2017

aelsabbahy mentioned this issue Oct 17, 2017

Add validation for http header #296

Closed

aelsabbahy mentioned this issue Oct 8, 2020

Support all architectures of FreeBSD goss-org/go-ps#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'Process' check is not reliable in 0.2.6+ as it was in 0.2.5 #267

'Process' check is not reliable in 0.2.6+ as it was in 0.2.5 #267

taurus-forever commented Jul 14, 2017

mika commented Jul 14, 2017

taurus-forever commented Jul 14, 2017

aelsabbahy commented Jul 14, 2017 •

edited

Loading

taurus-forever commented Jul 17, 2017

aelsabbahy commented Jul 17, 2017

taurus-forever commented Jul 17, 2017 •

edited

Loading

aelsabbahy commented Jul 19, 2017

taurus-forever commented Jul 19, 2017

taurus-forever commented Jul 19, 2017 •

edited

Loading

aelsabbahy commented Jul 19, 2017

aelsabbahy commented Jul 20, 2017

aelsabbahy commented Jul 21, 2017 •

edited

Loading

taurus-forever commented Jul 21, 2017

aelsabbahy commented Jul 21, 2017

'Process' check is not reliable in 0.2.6+ as it was in 0.2.5 #267

'Process' check is not reliable in 0.2.6+ as it was in 0.2.5 #267

Comments

taurus-forever commented Jul 14, 2017

mika commented Jul 14, 2017

taurus-forever commented Jul 14, 2017

aelsabbahy commented Jul 14, 2017 • edited Loading

taurus-forever commented Jul 17, 2017

aelsabbahy commented Jul 17, 2017

taurus-forever commented Jul 17, 2017 • edited Loading

aelsabbahy commented Jul 19, 2017

taurus-forever commented Jul 19, 2017

taurus-forever commented Jul 19, 2017 • edited Loading

aelsabbahy commented Jul 19, 2017

aelsabbahy commented Jul 20, 2017

aelsabbahy commented Jul 21, 2017 • edited Loading

taurus-forever commented Jul 21, 2017

aelsabbahy commented Jul 21, 2017

aelsabbahy commented Jul 14, 2017 •

edited

Loading

taurus-forever commented Jul 17, 2017 •

edited

Loading

taurus-forever commented Jul 19, 2017 •

edited

Loading

aelsabbahy commented Jul 21, 2017 •

edited

Loading